Re: Network namespaces a path to mergable code.

2006-06-27 Thread Eric W. Biederman
Sam Vilain <[EMAIL PROTECTED]> writes:

> Eric W. Biederman wrote:
>> In general it is possible to get file descriptors opened by someone
>> else because unix domain sockets allow file descriptor passing.  Similarly
>> I think there are cases in both unshare and fork that allows you to sockets
>> open before you entered a namespace.
>>   
>
> This is an interesting point; it is known to be possible to do this on a
> traditional system, because with a Unix Domain socket, the other end is
> always in the same Unix Domain.
>
> However what we're doing is saying that, well, the other end of the
> socket might not be in the same Unix Domain. In fact, we've already
> smashed to pieces this monolithic concept of a Unix Domain, to the point
> where the other end might be in a different network domain, but is in
> the same filesystem domain, for instance. Does it get to pass file
> descriptors through?

Despite what it might look like unix domain sockets do not live in the
filesystem.  They store a cookie in the filesystem that roughly
corresponds to the port number of an AF_INET socket.  When you open a
socket the lookup is done by the cookie retrieved from the filesystem.
So except for their cookies unix domain sockets are always in the
network stack.

Which means it is a royal pain to create a unix domain socket between
namespaces.  Which is the generally desired behavior.

> We would appear to be stretching the definition of "Unix Domain"
> somewhat if we allow these sockets to exist between network namespaces.
> Maybe it doesn't matter; this is just a VFS namespace feature/caveat.

Unless I am mistaken this is something that can only be created (given
my describe semantics) when you create the container.  So if you want
it you got it but you can't create it if you never had it.

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TOE, etc.

2006-06-27 Thread David Miller
From: Herbert Xu <[EMAIL PROTECTED]>
Date: Wed, 28 Jun 2006 15:35:54 +1000

> With their RDMA NIC, we'll have TCP/SCTP connections that bypass
> netfilter, tc, IPsec, AF_PACKET/tcpdump and the rest of our stack
> while at the same time it is using the same IP address as us and
> deciding what packets we will or won't see.

That's true.  I don't think we should really add any more
help for these kinds of things then.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-27 Thread Sam Vilain
Eric W. Biederman wrote:
> Have a few more network interfaces for a layer 2 solution
> is fundamental.  Believing without proof and after arguments
> to the contrary that you have not contradicted that a layer 2
> solution is inherently slower is non-productive.  Arguing
> that a layer 2 only solution most prove itself on guest to guest
> communication is also non-productive.
>   

Yes, it does break what some people consider to be a sanity condition
when you don't have loopback anymore within a guest. I once experimented
with using 127.* addresses for per-guest loopback devices with vserver
to fix this, but that couldn't work without fixing glibc to not make
assumptions deep in the bowels of the resolver. I logged a fault with
gnu.org and you can guess where it went :-).

I don't think it's just the performance issue, though. Consider also
that if you only have one set of interfaces to manage, the overall
configuration of the network stack is simpler. `ip addr list' on the
host shows all the addresses on the system, you only have one routing
table to manage, one set of iptables, etc.

That being said, perhaps if each guest got its own interface, and from
some suitably privileged context you could see them all, perhaps it
would be nicer and maybe just as fast. Perhaps then *devices* could get
their own routing namespaces, and routing namespaces could get iptables
namespaces, or something like that, to give the most options.

> With a guest with 4 IPs 
> 10.0.0.1 192.168.0.1 172.16.0.1 127.0.0.1
> How do you make INADDR_ANY work with just filtering at bind time?
>   

It used to just bind to the first one. Don't know if it still does.

Sam.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network namespaces a path to mergable code.

2006-06-27 Thread Sam Vilain
Eric W. Biederman wrote:
> In general it is possible to get file descriptors opened by someone
> else because unix domain sockets allow file descriptor passing.  Similarly
> I think there are cases in both unshare and fork that allows you to sockets
> open before you entered a namespace.
>   

This is an interesting point; it is known to be possible to do this on a
traditional system, because with a Unix Domain socket, the other end is
always in the same Unix Domain.

However what we're doing is saying that, well, the other end of the
socket might not be in the same Unix Domain. In fact, we've already
smashed to pieces this monolithic concept of a Unix Domain, to the point
where the other end might be in a different network domain, but is in
the same filesystem domain, for instance. Does it get to pass file
descriptors through?

We would appear to be stretching the definition of "Unix Domain"
somewhat if we allow these sockets to exist between network namespaces.
Maybe it doesn't matter; this is just a VFS namespace feature/caveat.

Sam.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network namespaces a path to mergable code.

2006-06-27 Thread Abdallah Chatila
On Tue, Jun 27, 2006 at 10:33:48PM -0600, Eric W. Biederman wrote:
> 
> Something to examine here is that if both network devices and sockets
> are tagged does that still allow implicit network namespace passing.

I think avoiding implicit network namespace passing expresses more
power/flexibility plus it would make things clearer to what
container/namespace a given network resource belongs too.

>From our experience with an implementation of network containers [Virtual
Routing for ipv4/ipv6, with a complete isolation between containers where ip
addresses can overlap...], there is some problem domain in which you cannot
afford to duplicate a process/daemon in each container [a big process for
instance, scalability w.r.t. number of containers etc]

By having a proper namespace tag per socket, this can be solved by allowing
a process running in the host context to create sockets in that namespace
than moving them to the target guest namespaces [via a special setsockopt
for instance or unix domain socket as you said].


Regards

> 
> Eric
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Patch 1/1] AF_UNIX Datagram getpeersec (minor fix)

2006-06-27 Thread Catherine Zhang
Hi,

Minor fix (un-export selinux_get_sock_sid()).

thanks,
Catherine

--

From: [EMAIL PROTECTED]

This patch implements an API whereby an application can determine the 
label of its peer's Unix datagram sockets via the auxiliary data mechanism of 
recvmsg.

Patch purpose:

This patch enables a security-aware application to retrieve the
security context of the peer of a Unix datagram socket.  The application 
can then use this security context to determine the security context for 
processing on behalf of the peer who sent the packet. 

Patch design and implementation:

The design and implementation is very similar to the UDP case for INET 
sockets.  Basically we build upon the existing Unix domain socket API for
retrieving user credentials.  Linux offers the API for obtaining user
credentials via ancillary messages (i.e., out of band/control messages
that are bundled together with a normal message).  To retrieve the security 
context, the application first indicates to the kernel such desire by 
setting the SO_PASSSEC option via getsockopt.  Then the application 
retrieves the security context using the auxiliary data mechanism.  

An example server application for Unix datagram socket should look like this:

toggle = 1;
toggle_len = sizeof(toggle);

setsockopt(sockfd, SOL_SOCKET, SO_PASSSEC, &toggle, &toggle_len);
recvmsg(sockfd, &msg_hdr, 0);
if (msg_hdr.msg_controllen > sizeof(struct cmsghdr)) {
cmsg_hdr = CMSG_FIRSTHDR(&msg_hdr);
if (cmsg_hdr->cmsg_len <= CMSG_LEN(sizeof(scontext)) &&
cmsg_hdr->cmsg_level == SOL_SOCKET &&
cmsg_hdr->cmsg_type == SCM_SECURITY) {
memcpy(&scontext, CMSG_DATA(cmsg_hdr), sizeof(scontext));
}
}

sock_setsockopt is enhanced with a new socket option SOCK_PASSSEC to allow
a server socket to receive security context of the peer.  

Testing:

We have tested the patch by setting up Unix datagram client and server
applications.  We verified that the server can retrieve the security context 
using the auxiliary data mechanism of recvmsg.


---

 include/asm-alpha/socket.h   |1 +
 include/asm-arm/socket.h |1 +
 include/asm-arm26/socket.h   |1 +
 include/asm-cris/socket.h|1 +
 include/asm-frv/socket.h |1 +
 include/asm-h8300/socket.h   |1 +
 include/asm-i386/socket.h|1 +
 include/asm-ia64/socket.h|1 +
 include/asm-m32r/socket.h|1 +
 include/asm-m68k/socket.h|1 +
 include/asm-mips/socket.h|1 +
 include/asm-parisc/socket.h  |1 +
 include/asm-powerpc/socket.h |1 +
 include/asm-s390/socket.h|1 +
 include/asm-sh/socket.h  |1 +
 include/asm-sparc/socket.h   |1 +
 include/asm-sparc64/socket.h |1 +
 include/asm-v850/socket.h|1 +
 include/asm-x86_64/socket.h  |1 +
 include/asm-xtensa/socket.h  |1 +
 include/linux/net.h  |1 +
 include/net/af_unix.h|6 ++
 include/net/scm.h|   17 +
 net/core/sock.c  |   11 +++
 net/unix/af_unix.c   |   27 +++
 security/selinux/hooks.c |   11 ---
 26 files changed, 90 insertions(+), 3 deletions(-)

diff -puN include/asm-alpha/socket.h~lsm-secpeer-unix include/asm-alpha/socket.h
--- linux-2.6.17-rc6-mm2-JM/include/asm-alpha/socket.h~lsm-secpeer-unix 
2006-06-27 18:14:52.0 -0400
+++ linux-2.6.17-rc6-mm2-JM-cxzhang/include/asm-alpha/socket.h  2006-06-27 
18:16:31.0 -0400
@@ -51,6 +51,7 @@
 #define SCM_TIMESTAMP  SO_TIMESTAMP
 
 #define SO_PEERSEC 30
+#define SO_PASSSEC 34
 
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION 19
diff -puN include/asm-arm/socket.h~lsm-secpeer-unix include/asm-arm/socket.h
--- linux-2.6.17-rc6-mm2-JM/include/asm-arm/socket.h~lsm-secpeer-unix   
2006-06-27 18:15:10.0 -0400
+++ linux-2.6.17-rc6-mm2-JM-cxzhang/include/asm-arm/socket.h2006-06-27 
18:16:31.0 -0400
@@ -48,5 +48,6 @@
 #define SO_ACCEPTCONN  30
 
 #define SO_PEERSEC 31
+#define SO_PASSSEC 34
 
 #endif /* _ASM_SOCKET_H */
diff -puN include/asm-arm26/socket.h~lsm-secpeer-unix include/asm-arm26/socket.h
--- linux-2.6.17-rc6-mm2-JM/include/asm-arm26/socket.h~lsm-secpeer-unix 
2006-06-27 18:15:10.0 -0400
+++ linux-2.6.17-rc6-mm2-JM-cxzhang/include/asm-arm26/socket.h  2006-06-27 
18:16:31.0 -0400
@@ -48,5 +48,6 @@
 #define SO_ACCEPTCONN  30
 
 #define SO_PEERSEC 31
+#define SO_PASSSEC 34
 
 #endif /* _ASM_SOCKET_H */
diff -puN include/asm-cris/socket.h~lsm-secpeer-unix include/asm-cris/socket.h
--- linux-2.6.17-rc6-mm2-JM/include/asm-cris/socket.h~lsm-secpeer-unix  
2006-06-27 18:15:10.0 -0400
+++ linux-2.6.17-rc6-mm2-JM-cxzhang/include/asm-cris/socket.h   2006-06-27 
18:16:31.0 -0400
@@ -50,6 +50,7 @@
 #define SO_ACCEPTCONN  30
 
 #define SO_PEERSEC 31
+#define 

Re: TOE, etc.

2006-06-27 Thread Herbert Xu
On Tue, Jun 27, 2006 at 09:43:23PM -0700, David Miller wrote:
> 
> Socket state, and that is one thing I don't see them doing yet.

I wonder what happens when the Linux TCP stack attempts to open a
connection to a remote host when that connection is already open
in the RDMA NIC?  For that matter what happens if a Linux application
decides to listen on a TCP port already listened on by the RDMA
NIC?

The only saving grace is that they're only doing RDMA rather than
arbitrary TCP.  However, exactly the same infrastructure can be used
to do arbitrary TCP should they wish to.
 
> But we have to realize they've already been given %95 of the
> interfaces they need to speak IP using our routes and our neighbour
> entries.
> 
> Right?

Yes, however I think the same argument could be applied to TOE.

With their RDMA NIC, we'll have TCP/SCTP connections that bypass
netfilter, tc, IPsec, AF_PACKET/tcpdump and the rest of our stack
while at the same time it is using the same IP address as us and
deciding what packets we will or won't see.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tg3 driver and interrupt coalescence questions

2006-06-27 Thread Robert Iakobashvili

Hi Cris,


I'm looking to decrease the interrupt load on the system.  During the
test I mentioned above I had some interesting and confusing results.
The changes from the default settings to the settings I posted resulted
in a 100% performance increase (counted by the number of VoIP audio
streams the tested server could support).  With default settings one of
the two CPUs in the system maxed out at 99% cpu usage handling
interrupts, while the second CPU was not maxed out, but we started to
drop packets and the VoIP call setups started showing retransmits (which
is the measurement for failure in this test) at about 300 streams.  With
the new settings we were able to hit 600 streams.

So I definately recognized a significant improvement.  However I'd still
like to get more improvement.  At 600 streams and 20ms packets we are
looking at 30,000 pps.  The % of cpu (1 CPU as apparently the interrupts
can't be shared across multiple CPUs) used for interrupt handling at
this 600 stream limit was 88.0%.


interrupts can be balances across multiple CPUs or not.
It depends on 4 areas:
1. enabling/disabling such option in kernel upon compilation;
2. enabling/disabling of a user-space service for interrupt balancing,
"irqbalance" on redhat, nothing such on debian;
3. enabling of disabling cpu affinity for an irq;

Normally, irq-affinity for a nic interrupt is considered good, but if a CPU
is overloaded you may try irq balancing.


Now what was interesting was on the test generation side (same hardware
exactly) of things, I was using the SIPP software to generate the VoIP
streams, and each blade in the blade server was only able to generate
~200 streams, with default settings in ethtool, one of the CPUs would
hit max usage for interrupt handling at that point.  So I modified the
ethtool settings to match those I listed above and there was no
discernable difference.  It was identical performance to the default
settings.


RTP streams generation can burn your CPU cycles as well as output
of them to network, thus balancing of the
load among the CPUs, irqbalancing may improve something.

--
Sincerely,
--
Robert Iakobashvili, coroberti at gmail dot com
Navigare necesse est, vivere non est necesse.
--
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]NET: Add ECN support for TSO

2006-06-27 Thread Herbert Xu
On Tue, Jun 27, 2006 at 09:54:39PM -0700, Michael Chan wrote:
>
> Assuming that we'll later have GSO_TCPV6, isn't it better to check for
> TCPV4 explicitly now?  Or just change it later when necessary.

Good point, I suppose you never know whether a V6 TSO-capable card is going
to handle ECN correctly in both cases.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]NET: Add ECN support for TSO

2006-06-27 Thread Michael Chan
On Wed, 2006-06-28 at 14:42 +1000, Herbert Xu wrote:
> On Tue, Jun 27, 2006 at 09:37:01PM -0700, Michael Chan wrote:

> > @@ -56,6 +55,9 @@ static inline void TCP_ECN_send(struct s
> > if (tp->ecn_flags&TCP_ECN_QUEUE_CWR) {
> > tp->ecn_flags &= ~TCP_ECN_QUEUE_CWR;
> > skb->h.th->cwr = 1;
> > +   if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4)
> > +   skb_shinfo(skb)->gso_type |=
> > +   SKB_GSO_TCPV4_ECN;
> 
> As a byte-pincher I must suggest that you turn this check into something
> like 
> 
>   if (skb_shinfo(skb)->gso_type)
> 
> or even
> 
>   if (skb_shinfo(skb)->gso_size)
> 
Assuming that we'll later have GSO_TCPV6, isn't it better to check for
TCPV4 explicitly now?  Or just change it later when necessary.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TOE, etc.

2006-06-27 Thread David Miller
From: Herbert Xu <[EMAIL PROTECTED]>
Date: Wed, 28 Jun 2006 14:29:59 +1000

> On Wed, Jun 28, 2006 at 12:18:25AM -0400, Jeff Garzik wrote:
> > 
> > A PCI device that presents itself as a SCSI controller, but under the 
> > hood is really iSCSI-over-TCP smells like TOE.  Running a virtualized 
> > Linux guest on top of a proprietary stack [which provides networking 
> > services to guests] also smells like TOE.  :)
> 
> Agreed.  However, when they start adding hooks to the ARP table, the
> routing table, and PMTU management, it begs the question what more is
> there to add for TOE (well, user-space driven TOE at least)?

Socket state, and that is one thing I don't see them doing yet.

> Put it another way, I think the dividing line between TOE and iSCSI or
> virtualisation is exactly the interface between them and the Linux kernel.
> If the interface is an existing one such as SCSI or standard IP then it's
> OK.  However, when it starts poking in the guts of the Linux stack I'd say
> that it has crossed the line.

Yeah, it's starting to smell really bad.

But we have to realize they've already been given %95 of the
interfaces they need to speak IP using our routes and our neighbour
entries.

Right?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]NET: Add ECN support for TSO

2006-06-27 Thread Herbert Xu
On Tue, Jun 27, 2006 at 09:37:01PM -0700, Michael Chan wrote:
> 
> Signed-off-by: Michael Chan <[EMAIL PROTECTED]>

Looks good to me too!

> @@ -56,6 +55,9 @@ static inline void TCP_ECN_send(struct s
>   if (tp->ecn_flags&TCP_ECN_QUEUE_CWR) {
>   tp->ecn_flags &= ~TCP_ECN_QUEUE_CWR;
>   skb->h.th->cwr = 1;
> + if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4)
> + skb_shinfo(skb)->gso_type |=
> + SKB_GSO_TCPV4_ECN;

As a byte-pincher I must suggest that you turn this check into something
like 

if (skb_shinfo(skb)->gso_type)

or even

if (skb_shinfo(skb)->gso_size)

:)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism)

2006-06-27 Thread Jeff Garzik

Herbert Xu wrote:

On Wed, Jun 28, 2006 at 12:18:25AM -0400, Jeff Garzik wrote:
A PCI device that presents itself as a SCSI controller, but under the 
hood is really iSCSI-over-TCP smells like TOE.  Running a virtualized 
Linux guest on top of a proprietary stack [which provides networking 
services to guests] also smells like TOE.  :)


Agreed.  However, when they start adding hooks to the ARP table, the
routing table, and PMTU management, it begs the question what more is
there to add for TOE (well, user-space driven TOE at least)?


Well, you've always been able to implement userspace (or otherwise 
completely-virtualized) network stack.  tuntap and the packet socket 
enable that, if nothing else.  But, like you characterize below, those 
are existing, well-defined, easily contained interfaces.




Put it another way, I think the dividing line between TOE and iSCSI or
virtualisation is exactly the interface between them and the Linux kernel.
If the interface is an existing one such as SCSI or standard IP then it's
OK.  However, when it starts poking in the guts of the Linux stack I'd say
that it has crossed the line.


Strongly agreed.

Jeff


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]NET: Add ECN support for TSO

2006-06-27 Thread Michael Chan
On Wed, 2006-06-28 at 13:48 +1000, Herbert Xu wrote:

> I think you're mixing up GSO the mechanism with GSO the flag.  The GSO
> flag simply tells the TCP stack whether TSO should be used or not, even
> if the hardware does not support TSO at all.  The GSO mechanism on the
> other hand is ALWAYS present.  So regardless of the presence of the GSO
> flag, you can always rely on the GSO mechanism to pick up the pieces (or
> rather generate the pieces as the case may be :)
> 
Thanks, that was my confusion.  Here's the revised patch:

[NET]: Add ECN support for TSO

In the current TSO implementation, NETIF_F_TSO and ECN cannot be
turned on together in a TCP connection.  The problem is that most
hardware that supports TSO does not handle CWR correctly if it is set
in the TSO packet.  Correct handling requires CWR to be set in the
first packet only if it is set in the TSO header.

This patch adds the ability to turn on NETIF_F_TSO and ECN using 
GSO if necessary to handle TSO packets with CWR set.  Hardware
that handles CWR correctly can turn on NETIF_F_TSO_ECN in the dev->
features flag.

All TSO packets with CWR set will have the SKB_GSO_TCPV4_ECN set.  If
the output device does not have the NETIF_F_TSO_ECN feature set, GSO
will split the packet up correctly with CWR only set in the first
segment.

With help from Herbert Xu <[EMAIL PROTECTED]>.

Since ECN can always be enabled with TSO, the SOCK_NO_LARGESEND sock
flag is completely removed.

Signed-off-by: Michael Chan <[EMAIL PROTECTED]>


diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 84b0f0d..a42a9f4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -316,6 +316,7 @@ struct net_device
 #define NETIF_F_TSO(SKB_GSO_TCPV4 << NETIF_F_GSO_SHIFT)
 #define NETIF_F_UFO(SKB_GSO_UDPV4 << NETIF_F_GSO_SHIFT)
 #define NETIF_F_GSO_ROBUST (SKB_GSO_DODGY << NETIF_F_GSO_SHIFT)
+#define NETIF_F_TSO_ECN(SKB_GSO_TCPV4_ECN << NETIF_F_GSO_SHIFT)
 
 #define NETIF_F_GEN_CSUM   (NETIF_F_NO_CSUM | NETIF_F_HW_CSUM)
 #define NETIF_F_ALL_CSUM   (NETIF_F_IP_CSUM | NETIF_F_GEN_CSUM)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5fb72da..e74c294 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -175,6 +175,9 @@ enum {
 
/* This indicates the skb is from an untrusted source. */
SKB_GSO_DODGY = 1 << 2,
+
+   /* This indicates the tcp segment has CWR set. */
+   SKB_GSO_TCPV4_ECN = 1 << 3,
 };
 
 /** 
diff --git a/include/net/sock.h b/include/net/sock.h
index 2d8d6ad..7136bae 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -383,7 +383,6 @@ enum sock_flags {
SOCK_USE_WRITE_QUEUE, /* whether to call sk->sk_write_space in 
sock_wfree */
SOCK_DBG, /* %SO_DEBUG setting */
SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
-   SOCK_NO_LARGESEND, /* whether to sent large segments or not */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
 };
@@ -1033,7 +1032,7 @@ static inline void sk_setup_caps(struct 
if (sk->sk_route_caps & NETIF_F_GSO)
sk->sk_route_caps |= NETIF_F_TSO;
if (sk->sk_route_caps & NETIF_F_TSO) {
-   if (sock_flag(sk, SOCK_NO_LARGESEND) || dst->header_len)
+   if (dst->header_len)
sk->sk_route_caps &= ~NETIF_F_TSO;
else 
sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;
diff --git a/include/net/tcp_ecn.h b/include/net/tcp_ecn.h
index c6b8439..7bb366f 100644
--- a/include/net/tcp_ecn.h
+++ b/include/net/tcp_ecn.h
@@ -31,10 +31,9 @@ static inline void TCP_ECN_send_syn(stru
struct sk_buff *skb)
 {
tp->ecn_flags = 0;
-   if (sysctl_tcp_ecn && !(sk->sk_route_caps & NETIF_F_TSO)) {
+   if (sysctl_tcp_ecn) {
TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_ECE|TCPCB_FLAG_CWR;
tp->ecn_flags = TCP_ECN_OK;
-   sock_set_flag(sk, SOCK_NO_LARGESEND);
}
 }
 
@@ -56,6 +55,9 @@ static inline void TCP_ECN_send(struct s
if (tp->ecn_flags&TCP_ECN_QUEUE_CWR) {
tp->ecn_flags &= ~TCP_ECN_QUEUE_CWR;
skb->h.th->cwr = 1;
+   if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4)
+   skb_shinfo(skb)->gso_type |=
+   SKB_GSO_TCPV4_ECN;
}
} else {
/* ACK or retransmitted segment: clear ECT|CE */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 94fe5b1..7fa0b4a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4178,8 +4178,6 @@ static int tcp_rcv_synsent_state_process
 */
 
TCP_ECN_rcv_synack(tp, th);
- 

Re: Network namespaces a path to mergable code.

2006-06-27 Thread Eric W. Biederman
Sam Vilain <[EMAIL PROTECTED]> writes:

> It sounds then like it would be a good start to have general socket
> namespaces, if it would merge more easily - perhaps then network device
> namespaces would fall into place more easily.


I guess I really see both sockets and devices as the fundamental
entities of a network namespace.  Sockets need to be tagged because
in the general case there is no guarantee that a socket that you are
using was created in the network namespace of your current process.

In general it is possible to get file descriptors opened by someone
else because unix domain sockets allow file descriptor passing.  Similarly
I think there are cases in both unshare and fork that allows you to sockets
open before you entered a namespace.

Since you can't create a new socket in a different network namespace
I can't see any real problems with allowing them to be used, but they
are something to be careful about in container creation code.

Something to examine here is that if both network devices and sockets
are tagged does that still allow implicit network namespace passing.

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism)

2006-06-27 Thread Herbert Xu
On Wed, Jun 28, 2006 at 12:18:25AM -0400, Jeff Garzik wrote:
> 
> A PCI device that presents itself as a SCSI controller, but under the 
> hood is really iSCSI-over-TCP smells like TOE.  Running a virtualized 
> Linux guest on top of a proprietary stack [which provides networking 
> services to guests] also smells like TOE.  :)

Agreed.  However, when they start adding hooks to the ARP table, the
routing table, and PMTU management, it begs the question what more is
there to add for TOE (well, user-space driven TOE at least)?
 
> Unfortunately I don't have more details, so you just get a generalized 
> rant :)

OK, the patch under discussion here adds hooks to all the stuff in the
previous paragraph for the purpose of RDMA over TCP (well I must say
that the exact RDMA application/hardware has never been clearly given
but this is what I can gather from the previous posts).

Put it another way, I think the dividing line between TOE and iSCSI or
virtualisation is exactly the interface between them and the Linux kernel.
If the interface is an existing one such as SCSI or standard IP then it's
OK.  However, when it starts poking in the guts of the Linux stack I'd say
that it has crossed the line.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network namespaces a path to mergable code.

2006-06-27 Thread Eric W. Biederman
Andrey Savochkin <[EMAIL PROTECTED]> writes:

> Eric,
>
> On Tue, Jun 27, 2006 at 11:20:40AM -0600, Eric W. Biederman wrote:
>> 
>> Thinking about this I am going to suggest a slightly different direction
>> for get a patchset we can merge.
>> 
>> First we concentrate on the fundamentals.
>> - How we mark a device as belonging to a specific network namespace.
>> - How we mark a socket as belonging to a specific network namespace.
>
> I agree with the direction of your thoughts.
> I was trying to do a similar thing, define clear steps in network
> namespace merging.
>
> My first patchset covers devices but not sockets.
> The only difference from what you're suggesting is ipv4 routing.
> For me, it is not less important than devices and sockets.  May be even
> more important, since routing exposes design deficiencies less obvious at
> socket level.

I agree we need to do it.  I mostly want a base that allows us to 
not need to convert the whole network stack at once and still be able
to merge code all the way to the stable kernel.

The routing code is important for understanding design choices.  It
isn't important for merging if that makes sense.   

For everyone looking at routing choices the IPv6 routing table is
interesting because it does not use a hash table, and seems quite
possibly to be an equally fast structure that scales better.

There is something to think about there.

>> As part of the fundamentals we add a patch to the generic socket code
>> that by default will disable it for protocol families that do not indicate
>> support for handling network namespaces, on a non-default network namespace.
>
> Fine
>
> Can you summarize you objections against my way of handling devices, please?
> And what was the typo you referred to in your letter to Kirill Korotaev?

I have no fundamental objects to the content I have seen so far.

Please read the first email Kirill responded too.  I quoted a couple
of sections of code and described the bugs I saw with the patch.

All minor things.  The typo I was referring to was a section where the
original iteration was on an ifp variable and you called it dev
without changing the rest of the code in that section.  

The only big issue was that the patch too big, and should be split
into a patchset for better review.  One patch for the new functions,
and the an additional patch for each driver/subsystem hunk describing
why that chunk needed to be changed.

I'm still curious why many of those chunks can't use existing helper
functions, to be cleaned up.

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


TOE, etc. (was Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism)

2006-06-27 Thread Jeff Garzik

Herbert Xu wrote:

On Tue, Jun 27, 2006 at 11:24:25PM -0400, Jeff Garzik wrote:

I don't see how that position has changed?

http://linux-net.osdl.org/index.php/TOE


Well I must say that RDMA over TCP smells very much like TOE.  They've
got an ARP table, a routing table, and presumably a TCP stack.


A PCI device that presents itself as a SCSI controller, but under the 
hood is really iSCSI-over-TCP smells like TOE.  Running a virtualized 
Linux guest on top of a proprietary stack [which provides networking 
services to guests] also smells like TOE.  :)


If a TOE vendors wants to do TOE in a way that is transparent to the 
kernel, more power to them.  Such non-Linux TCP stack solutions still 
suffer many of the problems listed at the web page above, but at least 
they impose no burden on kernel maintenance.


i.e. we really _do not_ want to get into the habit of co-managing arp 
tables, routing tables, filtering rules, and dozens of other such 
resources with multiple remote, independent TCP stack.  We have enough 
complexity as it is today, coordinating between the random variations of 
SMP, uniprocessor, and NUMA machines out there.  Not to mention 
competing with under-the-hood firmware actions (ASF) on NICs.


As an aside, RDMA over TCP just seems silly.  TCP was _not_ meant to do 
the things that RDMA users want.  The infiniband/RDMA programming model 
is an ultra-low-latency polling model where one or two apps are allowed 
to completely consume the machine, either busy-waiting or processing 
messages.


Unfortunately I don't have more details, so you just get a generalized 
rant :)


Jeff



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-27 Thread Eric W. Biederman
Herbert Poetzl <[EMAIL PROTECTED]> writes:

> On Tue, Jun 27, 2006 at 10:29:39AM -0600, Eric W. Biederman wrote:
>> Herbert Poetzl <[EMAIL PROTECTED]> writes:
>
>> I watched the linux-vserver irc channel for a while and almost
>> every network problem was caused by the change in semantics 
>> vserver provides.
>
> the problem here is not the change in semantics compared
> to a real linux system (as there basically is none) but
> compared to _other_ technologies like UML or QEMU, which
> add the need for bridging and additional interfaces, while
> Linux-VServer only focuses on the IP layer ...

Not being able to bind to INADDR_ANY is a huge semantic change.
Unless things have changed recently you get that change when
you have two IP addresses in Linux-Vserver.

Talking to the outsider world through the loop back interface
is a noticeable semantics change.

Having to be careful of who uses INADDR_ANY on the host
when you have guests is essentially a semantics change.

Being able to talk to the outside world with a server
bound only to the loopback IP is a weird semantic
change.

And I suspect I missed something, it is weird peculiar and
I don't care to remember all of the exceptions.

Have a few more network interfaces for a layer 2 solution
is fundamental.  Believing without proof and after arguments
to the contrary that you have not contradicted that a layer 2
solution is inherently slower is non-productive.  Arguing
that a layer 2 only solution most prove itself on guest to guest
communication is also non-productive.

So just to sink one additional nail in the coffin of the silly
guest to guest communication issue.  For any two guests where
fast communication between them is really important I can run
an additional interface pair that requires no routing or bridging.
Given that the implementation of the tunnel device is essentially
the same as the loopback interface and that I make only one
trip through the network stack there will be no performance overhead.
Similarly for any critical guest communication to the outside world
I can give the guest a real network adapter.

That said I don't think those things will be necessary and that if
they are it is an optimization opportunity to make various bits
of the network stack faster.

Bridging or routing between guests is an exercise in simplicity
and control not a requirement.

>> In this case when you allow a guest more than one IP your hack 
>> while easy to maintain becomes much more complex. 
>
> why? a set of IPs is quite similar to a single IP (which
> is actually a subset), so no real change there, only
> IP_ANY means something different for a guest ...

Which simply filtering at bind time makes impossible.

With a guest with 4 IPs 
10.0.0.1 192.168.0.1 172.16.0.1 127.0.0.1
How do you make INADDR_ANY work with just filtering at bind time?

The host has at least the additional IPs.
10.0.0.2 192.168.0.2 172.16.0.2 127.0.0.1

Herbert I suspect we are talking about completely different
implementations otherwise I can't possibly see how we have
such different perceptions of their capabilities.

I am talking precisely about filter IP addresses at connect
or bind time that a guest can use.  Which as I recall is
what vserver implements.  If you are thinking of your ngnet
implementation that would explain things.

>> Especially as you address each case people care about one at a time.
>
> hmm?

Multiple IPs, IPv6, additional protocols, firewalls. etc.

>> In one shot this goes the entire way. Given how many people miss that
>> you do the work at layer 2 than at layer 3 I would not call this the
>> straight forward approach. The straight forward implementation yes,
>> but not the straight forward approach.
>
> seems I lost you here ...


>> > for example, you won't have multiple routing tables
>> > in a kernel where this feature is disabled, no?
>> > so why should it affect a guest, or require modified
>> > apps inside a guest when we would decide to provide
>> > only a single routing table?
>> >
>> >> From my POV, fully virtualized namespaces are the future. 
>> >
>> > the future is already there, it's called Xen or UML, or QEMU :)
>> 
>> Yep.  And now we need it to run fast.
>
> hmm, maybe you should try to optimize linux for Xen then,
> as I'm sure it will provide the optimal virtualization
> and has all the features folks are looking for (regarding
> virtualization)
>
> I thought we are trying to figure a light-weight subset
> of isolation and virtualization technologies and methods
> which make sense to have in mainline ...

And you presume doing things at layer 2 is more expensive than
layer 3.

>From what I have seen of layer 3 solutions it is a 
bloody maintenance nightmare, and an inflexible mess.

>> >> It is what makes virtualization solution usable (w/o apps
>> >> modifications), provides all the features and doesn't require much
>> >> efforts from people to be used.
>> >
>> > and what if they want to use virtualization inside
>> > their guests? where

Re: [PATCH]NET: Add ECN support for TSO

2006-06-27 Thread Herbert Xu
On Tue, Jun 27, 2006 at 08:40:34PM -0700, Michael Chan wrote:
>
> We need to turn off NETIF_F_TSO for a connection that has negotiated to
> turn on ECN if the output device cannot handle TSO and ECN.  In other
> words, if the output device does not have either GSO or TSO_ECN feature
> set.

I think you're mixing up GSO the mechanism with GSO the flag.  The GSO
flag simply tells the TCP stack whether TSO should be used or not, even
if the hardware does not support TSO at all.  The GSO mechanism on the
other hand is ALWAYS present.  So regardless of the presence of the GSO
flag, you can always rely on the GSO mechanism to pick up the pieces (or
rather generate the pieces as the case may be :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]NET: Add ECN support for TSO

2006-06-27 Thread Michael Chan
On Wed, 2006-06-28 at 13:10 +1000, Herbert Xu wrote:
> On Tue, Jun 27, 2006 at 08:06:47PM -0700, Michael Chan wrote:
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 2d8d6ad..2c75172 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -1033,7 +1033,8 @@ static inline void sk_setup_caps(struct 
> > if (sk->sk_route_caps & NETIF_F_GSO)
> > sk->sk_route_caps |= NETIF_F_TSO;
> > if (sk->sk_route_caps & NETIF_F_TSO) {
> > -   if (sock_flag(sk, SOCK_NO_LARGESEND) || dst->header_len)
> > +   if ((sock_flag(sk, SOCK_NO_LARGESEND) &&
> > +   !tso_ecn_capable(sk->sk_route_caps)) || dst->header_len)
> > sk->sk_route_caps &= ~NETIF_F_TSO;
> 
> Why turn it off? With GSO in place the stack will handle it just fine
> (even your description says so :)  We should instead remove all code
> that turns off TSO/ECN when the other is present.
> 
We need to turn off NETIF_F_TSO for a connection that has negotiated to
turn on ECN if the output device cannot handle TSO and ECN.  In other
words, if the output device does not have either GSO or TSO_ECN feature
set.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-27 Thread Eric W. Biederman
Alexey Kuznetsov <[EMAIL PROTECTED]> writes:

> Hello!
>
>> It may look weird, but do application really *need* to see eth0 rather
>> than eth858354?
>
> Applications do not care, humans do. :-)
>
> What's about applications they just need to see exactly the same device
> after migration. Not only name, but f.e. also its ifindex. If you do not
> create a separate namespace for netdevices, you will inevitably end up
> with some strange hack sort of VPIDs to translate (or to partition) ifindices
> or to tell that "ping -I eth858354 xxx" is too coimplicated application
> to survive migration.


Actually there are applications with peculiar licensing practices that
do look at devices like eth0 to verify you have the appropriate mac, and
do really weird things if you don't have an eth0.

Plus there are other cases where it can be simpler to hard code things
if it is allowable. (The human factor)  Otherwise your configuration
must be done through hotplug scripts.

But yes there are misguided applications that care.

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism

2006-06-27 Thread Herbert Xu
On Tue, Jun 27, 2006 at 11:24:25PM -0400, Jeff Garzik wrote:
>
> I don't see how that position has changed?
> 
> http://linux-net.osdl.org/index.php/TOE

Well I must say that RDMA over TCP smells very much like TOE.  They've
got an ARP table, a routing table, and presumably a TCP stack.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism

2006-06-27 Thread Jeff Garzik

Herbert Xu wrote:

On Wed, Jun 28, 2006 at 12:54:10PM +1000, Herbert Xu wrote:

Please give more specific reasons for needing these events because it
is certainly far from obvious from reading those documents.


Never mind, I've found your earlier messages on the list which explains
your reasons more clearly.  It would be nice if you could include those
explanations in your patch description.

BTW, does this mean that we're now comfortable with full TOE?


I don't see how that position has changed?

http://linux-net.osdl.org/index.php/TOE

Jeff


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]NET: Add ECN support for TSO

2006-06-27 Thread Herbert Xu
On Tue, Jun 27, 2006 at 08:06:47PM -0700, Michael Chan wrote:
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 2d8d6ad..2c75172 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1033,7 +1033,8 @@ static inline void sk_setup_caps(struct 
>   if (sk->sk_route_caps & NETIF_F_GSO)
>   sk->sk_route_caps |= NETIF_F_TSO;
>   if (sk->sk_route_caps & NETIF_F_TSO) {
> - if (sock_flag(sk, SOCK_NO_LARGESEND) || dst->header_len)
> + if ((sock_flag(sk, SOCK_NO_LARGESEND) &&
> + !tso_ecn_capable(sk->sk_route_caps)) || dst->header_len)
>   sk->sk_route_caps &= ~NETIF_F_TSO;

Why turn it off? With GSO in place the stack will handle it just fine
(even your description says so :)  We should instead remove all code
that turns off TSO/ECN when the other is present.

Otherwise the patch looks good.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH]bnx2: Add NETIF_F_TSO_ECN

2006-06-27 Thread Michael Chan
Add NETIF_F_TSO_ECN feature for all bnx2 hardware.

Signed-off-by: Michael Chan <[EMAIL PROTECTED]>


diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 7635736..e89d5df 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -5128,6 +5128,16 @@ bnx2_set_rx_csum(struct net_device *dev,
return 0;
 }
 
+static int
+bnx2_set_tso(struct net_device *dev, u32 data)
+{
+   if (data)
+   dev->features |= NETIF_F_TSO | NETIF_F_TSO_ECN;
+   else
+   dev->features &= ~(NETIF_F_TSO | NETIF_F_TSO_ECN);
+   return 0;
+}
+
 #define BNX2_NUM_STATS 46
 
 static struct {
@@ -5445,7 +5455,7 @@ static struct ethtool_ops bnx2_ethtool_o
.set_sg = ethtool_op_set_sg,
 #ifdef BCM_TSO
.get_tso= ethtool_op_get_tso,
-   .set_tso= ethtool_op_set_tso,
+   .set_tso= bnx2_set_tso,
 #endif
.self_test_count= bnx2_self_test_count,
.self_test  = bnx2_self_test,
@@ -5926,7 +5936,7 @@ bnx2_init_one(struct pci_dev *pdev, cons
dev->features |= NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX;
 #endif
 #ifdef BCM_TSO
-   dev->features |= NETIF_F_TSO;
+   dev->features |= NETIF_F_TSO | NETIF_F_TSO_ECN;
 #endif
 
netif_carrier_off(bp->dev);


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH]NET: Add ECN support for TSO

2006-06-27 Thread Michael Chan
In the current TSO implementation, NETIF_F_TSO and ECN cannot be
turned on together in a TCP connection.  The problem is that most
hardware that supports TSO does not handle CWR correctly if it is set
in the TSO packet.  Correct handling requires CWR to be set in the
first packet only if it is set in the TSO header.

This patch adds the ability to turn on NETIF_F_TSO and ECN using 
GSO if necessary to handle TSO packets with CWR set.  Hardware
that handles CWR correctly can turn on NETIF_F_TSO_ECN in the dev->
features flag.

All TSO packets with CWR set will have the SKB_GSO_TCPV4_ECN set.  If
the output device does not have the NETIF_F_TSO_ECN feature set, GSO
will split the packet up correctly with CWR only set in the first
segment.

It is further assumed that all hardware will handle ECE properly by
replicating the ECE flag in all segments.  If that is not the case, a
simple extension of the logic will be required.


Signed-off-by: Michael Chan <[EMAIL PROTECTED]>


diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index efd1e2a..f393de2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -316,6 +316,7 @@ struct net_device
 #define NETIF_F_TSO(SKB_GSO_TCPV4 << NETIF_F_GSO_SHIFT)
 #define NETIF_F_UFO(SKB_GSO_UDPV4 << NETIF_F_GSO_SHIFT)
 #define NETIF_F_GSO_ROBUST (SKB_GSO_DODGY << NETIF_F_GSO_SHIFT)
+#define NETIF_F_TSO_ECN(SKB_GSO_TCPV4_ECN << NETIF_F_GSO_SHIFT)
 
 #define NETIF_F_GEN_CSUM   (NETIF_F_NO_CSUM | NETIF_F_HW_CSUM)
 #define NETIF_F_ALL_CSUM   (NETIF_F_IP_CSUM | NETIF_F_GEN_CSUM)
@@ -1002,6 +1003,11 @@ static inline int netif_needs_gso(struct
return !skb_gso_ok(skb, dev->features);
 }
 
+static inline int tso_ecn_capable(unsigned long features)
+{
+   return ((features & NETIF_F_GSO) || (features & NETIF_F_TSO_ECN));
+}
+
 #endif /* __KERNEL__ */
 
 #endif /* _LINUX_DEV_H */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5fb72da..e74c294 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -175,6 +175,9 @@ enum {
 
/* This indicates the skb is from an untrusted source. */
SKB_GSO_DODGY = 1 << 2,
+
+   /* This indicates the tcp segment has CWR set. */
+   SKB_GSO_TCPV4_ECN = 1 << 3,
 };
 
 /** 
diff --git a/include/net/sock.h b/include/net/sock.h
index 2d8d6ad..2c75172 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1033,7 +1033,8 @@ static inline void sk_setup_caps(struct 
if (sk->sk_route_caps & NETIF_F_GSO)
sk->sk_route_caps |= NETIF_F_TSO;
if (sk->sk_route_caps & NETIF_F_TSO) {
-   if (sock_flag(sk, SOCK_NO_LARGESEND) || dst->header_len)
+   if ((sock_flag(sk, SOCK_NO_LARGESEND) &&
+   !tso_ecn_capable(sk->sk_route_caps)) || dst->header_len)
sk->sk_route_caps &= ~NETIF_F_TSO;
else 
sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;
diff --git a/include/net/tcp_ecn.h b/include/net/tcp_ecn.h
index c6b8439..871dca2 100644
--- a/include/net/tcp_ecn.h
+++ b/include/net/tcp_ecn.h
@@ -31,7 +31,8 @@ static inline void TCP_ECN_send_syn(stru
struct sk_buff *skb)
 {
tp->ecn_flags = 0;
-   if (sysctl_tcp_ecn && !(sk->sk_route_caps & NETIF_F_TSO)) {
+   if (sysctl_tcp_ecn && (!(sk->sk_route_caps & NETIF_F_TSO) ||
+  tso_ecn_capable(sk->sk_route_caps))) {
TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_ECE|TCPCB_FLAG_CWR;
tp->ecn_flags = TCP_ECN_OK;
sock_set_flag(sk, SOCK_NO_LARGESEND);
@@ -56,6 +57,9 @@ static inline void TCP_ECN_send(struct s
if (tp->ecn_flags&TCP_ECN_QUEUE_CWR) {
tp->ecn_flags &= ~TCP_ECN_QUEUE_CWR;
skb->h.th->cwr = 1;
+   if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4)
+   skb_shinfo(skb)->gso_type |=
+   SKB_GSO_TCPV4_ECN;
}
} else {
/* ACK or retransmitted segment: clear ECT|CE */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index bdd71db..c4a4dba 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2044,7 +2044,8 @@ struct sk_buff * tcp_make_synack(struct 
memset(th, 0, sizeof(struct tcphdr));
th->syn = 1;
th->ack = 1;
-   if (dst->dev->features&NETIF_F_TSO)
+   if ((dst->dev->features & NETIF_F_TSO) &&
+   !tso_ecn_capable(dst->dev->features))
ireq->ecn_ok = 0;
TCP_ECN_make_synack(req, th);
th->source = inet_sk(sk)->sport;


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism

2006-06-27 Thread Herbert Xu
On Wed, Jun 28, 2006 at 12:54:10PM +1000, Herbert Xu wrote:
> 
> Please give more specific reasons for needing these events because it
> is certainly far from obvious from reading those documents.

Never mind, I've found your earlier messages on the list which explains
your reasons more clearly.  It would be nice if you could include those
explanations in your patch description.

BTW, does this mean that we're now comfortable with full TOE?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism

2006-06-27 Thread Herbert Xu
Steve Wise <[EMAIL PROTECTED]> wrote:
> 
> The reason these devices need update events is because they typically
> cache this information in hardware and need to be notified when this
> information has been updated.  For information on RDMA protocols, see:
> http://www.ietf.org/html.charters/rddp-charter.html.

Please give more specific reasons for needing these events because it
is certainly far from obvious from reading those documents.

Without reasons these invasive changes may turn out to be completely
inappropriate.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch 1/1] AF_UNIX Datagram getpeersec (with latest updates)

2006-06-27 Thread Xiaolan Zhang
Got it.  Will send a new patch soon.

Catherine

James Morris <[EMAIL PROTECTED]> wrote on 06/27/2006 10:13:48 PM:

> On Tue, 27 Jun 2006, Xiaolan Zhang wrote:
> 
> > > Just one more thing, we don't need to export this function now.
> > 
> > You mean moving it to security/selinux/hooks.c and making it static?
> 
> Yep.
> 
> > I think conceptually this is where it should reside -- auditing system 

> > might need it in the future, for example.
> 
> We can export it then.
> 
> 
> 
> - James
> -- 
> James Morris
> <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch 1/1] AF_UNIX Datagram getpeersec (with latest updates)

2006-06-27 Thread James Morris
On Tue, 27 Jun 2006, James Morris wrote:

> > I think conceptually this is where it should reside -- auditing system 
> > might need it in the future, for example.
> 
> We can export it then.

To clarify, we can export it if the audit system needs it, in the future.



- James
-- 
James Morris
<[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch 1/1] AF_UNIX Datagram getpeersec (with latest updates)

2006-06-27 Thread James Morris
On Tue, 27 Jun 2006, Xiaolan Zhang wrote:

> > Just one more thing, we don't need to export this function now.
> 
> You mean moving it to security/selinux/hooks.c and making it static?

Yep.

> I think conceptually this is where it should reside -- auditing system 
> might need it in the future, for example.

We can export it then.



- James
-- 
James Morris
<[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NET]: Added GSO header verification

2006-06-27 Thread Michael Chan
On Wed, 2006-06-28 at 08:31 +1000, Herbert Xu wrote:

> [NET]: Fix logical error in skb_gso_ok
> 
> The test in skb_gso_ok is backwards.  Noticed by Michael Chan
> <[EMAIL PROTECTED]>.
> 
> Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>

Acked-by: Michael Chan <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] cirrus ep93xx ethernet driver

2006-06-27 Thread Lennert Buytenhek
On Mon, Jun 26, 2006 at 04:59:24AM +0200, Lennert Buytenhek wrote:

> The cirrus ep93xx is an ARM SoC that includes an ethernet MAC --
> this patch adds a driver for that ethernet MAC.

Attached is a new version that optimises interrupt handling somewhat.
Since we clear RX status as the first thing we do in the poll handler,
we might as well read the read-to-clear version of the interrupt status
register in the interrupt handler and avoid the explicit clear in the
poll handler.  This shaves close to a second off a 128M sendfile() test.

At ~40 seconds for a 128M sendfile (~3.2MB/sec), the network performance
of this CPU isn't impressive by any means, but given that the CPU only
runs at 200MHz and that the MAC doesn't do checksum offloading and insists
on 4-byte buffer alignment, we can't really do a whole lot better.  The
performance with this driver is still a good deal better than with the
vendor driver, though -- for this particular test (128M sendfile), the
vendor driver needs 1m21s.

Apart from that it still uses numeric chip register addresses, I'm
quite happy with the driver as it is, it survives heavy beating and
is pretty stable.

Index: linux-2.6.17-git10/drivers/net/arm/Kconfig
===
--- linux-2.6.17-git10.orig/drivers/net/arm/Kconfig
+++ linux-2.6.17-git10/drivers/net/arm/Kconfig
@@ -39,3 +39,10 @@ config ARM_AT91_ETHER
help
  If you wish to compile a kernel for the AT91RM9200 and enable
  ethernet support, then you should always answer Y to this.
+
+config EP93XX_ETH
+   tristate "EP93xx Ethernet support"
+   depends on NET_ETHERNET && ARM && ARCH_EP93XX
+   help
+ This is a driver for the ethernet hardware included in EP93xx CPUs.
+ Say Y if you are building a kernel for EP93xx based devices.
Index: linux-2.6.17-git10/drivers/net/arm/Makefile
===
--- linux-2.6.17-git10.orig/drivers/net/arm/Makefile
+++ linux-2.6.17-git10/drivers/net/arm/Makefile
@@ -8,3 +8,4 @@ obj-$(CONFIG_ARM_ETHERH)+= etherh.o
 obj-$(CONFIG_ARM_ETHER3)   += ether3.o
 obj-$(CONFIG_ARM_ETHER1)   += ether1.o
 obj-$(CONFIG_ARM_AT91_ETHER)   += at91_ether.o
+obj-$(CONFIG_EP93XX_ETH)   += ep93xx_eth.o
Index: linux-2.6.17-git10/drivers/net/arm/ep93xx_eth.c
===
--- /dev/null
+++ linux-2.6.17-git10/drivers/net/arm/ep93xx_eth.c
@@ -0,0 +1,668 @@
+/*
+ * EP93xx ethernet network device driver
+ * Copyright (C) 2006 Lennert Buytenhek <[EMAIL PROTECTED]>
+ * Dedicated to Marija Kulikova.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "ep93xx_eth.h"
+
+#define DRV_MODULE_VERSION "0.1"
+
+#define RX_QUEUE_ENTRIES   64
+#define TX_QUEUE_ENTRIES   8
+
+struct ep93xx_descs
+{
+   struct ep93xx_rdesc rdesc[RX_QUEUE_ENTRIES];
+   struct ep93xx_tdesc tdesc[TX_QUEUE_ENTRIES];
+   struct ep93xx_rstat rstat[RX_QUEUE_ENTRIES];
+   struct ep93xx_tstat tstat[TX_QUEUE_ENTRIES];
+};
+
+struct ep93xx_priv
+{
+   struct resource *res;
+   void*base_addr;
+   int irq;
+
+   struct ep93xx_descs *descs;
+   dma_addr_t  descs_dma_addr;
+
+   void*rx_buf[RX_QUEUE_ENTRIES];
+   void*tx_buf[TX_QUEUE_ENTRIES];
+
+   int rx_pointer;
+   int tx_clean_pointer;
+   int tx_pointer;
+   int tx_pending;
+
+   struct net_device_stats stats;
+};
+
+#define rdb(ep, off)   __raw_readb((ep)->base_addr + (off))
+#define rdw(ep, off)   __raw_readw((ep)->base_addr + (off))
+#define rdl(ep, off)   __raw_readl((ep)->base_addr + (off))
+#define wrb(ep, off, val)  __raw_writeb((val), (ep)->base_addr + (off))
+#define wrw(ep, off, val)  __raw_writew((val), (ep)->base_addr + (off))
+#define wrl(ep, off, val)  __raw_writel((val), (ep)->base_addr + (off))
+
+static int ep93xx_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+   struct ep93xx_priv *ep = netdev_priv(dev);
+   int entry;
+
+   if (unlikely(skb->len) > PAGE_SIZE) {
+   ep->stats.tx_dropped++;
+   dev_kfree_skb(skb);
+   return 0;
+   }
+
+   entry = ep->tx_pointer;
+   ep->tx_pointer = (ep->tx_pointer + 1) % TX_QUEUE_ENTRIES;
+
+   ep->descs->tdesc[entry].tdesc1 =
+   TDESC1_EOF | (entry << 16) | (skb->len & 0xfff);
+   s

Re: [Patch 1/1] AF_UNIX Datagram getpeersec (with latest updates)

2006-06-27 Thread Xiaolan Zhang
James Morris <[EMAIL PROTECTED]> wrote on 06/27/2006 09:33:17 PM:

> On Tue, 27 Jun 2006, Catherine Zhang wrote:
> 
> > diff -puN security/selinux/exports.c~lsm-secpeer-unix 
> security/selinux/exports.c
> > --- linux-2.6.17-rc6-mm2-JM/security/selinux/exports.c~lsm-
> secpeer-unix   2006-06-27 18:15:10.914669944 -0400
> > +++ linux-2.6.17-rc6-mm2-JM-cxzhang/security/selinux/exports.c 
> 2006-06-27 18:16:31.502418744 -0400
> > @@ -17,6 +17,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > 
> >  #include "security.h"
> >  #include "objsec.h"
> > @@ -72,6 +73,16 @@ void selinux_get_task_sid(struct task_st
> > *sid = 0;
> >  }
> > 
> > +void selinux_get_sock_sid(struct socket *sock, u32 *sid)
> > +{
> > +   if (selinux_enabled) {
> > +  const struct inode *inode = SOCK_INODE(sock);
> > +  selinux_get_inode_sid(inode, sid);
> > +  return;
> > +   }
> > +   *sid = 0;
> > +}
> > +
> 
> 
> Just one more thing, we don't need to export this function now.

You mean moving it to security/selinux/hooks.c and making it static?

I think conceptually this is where it should reside -- auditing system 
might need it in the future, for example.

thanks,
Catherine


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch 1/1] AF_UNIX Datagram getpeersec (with latest updates)

2006-06-27 Thread James Morris
On Tue, 27 Jun 2006, Catherine Zhang wrote:

> diff -puN security/selinux/exports.c~lsm-secpeer-unix 
> security/selinux/exports.c
> --- linux-2.6.17-rc6-mm2-JM/security/selinux/exports.c~lsm-secpeer-unix   
> 2006-06-27 18:15:10.914669944 -0400
> +++ linux-2.6.17-rc6-mm2-JM-cxzhang/security/selinux/exports.c
> 2006-06-27 18:16:31.502418744 -0400
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "security.h"
>  #include "objsec.h"
> @@ -72,6 +73,16 @@ void selinux_get_task_sid(struct task_st
>   *sid = 0;
>  }
>  
> +void selinux_get_sock_sid(struct socket *sock, u32 *sid)
> +{
> + if (selinux_enabled) {
> + const struct inode *inode = SOCK_INODE(sock);
> + selinux_get_inode_sid(inode, sid);
> + return;
> + }
> + *sid = 0;
> +}
> +


Just one more thing, we don't need to export this function now.



- James
-- 
James Morris
<[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Patch 1/1] AF_UNIX Datagram getpeersec (with latest updates)

2006-06-27 Thread Catherine Zhang
Hi,

This patch combines all previous updates.  Many thanks to James, Dave,
and Stephen for their modifications and comments!

cheers,
Catherine

--

From: [EMAIL PROTECTED]

This patch implements an API whereby an application can determine the 
label of its peer's Unix datagram sockets via the auxiliary data mechanism of 
recvmsg.

Patch purpose:

This patch enables a security-aware application to retrieve the
security context of the peer of a Unix datagram socket.  The application 
can then use this security context to determine the security context for 
processing on behalf of the peer who sent the packet. 

Patch design and implementation:

The design and implementation is very similar to the UDP case for INET 
sockets.  Basically we build upon the existing Unix domain socket API for
retrieving user credentials.  Linux offers the API for obtaining user
credentials via ancillary messages (i.e., out of band/control messages
that are bundled together with a normal message).  To retrieve the security 
context, the application first indicates to the kernel such desire by 
setting the SO_PASSSEC option via getsockopt.  Then the application 
retrieves the security context using the auxiliary data mechanism.  

An example server application for Unix datagram socket should look like this:

toggle = 1;
toggle_len = sizeof(toggle);

setsockopt(sockfd, SOL_SOCKET, SO_PASSSEC, &toggle, &toggle_len);
recvmsg(sockfd, &msg_hdr, 0);
if (msg_hdr.msg_controllen > sizeof(struct cmsghdr)) {
cmsg_hdr = CMSG_FIRSTHDR(&msg_hdr);
if (cmsg_hdr->cmsg_len <= CMSG_LEN(sizeof(scontext)) &&
cmsg_hdr->cmsg_level == SOL_SOCKET &&
cmsg_hdr->cmsg_type == SCM_SECURITY) {
memcpy(&scontext, CMSG_DATA(cmsg_hdr), sizeof(scontext));
}
}

sock_setsockopt is enhanced with a new socket option SOCK_PASSSEC to allow
a server socket to receive security context of the peer.  

Testing:

We have tested the patch by setting up Unix datagram client and server
applications.  We verified that the server can retrieve the security context 
using the auxiliary data mechanism of recvmsg.


---

 include/asm-alpha/socket.h   |1 +
 include/asm-arm/socket.h |1 +
 include/asm-arm26/socket.h   |1 +
 include/asm-cris/socket.h|1 +
 include/asm-frv/socket.h |1 +
 include/asm-h8300/socket.h   |1 +
 include/asm-i386/socket.h|1 +
 include/asm-ia64/socket.h|1 +
 include/asm-m32r/socket.h|1 +
 include/asm-m68k/socket.h|1 +
 include/asm-mips/socket.h|1 +
 include/asm-parisc/socket.h  |1 +
 include/asm-powerpc/socket.h |1 +
 include/asm-s390/socket.h|1 +
 include/asm-sh/socket.h  |1 +
 include/asm-sparc/socket.h   |1 +
 include/asm-sparc64/socket.h |1 +
 include/asm-v850/socket.h|1 +
 include/asm-x86_64/socket.h  |1 +
 include/asm-xtensa/socket.h  |1 +
 include/linux/net.h  |1 +
 include/linux/selinux.h  |   15 +++
 include/net/af_unix.h|6 ++
 include/net/scm.h|   17 +
 net/core/sock.c  |   11 +++
 net/unix/af_unix.c   |   27 +++
 security/selinux/exports.c   |   11 +++
 security/selinux/hooks.c |8 +++-
 28 files changed, 115 insertions(+), 1 deletion(-)

diff -puN include/asm-alpha/socket.h~lsm-secpeer-unix include/asm-alpha/socket.h
--- linux-2.6.17-rc6-mm2-JM/include/asm-alpha/socket.h~lsm-secpeer-unix 
2006-06-27 18:14:52.586456256 -0400
+++ linux-2.6.17-rc6-mm2-JM-cxzhang/include/asm-alpha/socket.h  2006-06-27 
18:16:31.488420872 -0400
@@ -51,6 +51,7 @@
 #define SCM_TIMESTAMP  SO_TIMESTAMP
 
 #define SO_PEERSEC 30
+#define SO_PASSSEC 34
 
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION 19
diff -puN include/asm-arm/socket.h~lsm-secpeer-unix include/asm-arm/socket.h
--- linux-2.6.17-rc6-mm2-JM/include/asm-arm/socket.h~lsm-secpeer-unix   
2006-06-27 18:15:10.052800968 -0400
+++ linux-2.6.17-rc6-mm2-JM-cxzhang/include/asm-arm/socket.h2006-06-27 
18:16:31.489420720 -0400
@@ -48,5 +48,6 @@
 #define SO_ACCEPTCONN  30
 
 #define SO_PEERSEC 31
+#define SO_PASSSEC 34
 
 #endif /* _ASM_SOCKET_H */
diff -puN include/asm-arm26/socket.h~lsm-secpeer-unix include/asm-arm26/socket.h
--- linux-2.6.17-rc6-mm2-JM/include/asm-arm26/socket.h~lsm-secpeer-unix 
2006-06-27 18:15:10.095794432 -0400
+++ linux-2.6.17-rc6-mm2-JM-cxzhang/include/asm-arm26/socket.h  2006-06-27 
18:16:31.489420720 -0400
@@ -48,5 +48,6 @@
 #define SO_ACCEPTCONN  30
 
 #define SO_PEERSEC 31
+#define SO_PASSSEC 34
 
 #endif /* _ASM_SOCKET_H */
diff -puN include/asm-cris/socket.h~lsm-secpeer-unix include/asm-cris/socket.h
--- linux-2.6.17-rc6-mm2-JM/include/asm-cris/socket.h~lsm-secpeer-unix  
2006-06-27 18:15:10.132788808 -0400
+++ linux-2.6.17-rc6-mm2-J

Please pull 'upstream' branch of wireless-2.6 (revised)

2006-06-27 Thread John W. Linville
On Mon, Jun 26, 2006 at 05:25:52PM -0400, John W. Linville wrote:

> Michael Buesch:
>   bcm43xx: suspend MAC while executing long pwork

The above patch ruffled some feathers on netdev.  In the interest of
moving things along, I have pulled that patch out of wireless-2.6.
I expect it will be back soon, probably with some additional changes
to satisfy concerns raised on the mailing list.

NOTE: While I was mucking around, I pulled a bunch of patches from
the master branch out into driver-specific branches for adm8211,
prism54usb, tiacx, and zd1211rw.  Then I rebuilt the master branch
by pulling from the driver branches.  This is intended to ease the
merging of individual drivers upstream (e.g. zd1211rw and maybe tiacx
in the near future).

Those working off my upstream branch or off Linus' tree should be
unaffected.  Anyone who works off my master branch may need to rebase,
especially if they want me to be able to pull from them (due to dirty
history).  I apologize for the hassle and appreciate your cooperation!

Thanks,

John
---

The following changes since commit fcc18e83e1f6fd9fa6b333735bf0fcd530655511:
  Malcolm Parsons:
uclinux: use PER_LINUX_32BIT in binfmt_flat

are found in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git 
upstream

Daniel Drake:
  bcm43xx: use softmac-suggested TX rate
  bcm43xx: enable shared key authentication

Eric Sesterhenn:
  skb used after passing to netif_rx in net/ieee80211/ieee80211_rx.c

Faidon Liambotis:
  Add two PLX device IDs

Hong Liu:
  ieee80211: fix not allocating IV+ICV space when usingencryption in 
ieee80211_tx_frame

Horms:
  CONFIG_WIRELESS_EXT is neccessary after all

John W. Linville:
  softmac: fix build-break from 881ee6999d66c8fc903b429b73bbe6045b38c549

Joseph Jezak:
  SoftMAC: Prevent multiple authentication attempts on the same network
  SoftMAC: Add network to ieee80211softmac_call_events when associate times 
out

Larry Finger:
  Convert bcm43xx-softmac to use the ieee80211_is_valid_channel routine
  2.6.17 missing a call to ieee80211softmac_capabilities from 
ieee80211softmac_assoc_req

Michael Buesch:
  bcm43xx: workaround init_board vs. IRQ race

 drivers/net/wireless/bcm43xx/bcm43xx_main.c|   31 -
 drivers/net/wireless/bcm43xx/bcm43xx_main.h|   24 
 drivers/net/wireless/bcm43xx/bcm43xx_radio.c   |7 +
 drivers/net/wireless/bcm43xx/bcm43xx_wx.c  |2 +
 drivers/net/wireless/bcm43xx/bcm43xx_xmit.c|5 +++
 drivers/net/wireless/hostap/hostap_plx.c   |2 +
 include/net/ieee80211softmac.h |1 +
 net/ieee80211/ieee80211_rx.c   |4 ++-
 net/ieee80211/ieee80211_tx.c   |   15 +++---
 net/ieee80211/softmac/ieee80211softmac_assoc.c |   31 -
 net/ieee80211/softmac/ieee80211softmac_auth.c  |4 +--
 net/ieee80211/softmac/ieee80211softmac_io.c|3 ++
 net/ieee80211/softmac/ieee80211softmac_wx.c|   36 +++-
 13 files changed, 105 insertions(+), 60 deletions(-)

diff --git a/drivers/net/wireless/bcm43xx/bcm43xx_main.c 
b/drivers/net/wireless/bcm43xx/bcm43xx_main.c
index 085d785..1cd47c5 100644
--- a/drivers/net/wireless/bcm43xx/bcm43xx_main.c
+++ b/drivers/net/wireless/bcm43xx/bcm43xx_main.c
@@ -1885,6 +1885,15 @@ static irqreturn_t bcm43xx_interrupt_han
 
spin_lock(&bcm->irq_lock);
 
+   /* Only accept IRQs, if we are initialized properly.
+* This avoids an RX race while initializing.
+* We should probably not enable IRQs before we are initialized
+* completely, but some careful work is needed to fix this. I think it
+* is best to stay with this cheap workaround for now... .
+*/
+   if (unlikely(bcm43xx_status(bcm) != BCM43xx_STAT_INITIALIZED))
+   goto out;
+
reason = bcm43xx_read32(bcm, BCM43xx_MMIO_GEN_IRQ_REASON);
if (reason == 0x) {
/* irq not for us (shared irq) */
@@ -1906,19 +1915,11 @@ static irqreturn_t bcm43xx_interrupt_han
 
bcm43xx_interrupt_ack(bcm, reason);
 
-   /* Only accept IRQs, if we are initialized properly.
-* This avoids an RX race while initializing.
-* We should probably not enable IRQs before we are initialized
-* completely, but some careful work is needed to fix this. I think it
-* is best to stay with this cheap workaround for now... .
-*/
-   if (likely(bcm43xx_status(bcm) == BCM43xx_STAT_INITIALIZED)) {
-   /* disable all IRQs. They are enabled again in the bottom half. 
*/
-   bcm->irq_savedstate = bcm43xx_interrupt_disable(bcm, 
BCM43xx_IRQ_ALL);
-   /* save the reason code and call our bottom half. */
-   bcm->irq_reason = reason;
-   tasklet_schedule(&bcm->isr_tasklet);
-   }
+   /* disable all IRQs. They are en

Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-27 Thread Alexey Kuznetsov
Hello!

> It may look weird, but do application really *need* to see eth0 rather
> than eth858354?

Applications do not care, humans do. :-)

What's about applications they just need to see exactly the same device
after migration. Not only name, but f.e. also its ifindex. If you do not
create a separate namespace for netdevices, you will inevitably end up
with some strange hack sort of VPIDs to translate (or to partition) ifindices
or to tell that "ping -I eth858354 xxx" is too coimplicated application
to survive migration.

Alexey
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Repost PATCH 6/6] PMC MSP85x0 gigabit ethernet driver

2006-06-27 Thread Francois Romieu

Kiran Thota <[EMAIL PROTECTED]> :
[...]
> +/*
> + * Allocate the SKBs for the Rx ring. Also used
> + * for refilling the queue
> + */
> +
> +static int msp85x0_ge_rx_task(struct net_device *netdev,
> + msp85x0_ge_port_info *msp85x0_ge_eth)
> +{
> + struct device *device = 
> &msp85x0_ge_device[msp85x0_ge_eth->port_num]->dev;
> + volatile msp85x0_ge_rx_desc *rx_desc;
> + struct sk_buff *skb;
> + int rx_used_desc;
> + int count = 0;
> + oom_flag=0;

Global variable.

[...]
> + if((rx_used_desc + 1) == MSP85x0_GE_RX_QUEUE)
> + msp85x0_ge_eth->rx_used_desc_q =0;
> + else
> + msp85x0_ge_eth->rx_used_desc_q = (rx_used_desc + 1);

Consider greping drivers/net for NEXT_TX or RING_NEXT.

[...]
> +static void msp85x0_port_init(struct net_device *netdev,
> + msp85x0_ge_port_info * msp85x0_ge_eth)
> +{
> + unsigned long reg_data;
> + unsigned int port_num;  
> +
> + port_num = msp85x0_ge_eth->port_num;
> + for (port_num = 0; port_num < NO_PORTS; port_num++)

There is something strange with port_num here.

[...]
> +static int start_tx_and_rx_activity(struct net_device *netdev)
> +{

The returned value is not used.

[...]
> +static int trtg_block_enable(struct net_device *netdev)
> +{

The returned value is not used.

[...]
> +static int enable_tx_and_rx_interrupts(struct net_device *netdev)
> +{

The returned value is not used.

[...]
> +static int xdma_config(struct net_device *netdev)
> +{

The indentation of this function is mostly broken.

[...]
> +static int msp85x0_ge_port_start(struct net_device *netdev)
> +{

The returned value is not used.

[...]
> +static int msp85x0_eth_setup_tx_rx_fifo(struct net_device *dev)
> +{

The returned value is not used.

[...]
> +static int msp85x0_ge_eth_open(struct net_device *netdev)
> +{
[...]
> + /* Fill the Rx ring with the SKBs */
> + msp85x0_ge_port_start(netdev);
[...]
> + if (!(phy_reg & 0x0400)) {
> + netif_carrier_off(netdev);
> + netif_stop_queue(netdev);
> + return MSP85x0_ERROR;

skb leak

[...]
> +int msp85x0_ge_start_xmit(struct sk_buff *skb, struct net_device *netdev)
> +{

static

This function ought to use NETDEV_TX_OK/NETDEV_TX_BUSY (should not happen).

[...]
> +static int msp85x0_ge_free_tx_queue(struct net_device *netdev)
> +{
> + msp85x0_ge_port_info *msp85x0_ge_eth = netdev_priv(netdev);
> + int pkts,port_num = msp85x0_ge_eth->port_num;
> + int tx_desc_used;
> + struct sk_buff *skb;
> +
> + /* Take the lock */
> + pkts=get_tx_pkt_count(port_num);
> + while(pkts)
> + {
> + pkts--;
> + tx_desc_used = msp85x0_ge_eth->tx_used_desc_q;
> +
> + /* return right away */
> + if (tx_desc_used == msp85x0_ge_eth->tx_curr_desc_q)
> + break;
> + 
> + skb = msp85x0_ge_eth->tx_skb[tx_desc_used];
> + dev_kfree_skb_irq(skb);

msp85x0_ge_free_tx_queue() is issued in msp85x0_ge_start_xmit(), thus
not in irq context.

[...]
> +static int msp85x0_ge_receive_queue(struct net_device *netdev)
> +{

Indentation needs to fixed in this function.

[...]
> + if (packet.cmd_sts & (MSP85x0_GE_RX_PERR | 
> MSP85x0_GE_RX_OVERFLOW_ERROR | MSP85x0_GE_RX_TRUNC | MSP85x0_GE_RX_CRC_ERROR))
> + {
> + if(packet.cmd_sts & MSP85x0_GE_RX_OVERFLOW_ERROR)
> + stats->rx_over_errors++; 
> + else if(packet.cmd_sts & MSP85x0_GE_RX_TRUNC)
> + stats->rx_frame_errors++;
> + else
> + stats->rx_errors++;
> + dev_kfree_skb_any(skb);

It's called in ->poll(), outside of in_irq().

dev->last_rx should be updated after netif_receive_skb().

[...]
> +static int msp85x0_ge_poll(struct net_device *netdev, int *budget)
> +{
[...]
> + spin_lock_irqsave(&msp85x0_ge_eth->lock,flags);

Afaik poll takes place with irq enabled: no need to save/restore.

[...]
> +/* Don't Re-Initialize the port, Just start from where it stops */ 
> +static int msp85x0_ge_eth_reopen(struct net_device *netdev)  
> +{
> + msp85x0_ge_port_info *msp85x0_ge_eth = netdev_priv(netdev);
> + unsigned int reg_data,irq;
> + int retval;
> +
> +irq = MSP85x0_ETH_PORT_IRQ;
> +
> + retval = request_irq(irq, INTERRUPT_HANDLER,
> +  SA_INTERRUPT | SA_SAMPLE_RANDOM | SA_SHIRQ, netdev->name, 
> netdev);

/me scratches head...

msp85x0_ge_change_mtu() does _not_ free_irqv and it issues
msp85x0_ge_eth_reopen().

I noticed this comment in msp85x0_ge_eth_stop():

/* This to work around to solve the msp85x0 shutdown and bringup sequence */

Can you elaborate ?

Random remarks:
- drivers/net/msp85x0_ge.h includes a lot of
  #define MSP85x0_GE_MSTATX_SOMETHING

  Your customers would surely appreciate extended

Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-27 Thread Dave Hansen
On Wed, 2006-06-28 at 00:52 +0200, Herbert Poetzl wrote:
> seriously, what I think Eric meant was that it
> might be nice (especially for migration purposes)
> to keep the device namespace completely virtualized
> and not just isolated ...

It might be nice, but it is probably unneeded for an initial
implementation.  In practice, a cluster doing
checkpoint/restart/migration will already have a system in place for
assigning unique IPs or other identifiers to each container.  It could
just as easily make sure to assign unique network device names to
containers.

The issues really only come into play when you have an unstructured set
of machines and you want to migrate between them without having prepared
them with any kind of unique net device names beforehand.

It may look weird, but do application really *need* to see eth0 rather
than eth858354?

-- Dave

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-27 Thread Herbert Poetzl
On Tue, Jun 27, 2006 at 10:29:39AM -0600, Eric W. Biederman wrote:
> Herbert Poetzl <[EMAIL PROTECTED]> writes:
> 
> > On Tue, Jun 27, 2006 at 01:54:51PM +0400, Kirill Korotaev wrote:
> >> >>My point is that if you make namespace tagging at routing time, and
> >> >>your packets are being routed only once, you lose the ability
> >> >>to have separate routing tables in each namespace.
> >> >
> >> >
> >> >Right. What is the advantage of having separate the routing tables ?
> >
> >> it is impossible to have bridged networking, tun/tap and many other 
> >> features without it. I even doubt that it is possible to introduce 
> >> private netfilter rules w/o virtualization of routing.
> >
> > why? iptables work quite fine on a typical linux
> > system when you 'delegate' certain functionality
> > to certain chains (i.e. doesn't require access to
> > _all_ of them)
> >
> >> The question is do we want to have fully featured namespaces which
> >> allow to create isolated virtual environments with semantics and
> >> behaviour of standalone linux box or do we want to introduce some
> >> hacks with new rules/restrictions to meet ones goals only?
> >
> > well, soemtimes 'hacks' are not only simpler but also 
> > a much better solution for a given problem than the
> > straight forward approach ... 
> 
> Well I would like to see a hack that qualifies.  

> I watched the linux-vserver irc channel for a while and almost
> every network problem was caused by the change in semantics 
> vserver provides.

the problem here is not the change in semantics compared
to a real linux system (as there basically is none) but
compared to _other_ technologies like UML or QEMU, which
add the need for bridging and additional interfaces, while
Linux-VServer only focuses on the IP layer ...

> In this case when you allow a guest more than one IP your hack 
> while easy to maintain becomes much more complex. 

why? a set of IPs is quite similar to a single IP (which
is actually a subset), so no real change there, only
IP_ANY means something different for a guest ...

> Especially as you address each case people care about one at a time.

hmm?

> In one shot this goes the entire way. Given how many people miss that
> you do the work at layer 2 than at layer 3 I would not call this the
> straight forward approach. The straight forward implementation yes,
> but not the straight forward approach.

seems I lost you here ...

> > for example, you won't have multiple routing tables
> > in a kernel where this feature is disabled, no?
> > so why should it affect a guest, or require modified
> > apps inside a guest when we would decide to provide
> > only a single routing table?
> >
> >> From my POV, fully virtualized namespaces are the future. 
> >
> > the future is already there, it's called Xen or UML, or QEMU :)
> 
> Yep.  And now we need it to run fast.

hmm, maybe you should try to optimize linux for Xen then,
as I'm sure it will provide the optimal virtualization
and has all the features folks are looking for (regarding
virtualization)

I thought we are trying to figure a light-weight subset
of isolation and virtualization technologies and methods
which make sense to have in mainline ...

> >> It is what makes virtualization solution usable (w/o apps
> >> modifications), provides all the features and doesn't require much
> >> efforts from people to be used.
> >
> > and what if they want to use virtualization inside
> > their guests? where do you draw the line?
> 
> The implementation doesn't have any problems with guests inside
> of guests.
> 
> The only reason to restrict guests inside of guests is because
> the we aren't certain which permissions make sense.

well, we have not even touched the permission issues yet

best,
Herbert

> Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 8/8] NetLabel: tie NetLabel into the Kconfig system

2006-06-27 Thread paul . moore
Modify the net/Kconfig file to enable selecting the NetLabel Kconfig options.
---
 net/Kconfig |2 ++
 1 files changed, 2 insertions(+)

Index: linux-2.6.17.i686-quilt/net/Kconfig
===
--- linux-2.6.17.i686-quilt.orig/net/Kconfig
+++ linux-2.6.17.i686-quilt/net/Kconfig
@@ -228,6 +228,8 @@ source "net/tux/Kconfig"
 config WIRELESS_EXT
bool
 
+source "net/netlabel/Kconfig"
+
 endif   # if NET
 endmenu # Networking
 

--
paul moore
linux security @ hp
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 7/8] NetLabel: unlabeled packet handling

2006-06-27 Thread paul . moore
Add unlabeled packet support to the NetLabel subsystem.  NetLabel does not do
any processing on unlabled packets, but it must support passing unlabled
packets on both the inbound and outbound sides.
---
 net/netlabel/netlabel_unlabeled.c |  258 ++
 1 files changed, 258 insertions(+)

Index: linux-2.6.17.i686-quilt/net/netlabel/netlabel_unlabeled.c
===
--- /dev/null
+++ linux-2.6.17.i686-quilt/net/netlabel/netlabel_unlabeled.c
@@ -0,0 +1,258 @@
+/*
+ * NetLabel Unlabeled Support
+ *
+ * This file defines functions for dealing with unlabeled packets for the
+ * NetLabel system.  The NetLabel system manages static and dynamic label
+ * mappings for network protocols such as CIPSO and RIPSO.
+ *
+ * Author: Paul Moore <[EMAIL PROTECTED]>
+ *
+ */
+
+/*
+ * (c) Copyright Hewlett-Packard Development Company, L.P., 2006
+ *
+ * This program is free software;  you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY;  without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
+ * the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program;  if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#include "netlabel_user.h"
+#include "netlabel_domainhash.h"
+#include "netlabel_unlabeled.h"
+
+/* Accept unlabeled packets flag */
+static atomic_t netlabel_unlabel_accept_flg = ATOMIC_INIT(0);
+
+/* NetLabel Generic NETLINK CIPSOv4 family */
+static struct genl_family netlbl_unlabel_gnl_family = {
+   .id = GENL_ID_GENERATE,
+   .hdrsize = 0,
+   .name = NETLBL_NLTYPE_UNLABELED_NAME,
+   .version = NETLBL_PROTO_VERSION,
+   .maxattr = 0,
+};
+
+
+/*
+ * Local Prototypes
+ */
+
+static void netlbl_unlabel_send_ack(const struct genl_info *info,
+   const u32 ret_code);
+
+
+/*
+ * NetLabel Command Handlers
+ */
+
+/**
+ * netlbl_unlabel_accept - Handle an ACCEPT message
+ * @skb: the NETLINK buffer
+ * @info: the Generic NETLINK info block
+ *
+ * Description:
+ * Process a user generated ACCEPT message and set the accept flag accordingly.
+ * Returns zero on success, negative values on failure.
+ *
+ */
+static int netlbl_unlabel_accept(struct sk_buff *skb, struct genl_info *info)
+{
+   int ret_val;
+   unsigned char *msg = netlbl_netlink_payload_data(skb);
+   u32 value;
+
+   ret_val = netlbl_netlink_cap_check(skb, CAP_NET_ADMIN);
+   if (ret_val != 0)
+   return ret_val;
+
+   if (netlbl_netlink_payload_len(skb) == 4) {
+   value = netlbl_get_u32(msg);
+   if (value == 1 || value == 0) {
+   atomic_set(&netlabel_unlabel_accept_flg, value);
+   netlbl_unlabel_send_ack(info, NETLBL_E_OK);
+   return 0;
+   }
+   }
+
+   netlbl_unlabel_send_ack(info, EINVAL);
+   return -EINVAL;
+}
+
+
+/*
+ * NetLabel Generic NETLINK Command Definitions
+ */
+
+static struct genl_ops netlbl_unlabel_genl_c_accept = {
+   .cmd = NLBL_UNLABEL_C_ACCEPT,
+   .flags = 0,
+   .doit = netlbl_unlabel_accept,
+   .dumpit = NULL,
+};
+
+/*
+ * NetLabel Generic NETLINK Protocol Functions
+ */
+
+/**
+ * netlbl_unlabel_send_ack - Send an ACK message
+ * @info: the generic NETLINK information
+ * @ret_code: return code to use
+ *
+ * Description:
+ * This function sends an ACK message to the sender of the NETLINK message
+ * specified by @info.
+ *
+ */
+static void netlbl_unlabel_send_ack(const struct genl_info *info,
+   const u32 ret_code)
+{
+   size_t msg_size;
+   size_t data_size;
+   struct sk_buff *skb;
+   unsigned char *data;
+
+   data_size = GENL_HDRLEN + 8;
+   msg_size = NLMSG_SPACE(data_size);
+
+   skb = alloc_skb(msg_size, GFP_KERNEL);
+   if (skb == NULL)
+   return;
+
+   data = netlbl_netlink_hdr_put(skb,
+ info->snd_pid,
+ 0,
+ 0,
+ netlbl_unlabel_gnl_family.id,
+ NLBL_UNLABEL_C_ACK,
+ data_size);
+   if (data == NULL)
+   goto send_ack_failure;
+
+   netlbl_putinc_u32(&data, info->snd_seq);
+   netlbl_putinc_u32(&data, 

[RFC 0/8] NetLabel: updated to use generic netlink

2006-06-27 Thread paul . moore
An updated patch set with some small changes as well as one big one - NetLabel
now uses the generic netlink interface for it's kernel-userland communication
as opposed to it's own dedicated netlink type.  Needless to say this requires
an updated userland configuration tool, so for those of you running this patch
please grab version 0.14 (or later) of the netlabel_tools which can be found
here:

 * http://free.linux.hp.com/~pmoore/projects/linux_cipso

Thanks.

--
paul moore
linux security @ hp
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 5/8] NetLabel: SELinux support

2006-06-27 Thread paul . moore
Add NetLabel support to the SELinux LSM and modify the socket_post_create() LSM
hook to return an error code.  The most significant part of this patch is the
addition of NetLabel hooks into the following SELinux LSM hooks:

 * selinux_file_permission()
 * selinux_socket_sendmsg()
 * selinux_socket_post_create()
 * selinux_socket_post_accept() [NEW]
 * selinux_socket_sock_rcv_skb()
 * selinux_socket_getpeersec_stream()
 * selinux_socket_getpeersec_dgram()

The basic reasoning behind this patch is that outgoing packets are "NetLabel'd"
by labeling their socket and the NetLabel security attributes are checked via
the additional hook in selinux_socket_sock_rcv_skb().  NetLabel itself is only
a labeling mechanism, similar to filesystem extended attributes, it is up to
the SELinux enforcement mechanism to perform the actual access checks.

In addition to the changes outlined above this patch also includes some changes
to the extended bitmap (ebitmap) and multi-level security (mls) code to import
and export SELinux TE/MLS attributes into and out of NetLabel.
---
 include/linux/security.h|   25 -
 net/socket.c|   13 
 security/dummy.c|6 
 security/selinux/hooks.c|   59 ++
 security/selinux/include/objsec.h   |   11 
 security/selinux/include/selinux_netlabel.h |   94 
 security/selinux/ss/Makefile|1 
 security/selinux/ss/ebitmap.c   |  155 +++
 security/selinux/ss/ebitmap.h   |6 
 security/selinux/ss/mls.c   |  160 +++
 security/selinux/ss/mls.h   |   25 +
 security/selinux/ss/selinux_netlabel.c  |  574 
 security/selinux/ss/services.c  |   12 
 security/selinux/ss/services.h  |2 
 14 files changed, 1113 insertions(+), 30 deletions(-)

Index: linux-2.6.17.i686-quilt/include/linux/security.h
===
--- linux-2.6.17.i686-quilt.orig/include/linux/security.h
+++ linux-2.6.17.i686-quilt/include/linux/security.h
@@ -1267,8 +1267,8 @@ struct security_operations {
int (*unix_may_send) (struct socket * sock, struct socket * other);
 
int (*socket_create) (int family, int type, int protocol, int kern);
-   void (*socket_post_create) (struct socket * sock, int family,
-   int type, int protocol, int kern);
+   int (*socket_post_create) (struct socket * sock, int family,
+  int type, int protocol, int kern);
int (*socket_bind) (struct socket * sock,
struct sockaddr * address, int addrlen);
int (*socket_connect) (struct socket * sock,
@@ -2677,13 +2677,13 @@ static inline int security_socket_create
return security_ops->socket_create(family, type, protocol, kern);
 }
 
-static inline void security_socket_post_create(struct socket * sock, 
-  int family,
-  int type, 
-  int protocol, int kern)
+static inline int security_socket_post_create(struct socket * sock,
+ int family,
+ int type,
+ int protocol, int kern)
 {
-   security_ops->socket_post_create(sock, family, type,
-protocol, kern);
+   return security_ops->socket_post_create(sock, family, type,
+   protocol, kern);
 }
 
 static inline int security_socket_bind(struct socket * sock, 
@@ -2809,11 +2809,12 @@ static inline int security_socket_create
return 0;
 }
 
-static inline void security_socket_post_create(struct socket * sock, 
-  int family,
-  int type, 
-  int protocol, int kern)
+static inline int security_socket_post_create(struct socket * sock,
+ int family,
+ int type,
+ int protocol, int kern)
 {
+   return 0;
 }
 
 static inline int security_socket_bind(struct socket * sock, 
Index: linux-2.6.17.i686-quilt/net/socket.c
===
--- linux-2.6.17.i686-quilt.orig/net/socket.c
+++ linux-2.6.17.i686-quilt/net/socket.c
@@ -976,11 +976,18 @@ int sock_create_lite(int family, int typ
goto out;
}
 
-   security_socket_post_create(sock, family, type, protocol, 1);
sock->type = type;
+   err = security_socket_post_create(sock, family, type, protocol, 1);
+   if (err)
+   goto out_rel

[RFC 1/8] NetLabel: documentation

2006-06-27 Thread paul . moore
Documentation for the NetLabel system, this includes a basic overview of how
NetLabel works and how LSM developers can integrate it into their favorite
LSM.  Also, due to the difficulty of finding expired IETF drafts, I am
including the IETF CIPSO draft that is the basis of the NetLabel CIPSO
implementation.
---
 CREDITS   |7 
 Documentation/00-INDEX|2 
 Documentation/netlabel/00-INDEX   |   10 
 Documentation/netlabel/cipso_ipv4.txt |   48 
 Documentation/netlabel/draft-ietf-cipso-ipsecurity-01.txt |  791 ++
 Documentation/netlabel/introduction.txt   |   53 
 Documentation/netlabel/lsm_interface.txt  |   47 
 7 files changed, 958 insertions(+)

Index: linux-2.6.17.i686-quilt/CREDITS
===
--- linux-2.6.17.i686-quilt.orig/CREDITS
+++ linux-2.6.17.i686-quilt/CREDITS
@@ -2383,6 +2383,13 @@ N: Thomas Molina
 E: [EMAIL PROTECTED]
 D: bug fixes, documentation, minor hackery
 
+N: Paul Moore
+E: [EMAIL PROTECTED]
+D: NetLabel author
+S: Hewlett-Packard
+S: 110 Spit Brook Road
+S: Nashua, NH 03062
+
 N: James Morris
 E: [EMAIL PROTECTED]
 W: http://namei.org/
Index: linux-2.6.17.i686-quilt/Documentation/00-INDEX
===
--- linux-2.6.17.i686-quilt.orig/Documentation/00-INDEX
+++ linux-2.6.17.i686-quilt/Documentation/00-INDEX
@@ -184,6 +184,8 @@ mtrr.txt
- how to use PPro Memory Type Range Registers to increase performance.
 nbd.txt
- info on a TCP implementation of a network block device.
+netlabel/
+   - directory with information on the NetLabel subsystem.
 networking/
- directory with info on various aspects of networking with Linux.
 nfsroot.txt
Index: linux-2.6.17.i686-quilt/Documentation/netlabel/00-INDEX
===
--- /dev/null
+++ linux-2.6.17.i686-quilt/Documentation/netlabel/00-INDEX
@@ -0,0 +1,10 @@
+00-INDEX
+   - this file.
+cipso_ipv4.txt
+   - documentation on the IPv4 CIPSO protocol engine.
+draft-ietf-cipso-ipsecurity-01.txt
+   - IETF draft of the CIPSO protocol, dated 16 July 1992.
+introduction.txt
+   - NetLabel introduction, READ THIS FIRST.
+lsm_interface.txt
+   - documentation on the NetLabel kernel security module API.
Index: linux-2.6.17.i686-quilt/Documentation/netlabel/cipso_ipv4.txt
===
--- /dev/null
+++ linux-2.6.17.i686-quilt/Documentation/netlabel/cipso_ipv4.txt
@@ -0,0 +1,48 @@
+NetLabel CIPSO/IPv4 Protocol Engine
+==
+Paul Moore, [EMAIL PROTECTED]
+
+May 17, 2006
+
+ * Overview
+
+The NetLabel CIPSO/IPv4 protocol engine is based on the IETF Commercial IP
+Security Option (CIPSO) draft from July 16, 1992.  A copy of this draft can be
+found in this directory, consult '00-INDEX' for the filename.  While the IETF
+draft never made it to an RFC standard it has become a de-facto standard for
+labeled networking and is used in many trusted operating systems.
+
+ * Outbound Packet Processing
+
+The CIPSO/IPv4 protocol engine applies the CIPSO IP option to packets by
+adding the CIPSO label to the socket.  This causes all packets leaving the
+system through the socket to have the CIPSO IP option applied.  The socket's
+CIPSO label can be changed at any point in time, however, it is recommended
+that it is set upon the socket's creation.  The LSM can set the socket's CIPSO
+label by using the NetLabel security module API; if the NetLabel "domain" is
+configured to use CIPSO for packet labeling then a CIPSO IP option will be
+generated and attached to the socket.
+
+ * Inbound Packet Processing
+
+The CIPSO/IPv4 protocol engine validates every CIPSO IP option it finds at the
+IP layer without any special handling required by the LSM.  However, in order
+to decode and translate the CIPSO label on the packet the LSM must use the
+NetLabel security module API to extract the security attributes of the packet.
+This is typically done at the socket layer using the 'socket_sock_rcv_skb()'
+LSM hook.
+
+ * Label Translation
+
+The CIPSO/IPv4 protocol engine contains a mechanism to translate CIPSO security
+attributes such as sensitivity level and category to values which are
+appropriate for the host.  These mappings are defined as part of a CIPSO
+Domain Of Interpretation (DOI) definition and are configured through the
+NetLabel user space communication layer.  Each DOI definition can have a
+different security attribute mapping table.
+
+ * Label Translation Cache
+
+The NetLabel system provides a framework for caching security attribute
+mappings from the network labels to the corresponding LSM identifiers.  The
+CIPSO/IPv4 protocol engine supports this ca

[RFC 3/8] NetLabel: CIPSOv4 engine

2006-06-27 Thread paul . moore
Add support for the Commercial IP Security Option (CIPSO) to the IPv4 network
stack.  CIPSO has become a de-facto standard for trusted/labeled networking
amongst existing Trusted Operating Systems such as Trusted Solaris, HP-UX CMW,
etc.  This implementation is designed to be used with the NetLabel subsystem
to provide explicit packet labeling to LSM developers.

The CIPSO/IPv4 packet labeling works by the LSM calling a NetLabel API function
which attaches a CIPSO label (IPv4 option) to a given socket; this in turn
attaches the CIPSO label to every packet leaving the socket without any extra
processing on the outbound side.  On the inbound side the individual packet's
sk_buff is examined through a call to a NetLabel API function to determine if a
CIPSO/IPv4 label is present and if so the security attributes of the CIPSO
label are returned to the caller of the NetLabel API function.
---
 net/ipv4/cipso_ipv4.c | 1749 ++
 1 files changed, 1749 insertions(+)

Index: linux-2.6.17.i686-quilt/net/ipv4/cipso_ipv4.c
===
--- /dev/null
+++ linux-2.6.17.i686-quilt/net/ipv4/cipso_ipv4.c
@@ -0,0 +1,1749 @@
+/*
+ * CIPSO - Commercial IP Security Option
+ *
+ * This is an implementation of the CIPSO 2.2 protocol as specified in
+ * draft-ietf-cipso-ipsecurity-01.txt with additional tag types as found in
+ * FIPS-188, copies of both documents can be found in the Documentation
+ * directory.  While CIPSO never became a full IETF RFC standard many vendors
+ * have chosen to adopt the protocol and over the years it has become a
+ * de-facto standard for labeled networking.
+ *
+ * Author: Paul Moore <[EMAIL PROTECTED]>
+ *
+ */
+
+/*
+ * (c) Copyright Hewlett-Packard Development Company, L.P., 2006
+ *
+ * This program is free software;  you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY;  without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
+ * the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program;  if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct cipso_v4_domhsh_entry {
+   char *domain;
+   u32 valid;
+   struct list_head list;
+   struct rcu_head rcu;
+};
+
+/* List of available DOI definitions */
+/* XXX - Updates should be minimal so having a single lock for the
+   cipso_v4_doi_list and the cipso_v4_doi_list->dom_list should be
+   okay. */
+/* XXX - This currently assumes a minimal number of different DOIs in use,
+   if in practice there are a lot of different DOIs this list should
+   probably be turned into a hash table or something similar so we
+   can do quick lookups. */
+DEFINE_SPINLOCK(cipso_v4_doi_list_lock);
+static struct list_head cipso_v4_doi_list = LIST_HEAD_INIT(cipso_v4_doi_list);
+
+/* Label mapping cache */
+#define CIPSO_V4_CACHE_BUCKETBITS 7
+#define CIPSO_V4_CACHE_BUCKETS(1 << CIPSO_V4_CACHE_BUCKETBITS)
+#define CIPSO_V4_CACHE_BUCKETSIZE 10
+#define CIPSO_V4_CACHE_REORDERLIMIT   10
+/* PM - the number of cache buckets should probably be a compile time option */
+struct cipso_v4_map_cache_bkt {
+   spinlock_t lock;
+   u32 size;
+   struct list_head list;
+};
+struct cipso_v4_map_cache_entry {
+   u32 hash;
+   unsigned char *key;
+   u32 key_len;
+
+   struct netlbl_lsm_cache lsm_data;
+
+   u32 activity;
+   struct list_head list;
+};
+static u32 cipso_v4_cache_size = 0;
+static struct cipso_v4_map_cache_bkt *cipso_v4_cache = NULL;
+#define CIPSO_V4_CACHE_ENABLED (cipso_v4_cache_size > 0)
+
+/*
+ * Helper Functions
+ */
+
+/**
+ * cipso_v4_bitmap_walk - Walk a bitmap looking for a bit
+ * @bitmap: the bitmap
+ * @bitmap_len: length in bits
+ * @offset: starting offset
+ * @state: if non-zero, look for a set (1) bit else look for a cleared (0) bit
+ *
+ * Description:
+ * Starting at @offset, walk the bitmap from left to right until either the
+ * desired bit is found or we reach the end.  Return the bit offset, -1 if
+ * not found, or -2 if error.
+ */
+static int cipso_v4_bitmap_walk(const unsigned char *bitmap,
+   const u32 bitmap_len,
+   const u32 offset,
+   const u8 state)
+{
+   u32 bit_spot;
+   u32 byte_offset;
+   unsigned char bitmask;
+   unsigned char byte;
+
+   /

[RFC 6/8] NetLabel: CIPSOv4 integration

2006-06-27 Thread paul . moore
Add CIPSO/IPv4 support and management to the NetLabel subsystem.  These changes
integrate the CIPSO/IPv4 configuration into the existing NetLabel code and
enable the use of CIPSO/IPv4 within the overall NetLabel framework.
---
 net/netlabel/netlabel_cipso_v4.c |  634 +++
 1 files changed, 634 insertions(+)

Index: linux-2.6.17.i686-quilt/net/netlabel/netlabel_cipso_v4.c
===
--- /dev/null
+++ linux-2.6.17.i686-quilt/net/netlabel/netlabel_cipso_v4.c
@@ -0,0 +1,634 @@
+/*
+ * NetLabel CIPSO/IPv4 Support
+ *
+ * This file defines the CIPSO/IPv4 functions for the NetLabel system.  The
+ * NetLabel system manages static and dynamic label mappings for network
+ * protocols such as CIPSO and RIPSO.
+ *
+ * Author: Paul Moore <[EMAIL PROTECTED]>
+ *
+ */
+
+/*
+ * (c) Copyright Hewlett-Packard Development Company, L.P., 2006
+ *
+ * This program is free software;  you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY;  without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
+ * the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program;  if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "netlabel_user.h"
+#include "netlabel_cipso_v4.h"
+
+/* NetLabel Generic NETLINK CIPSOv4 family */
+static struct genl_family netlbl_cipsov4_gnl_family = {
+   .id = GENL_ID_GENERATE,
+   .hdrsize = 0,
+   .name = NETLBL_NLTYPE_CIPSOV4_NAME,
+   .version = NETLBL_PROTO_VERSION,
+   .maxattr = 0,
+};
+
+
+/*
+ * Local Prototypes
+ */
+
+static void netlbl_cipsov4_send_ack(const struct genl_info *info,
+   const u32 ret_code);
+
+
+/*
+ * Helper Functions
+ */
+
+/**
+ * netlbl_cipsov4_doi_free - Frees a CIPSO V4 DOI definition
+ * @entry: the entry's RCU field
+ *
+ * Description:
+ * This function is designed to be used as a callback to the call_rcu()
+ * function so that the memory allocated to the DOI definition can be released
+ * safely.
+ *
+ */
+static void netlbl_cipsov4_doi_free(struct rcu_head *entry)
+{
+   struct cipso_v4_doi *ptr;
+
+   ptr = container_of(entry, struct cipso_v4_doi, rcu);
+   switch (ptr->type) {
+   case CIPSO_V4_MAP_STD:
+   if (ptr->map.std->lvl.cipso_size > 0)
+   kfree(ptr->map.std->lvl.cipso);
+   if (ptr->map.std->lvl.local_size > 0)
+   kfree(ptr->map.std->lvl.local);
+   if (ptr->map.std->cat.cipso_size > 0)
+   kfree(ptr->map.std->cat.cipso);
+   if (ptr->map.std->cat.local_size > 0)
+   kfree(ptr->map.std->cat.local);
+   break;
+   }
+   kfree(ptr);
+}
+
+
+/*
+ * NetLabel Command Handlers
+ */
+
+/**
+ * netlbl_cipsov4_add_std - Adds a CIPSO V4 DOI definition
+ * @doi: the DOI value
+ * @msg: the ADD message data
+ * @msg_size: the size of the ADD message buffer
+ *
+ * Description:
+ * Create a new CIPSO_V4_MAP_STD DOI definition based on the given ADD message
+ * and add it to the CIPSO V4 engine.  Return zero on success and non-zero on
+ * error.
+ *
+ */
+static int netlbl_cipsov4_add_std(const u32 doi,
+ const unsigned char *msg,
+ const u32 msg_size)
+{
+   int ret_val = -EPERM;
+   unsigned char *msg_ptr = (unsigned char *)msg;
+   u32 msg_len = msg_size;
+   u32 num_tags;
+   u32 num_lvls;
+   u32 num_cats;
+   struct cipso_v4_doi *doi_def = NULL;
+   u32 iter;
+   u32 tmp_val_a;
+   u32 tmp_val_b;
+
+   if (msg_len < 4)
+   goto add_std_failure;
+   num_tags = netlbl_getinc_u32(&msg_ptr);
+   msg_len -= 4;
+   if (num_tags == 0 || num_tags > CIPSO_V4_TAG_MAXCNT)
+   goto add_std_failure;
+
+   doi_def = kmalloc(sizeof(*doi_def), GFP_KERNEL);
+   if (doi_def == NULL) {
+   ret_val = -ENOMEM;
+   goto add_std_failure;
+   }
+   doi_def->map.std = kzalloc(sizeof(*doi_def->map.std),
+  GFP_KERNEL);
+   if (doi_def->map.std == NULL) {
+   ret_val = -ENOMEM;
+   goto add_std_failure;
+   }
+   doi_def->type = CIPSO_V4_MAP_STD;
+
+   if (msg_len < num_tags)
+   goto add_std_failure;
+   msg_len -= num_tags;
+   for (iter = 0; iter < n

[RFC 2/8] NetLabel: core network changes

2006-06-27 Thread paul . moore
Changes to the core network stack to support the NetLabel subsystem.  This
includes changes to the IPv4 option handling to support CIPSO labels, and a new
NetLabel hook in inet_accept() to handle NetLabel attributes across a
accept()s done by in-kernel daemons.
---
 include/linux/ip.h   |1 
 include/net/cipso_ipv4.h |  251 
 include/net/inet_sock.h  |2 
 include/net/netlabel.h   |  488 +++
 net/ipv4/Makefile|1 
 net/ipv4/af_inet.c   |3 
 net/ipv4/ah4.c   |2 
 net/ipv4/ip_options.c|   19 +
 8 files changed, 765 insertions(+), 2 deletions(-)

Index: linux-2.6.17.i686-quilt/include/linux/ip.h
===
--- linux-2.6.17.i686-quilt.orig/include/linux/ip.h
+++ linux-2.6.17.i686-quilt/include/linux/ip.h
@@ -57,6 +57,7 @@
 #define IPOPT_SEC  (2 |IPOPT_CONTROL|IPOPT_COPY)
 #define IPOPT_LSRR (3 |IPOPT_CONTROL|IPOPT_COPY)
 #define IPOPT_TIMESTAMP(4 |IPOPT_MEASUREMENT)
+#define IPOPT_CIPSO(6 |IPOPT_CONTROL|IPOPT_COPY)
 #define IPOPT_RR   (7 |IPOPT_CONTROL)
 #define IPOPT_SID  (8 |IPOPT_CONTROL|IPOPT_COPY)
 #define IPOPT_SSRR (9 |IPOPT_CONTROL|IPOPT_COPY)
Index: linux-2.6.17.i686-quilt/include/net/cipso_ipv4.h
===
--- /dev/null
+++ linux-2.6.17.i686-quilt/include/net/cipso_ipv4.h
@@ -0,0 +1,251 @@
+/*
+ * CIPSO - Commercial IP Security Option
+ *
+ * This is an implementation of the CIPSO 2.2 protocol as specified in
+ * draft-ietf-cipso-ipsecurity-01.txt with additional tag types as found in
+ * FIPS-188, copies of both documents can be found in the Documentation
+ * directory.  While CIPSO never became a full IETF RFC standard many vendors
+ * have chosen to adopt the protocol and over the years it has become a
+ * de-facto standard for labeled networking.
+ *
+ * Author: Paul Moore <[EMAIL PROTECTED]>
+ *
+ */
+
+/*
+ * (c) Copyright Hewlett-Packard Development Company, L.P., 2006
+ *
+ * This program is free software;  you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY;  without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
+ * the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program;  if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ */
+
+#ifndef _CIPSO_IPV4_H
+#define _CIPSO_IPV4_H
+
+#include 
+#include 
+#include 
+#include 
+
+/* known doi values */
+#define CIPSO_V4_DOI_UNKNOWN  0x
+
+/* tag types */
+#define CIPSO_V4_TAG_INVALID  0
+#define CIPSO_V4_TAG_RBITMAP  1
+#define CIPSO_V4_TAG_ENUM 2
+#define CIPSO_V4_TAG_RANGE5
+#define CIPSO_V4_TAG_PBITMAP  6
+#define CIPSO_V4_TAG_FREEFORM 7
+
+/* doi mapping types */
+#define CIPSO_V4_MAP_UNKNOWN  0
+#define CIPSO_V4_MAP_STD  1
+#define CIPSO_V4_MAP_PASS 2
+
+/* limits */
+#define CIPSO_V4_MAX_REM_LVLS 256
+#define CIPSO_V4_INV_LVL  0x8000
+#define CIPSO_V4_MAX_LOC_LVLS (CIPSO_V4_INV_LVL - 1)
+#define CIPSO_V4_MAX_REM_CATS 65536
+#define CIPSO_V4_INV_CAT  0x8000
+#define CIPSO_V4_MAX_LOC_CATS (CIPSO_V4_INV_CAT - 1)
+
+/*
+ * CIPSO DOI definitions
+ */
+
+/* DOI definition struct */
+#define CIPSO_V4_TAG_MAXCNT   5
+struct cipso_v4_doi {
+   u32 doi;
+   u32 type;
+   union {
+   struct cipso_v4_std_map_tbl *std;
+   } map;
+   u8 tags[CIPSO_V4_TAG_MAXCNT];
+
+   u32 valid;
+   struct list_head list;
+   struct rcu_head rcu;
+   struct list_head dom_list;
+};
+
+/* Standard CIPSO mapping table */
+/* NOTE: the highest order bit (i.e. 0x8000) is an 'invalid' flag, if the
+ *   bit is set then consider that value as unspecified, meaning the
+ *   mapping for that particular level/category is invalid */
+struct cipso_v4_std_map_tbl {
+   struct {
+   u32 *cipso;
+   u32 *local;
+   u32 cipso_size;
+   u32 local_size;
+   } lvl;
+   struct {
+   u32 *cipso;
+   u32 *local;
+   u32 cipso_size;
+   u32 local_size;
+   } cat;
+};
+
+/*
+ * Helper Functions
+ */
+
+#define CIPSO_V4_OPTEXIST(x) (IPCB(x)->opt.cipso != 0)
+#define CIPSO_V4_OPTPTR(x) ((x)->nh.raw + IPCB(x)->opt.cipso)
+
+/*
+ * DOI List Functions
+ */
+
+#ifdef CONFIG_NETLABEL
+int cipso_v4_doi_add(struct cipso_v4_doi *doi_def);
+int

Re: [PATCH 00/21] e1000: driver update to 7.1.9-k2

2006-06-27 Thread Auke Kok


Jeff,

after comments I've made some adjustments. I'll list them below against the 
old summary. The changes are available from our git-server:


Please pull from:

git://lost.foo-projects.org/~ahkok/git/netdev-2.6 upstream

These patches are against
netdev-2.6#upstream 612eff0e3715a6faff5ba1b74873b99e036c59fe
(Brian Haley <[EMAIL PROTECTED]> / [PATCH] s2io: netpoll support)



Summary of patches:

[01]: fix loopback ethtool test
[02]: rework driver hardware reset locking
[03]: Make PHY powerup/down a function
[04]: fix CONFIG_PM blocks
[05]: small performance tweak by removing double code
[06]: add smart power down code
[07]: change printk into DPRINTK
[08]: recycle skb
[09]: rework module param code with uninitialized values
[10]: force register write flushes to circumvent broken platforms


Unmodified. See comments here:
http://marc.theaimsgroup.com/?l=linux-netdev&m=115142459725123&w=2 [1]


[11]: disable CRC stripping workaround


Removed all references to SECRC (crc stripping) instead of leaving it commented.


[12]: fix adapter led blinking inconsistency
[13]: add E1000_BIG_ENDIAN symbol


Dropped this patch entirely


[14]: M88 PHY workaround
[15]: check return value of _get_speed_and_duplex
[16]: disable ERT
[17]: add ich8lan core functions
[18]: integrate ich8 support into driver
[19]: allow user to disable ich8 lock loss workaround
[20]: add ich8lan device ID's
[21]: increase version to 7.1.9-k2



[1] I can drop #11 in case someone throws a fit ;) - as everyone I'd really 
like to see patches 17->20 queued for 2.6.18 for obvious reasons - this is the 
most important section of these patches!


Cheers,

Auke


---
 drivers/net/e1000/e1000.h |   10
 drivers/net/e1000/e1000_ethtool.c |  143 +--
 drivers/net/e1000/e1000_hw.c  | 1770 +++---
 drivers/net/e1000/e1000_hw.h  |  398 
 drivers/net/e1000/e1000_main.c|  384 +---
 drivers/net/e1000/e1000_osdep.h   |   13
 drivers/net/e1000/e1000_param.c   |  213 ++--
 7 files changed, 2530 insertions(+), 401 deletions(-)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-27 Thread Herbert Poetzl
On Tue, Jun 27, 2006 at 09:07:38AM -0700, Ben Greear wrote:
> Ben Greear wrote:
> >Herbert Poetzl wrote:
> >
> >>On Mon, Jun 26, 2006 at 03:13:17PM -0700, Ben Greear wrote:
> >
> >>yes, that sounds good to me, any numbers how that
> >>affects networking in general (performance wise and
> >>memory wise, i.e. caches and hashes) ...
> >
> >I'll run some tests later today.  Based on my previous tests,
> >I don't remember any significant overhead.
> 
> Here's a quick benchmark using my redirect devices (RDD). Each RDD
> comes in a pair...when you tx on one, the pkt is rx'd on the peer.
> The idea is that it is exactly like two physical ethernet interfaces
> connected by a cross-over cable.
>
> My test system is a 64-bit dual-core Intel system, 3.013 Ghz processor
> with 1GB RAM. Fairly standard stuff..it's one of the Shuttle XPC
> systems. Kernel is 2.6.16.16 (64-bit).
> 
> 
> Test setup is:  rdd1 -- rdd2   [bridge]   rdd3 -- rdd4
> 
> I am using my proprietary module for the bridge logic...and the
> default bridge should be at least this fast. I am injecting 1514 byte
> packets on rdd1 and rdd4 with pktgen (bi-directional flow). My pktgen
> is also receiving the pkts and gathering stats.
>
> This setup sustains 1.7Gbps of generated and received traffic between
> rdd1 and rdd4.
>
> Running only the [bridge] between two 10/100/1000 ports on an Intel
> PCI-E NIC will sustain about 870Mbps (bi-directional) on this system,
> so the virtual devices are quite efficient, as suspected.
>
> I have not yet had time to benchmark the mac-vlans...hopefully later
> today.

hmm, maybe you could also benchmark loopback connections
(and their throughput) on your system?

my (not so fancy) PIII, 32bit, 2.6.17.1 seems to do
roughly 2Gbs on the loopback device (tested with dd
and netcat)

best,
Herbert

> Thanks,
> Ben
> 
> -- 
> Ben Greear <[EMAIL PROTECTED]>
> Candela Technologies Inc  http://www.candelatech.com
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-27 Thread Herbert Poetzl
On Tue, Jun 27, 2006 at 10:19:23AM -0700, Ben Greear wrote:
> Eric W. Biederman wrote:
> >Herbert Poetzl <[EMAIL PROTECTED]> writes:
> >
> >
> >>On Tue, Jun 27, 2006 at 05:52:52AM -0600, Eric W. Biederman wrote:
> >>
> >>>Inside the containers I want all network devices named eth0!
> >>
> >>huh? even if there are two of them? also tun?
> >>
> >>I think you meant, you want to be able to have eth0 in
> >>_more_ than one guest where eth0 in a guest can also
> >>be/use/relate to eth1 on the host, right?
> >
> >
> >Right I want to have an eth0 in each guest where eth0 is
> >it's own network device and need have no relationship to
> >eth0 on the host.
> 
> How does that help anything?  Do you envision programs
> that make special decisions on whether the interface is
> called eth0 v/s eth151?

well, those poor folks who do not have ethernet
devices for networking :)

seriously, what I think Eric meant was that it
might be nice (especially for migration purposes)
to keep the device namespace completely virtualized
and not just isolated ...

I'm fine with that, as long as it does not add
overhead or complicate handling, and as far as I
can tell, it should not do that ...

best,
Herbert

> Ben
> 
> 
> -- 
> Ben Greear <[EMAIL PROTECTED]>
> Candela Technologies Inc  http://www.candelatech.com
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH REPOST 0/2][RFC] Network Event Notifier Mechanism

2006-06-27 Thread Herbert Xu
On Tue, Jun 27, 2006 at 09:31:57AM -0500, Steve Wise wrote:
> 
> > I'd like to know more about what the RDMA device is going to do with this
> > information.  I thought RDMA was for receiving packets? Most of the info
> > here pertains to transmission.
> 
> RDMA Ethernet devices adhere to a set of protocols defined by the IETF.
> See the RDDP WG (http://www.ietf.org/html.charters/rddp-charter.html)
> for the Internet Drafts that define the protocols.

Would it be possible for you to give us a quick summary of the relevant
points?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network namespaces a path to mergable code.

2006-06-27 Thread Sam Vilain
Andrey Savochkin wrote:
> On Tue, Jun 27, 2006 at 11:20:40AM -0600, Eric W. Biederman wrote:
>   
>> Thinking about this I am going to suggest a slightly different direction
>> for get a patchset we can merge.
>>
>> First we concentrate on the fundamentals.
>> - How we mark a device as belonging to a specific network namespace.
>> - How we mark a socket as belonging to a specific network namespace.
>> 
>
> I agree with the direction of your thoughts.
> I was trying to do a similar thing, define clear steps in network
> namespace merging.
>
> My first patchset covers devices but not sockets.
> The only difference from what you're suggesting is ipv4 routing.
> For me, it is not less important than devices and sockets.  May be even
> more important, since routing exposes design deficiencies less obvious at
> socket level.
>   

It sounds then like it would be a good start to have general socket
namespaces, if it would merge more easily - perhaps then network device
namespaces would fall into place more easily.

AIUI socket namespaces are also necessary for situations where you want
containers to share IP addresses. AIUI, PlanetLab do something like this
with a module atop of VServer already (but read
http://openvz.org/pipermail/devel/2006-June/000666.html for a proper
explanation from Mark Huang)

>> As part of the fundamentals we add a patch to the generic socket code
>> that by default will disable it for protocol families that do not indicate
>> support for handling network namespaces, on a non-default network namespace.
>> 
>
> Fine
>
> Can you summarize you objections against my way of handling devices, please?
>   
There were many objections, the major one being the patch was too large for 
certainty of adequate review.

Quoting what I perceived as a summary from Eric:
> When I went through this, my patchset just added an explicit
> continue if the devices was not in the appropriate namespace.
> I actually prefer the multiple list implementation but at
> the same time I think it is harder to get a clean implementation
> out of it.


You offered to re-do the patch without separate lists - I suggest that
this go ahead. No-one should really care; splitting it out into separate
lists can then be considered a performance optimization for later.

> And what was the typo you referred to in your letter to Kirill Korotaev?
>   
I think this is the comment he refers to:
> These hunks should use for_each_netdev(ifp);


Both quotes are from http://lkml.org/lkml/2006/6/26/147

Though, in Kirill's defense, it seems a bit strange to expect him to
raise a fault that was just raised by Eric, in a reply to the message
where he raised it.

Sam.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NET]: Added GSO header verification

2006-06-27 Thread Herbert Xu
On Tue, Jun 27, 2006 at 01:46:35PM -0700, Michael Chan wrote:
> On Tue, 2006-06-27 at 22:07 +1000, Herbert Xu wrote:
> 
> > [NET]: Added GSO header verification
> >
> > @@ -2166,10 +2166,14 @@ struct sk_buff *tcp_tso_segment(struct s
> > if (!pskb_may_pull(skb, thlen))
> > goto out;
> >  
> > +   segs = NULL;
> > +   if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST))
> > +   goto out;
> > +
> 
> This logic doesn't look right to me.  Perhaps it's backwards and should
> be:
> 
> if (!skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST))

Oops, you're absolutely right.  Here is the fix.

[NET]: Fix logical error in skb_gso_ok

The test in skb_gso_ok is backwards.  Noticed by Michael Chan
<[EMAIL PROTECTED]>.

Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 84b0f0d..efd1e2a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -994,12 +994,12 @@ static inline int skb_gso_ok(struct sk_b
 {
int feature = skb_shinfo(skb)->gso_size ?
  skb_shinfo(skb)->gso_type << NETIF_F_GSO_SHIFT : 0;
-   return (features & feature) != feature;
+   return (features & feature) == feature;
 }
 
 static inline int netif_needs_gso(struct net_device *dev, struct sk_buff *skb)
 {
-   return skb_gso_ok(skb, dev->features);
+   return !skb_gso_ok(skb, dev->features);
 }
 
 #endif /* __KERNEL__ */
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH Round 3 0/2][RFC] Network Event Notifier Mechanism

2006-06-27 Thread Steve Wise

Round 3 Changes:

- changed netlink msg for neighbour change to (RTM_NEIGHUPD)
- added netlink msg for PMTU change events (RTM_ROUTEUPD)
- added netlink messages for redirect (RTM_DELROUTE + RTM_NEWROUTE)
- tested neighbour change events via netlink for ipv4 and ipv6.
- tested redirect change events via netlink for ipv4.

Round 2 Changes:

- cleaned up event structures per review feedback.
- began integration with netlink (see neighbour changes in patch 2).
- added IPv6 support.

TODO: 

- review feedback changes, if any
- more testing
- retest with RDMA NIC

--

This patch implements a mechanism that allows interested clients to
register for notification of certain network events. The intended use
is to allow RDMA devices (linux/drivers/infiniband) to be notified of
neighbour updates, ICMP redirects, path MTU changes, and route changes.

The reason these devices need update events is because they typically
cache this information in hardware and need to be notified when this
information has been updated.  For information on RDMA protocols, see:
http://www.ietf.org/html.charters/rddp-charter.html.

The key events of interest are:

- neighbour mac address change 
- routing redirect (the next hop neighbour changes for a dst_entry)
- path mtu change (the path mtu for a dst_entry changes).
- route add/deletes

NOTE: These new netevents are also passed up to user space via netlink.

We would like to get this or similar functionality included in 2.6.19
and request comments.

This patchset consists of 2 patches:

1) New files implementing the Network Event Notifier
2) Core network changes to generate network event notifications

Signed-off-by: Tom Tucker <[EMAIL PROTECTED]>
Signed-off-by: Steve Wise <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH Round 3 1/2] Network Event Notifier Mechanism.

2006-06-27 Thread Steve Wise

This patch uses notifier blocks to implement a network event
notifier mechanism.

Clients register their callback function by calling
register_netevent_notifier() like this:

static struct notifier_block nb = {
.notifier_call = my_callback_func
};

...

register_netevent_notifier(&nb);
---

 include/net/netevent.h |   49 +++
 net/core/netevent.c|   68 
 2 files changed, 117 insertions(+), 0 deletions(-)

diff --git a/include/net/netevent.h b/include/net/netevent.h
new file mode 100644
index 000..22214c8
--- /dev/null
+++ b/include/net/netevent.h
@@ -0,0 +1,49 @@
+#ifndef _NET_EVENT_H
+#define _NET_EVENT_H
+
+/*
+ * Generic netevent notifiers
+ *
+ * Authors:
+ *  Tom Tucker  <[EMAIL PROTECTED]>
+ *
+ * Changes:
+ */
+
+#ifdef __KERNEL__
+
+#include 
+
+/* 
+ * Generic route info structure.
+ *
+ * FamilyData ptr type
+ * 
+ * AF_INET - struct fib_info *
+ * AF_INET6- struct rt6_info *
+ * AF_DECnet   - struct dn_route *
+ */
+struct netevent_route_info {
+   u16 family;
+   void *data; 
+};
+
+struct netevent_redirect {
+   struct dst_entry *old;
+   struct dst_entry *new;
+};
+
+enum netevent_notif_type {
+   NETEVENT_NEIGH_UPDATE = 1, /* arg is struct neighbour ptr */
+   NETEVENT_ROUTE_ADD,/* arg is struct netevent_route_info ptr */
+   NETEVENT_ROUTE_DEL,/* arg is struct netevent_route_info ptr */
+   NETEVENT_PMTU_UPDATE,  /* arg is struct dst_entry ptr */
+   NETEVENT_REDIRECT, /* arg is struct netevent_redirect ptr */
+};
+
+extern int register_netevent_notifier(struct notifier_block *nb);
+extern int unregister_netevent_notifier(struct notifier_block *nb);
+extern int call_netevent_notifiers(unsigned long val, void *v);
+
+#endif
+#endif
diff --git a/net/core/netevent.c b/net/core/netevent.c
new file mode 100644
index 000..e995751
--- /dev/null
+++ b/net/core/netevent.c
@@ -0,0 +1,68 @@
+/*
+ * Network event notifiers
+ *
+ * Authors:
+ *  Tom Tucker <[EMAIL PROTECTED]>
+ *
+ * This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Fixes:
+ */
+
+#include 
+#include 
+
+static ATOMIC_NOTIFIER_HEAD(netevent_notif_chain);
+
+/**
+ * register_netevent_notifier - register a netevent notifier block
+ * @nb: notifier
+ *
+ * Register a notifier to be called when a netevent occurs.
+ * The notifier passed is linked into the kernel structures and must
+ * not be reused until it has been unregistered. A negative errno code
+ * is returned on a failure.
+ */
+int register_netevent_notifier(struct notifier_block *nb)
+{
+   int err;
+
+   err = atomic_notifier_chain_register(&netevent_notif_chain, nb);
+   return err;
+}
+
+/**
+ * netevent_unregister_notifier - unregister a netevent notifier block
+ * @nb: notifier
+ *
+ * Unregister a notifier previously registered by
+ * register_neigh_notifier(). The notifier is unlinked into the
+ * kernel structures and may then be reused. A negative errno code
+ * is returned on a failure.
+ */
+
+int unregister_netevent_notifier(struct notifier_block *nb)
+{
+   return atomic_notifier_chain_unregister(&netevent_notif_chain, nb);
+}
+
+/**
+ * call_netevent_notifiers - call all netevent notifier blocks
+ *  @val: value passed unmodified to notifier function
+ *  @v:   pointer passed unmodified to notifier function
+ *
+ * Call all neighbour notifier blocks.  Parameters and return value
+ * are as for notifier_call_chain().
+ */
+
+int call_netevent_notifiers(unsigned long val, void *v)
+{
+   return atomic_notifier_call_chain(&netevent_notif_chain, val, v);
+}
+
+EXPORT_SYMBOL_GPL(register_netevent_notifier);
+EXPORT_SYMBOL_GPL(unregister_netevent_notifier);
+EXPORT_SYMBOL_GPL(call_netevent_notifiers);
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH Round 3 2/2] Core network changes to support network event notification.

2006-06-27 Thread Steve Wise

This patch adds netevent and netlink calls for neighbour change, route
add/del, pmtu change, and routing redirect events.

Netlink Details:

Neighbour change events are broadcast as a new ndmsg type RTM_NEIGHUPD.

Path mtu change events are broadcast as a new rtmsg type RTM_ROUTEUPD.

Routing redirect events are broadcast as a pair of rtmsgs, RTM_DELROUTE
and RTM_NEWROUTE.
---

 include/linux/rtnetlink.h |4 ++
 net/core/Makefile |2 +
 net/core/neighbour.c  |   37 ---
 net/ipv4/fib_semantics.c  |9 +
 net/ipv4/route.c  |   86 ++--
 net/ipv6/route.c  |   87 +
 6 files changed, 213 insertions(+), 12 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index facd9ee..340ca4f 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -35,6 +35,8 @@ #define RTM_NEWROUTE  RTM_NEWROUTE
 #define RTM_DELROUTE   RTM_DELROUTE
RTM_GETROUTE,
 #define RTM_GETROUTE   RTM_GETROUTE
+   RTM_ROUTEUPD,
+#define RTM_ROUTEUPD   RTM_ROUTEUPD
 
RTM_NEWNEIGH= 28,
 #define RTM_NEWNEIGH   RTM_NEWNEIGH
@@ -42,6 +44,8 @@ #define RTM_NEWNEIGH  RTM_NEWNEIGH
 #define RTM_DELNEIGH   RTM_DELNEIGH
RTM_GETNEIGH,
 #define RTM_GETNEIGH   RTM_GETNEIGH
+   RTM_NEIGHUPD,
+#define RTM_NEIGHUPD   RTM_NEIGHUPD
 
RTM_NEWRULE = 32,
 #define RTM_NEWRULERTM_NEWRULE
diff --git a/net/core/Makefile b/net/core/Makefile
index e9bd246..2645ba4 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -7,7 +7,7 @@ obj-y := sock.o request_sock.o skbuff.o 
 
 obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
 
-obj-y   += dev.o ethtool.o dev_mcast.o dst.o \
+obj-y   += dev.o ethtool.o dev_mcast.o dst.o netevent.o \
neighbour.o rtnetlink.o utils.o link_watch.o filter.o
 
 obj-$(CONFIG_XFRM) += flow.o
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 50a8c73..bf70981 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -30,9 +30,11 @@ #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
+#include 
 
 #define NEIGH_DEBUG 1
 
@@ -59,6 +61,7 @@ static void neigh_app_notify(struct neig
 #endif
 static int pneigh_ifdown(struct neigh_table *tbl, struct net_device *dev);
 void neigh_changeaddr(struct neigh_table *tbl, struct net_device *dev);
+static void rtm_neigh_change(struct neighbour *n);
 
 static struct neigh_table *neigh_tables;
 #ifdef CONFIG_PROC_FS
@@ -755,6 +758,7 @@ #endif
neigh->nud_state = NUD_STALE;
neigh->updated = jiffies;
neigh_suspect(neigh);
+   notify = 1;
}
} else if (state & NUD_DELAY) {
if (time_before_eq(now, 
@@ -763,6 +767,7 @@ #endif
neigh->nud_state = NUD_REACHABLE;
neigh->updated = jiffies;
neigh_connect(neigh);
+   notify = 1;
next = neigh->confirmed + neigh->parms->reachable_time;
} else {
NEIGH_PRINTK2("neigh %p is probed.\n", neigh);
@@ -820,6 +825,8 @@ #endif
 out:
write_unlock(&neigh->lock);
}
+   if (notify)
+   rtm_neigh_change(neigh);
 
 #ifdef CONFIG_ARPD
if (notify && neigh->parms->app_probes)
@@ -927,9 +934,7 @@ int neigh_update(struct neighbour *neigh
 {
u8 old;
int err;
-#ifdef CONFIG_ARPD
int notify = 0;
-#endif
struct net_device *dev;
int update_isrouter = 0;
 
@@ -949,9 +954,7 @@ #endif
neigh_suspect(neigh);
neigh->nud_state = new;
err = 0;
-#ifdef CONFIG_ARPD
notify = old & NUD_VALID;
-#endif
goto out;
}
 
@@ -1023,9 +1026,7 @@ #endif
if (!(new & NUD_CONNECTED))
neigh->confirmed = jiffies -
  (neigh->parms->base_reachable_time << 1);
-#ifdef CONFIG_ARPD
notify = 1;
-#endif
}
if (new == old)
goto out;
@@ -1056,7 +1057,11 @@ out:
(neigh->flags | NTF_ROUTER) :
(neigh->flags & ~NTF_ROUTER);
}
+
write_unlock_bh(&neigh->lock);
+
+   if (notify)
+   rtm_neigh_change(neigh);
 #ifdef CONFIG_ARPD
if (notify && neigh->parms->app_probes)
neigh_app_notify(neigh);
@@ -2370,9 +2375,27 @@ static void neigh_app_notify(struct neig
NETLINK_CB(skb).dst_group  = RTNLGRP_NEIGH;
netlink_broadcast(rtnl, skb, 0, RTNLGRP_NEIGH, GFP_ATOMIC);
 }
-
 #endif /* CONFIG_ARPD */
 
+static void rtm_neigh_change(struct neighbour *n)
+{
+   struct nlmsghdr *nlh;
+   int size = NLMSG_SPACE(sizeof(struct ndmsg) + 256);
+ 

Re: [NET]: Added GSO header verification

2006-06-27 Thread Michael Chan
On Tue, 2006-06-27 at 22:07 +1000, Herbert Xu wrote:

> [NET]: Added GSO header verification
>
> @@ -2166,10 +2166,14 @@ struct sk_buff *tcp_tso_segment(struct s
> if (!pskb_may_pull(skb, thlen))
> goto out;
>  
> +   segs = NULL;
> +   if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST))
> +   goto out;
> +

This logic doesn't look right to me.  Perhaps it's backwards and should
be:

if (!skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST))

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please pull 'upstream' branch of wireless-2.6

2006-06-27 Thread Larry Finger

Michael Buesch wrote:

On Tuesday 27 June 2006 22:06, Larry Finger wrote:

John,

I would like to find a diplomatic solution to this impasse between Michael and Jeff, which is why 
I'm writing to you privately. Michael is correct in that the loop in question will not usually delay 


private?


I meant it to be private, but screwed up.

long; however, on some hardware it takes longer than on his. On mine, I have seen delays as long as 
550 usec.


What's the chip?


bcm43xx: Chip ID 0x4306, rev 0x2
bcm43xx: Number of cores: 6
bcm43xx: Core 0: ID 0x800, rev 0x2, vendor 0x4243, enabled
bcm43xx: Core 1: ID 0x812, rev 0x4, vendor 0x4243, disabled
bcm43xx: Core 2: ID 0x80d, rev 0x1, vendor 0x4243, enabled
bcm43xx: Core 3: ID 0x807, rev 0x1, vendor 0x4243, disabled
bcm43xx: Core 4: ID 0x804, rev 0x7, vendor 0x4243, enabled
bcm43xx: Core 5: ID 0x812, rev 0x4, vendor 0x4243, disabled
bcm43xx: Ignoring additional 802.11 core.
bcm43xx: Detected PHY: Version: 1, Type 2, Revision 1
bcm43xx: Detected Radio: ID: 2205017f (Manuf: 17f Ver: 2050 Rev: 2)

In any case, I think that the following code fragment would work and pass Jeff's criticism: 


for (i=5000; i; i--) {
..
usleep(1);


usleep? Can't find that in my kernel tree.
In fact, I think the lowest possible sleep time
depends on HZ and is 1msec on 1000HZ.


I meant udelay, of course.


Additionally, we are holding a spinlock at this time, so it is
not as easy as simply replacing udelay() by some sleeping function.


I know that.

This would make the worst-case delay be 5 msec, but would provide a cushion of 10X the longest I 
have seen and should be safe.


Do you have any suggestions on what should be done next?


Leave it as is and find out why it takes so long for your strange card. ;)


I once offered you my second, duplicate card for testing, but never heard back. Do you have any 
ideas regarding diagnostics to see why it takes so long? Remember, this card used to time-out on the 
1 second delay before the periodic work was restructured.


Larry



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH Round 2 0/2][RFC] Network Event Notifier Mechanism

2006-06-27 Thread David Miller
From: Steve Wise <[EMAIL PROTECTED]>
Date: Tue, 27 Jun 2006 15:33:19 -0500

> From my experimentation with netlink, RTM_NEWROUTE and RTM_DELROUTE
> messages do not get sent up for redirect events.  I have, in fact, added
> this with the new patch I'll send out soon.  So either way I need to
> change the IPv[46] code to generate a notification for redirects.  With
> the single NETEVENT_REDIRECT call, the RDMA driver can, in one sweep,
> update all the connections.  It seems more efficient.  At the place
> where I've hooked redirect, both the old route and the new route are
> already created.

Ok, let's see what it looks like.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Export accept queue len of a TCP listening socket via rx_queue

2006-06-27 Thread David Miller
From: Sridhar Samudrala <[EMAIL PROTECTED]>
Date: Thu, 22 Jun 2006 10:38:17 -0700

> On Thu, 2006-06-22 at 10:50 +1000, Herbert Xu wrote:
> > Sridhar Samudrala <[EMAIL PROTECTED]> wrote:
> > >> 
> > >> What about using the same fields (rqueue/wqueue) as you did for /proc?
> > > 
> > > I meant extending tcp_info structure to add new fields. I think the user
> > > space also uses this structure.
> > 
> > What about putting it into inet_idiag_msg.idiag_[rw]queue instead?
> 
> OK. I was under the mistaken assumption that [rw]queue fields are exported
> via tcp_info. This makes it pretty simple to support netlink users also. 
> Here is the updated patch.

This looks fine.  Applied, thanks a lot Sridhar.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PKT_SCHED]: PSCHED_TADD() and PSCHED_TADD2() can result,tv_usec >= 1000000

2006-06-27 Thread David Miller
From: Shuya MAEDA <[EMAIL PROTECTED]>
Date: Wed, 21 Jun 2006 09:16:03 +0900

> Thank you for the comment.
> I made the patch that used the loop instead of the divide and modulus.
> Are there any comments?

Your email client has corrupted the patch, turning tab characters
into spaces, and also turning lines containing only spaces into
empty lines.

Therefore, I cannot apply your patch, please send your patch properly
so that I may apply it.

Thank you.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NET]: Make illegal_highdma more anal

2006-06-27 Thread David Miller
From: Herbert Xu <[EMAIL PROTECTED]>
Date: Wed, 21 Jun 2006 09:49:38 +1000

> [NET]: Make illegal_highdma more anal
> 
> Rather than having illegal_highdma as a macro when HIGHMEM is off, we
> can turn it into an inline function that returns zero.  This will catch
> callers that give it bad arguments.
> 
> Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>

Looks sane, applied.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH Round 2 0/2][RFC] Network Event Notifier Mechanism

2006-06-27 Thread Steve Wise
On Tue, 2006-06-27 at 13:21 -0700, David Miller wrote:
> From: Steve Wise <[EMAIL PROTECTED]>
> Date: Tue, 27 Jun 2006 15:19:08 -0500
> 
> > For an RDMA NIC, all this logic is in HW, which is why we need the event
> > notification; to tell the HW to change its next hop information.
> 
> Back to the route change notification, I still think you can
> get what you need by just looking for the route delete.
> 
> You can match if any RDMA connection is using the deleted
> route, mark it "update pending" or something like that,
> and when the you get the "new route" event you can walk the
> "pending" list and try to relookup the route for those
> connections.

>From my experimentation with netlink, RTM_NEWROUTE and RTM_DELROUTE
messages do not get sent up for redirect events.  I have, in fact, added
this with the new patch I'll send out soon.  So either way I need to
change the IPv[46] code to generate a notification for redirects.  With
the single NETEVENT_REDIRECT call, the RDMA driver can, in one sweep,
update all the connections.  It seems more efficient.  At the place
where I've hooked redirect, both the old route and the new route are
already created.


Steve.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH Round 2 0/2][RFC] Network Event Notifier Mechanism

2006-06-27 Thread Caitlin Bestler
[EMAIL PROTECTED] wrote:
> From: Steve Wise <[EMAIL PROTECTED]>
> Date: Tue, 27 Jun 2006 10:02:19 -0500
> 
>> For the RDMA kernel subsystem, however, we still need a specific
>> event. We need both the old and new dst_entry struct ptrs to figure
>> out which active connections were using the old dst_entry and should
>> be updated to use the new dst_entry.
> 
> This change isn't truly atomic from a kernel standpoint either.
> 
> The new dst won't be selected by the socket until later, when
> the socket tries to send something, notices the old dst is
> obsolete, and looks up a new one.
> 
> Your code could do the same thing.

The request to "send something" is posted directly form user
mode to a mapped memory ring that is reaped by the hardware.
Having the hardware fault, report that fault, and wait for
the host to update it with the new mapping is somewhat clumbsy.
It also won't work at all for existing hardware.

The best you could do is to have the driver invalidate the old
entry, then *presume* that the hardware will want the replacement
and look that up, and then forward that answer to the hardware.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NET]: Added GSO header verification

2006-06-27 Thread David Miller
From: Herbert Xu <[EMAIL PROTECTED]>
Date: Tue, 27 Jun 2006 22:07:14 +1000

> This feature is only needed by Xen but most of the code here is useful
> for other things like TCPv4 ECN support.
> 
> [NET]: Added GSO header verification

Looks sane, applied.

Thanks Herbert.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHSET] Towards accurate incoming interface information

2006-06-27 Thread David Miller
From: Thomas Graf <[EMAIL PROTECTED]>
Date: Tue, 27 Jun 2006 17:07:27 +0200

> * Thomas Graf <[EMAIL PROTECTED]> 2006-06-26 16:54
> > This patchset transforms skb->input_dev based on a device
> > reference to skb->iif based on an interface index moving
> > towards accurate iif information for routing and classification
> > through the following changesets:
> 
> Hold on with this, I haven't noticed this ifb device
> go in and thus missed to update it. I'll post an
> updated patch shortly

Ok.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/1] netlink: encapsulate eff_cap usage within security framework

2006-06-27 Thread David Miller
From: Stephen Smalley <[EMAIL PROTECTED]>
Date: Mon, 26 Jun 2006 13:19:05 -0400

> This patch encapsulates the usage of eff_cap (in netlink_skb_params) within
> the security framework by extending security_netlink_recv to include a 
> required
> capability parameter and converting all direct usage of eff_caps outside
> of the lsm modules to use the interface.  It also updates the SELinux
> implementation of the security_netlink_send and security_netlink_recv
> hooks to take advantage of the sid in the netlink_skb_params struct.
> This also enables SELinux to perform auditing of netlink capability checks.
> Please apply, for 2.6.18 if possible.
> 
> Signed-off-by: Darrel Goeddel <[EMAIL PROTECTED]>
> Signed-off-by: Stephen Smalley <[EMAIL PROTECTED]>
> Acked-by:  James Morris <[EMAIL PROTECTED]>

Applied, thanks a lot.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH Round 2 0/2][RFC] Network Event Notifier Mechanism

2006-06-27 Thread David Miller
From: Steve Wise <[EMAIL PROTECTED]>
Date: Tue, 27 Jun 2006 10:02:19 -0500

> For the RDMA kernel subsystem, however, we still need a specific event.
> We need both the old and new dst_entry struct ptrs to figure out which
> active connections were using the old dst_entry and should be updated to
> use the new dst_entry.

This change isn't truly atomic from a kernel standpoint either.

The new dst won't be selected by the socket until later,
when the socket tries to send something, notices the old dst
is obsolete, and looks up a new one.

Your code could do the same thing.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please pull 'upstream' branch of wireless-2.6

2006-06-27 Thread Michael Buesch
On Tuesday 27 June 2006 22:06, Larry Finger wrote:
> John,
> 
> I would like to find a diplomatic solution to this impasse between Michael 
> and Jeff, which is why 
> I'm writing to you privately. Michael is correct in that the loop in question 
> will not usually delay 

private?

> long; however, on some hardware it takes longer than on his. On mine, I have 
> seen delays as long as 
> 550 usec.

What's the chip?

> In any case, I think that the following code fragment would work and pass 
> Jeff's criticism: 
> 
> for (i=5000; i; i--) {
>   ..
>   usleep(1);

usleep? Can't find that in my kernel tree.
In fact, I think the lowest possible sleep time
depends on HZ and is 1msec on 1000HZ.

Additionally, we are holding a spinlock at this time, so it is
not as easy as simply replacing udelay() by some sleeping function.

> This would make the worst-case delay be 5 msec, but would provide a cushion 
> of 10X the longest I 
> have seen and should be safe.
> 
> Do you have any suggestions on what should be done next?

Leave it as is and find out why it takes so long for your strange card. ;)

-- 
Greetings Michael.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH Round 2 0/2][RFC] Network Event Notifier Mechanism

2006-06-27 Thread Steve Wise
On Tue, 2006-06-27 at 13:14 -0700, David Miller wrote:
> This change isn't truly atomic from a kernel standpoint either.
> 
> The new dst won't be selected by the socket until later,
> when the socket tries to send something, notices the old dst
> is obsolete, and looks up a new one.
> 
> Your code could do the same thing.
> 

For an RDMA NIC, all this logic is in HW, which is why we need the event
notification; to tell the HW to change its next hop information.



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH Round 2 0/2][RFC] Network Event Notifier Mechanism

2006-06-27 Thread David Miller
From: Steve Wise <[EMAIL PROTECTED]>
Date: Tue, 27 Jun 2006 15:19:08 -0500

> For an RDMA NIC, all this logic is in HW, which is why we need the event
> notification; to tell the HW to change its next hop information.

Back to the route change notification, I still think you can
get what you need by just looking for the route delete.

You can match if any RDMA connection is using the deleted
route, mark it "update pending" or something like that,
and when the you get the "new route" event you can walk the
"pending" list and try to relookup the route for those
connections.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please pull 'upstream' branch of wireless-2.6

2006-06-27 Thread Larry Finger

Michael Buesch wrote:

On Tuesday 27 June 2006 21:33, John W. Linville wrote:

On Tue, Jun 27, 2006 at 06:31:01PM +0200, Michael Buesch wrote:

On Tuesday 27 June 2006 18:12, Jeff Garzik wrote:

Michael Buesch wrote:

So, I will submit a patch to lower the udelay(10) to udelay(1)
and we can close the discussion? ;)
No, that totally avoids my point.  Your "otherwise idle machine" test is 
probably nowhere near worst case in the field, for loops that can 
potentially lock the CPU for a long time upon hardware fault.  And then 
there are the huge delays in specific functions that I pointed out...

wtf are you requesting from me?
1) I proved you that the loop does only spin _once_ or even _less_.
2) If the hardware is faulty, the user must replace it.
   Because, if the hardware is faulty, it can crash the whole
   machine anyway, obviously.

3) There is no "huge delay". I proved it with my logs.
   -> No CPU hog => Nothing to fix.

Michael,

I think Jeff's concern is that by using udelay you are busy-waiting.
And, the for loop limit of 10 means you could freeze the kernel
for up to a whole second.  Granted that this won't happen very often


s/very often/ever/

It won't happen, as long as the driver is not buggy, or the device
is hardware broken. So, if it happens, something has to be fixed.
In fact, it did happen _never_ for me.
If it triggers, the device does not work _at all_ anyway.


and in the grand scheme of things a second isn't all _that_ long,
but still it would be better to avoid a delay like that -- a second
could be the time it takes to avoid a meltdown at the nuclear power
plant. :-)

Could you not use msleep instead of udelay (and scale the for loop
appropriately)?  What would be the problem with that?  It would get
rid of the busy waiting.


Becauses it horribly _increases_ the delay.
We "spin" for _at most_ 10 usecs here. Please always remember that.
We are talking about a 10 usec delay here. And I already sent a
patch to even reduce this to under 10 usec.


To be fair, this code was already in the driver and was only being
moved by this patch.  Still, what better time to fix it than now? :-)


If it ain't broken, don't fix it.


I'll go ahead and reshuffle wireless-2.6 to drop this patch.  A new
patch that passes muster w/ Jeff will be most welcome! :-)


A new patch won't appear, as there is no problem with this
delay.
Please don't drop anything and apply the following patch on top
of it:


John,

I would like to find a diplomatic solution to this impasse between Michael and Jeff, which is why 
I'm writing to you privately. Michael is correct in that the loop in question will not usually delay 
long; however, on some hardware it takes longer than on his. On mine, I have seen delays as long as 
550 usec. In any case, I think that the following code fragment would work and pass Jeff's criticism:


for (i=5000; i; i--) {
..
usleep(1);
}

This would make the worst-case delay be 5 msec, but would provide a cushion of 10X the longest I 
have seen and should be safe.


Do you have any suggestions on what should be done next?

Larry
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please pull 'upstream' branch of wireless-2.6

2006-06-27 Thread Michael Buesch
On Tuesday 27 June 2006 21:33, John W. Linville wrote:
> On Tue, Jun 27, 2006 at 06:31:01PM +0200, Michael Buesch wrote:
> > On Tuesday 27 June 2006 18:12, Jeff Garzik wrote:
> > > Michael Buesch wrote:
> > > > So, I will submit a patch to lower the udelay(10) to udelay(1)
> > > > and we can close the discussion? ;)
> > > 
> > > No, that totally avoids my point.  Your "otherwise idle machine" test is 
> > > probably nowhere near worst case in the field, for loops that can 
> > > potentially lock the CPU for a long time upon hardware fault.  And then 
> > > there are the huge delays in specific functions that I pointed out...
> > 
> > wtf are you requesting from me?
> > 1) I proved you that the loop does only spin _once_ or even _less_.
> > 2) If the hardware is faulty, the user must replace it.
> >Because, if the hardware is faulty, it can crash the whole
> >machine anyway, obviously.
> > 
> > 3) There is no "huge delay". I proved it with my logs.
> >-> No CPU hog => Nothing to fix.
> 
> Michael,
> 
> I think Jeff's concern is that by using udelay you are busy-waiting.
> And, the for loop limit of 10 means you could freeze the kernel
> for up to a whole second.  Granted that this won't happen very often

s/very often/ever/

It won't happen, as long as the driver is not buggy, or the device
is hardware broken. So, if it happens, something has to be fixed.
In fact, it did happen _never_ for me.
If it triggers, the device does not work _at all_ anyway.

> and in the grand scheme of things a second isn't all _that_ long,
> but still it would be better to avoid a delay like that -- a second
> could be the time it takes to avoid a meltdown at the nuclear power
> plant. :-)
> 
> Could you not use msleep instead of udelay (and scale the for loop
> appropriately)?  What would be the problem with that?  It would get
> rid of the busy waiting.

Becauses it horribly _increases_ the delay.
We "spin" for _at most_ 10 usecs here. Please always remember that.
We are talking about a 10 usec delay here. And I already sent a
patch to even reduce this to under 10 usec.

> To be fair, this code was already in the driver and was only being
> moved by this patch.  Still, what better time to fix it than now? :-)

If it ain't broken, don't fix it.

> I'll go ahead and reshuffle wireless-2.6 to drop this patch.  A new
> patch that passes muster w/ Jeff will be most welcome! :-)

A new patch won't appear, as there is no problem with this
delay.
Please don't drop anything and apply the following patch on top
of it:

--

Microoptimization:
This reduces the udelay in bcm43xx_mac_suspend.

Signed-off-by: Michael Buesch <[EMAIL PROTECTED]>

Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
===
--- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c   
2006-06-27 17:47:24.0 +0200
+++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c2006-06-27 
17:53:29.0 +0200
@@ -2328,7 +2328,7 @@
tmp = bcm43xx_read32(bcm, BCM43xx_MMIO_GEN_IRQ_REASON);
if (tmp & BCM43xx_IRQ_READY)
goto out;
-   udelay(10);
+   udelay(1);
}
printkl(KERN_ERR PFX "MAC suspend failed\n");
}


-- 
Greetings Michael.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [redhat-lspp] Re: [RFC 3/7] NetLabel: CIPSOv4 engine

2006-06-27 Thread Klaus Weidner
On Mon, Jun 26, 2006 at 08:33:57PM -0400, James Morris wrote:
> On Mon, 26 Jun 2006, Joe Nall wrote:
> > For all of the EAL4 LSPP Linux evaluation work is being done by Red
> > Hat/IBM/HP/atsec and others to be useful to integrators, there has to be 
> > basic
> > (e.g. CIPSO) multilevel network interoperability with existing multilevel
> > systems and good (e.g IPSec) multilevel networking between SELinux systems.
> 
> Just to be clear, my understanding is that the native xfrm labeling is 
> suitable for LSPP evaluation, as distinct from CIPSO being desired by 
> system integrators from an interoperability point of view.

It's not quite that distinct, the two solutions overlap in some areas but
neither can replace the other.

CIPSO would also be suitable for LSPP evaluation since it is capable of
exporting and importing labeled data. It requires a trusted network since
it doesn't encrypt or authenticate, so the evaluation would need to
restrict the environment accordingly.

The native IPSEC/xfrm approach is useful for more hostile environments
where you can't fully trust the network, but it's not interoperable with
existing deployed systems so it's not a replacement for CIPSO.

>From an evaluation point of view, either CIPSO or IPSEC/xfrm would be
able to meet LSPP requirements but with different restrictions on the
environment.

-Klaus
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please pull 'upstream' branch of wireless-2.6

2006-06-27 Thread John W. Linville
On Tue, Jun 27, 2006 at 06:31:01PM +0200, Michael Buesch wrote:
> On Tuesday 27 June 2006 18:12, Jeff Garzik wrote:
> > Michael Buesch wrote:
> > > So, I will submit a patch to lower the udelay(10) to udelay(1)
> > > and we can close the discussion? ;)
> > 
> > No, that totally avoids my point.  Your "otherwise idle machine" test is 
> > probably nowhere near worst case in the field, for loops that can 
> > potentially lock the CPU for a long time upon hardware fault.  And then 
> > there are the huge delays in specific functions that I pointed out...
> 
> wtf are you requesting from me?
> 1) I proved you that the loop does only spin _once_ or even _less_.
> 2) If the hardware is faulty, the user must replace it.
>Because, if the hardware is faulty, it can crash the whole
>machine anyway, obviously.
> 
> 3) There is no "huge delay". I proved it with my logs.
>-> No CPU hog => Nothing to fix.

Michael,

I think Jeff's concern is that by using udelay you are busy-waiting.
And, the for loop limit of 10 means you could freeze the kernel
for up to a whole second.  Granted that this won't happen very often
and in the grand scheme of things a second isn't all _that_ long,
but still it would be better to avoid a delay like that -- a second
could be the time it takes to avoid a meltdown at the nuclear power
plant. :-)

Could you not use msleep instead of udelay (and scale the for loop
appropriately)?  What would be the problem with that?  It would get
rid of the busy waiting.

To be fair, this code was already in the driver and was only being
moved by this patch.  Still, what better time to fix it than now? :-)

I'll go ahead and reshuffle wireless-2.6 to drop this patch.  A new
patch that passes muster w/ Jeff will be most welcome! :-)

Thanks,

John
-- 
John W. Linville
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] bcm43xx: opencoded locking

2006-06-27 Thread Michael Buesch
As many people don't seem to like the locking "obfuscation"
in the bcm43xx driver, this patch removes it.

Signed-off-by: Michael Buesch <[EMAIL PROTECTED]>

Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx.h
===
--- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx.h2006-06-27 
17:47:24.0 +0200
+++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx.h 2006-06-27 
20:44:27.0 +0200
@@ -647,6 +647,19 @@
 #define bcm43xx_status(bcm)atomic_read(&(bcm)->init_status)
 #define bcm43xx_set_status(bcm, stat)  atomic_set(&(bcm)->init_status, (stat))
 
+/**** THEORY OF LOCKING ***
+ *
+ * We have two different locks in the bcm43xx driver.
+ * => bcm->mutex:General sleeping mutex. Protects struct bcm43xx_private
+ *   and the device registers. This mutex does _not_ protect
+ *   against concurrency from the IRQ handler.
+ * => bcm->irq_lock: IRQ spinlock. Protects against IRQ handler concurrency.
+ *
+ * Please note that, if you only take the irq_lock, you are not protected
+ * against concurrency from the periodic work handlers.
+ * Most times you want to take _both_ locks.
+ */
+
 struct bcm43xx_private {
struct ieee80211_device *ieee;
struct ieee80211softmac_device *softmac;
@@ -657,7 +670,6 @@
 
void __iomem *mmio_addr;
 
-   /* Locking, see "theory of locking" text below. */
spinlock_t irq_lock;
struct mutex mutex;
 
@@ -689,6 +701,7 @@
struct bcm43xx_sprominfo sprom;
 #define BCM43xx_NR_LEDS4
struct bcm43xx_led leds[BCM43xx_NR_LEDS];
+   spinlock_t leds_lock;
 
/* The currently active core. */
struct bcm43xx_coreinfo *current_core;
@@ -759,55 +772,6 @@
 };
 
 
-/**** THEORY OF LOCKING ***
- *
- * We have two different locks in the bcm43xx driver.
- * => bcm->mutex:General sleeping mutex. Protects struct bcm43xx_private
- *   and the device registers.
- * => bcm->irq_lock: IRQ spinlock. Protects against IRQ handler concurrency.
- *
- * We have three types of helper function pairs to utilize these locks.
- * (Always use the helper functions.)
- * 1) bcm43xx_{un}lock_noirq():
- * Takes bcm->mutex. Does _not_ protect against IRQ concurrency,
- * so it is almost always unsafe, if device IRQs are enabled.
- * So only use this, if device IRQs are masked.
- * Locking may sleep.
- * You can sleep within the critical section.
- * 2) bcm43xx_{un}lock_irqonly():
- * Takes bcm->irq_lock. Does _not_ protect against
- * bcm43xx_lock_noirq() critical sections.
- * Does only protect against the IRQ handler path and other
- * irqonly() critical sections.
- * Locking does not sleep.
- * You must not sleep within the critical section.
- * 3) bcm43xx_{un}lock_irqsafe():
- * This is the cummulative lock and takes both, mutex and irq_lock.
- * Protects against noirq() and irqonly() critical sections (and
- * the IRQ handler path).
- * Locking may sleep.
- * You must not sleep within the critical section.
- */
-
-/* Lock type 1 */
-#define bcm43xx_lock_noirq(bcm)mutex_lock(&(bcm)->mutex)
-#define bcm43xx_unlock_noirq(bcm)  mutex_unlock(&(bcm)->mutex)
-/* Lock type 2 */
-#define bcm43xx_lock_irqonly(bcm, flags)   \
-   spin_lock_irqsave(&(bcm)->irq_lock, flags)
-#define bcm43xx_unlock_irqonly(bcm, flags) \
-   spin_unlock_irqrestore(&(bcm)->irq_lock, flags)
-/* Lock type 3 */
-#define bcm43xx_lock_irqsafe(bcm, flags) do {  \
-   bcm43xx_lock_noirq(bcm);\
-   bcm43xx_lock_irqonly(bcm, flags);   \
-   } while (0)
-#define bcm43xx_unlock_irqsafe(bcm, flags) do {\
-   bcm43xx_unlock_irqonly(bcm, flags); \
-   bcm43xx_unlock_noirq(bcm);  \
-   } while (0)
-
-
 static inline
 struct bcm43xx_private * bcm43xx_priv(struct net_device *dev)
 {
Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_debugfs.c
===
--- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_debugfs.c
2006-06-24 22:13:44.0 +0200
+++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_debugfs.c 2006-06-27 
20:44:27.0 +0200
@@ -77,7 +77,8 @@
 
down(&big_buffer_sem);
 
-   bcm43xx_lock_irqsafe(bcm, flags);
+   mutex_lock(&bcm->mutex);
+   spin_lock_irqsave(&bcm->irq_lock, flags);
if (bcm43xx_status(bcm) != BCM43xx_STAT_INITIALIZED) {
fappend("Board not initialized.\n");
goto out;
@@ -121,7 +122,8 @@
fappend("\n");
 
 out:
-   bcm43xx_unlock_irqsafe(bcm, flags);
+   spin_unlock_irqrestore(&bcm->irq_lock, flags);
+   mutex_unlock(&bcm->mutex);
res = simple_read_from_buffer(userbuf, count, ppos, buf, pos);
up(&big_buffer_sem);
return res;
@@ -1

Re: tg3 driver and interrupt coalescence questions

2006-06-27 Thread Chris A. Icide
Rick Jones wrote:
>
> Are you looking to increase or decrease the settings?  I would think
> (initially at least) that for VOIP one might not want to increase them.
>
> rick jones
I'm looking to decrease the interrupt load on the system.  During the
test I mentioned above I had some interesting and confusing results. 
The changes from the default settings to the settings I posted resulted
in a 100% performance increase (counted by the number of VoIP audio
streams the tested server could support).  With default settings one of
the two CPUs in the system maxed out at 99% cpu usage handling
interrupts, while the second CPU was not maxed out, but we started to
drop packets and the VoIP call setups started showing retransmits (which
is the measurement for failure in this test) at about 300 streams.  With
the new settings we were able to hit 600 streams.

So I definately recognized a significant improvement.  However I'd still
like to get more improvement.  At 600 streams and 20ms packets we are
looking at 30,000 pps.  The % of cpu (1 CPU as apparently the interrupts
can't be shared across multiple CPUs) used for interrupt handling at
this 600 stream limit was 88.0%.

Now what was interesting was on the test generation side (same hardware
exactly) of things, I was using the SIPP software to generate the VoIP
streams, and each blade in the blade server was only able to generate
~200 streams, with default settings in ethtool, one of the CPUs would
hit max usage for interrupt handling at that point.  So I modified the
ethtool settings to match those I listed above and there was no
discernable difference.  It was identical performance to the default
settings. 

Michael's response clarified for me what the actual parameters in the -C
section of ethtool do, thanks Michael.  However I';; be greatly
appreciative of any recommedations anyone might have for interrupt
mitigation settings for 100% UDP RTP traffic of 20ms packets (50 pps per
stream).

-Chris

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tg3 driver and interrupt coalescence questions

2006-06-27 Thread Rick Jones

Chris A. Icide wrote:

I've been digging around trying to get some information on the
current status of interrupt mitigation features for a Braodcom 5704 interface.

Specifically I'm sending and receiving lots of VoIP packets (50 pps

> per stream, many streams).


What I can't seem to determine is this:

What version of the linux kernel & tg3 drivers are required to
support both rx and tx mitigation?
What do the ethtool coalescence settings actually do (I've not been


Delay interrupts and increase individual packet latency with the 
intention being decreasing CPU utilization and allowing a higher 
aggregate packet per second limit.  IE bandwidth vs latency tradeoffs.



able to find actual descriptions of the different parameters in the -C
section)
Is there anything special that needs to be done when compiling a
kernel to enable this feature for both the kernel and the tg3 driver.


Are you looking to increase or decrease the settings?  I would think 
(initially at least) that for VOIP one might not want to increase them.


rick jones
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]: e1000: Janitor: Use #defined values for literals

2006-06-27 Thread Auke Kok

Linas Vepstas wrote:

On Fri, Jun 23, 2006 at 01:07:21PM -0700, Auke Kok wrote:

Linas Vepstas wrote:

Minor janitorial patch: use #defines for literal values.
+   pci_enable_wake(pdev, PCI_D3hot, 0);
+   pci_enable_wake(pdev, PCI_D3cold, 0);
I Acked this but that's silly - the patches sent yesterday already change 
the code above and this patch is no longer needed (thanks Jesse for 
spotting this).


This patch would conflict with them so please don't apply.


Maybe there's a backlog in the queue, but I not this is not 
yet in 2.6.17-mm3 


It's part of the submission for 2.6.18 I sent to jgarzik on friday, which 
cleans up this section in the way.


Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network namespaces a path to mergable code.

2006-06-27 Thread Andrey Savochkin
Eric,

On Tue, Jun 27, 2006 at 11:20:40AM -0600, Eric W. Biederman wrote:
> 
> Thinking about this I am going to suggest a slightly different direction
> for get a patchset we can merge.
> 
> First we concentrate on the fundamentals.
> - How we mark a device as belonging to a specific network namespace.
> - How we mark a socket as belonging to a specific network namespace.

I agree with the direction of your thoughts.
I was trying to do a similar thing, define clear steps in network
namespace merging.

My first patchset covers devices but not sockets.
The only difference from what you're suggesting is ipv4 routing.
For me, it is not less important than devices and sockets.  May be even
more important, since routing exposes design deficiencies less obvious at
socket level.

> 
> As part of the fundamentals we add a patch to the generic socket code
> that by default will disable it for protocol families that do not indicate
> support for handling network namespaces, on a non-default network namespace.

Fine

Can you summarize you objections against my way of handling devices, please?
And what was the typo you referred to in your letter to Kirill Korotaev?

Regards
Andrey
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tg3 driver and interrupt coalescence questions

2006-06-27 Thread Michael Chan
On Tue, 2006-06-27 at 10:16 -0700, Chris A. Icide wrote:

> What version of the linux kernel & tg3 drivers are required to support both 
> rx and tx mitigation?

ethtool -C for tg3 was added around July of 2005. The version with this
change added was 3.33.

> What do the ethtool coalescence settings actually do (I've not been able to 
> find actual descriptions of the different parameters in the -C section)

They set the delay between the tx and rx events and the generation of
interrupts for those events.

These are the only parameters that are relevant for tg3:

rx-frames[-irq]
rx-usecs[-irq]
tx-frames[-irq]
tx-usecs[-irq]

The frames parameters specify how many packets are received/transmitted
before generating an interrupt.  The usecs parameters specify how many
microseconds after at least 1 packet is received/transmitted before
generating an interrupt.  The [-irq] parameters are the corresponding
delays in updating the status when the interrupt is disabled.

> Is there anything special that needs to be done when compiling a kernel to 
> enable this feature for both the kernel and the tg3 driver.

No.

> 05:01.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704S
> Gigabit Ethernet (rev 10)
> Subsystem: IBM: Unknown device 02e8
> Flags: bus master, 66Mhz, medium devsel, latency 64, IRQ 201
> Memory at dcfe (64-bit, non-prefetchable) [size=64K]
> Capabilities: [40] PCI-X non-bridge device.
> Capabilities: [48] Power Management version 2
> Capabilities: [50] Vital Product Data
> Capabilities: [58] Message Signalled Interrupts: 64bit+
> Queue=0/3 Enable-
> 
> Linux version 2.6.9-34.ELsmp ([EMAIL PROTECTED]) (gcc version
> 3.4.5 20051201 (Red Hat 3.4.5-2)) #1 SMP Thu Mar 9 06:23:23 GMT 2006
> 
> [EMAIL PROTECTED] ~]# ethtool -c eth1
> Coalesce parameters for eth1:
> Adaptive RX: off  TX: off
> stats-block-usecs: 100
> sample-interval: 0
> pkt-rate-low: 0
> pkt-rate-high: 0
> 
> rx-usecs: 500
> rx-frames: 30
> rx-usecs-irq: 500
> rx-frames-irq: 20
> 

This means that the first interrupt will be generated after 30 packets
are received or 500 microseconds after the nth packet is received (1 <=
n < 30). When irq is disabled, 20 packets instead of 30 before updating
status.

> tx-usecs: 400
> tx-frames: 53
> tx-usecs-irq: 490
> tx-frames-irq: 5

The first tx interrupt will be generated after 53 packets are
transmitted or 400 microseconds after the nth packet is transmitted (1
<= n < 53). When irq is disabled, 5 packets or 490 micosecs before
updating status.

If the condition for generating a tx or rx interrupt is met, you get all
the accumulated tx and rx status during the interrupt.

Hope this helps.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]: e1000: Janitor: Use #defined values for literals

2006-06-27 Thread Linas Vepstas
On Fri, Jun 23, 2006 at 01:07:21PM -0700, Auke Kok wrote:
> Linas Vepstas wrote:
> >Minor janitorial patch: use #defines for literal values.
> >+pci_enable_wake(pdev, PCI_D3hot, 0);
> >+pci_enable_wake(pdev, PCI_D3cold, 0);
> 
> I Acked this but that's silly - the patches sent yesterday already change 
> the code above and this patch is no longer needed (thanks Jesse for 
> spotting this).
> 
> This patch would conflict with them so please don't apply.

Maybe there's a backlog in the queue, but I not this is not 
yet in 2.6.17-mm3 

--linas

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL

2006-06-27 Thread Patrick McHardy
Russell Stuart wrote:
> Without seeing your actual proposal it is difficult to
> judge whether this is a reasonable trade-off or not.
> Hopefully we will see your code soon.  Do you have any
> idea when?

Probably not today, I'll try to get it into shape until tomomorrow.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


tg3 driver and interrupt coalescence questions

2006-06-27 Thread Chris A. Icide
I've been digging around trying to get some information on the current status 
of interrupt mitigation features for a Braodcom 5704 interface.

Specifically I'm sending and receiving lots of VoIP packets (50 pps per stream, 
many streams).

What I can't seem to determine is this:

What version of the linux kernel & tg3 drivers are required to support both rx 
and tx mitigation?
What do the ethtool coalescence settings actually do (I've not been able to 
find actual descriptions of the different parameters in the -C section)
Is there anything special that needs to be done when compiling a kernel to 
enable this feature for both the kernel and the tg3 driver.

Just a warning, I'm not a C coder, so I've not had much luck digging around the 
code and looking for answers.

I've currently got a blade server with 10 blades I'm using 9 blades to generate 
this small packet high rate traffic to the 10th blade and trying to improve the 
ability of a blade to handle VoIP traffic.  I made some guesses at settings for 
the -C options in ethtool on both the test blade and the traffic generators.  
Interestingly it seems to have had a very good effect on the test blade (%cpu 
for interrupt down from 99.9% to ~20%), but the same settings on the traffic 
generation servers seems to have had no effect.

Hardware is identical, kernel is identical.

Any help is GREATLY appreciated.

-Chris

05:01.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704S
Gigabit Ethernet (rev 10)
Subsystem: IBM: Unknown device 02e8
Flags: bus master, 66Mhz, medium devsel, latency 64, IRQ 201
Memory at dcfe (64-bit, non-prefetchable) [size=64K]
Capabilities: [40] PCI-X non-bridge device.
Capabilities: [48] Power Management version 2
Capabilities: [50] Vital Product Data
Capabilities: [58] Message Signalled Interrupts: 64bit+
Queue=0/3 Enable-

Linux version 2.6.9-34.ELsmp ([EMAIL PROTECTED]) (gcc version
3.4.5 20051201 (Red Hat 3.4.5-2)) #1 SMP Thu Mar 9 06:23:23 GMT 2006

[EMAIL PROTECTED] ~]# ethtool -c eth1
Coalesce parameters for eth1:
Adaptive RX: off  TX: off
stats-block-usecs: 100
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 500
rx-frames: 30
rx-usecs-irq: 500
rx-frames-irq: 20

tx-usecs: 400
tx-frames: 53
tx-usecs-irq: 490
tx-frames-irq: 5

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

[EMAIL PROTECTED] ~]# ethtool -i eth1
driver: tg3
version: 3.43-rh
firmware-version:
bus-info: :05:01.1

[EMAIL PROTECTED] ~]# ethtool eth1
Settings for eth1:
Supported ports: [ FIBRE ]
Supported link modes:   1000baseT/Half 1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes:  1000baseT/Half 1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: g
Wake-on: d
Current message level: 0x00ff (255)
Link detected: yes


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-27 Thread Ben Greear

Eric W. Biederman wrote:

Herbert Poetzl <[EMAIL PROTECTED]> writes:



On Tue, Jun 27, 2006 at 05:52:52AM -0600, Eric W. Biederman wrote:


Inside the containers I want all network devices named eth0!


huh? even if there are two of them? also tun?

I think you meant, you want to be able to have eth0 in
_more_ than one guest where eth0 in a guest can also
be/use/relate to eth1 on the host, right?



Right I want to have an eth0 in each guest where eth0 is
it's own network device and need have no relationship to
eth0 on the host.


How does that help anything?  Do you envision programs
that make special decisions on whether the interface is
called eth0 v/s eth151?

Ben


--
Ben Greear <[EMAIL PROTECTED]>
Candela Technologies Inc  http://www.candelatech.com

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] Network namespaces a path to mergable code.

2006-06-27 Thread Eric W. Biederman

Thinking about this I am going to suggest a slightly different direction
for get a patchset we can merge.

First we concentrate on the fundamentals.
- How we mark a device as belonging to a specific network namespace.
- How we mark a socket as belonging to a specific network namespace.

As part of the fundamentals we add a patch to the generic socket code
that by default will disable it for protocol families that do not indicate
support for handling network namespaces, on a non-default network namespace.

I think that gives us a path that will allow us to convert the network stack
one protocol family at a time instead of in one big lump.

Stubbing off the sysfs and sysctl interfaces in the first round for the
non-default namespaces as you have done should be good enough.

The reason for the suggestion is that most of the work for the protocol
stacks ipv4 ipv6 af_packet af_unix is largely noise, and simple
replacement without real design work happening.  Mostly it is just
tweaking the code to remove global variables, and doing a couple
lookups.

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-27 Thread Alexey Kuznetsov
On Tue, Jun 27, 2006 at 06:02:42PM +0200, Herbert Poetzl wrote:

>  - loopback traffic inside a guest is insignificantly
>slower than on a normal system
> 
>  - loopback traffic on the host is insignificantly
>slower than on a normal system
> 
>  - inter guest traffic is faster than on-wire traffic,
>and should be withing a small tolerance of the
>loopback case (as it really isn't different)

I do not follow what are you people arguing about?

Intra-guest, guest-guest and host-guest paths have _no_ differences
from host-host loopback. Only the device is different:
* virtual loopback for intra-guest
* virtual interface for guest-guest and host-guest

But the work is exactly the same, only the place where packets
looped back is different. How could this be issue to break a lance over? :-)

Alexey


PS. The only thing, which I can imagine is "optimized" out ip_route_input()
in the case of loopback. But this optimization was an obvious design mistake
(mine, sorry) and apparently will die together with removal of current
deficiences of routing cache. Actually, it is one of deficiences.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][patch 1/4] Network namespaces: cleanup of dev_base list use

2006-06-27 Thread Eric W. Biederman
Kirill Korotaev <[EMAIL PROTECTED]> writes:

> This doesn't support anything. e.g. I caught quite a lot of bugs after Ingo
> Molnar, but this doesn't make his code "poor". People are people.
> Anyway, I would be happy to see the typo.

Look up thread.  You replied to the message where I commented on it.

There are two ways to argue this.
- It is the linux kernel development style to do small simple
  obviously patches that copy the maintainer of the code you are
  changing.
- Explain why that is the style.

The basic idea is that on a simple patch that is well described, it is
trivial to check and trivial to verify.

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please pull 'upstream' branch of wireless-2.6

2006-06-27 Thread Joseph Jezak
> No, that totally avoids my point.  Your "otherwise idle machine" test is
> probably nowhere near worst case in the field, for loops that can
> potentially lock the CPU for a long time upon hardware fault.  And then
> there are the huge delays in specific functions that I pointed out...
> 
> Jeff

The problem is that these are the delays used in the original driver
that we've been writing the specs from.  We don't know what they're
for or why they're so long.  We don't know if reducing the delay
will cause issues on some hardware and work fine on others.  Without
the actual specs from Broadcom, it's hard to say what's excessive
and what's not and whether changing it will break the driver.

-Joe
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-27 Thread Eric W. Biederman
Herbert Poetzl <[EMAIL PROTECTED]> writes:

> On Tue, Jun 27, 2006 at 05:52:52AM -0600, Eric W. Biederman wrote:
>> 
>> Inside the containers I want all network devices named eth0!
>
> huh? even if there are two of them? also tun?
>
> I think you meant, you want to be able to have eth0 in
> _more_ than one guest where eth0 in a guest can also
> be/use/relate to eth1 on the host, right?

Right I want to have an eth0 in each guest where eth0 is
it's own network device and need have no relationship to
eth0 on the host.

>> We need a clean abstraction that optimizes well.
>> 
>> However local communication between containers is not what we
>> should benchmark. That can always be improved later. So long as
>> the performance is reasonable. What needs to be benchmarked is the
>> overhead of namespaces when connected to physical networking devices
>> and on their own local loopback, and comparing that to a kernel
>> without namespace support.
>
> well, for me (obviously advocating the lightweight case)
> it seems improtant that the following conditions are met:
>
>  - loopback traffic inside a guest is insignificantly
>slower than on a normal system
>
>  - loopback traffic on the host is insignificantly
>slower than on a normal system
>
>  - inter guest traffic is faster than on-wire traffic,
>and should be withing a small tolerance of the
>loopback case (as it really isn't different)
>
>  - network (on-wire) traffic should be as fast as without
>the namespace (i.e. within 1% or so, better not really
>measurable)
>
>  - all this should be true in a setup with a significant
>number of guests, when only one guest is active, but
>all other guests are ready/configured
>
>  - all this should scale well with a few hundred guests

Ultimately I agree. However.  Only host performance should be
a merge blocker.  Allowing us to go back and reclaim the few
percentage points we lost later.

>> If we don't hurt that core case we have an implementation we can
>> merge.  There are a lot of optimization opportunities for local
>> communications and we can do that after we have a correct and accepted
>> implementation.  Anything else is optimizing too soon, and will
>> just be muddying the waters.
>
> what I fear is that once something is in, the kernel will
> just become slower (as it already did in some areas) and
> nobody will care/be-able to fix that later on ...

If nobody cares it doesn't matter.

If no one can fix it that is a problem.  Which is why we need
high standards and clean code, not early optimizations.

But on that front each step of the way must be justified on
it's own merits.  Not because it will give us some holy grail.

The way to keep the inter guest performance from degrading is
to measure it an complain.  But the linux network stack is too
big to get in one pass.

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch 1/1] AF_UNIX Datagram getpeersec [Updated #2]

2006-06-27 Thread Xiaolan Zhang
Some more fixes:
 
> diff -purN -X dontdiff linux-2.6.o/net/unix/af_unix.c linux-2.6.
> w/net/unix/af_unix.c
> --- linux-2.6.o/net/unix/af_unix.c   2006-06-21 00:02:30.0 -0400
> +++ linux-2.6.w/net/unix/af_unix.c   2006-06-27 09:30:12.0 -0400
> @@ -128,6 +128,28 @@ static atomic_t unix_nr_socks = ATOMIC_I
> 
>  #define UNIX_ABSTRACT(sk)   (unix_sk(sk)->addr->hash != UNIX_HASH_SIZE)
> 
> +#ifdef CONFIG_SECURITY_NETWORK
> +static void unix_get_peersec_dgram(struct sk_buff *skb)
> +{

add int err;

> +   err = security_socket_getpeersec_dgram(skb, UNIXSECDATA(skb),
> +  UNIXSECLEN(skb));
> +   if (err)
> +  *(UNIXSEC(skb)) = NULL;

change to *(UNIXSECDATA(skb)) = NULL;

> +}
> +
> +static inline void unix_set_secdata(struct scm_cookie *scm, struct 
> sk_buff *skb)
> +{
> +   scm->secdata = *UNIXSECDATA(skb);
> +   scm->seclen = UNIXSECLEN(skb);

change to scm->seclen = *UNIXSECLEN(skb);

> +}
> +#else
> +static void unix_get_peersec_dgram(struct sk_buff *skb)
> +{ }
> +
> +static inline void unix_set_secdata(struct scm_cookie *scm, struct 
> sk_buff *skb)
> +{ }
> +#endif /* CONFIG_SECURITY_NETWORKING */
> +
>  /*
>   *  SMP locking strategy:
>   *hash table is protected with spinlock unix_table_lock
> @@ -1291,6 +1313,8 @@ static int unix_dgram_sendmsg(struct kio
> if (siocb->scm->fp)
>unix_attach_fds(siocb->scm, skb);
> 
> +   unix_get_peersec_dgram(skb);
> +
> skb->h.raw = skb->data;
> err = memcpy_fromiovec(skb_put(skb,len), msg->msg_iov, len);
> if (err)
> @@ -1570,6 +1594,7 @@ static int unix_dgram_recvmsg(struct kio
>memset(&tmp_scm, 0, sizeof(tmp_scm));
> }
> siocb->scm->creds = *UNIXCREDS(skb);
> +   unix_set_secdata(siocb->scm, skb);
> 
> if (!(flags & MSG_PEEK))
> {

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-27 Thread Eric W. Biederman
Herbert Poetzl <[EMAIL PROTECTED]> writes:

> On Tue, Jun 27, 2006 at 01:09:11PM +0400, Andrey Savochkin wrote:
>> 
>> I'd like to caution about over-optimizing communications between
>> different network namespaces. Many optimizations of local traffic
>> (such as high MTU) don't look so appealing when you start to think
>> about live migration of namespaces.
>
> I think the 'optimization' (or to be precise: desire
> not to sacrifice local/loopback traffic for some use
> case as you describe it) does not interfere with live
> migration at all, we still will have 'local' and 'remote'
> traffic, and personally I doubt that the live migration
> is a feature for the masses ...

Several things.
- The linux loopback device is not strongly optimized, it is a compatibility
  layer.
- Traffic between guests is an implementation detail.
  There is nothing fundamental in our semantics that says the traffic
  has to be slow for any workload (except for the limuts imposed by using
  actual on the wire protocols).  The lo shares the same problem.

Worry about this case now when it has clearly been shown that there are several
possible ways to optimize this and get back any lost local performance is
optimizing way too early.

Criticize the per namespace performance and all you want.  That is pretty
much a merge blocker.  Unless we do worse than a 1-5% penalty the communication
across namespaces is really a non-issue.

Even with your large communications flows between guests 1-5% is nothing.

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please pull 'upstream' branch of wireless-2.6

2006-06-27 Thread Michael Buesch
On Tuesday 27 June 2006 18:12, Jeff Garzik wrote:
> Michael Buesch wrote:
> > So, I will submit a patch to lower the udelay(10) to udelay(1)
> > and we can close the discussion? ;)
> 
> No, that totally avoids my point.  Your "otherwise idle machine" test is 
> probably nowhere near worst case in the field, for loops that can 
> potentially lock the CPU for a long time upon hardware fault.  And then 
> there are the huge delays in specific functions that I pointed out...

wtf are you requesting from me?
1) I proved you that the loop does only spin _once_ or even _less_.
2) If the hardware is faulty, the user must replace it.
   Because, if the hardware is faulty, it can crash the whole
   machine anyway, obviously.

3) There is no "huge delay". I proved it with my logs.
   -> No CPU hog => Nothing to fix.

-- 
Greetings Michael.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-27 Thread Eric W. Biederman
Herbert Poetzl <[EMAIL PROTECTED]> writes:

> On Tue, Jun 27, 2006 at 01:54:51PM +0400, Kirill Korotaev wrote:
>> >>My point is that if you make namespace tagging at routing time, and
>> >>your packets are being routed only once, you lose the ability
>> >>to have separate routing tables in each namespace.
>> >
>> >
>> >Right. What is the advantage of having separate the routing tables ?
>
>> it is impossible to have bridged networking, tun/tap and many other 
>> features without it. I even doubt that it is possible to introduce 
>> private netfilter rules w/o virtualization of routing.
>
> why? iptables work quite fine on a typical linux
> system when you 'delegate' certain functionality
> to certain chains (i.e. doesn't require access to
> _all_ of them)
>
>> The question is do we want to have fully featured namespaces which
>> allow to create isolated virtual environments with semantics and
>> behaviour of standalone linux box or do we want to introduce some
>> hacks with new rules/restrictions to meet ones goals only?
>
> well, soemtimes 'hacks' are not only simpler but also 
> a much better solution for a given problem than the
> straight forward approach ... 

Well I would like to see a hack that qualifies.  I watched the
linux-vserver irc channel for a while and almost every network problem
was caused by the change in semantics vserver provides.

In this case when you allow a guest more than one IP your hack while
easy to maintain becomes much more complex.  Especially as you address
each case people care about one at a time.

In one shot this goes the entire way.  Given how many people miss
that you do the work at layer 2 than at layer 3 I would not call this
the straight forward approach.  The straight forward implementation yes,
but not the straight forward approach.

> for example, you won't have multiple routing tables
> in a kernel where this feature is disabled, no?
> so why should it affect a guest, or require modified
> apps inside a guest when we would decide to provide
> only a single routing table?
>
>> From my POV, fully virtualized namespaces are the future. 
>
> the future is already there, it's called Xen or UML, or QEMU :)

Yep.  And now we need it to run fast.

>> It is what makes virtualization solution usable (w/o apps
>> modifications), provides all the features and doesn't require much
>> efforts from people to be used.
>
> and what if they want to use virtualization inside
> their guests? where do you draw the line?

The implementation doesn't have any problems with guests inside
of guests.

The only reason to restrict guests inside of guests is because
the we aren't certain which permissions make sense.

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please pull 'upstream' branch of wireless-2.6

2006-06-27 Thread Michael Buesch
On Tuesday 27 June 2006 18:10, Jeff Garzik wrote:
> Michael Buesch wrote:
> > On Tuesday 27 June 2006 16:11, Jeff Garzik wrote:
> >> Overall, bcm43xx is _really really bad_ about this sort of thing.  Just 
> >> grepping for udelay in bcm43xx_radio.c shows some of the worst 
> >> offenders.  bcm43xx_radio_init2060() and bcm43xx_radio_selectchannel() 
> >> both look like candidates for using msleep() rather than udelay().
> > 
> > This is _all_ at initialization time.
> > select_channel How often do you select a channel?
> 
> That question is irrelevant, because you have no idea what -else- is 
> going on in the system, at the point when bcm43xx chooses to spin the 
> CPU heavily.
> 
> Initialization time means you are definitely not in a hot path, and can 
> therefore sleep.

Ok, again:
If you are running a preemptible kernel (I am doing a patch for the
non-preemptible case), everything is _already_ fine.
We are not spinning long times with locks held or IRQs disabled.
I already fixed that.

And no, I don't really care for initialization time.
I am not going to potentially break the driver to remove
1ms of wasted CPU on ifconfig up.
In fact, initialization is and always was done lockless.
So we should be fine there, too, actually.

We don't know why these delays are there all. And we never will.
But as this are all some measuring an calibration routines,
they surely have some purpose. We don't know if longer delays
in some places may have ill effects. Making the whole thing
preemptible (as I am doing / have done) surely has its potential
to break the driver.

I prefer correct operation over an unnoticable 1ms CPU hog.

> > I recently reworked the periodically exectuted workhandlers,
> > so that they are preemptible.
> 
> Major classes of users run their kernels without preempt.  Please don't 
> depend on that to avoid bad behavior.

I am doing a patch atm.
I will add voluntary preemption points, if the kernel is not preemptible.

-- 
Greetings Michael.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch 1/1] AF_UNIX Datagram getpeersec [Updated #2]

2006-06-27 Thread Xiaolan Zhang
Hi,

Thanks for the updates.  I am testing the code now.  Some minor fixes (so 
far):

changed all

#ifdef CONFIG_SECURITY_NETWORKING

to

#ifdef CONFIG_SECURITY_NETWORK

cheers,
Catherine


James Morris <[EMAIL PROTECTED]> wrote on 06/27/2006 09:57:15 AM:

> On Tue, 27 Jun 2006, Stephen Smalley wrote:
> 
> > What about saving the u32 seclen with the secdata, and using it later
> > rather than recomputing strlen(secdata)?  That also avoids encoding an
> > assumption in the af_unix code about the content of the data (i.e.
> > NUL-terminated string), leaving that to the security module.
> 
> Ok, this and other issues are addressed in the patch below, which is now 

> back to a single patch.
> 
> I also #ifdef'd the security fields in struct unix_skb_parms.
> 
> Please review and test.
> 
> ---
> 
>  include/asm-alpha/socket.h   |1 +
>  include/asm-arm/socket.h |1 +
>  include/asm-arm26/socket.h   |1 +
>  include/asm-cris/socket.h|1 +
>  include/asm-frv/socket.h |1 +
>  include/asm-h8300/socket.h   |1 +
>  include/asm-i386/socket.h|1 +
>  include/asm-ia64/socket.h|1 +
>  include/asm-m32r/socket.h|1 +
>  include/asm-m68k/socket.h|1 +
>  include/asm-mips/socket.h|1 +
>  include/asm-parisc/socket.h  |1 +
>  include/asm-powerpc/socket.h |1 +
>  include/asm-s390/socket.h|1 +
>  include/asm-sh/socket.h  |1 +
>  include/asm-sparc/socket.h   |1 +
>  include/asm-sparc64/socket.h |1 +
>  include/asm-v850/socket.h|1 +
>  include/asm-x86_64/socket.h  |1 +
>  include/asm-xtensa/socket.h  |1 +
>  include/linux/net.h  |1 +
>  include/linux/selinux.h  |   15 +++
>  include/net/af_unix.h|7 +++
>  include/net/scm.h|   17 +
>  net/core/sock.c  |   11 +++
>  net/unix/af_unix.c   |   25 +
>  security/selinux/exports.c   |   11 +++
>  security/selinux/hooks.c |8 +++-
>  28 files changed, 114 insertions(+), 1 deletion(-)
> 
> diff -purN -X dontdiff linux-2.6.o/include/asm-alpha/socket.h 
> linux-2.6.w/include/asm-alpha/socket.h
> --- linux-2.6.o/include/asm-alpha/socket.h   2006-06-21 00:02:08.
> 0 -0400
> +++ linux-2.6.w/include/asm-alpha/socket.h   2006-06-27 02:08:49.
> 0 -0400
> @@ -51,6 +51,7 @@
>  #define SCM_TIMESTAMP  SO_TIMESTAMP
> 
>  #define SO_PEERSEC  30
> +#define SO_PASSSEC  34
> 
>  /* Security levels - as per NRL IPv6 - don't actually do anything */
>  #define SO_SECURITY_AUTHENTICATION  19
> diff -purN -X dontdiff linux-2.6.o/include/asm-arm/socket.h linux-2.
> 6.w/include/asm-arm/socket.h
> --- linux-2.6.o/include/asm-arm/socket.h   2006-06-21 00:02:10.0 
-0400
> +++ linux-2.6.w/include/asm-arm/socket.h   2006-06-27 02:08:49.0 
-0400
> @@ -48,5 +48,6 @@
>  #define SO_ACCEPTCONN  30
> 
>  #define SO_PEERSEC  31
> +#define SO_PASSSEC  34
> 
>  #endif /* _ASM_SOCKET_H */
> diff -purN -X dontdiff linux-2.6.o/include/asm-arm26/socket.h 
> linux-2.6.w/include/asm-arm26/socket.h
> --- linux-2.6.o/include/asm-arm26/socket.h   2006-06-21 00:02:10.
> 0 -0400
> +++ linux-2.6.w/include/asm-arm26/socket.h   2006-06-27 02:08:49.
> 0 -0400
> @@ -48,5 +48,6 @@
>  #define SO_ACCEPTCONN  30
> 
>  #define SO_PEERSEC  31
> +#define SO_PASSSEC  34
> 
>  #endif /* _ASM_SOCKET_H */
> diff -purN -X dontdiff linux-2.6.o/include/asm-cris/socket.h 
> linux-2.6.w/include/asm-cris/socket.h
> --- linux-2.6.o/include/asm-cris/socket.h   2006-06-21 00:02:11.
> 0 -0400
> +++ linux-2.6.w/include/asm-cris/socket.h   2006-06-27 02:08:49.
> 0 -0400
> @@ -50,6 +50,7 @@
>  #define SO_ACCEPTCONN  30
> 
>  #define SO_PEERSEC 31
> +#define SO_PASSSEC  34
> 
>  #endif /* _ASM_SOCKET_H */
> 
> diff -purN -X dontdiff linux-2.6.o/include/asm-frv/socket.h linux-2.
> 6.w/include/asm-frv/socket.h
> --- linux-2.6.o/include/asm-frv/socket.h   2006-06-21 00:02:11.0 
-0400
> +++ linux-2.6.w/include/asm-frv/socket.h   2006-06-27 02:08:49.0 
-0400
> @@ -48,6 +48,7 @@
>  #define SO_ACCEPTCONN  30
> 
>  #define SO_PEERSEC  31
> +#define SO_PASSSEC  34
> 
>  #endif /* _ASM_SOCKET_H */
> 
> diff -purN -X dontdiff linux-2.6.o/include/asm-h8300/socket.h 
> linux-2.6.w/include/asm-h8300/socket.h
> --- linux-2.6.o/include/asm-h8300/socket.h   2006-06-21 00:02:11.
> 0 -0400
> +++ linux-2.6.w/include/asm-h8300/socket.h   2006-06-27 02:08:49.
> 0 -0400
> @@ -48,5 +48,6 @@
>  #define SO_ACCEPTCONN  30
> 
>  #define SO_PEERSEC  31
> +#define SO_PASSSEC  34
> 
>  #endif /* _ASM_SOCKET_H */
> diff -purN -X dontdiff linux-2.6.o/include/asm-i386/socket.h 
> linux-2.6.w/include/asm-i386/socket.h
> --- linux-2.6.o/include/asm-i386/socket.h   2006-06-21 00:02:12.
> 0 -0400
> +++ linux-2.6.w/include/asm-i386/socket.h   2006-06-27 02:08:

Re: [PATCH 17/21] e1000: add ich8lan core functions

2006-06-27 Thread Auke Kok

Jeff Garzik wrote:

Kok, Auke wrote:

This implements the core new functions needed for ich8's internal
NIC. This includes:

* ich8 specific read/write code
* flash/nvm access code
* software semaphore flag functions
* 10/100 PHY (fe - no gigabit speed) support for low-end versions
* A workaround for a powerdown sequence problem discovered that
affects a small number of motherboard.

Signed-off-by: Jesse Brandeburg <[EMAIL PROTECTED]>
Signed-off-by: Auke Kok <[EMAIL PROTECTED]>
---

 drivers/net/e1000/e1000_hw.c| 1000 
+++

 drivers/net/e1000/e1000_hw.h|  386 +++
 drivers/net/e1000/e1000_osdep.h |   13 +
 3 files changed, 1392 insertions(+), 7 deletions(-)


If it takes this much code to support ICH8, it seems like a e1000-ich8.c 
would be warranted...


that's work in progress - Jeb Cramer has been working on this for a while now 
but unfortunately it's not ready, and getting ich8 supported in a way that we 
know that doesn't introduce new bugs is more important. This patch adds tested 
and validated support for these chipsets that has been hammered by our test team.


We are planning (working) on cleaning it all up (including whitespace!) - but 
getting the ich8 support out is more important - people can buy the hardware 
today.


Cheers,

Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >