date:20070125

Re: [PATCH] net: decnet handle a failure in neigh_parms_alloc (take 2)

2007-01-25 Thread Steven Whitehouse

Hi,

On Wed, Jan 24, 2007 at 09:55:45PM -0700, Eric W. Biederman wrote:
 
 While enhancing the neighbour code to handle multiple network
 namespaces I noticed that decnet is assuming neigh_parms_alloc
 will allways succeed, which is clearly wrong.  So handle the
 failure.
 
 Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
Acked-by: Steven Whitehouse [EMAIL PROTECTED]

Also you should cc Patrick as he is now the maintainer,

Steve.


 ---
  net/decnet/dn_dev.c |   11 +--
  1 files changed, 9 insertions(+), 2 deletions(-)
 
 diff --git a/net/decnet/dn_dev.c b/net/decnet/dn_dev.c
 index 324eb47..913e25a 100644
 --- a/net/decnet/dn_dev.c
 +++ b/net/decnet/dn_dev.c
 @@ -1140,16 +1140,23 @@ struct dn_dev *dn_dev_create(struct net_device *dev, 
 int *err)
   init_timer(dn_db-timer);
  
   dn_db-uptime = jiffies;
 +
 + dn_db-neigh_parms = neigh_parms_alloc(dev, dn_neigh_table);
 + if (!dn_db-neigh_parms) {
 + dev-dn_ptr = NULL;
 + kfree(dn_db);
 + return NULL;
 + }
 +
   if (dn_db-parms.up) {
   if (dn_db-parms.up(dev)  0) {
 + neigh_parms_release(dn_neigh_table, 
 dn_db-neigh_parms);
   dev-dn_ptr = NULL;
   kfree(dn_db);
   return NULL;
   }
   }
  
 - dn_db-neigh_parms = neigh_parms_alloc(dev, dn_neigh_table);
 -
   dn_dev_sysctl_register(dev, dn_db-parms);
  
   dn_dev_set_timer(dev);
 -- 
 1.4.4.1.g278f
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IPv6: Implement RFC 4429 Optimistic Duplicate Address Detection

2007-01-25 Thread Neil Horman

On Wed, Jan 24, 2007 at 05:54:47PM -0800, Sridhar Samudrala wrote:
 Sec 2.1 of RFC 4429 says
 
Unless noted otherwise, components of the IPv6 protocol stack should
treat addresses in the Optimistic state equivalently to those in the
Deprecated state, indicating that the address is available for use
but should not be used if another suitable address is available.  For
example, Default Address Selection [RFC3484] uses the address state
to decide which source address to use for an outgoing packet.
Implementations should treat an address in state Optimistic as if it
were in state Deprecated.  If address states are recorded as
individual flags, this can easily be achieved by also setting
'Deprecated' when 'Optimistic' is set.
 
 So i think DEPRECATED flag also should be set when we mark an address
 as OPTIMISTIC so that we don't use it as source address for new 
 connections if another address is available until DAD is completed.
 
 Thanks
 Sridhar
 

Oh, good catch.  Thank you Sri.  However, I'm worried about the next paragraph:

It is important to note that the address lifetime rules of [RFC2462]
   still apply, and so an address may be Deprecated as well as
   Optimistic.  When DAD completes without incident, the address becomes
   either a Preferred or a Deprecated address, as per RFC 2462

Given that, it seems to me that addresses which are flagged as Deprecated may
enter and exit that state independently of the DAD process, which I think gives
rise to the possibility of a race.  I.e. if an address becomes deprecated right
before DAD completes, and then addrconf_dad_complete clears the IFA_F_DEPRECATED
flag, that seems wrong.  Instead I think it would be better if we tested for the
OPTIMISTIC flag in ipv6_dev_get_saddr in parallel with the DEPRECATED flag.  I
may be wrong about this, but I'm going to err on the side of safety.  If you can
ensure that this race is not possible.  Please let me know, and I'll happily
just set the flag.  I'll repost a new patch soon.

Thanks  Regards
Neil

 
 On Tue, 2007-01-23 at 15:51 -0500, Neil Horman wrote:
  On Tue, Jan 23, 2007 at 09:18:20AM +0900, YOSHIFUJI Hideaki / 吉藤英明 wrote:
   Hello.
  snip
  
  New patch attached, incorporating Yoshijui and Vlads latest comments.  I 
  didn't
  follow guidance on the ndisc_recv_ns comment, Yoshifuji, since Vlad had 
  already
  suggested an alternate solution in a previous post, but from looking at them
  both, they should be equivalent.
  
  Thanks  Regards
  Neil
  
  Signed-off-by: Neil Horman [EMAIL PROTECTED]
  
  
   include/linux/if_addr.h |1
   include/linux/ipv6.h|2 +
   include/linux/sysctl.h  |1
   include/net/addrconf.h  |4 +-
   net/ipv6/addrconf.c |   56 
   net/ipv6/mcast.c|4 +-
   net/ipv6/ndisc.c|   82 
  +++-
   7 files changed, 117 insertions(+), 33 deletions(-)
  
  
  diff --git a/include/linux/if_addr.h b/include/linux/if_addr.h
  index d557e4c..43f3bed 100644
  --- a/include/linux/if_addr.h
  +++ b/include/linux/if_addr.h
  @@ -39,6 +39,7 @@ enum
   #define IFA_F_TEMPORARYIFA_F_SECONDARY
  
   #defineIFA_F_NODAD 0x02
  +#define IFA_F_OPTIMISTIC   0x04
   #defineIFA_F_HOMEADDRESS   0x10
   #define IFA_F_DEPRECATED   0x20
   #define IFA_F_TENTATIVE0x40
  diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
  index f824113..5d37abf 100644
  --- a/include/linux/ipv6.h
  +++ b/include/linux/ipv6.h
  @@ -177,6 +177,7 @@ struct ipv6_devconf {
   #endif
   #endif
  __s32   proxy_ndp;
  +   __s32   optimistic_dad;
  void*sysctl;
   };
  
  @@ -205,6 +206,7 @@ enum {
  DEVCONF_RTR_PROBE_INTERVAL,
  DEVCONF_ACCEPT_RA_RT_INFO_MAX_PLEN,
  DEVCONF_PROXY_NDP,
  +   DEVCONF_OPTIMISTIC_DAD,
  DEVCONF_MAX
   };
  
  diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
  index 81480e6..972a33a 100644
  --- a/include/linux/sysctl.h
  +++ b/include/linux/sysctl.h
  @@ -570,6 +570,7 @@ enum {
  NET_IPV6_RTR_PROBE_INTERVAL=21,
  NET_IPV6_ACCEPT_RA_RT_INFO_MAX_PLEN=22,
  NET_IPV6_PROXY_NDP=23,
  +   NET_IPV6_OPTIMISTIC_DAD=24,
  __NET_IPV6_MAX
   };
  
  diff --git a/include/net/addrconf.h b/include/net/addrconf.h
  index 88df8fc..d248a19 100644
  --- a/include/net/addrconf.h
  +++ b/include/net/addrconf.h
  @@ -73,7 +73,9 @@ extern intipv6_get_saddr(struct 
  dst_entry *dst,
   extern int ipv6_dev_get_saddr(struct net_device *dev,
 struct in6_addr *daddr,
 struct in6_addr *saddr);
  -extern int ipv6_get_lladdr(struct net_device *dev, struct 
  in6_addr *);
  +extern int ipv6_get_lladdr(struct net_device *dev,
  +   struct in6_addr *,
  +

Re: [BUG] problem with BPF in PF_PACKET sockets, introduced in linux-2.6.19

2007-01-25 Thread Alexey Kuznetsov

Hello!

 So this whole idea to make run_filter() return signed integers
 and fail on negative is entirely flawed, it simply cannot work
 and retain the expected semantics which have been there forever.

Actually, it can. Return value was used only as sign of error,
so that the mistake was to return original unsigned result casted to int.

Alternative fix is enclosed. To be honest, it is not better than
yours: duplication of couple lines of code against passing return
value by pointer.

Alexey


diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index da73e8a..51e5537 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -437,11 +437,13 @@ static inline int run_filter(struct sk_b
rcu_read_lock_bh();
filter = rcu_dereference(sk-sk_filter);
if (filter != NULL) {
-   err = sk_run_filter(skb, filter-insns, filter-len);
-   if (!err)
+   unsigned int res;
+
+   res = sk_run_filter(skb, filter-insns, filter-len);
+   if (!res)
err = -EPERM;
-   else if (*snaplen  err)
-   *snaplen = err;
+   else if (*snaplen  res)
+   *snaplen = res;
}
rcu_read_unlock_bh();
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: bonding: bug in balance-alb mode (incorrect update-ARP-replie s)

2007-01-25 Thread JUNG, Christian

 Jay Vosburgh [EMAIL PROTECTED] wrote:
 
   Is your test occuring on an isolated network, and is there other
 concurrent network traffic that might be affecting things?

The problem still persists as long as the box is connected to our Ciscos.

I tried to simulate it with a dumb switch with my two boxes connected only.
But there were no unsolicited ARP-replies anymore.

On the Ciscos I sometimes see ARP-replies with a destination MAC of
00:00:00:00:00:00 (!) from some Linux-boxes which are using bonding.
Currently I don't have a clue from where they're coming...

The boxes are receiving around 200 ARP-replies a minute - so yes, there's
concurrent network traffic :-)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marvell Libertas 8388 802.11 USB - added to Orbit

2007-01-25 Thread Luis R. Rodriguez


I've slapped the two Marvell Libertas 8388 802.11 USB cards onto
Winlab's Orbit testbed on sandbox 8. This allows anyone willing to
help hack on the driver with access to a node with the wireless card.

http://www.orbit-lab.org/wiki/Documentation/Developers

 Luis
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Marvell Libertas 8388 802.11 USB - added to Orbit

2007-01-25 Thread John W. Linville

On Thu, Jan 25, 2007 at 10:24:40AM -0500, Luis R. Rodriguez wrote:
 I've slapped the two Marvell Libertas 8388 802.11 USB cards onto
 Winlab's Orbit testbed on sandbox 8. This allows anyone willing to
 help hack on the driver with access to a node with the wireless card.
 
 http://www.orbit-lab.org/wiki/Documentation/Developers

Very cool...thanks, Luis!

-- 
John W. Linville
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lksctp-developers] Fw: Intermittent SCTP multihoming breakage

2007-01-25 Thread Vlad Yasevich

Hi Steve

Steve Hill wrote:
 On Wed, 10 Jan 2007, Sridhar Samudrala wrote:
 
 So looks like there may be an issue with PR-SCTP(partial reliability)
 support and packet loss. I will take a look into this.

 Do you still see this problem even if you don't set timetolive?
 
 No, the problem seems to go away if the timetolive is set to 0, so this is
 what I have now done since I had not intended to set the timetolive in the
 first place (but I thought it was still worth posting details of the
 problem since it does appear to be a bug).
 

I think I found this bug.  It was rather interesting to figure out.  The problem
appears to be that data messages time-out within the rto.  As a result, they
move the abandoned list and are never retransmitted.  This clears the retransmit
list and the retransmit timer, however the data is still charged as in-flight 
against
the association.  This in turn causes new data not to be send, since we are 
'supposedly'
utilizing our congestion window.

Can you try the attached patch and let me know if the problem is fixed.  You 
can 
try reducing rto_max or path_max_retrans to get the failover to happen a little 
faster.

Regards
-vlad
[SCTP]: Fix connection hang with PR-SCTP

The problem that this patch corrects happens when all
of the following conditions are satisfisfied:
1.  PR-SCTP is used and the timeout on the chunks is
set below RTO.Max.
2.  One of the paths on a multihomed associations
is brought down.

In this scenario, data will expire within the rto of the
initial transmission and will never be retransmitted.  However
this data still fills the send buffer and is counted against
the association as outstanding data.  This causes any new
data to not be sent and retransmission to not happen.

The fix is to discount the abandoned data from the outstanding
count and peers rwnd estimation.  This allows new data to
be sent and a retransmission timer restarted.  Even though
this new data will most like expire withing the rto, the
timer still counts as a strike agains the transport and forces
the FORWARD-TSN chunk to be retransmitted as well.

Signed-off-by: Vlad Yasevich [EMAIL PROTECTED]
---
 net/sctp/outqueue.c |   27 ++-
 1 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index fba567a..54d1b7f 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -396,6 +396,19 @@ void sctp_retransmit_mark(struct sctp_outq *q,
if (sctp_chunk_abandoned(chunk)) {
list_del_init(lchunk);
sctp_insert_list(q-abandoned, lchunk);
+
+   /* If this chunk has not been previousely acked,
+* stop considering it 'outstanding'.  Our peer
+* will most likely never see it since it will
+* not be retransmitted
+*/
+   if (!chunk-tsn_gap_acked) {
+   chunk-transport-flight_size -=
+   sctp_data_size(chunk);
+   q-outstanding_bytes -= sctp_data_size(chunk);
+   q-asoc-peer.rwnd += (sctp_data_size(chunk) +
+   sizeof(struct sk_buff));
+   }
continue;
}
 
@@ -1244,6 +1257,15 @@ static void sctp_check_transmitted(struct sctp_outq *q,
if (sctp_chunk_abandoned(tchunk)) {
/* Move the chunk to abandoned list. */
sctp_insert_list(q-abandoned, lchunk);
+
+   /* If this chunk has not been acked, stop
+* considering it as 'outstanding'.
+*/
+   if (!tchunk-tsn_gap_acked) {
+   tchunk-transport-flight_size -=
+   sctp_data_size(tchunk);
+   q-outstanding_bytes -= sctp_data_size(tchunk);
+   }
continue;
}
 
@@ -1695,11 +1717,6 @@ static void sctp_generate_fwdtsn(struct sctp_outq *q, 
__u32 ctsn)
 */ 
if (TSN_lte(tsn, ctsn)) {
list_del_init(lchunk);
-   if (!chunk-tsn_gap_acked) {
-   chunk-transport-flight_size -=
-   sctp_data_size(chunk);
-   q-outstanding_bytes -= sctp_data_size(chunk);
-   }
sctp_chunk_free(chunk);
} else {
if (TSN_lte(tsn, asoc-adv_peer_ack_point+1)) {
-- 
1.4.4.2.g8336

Re: [Lksctp-developers] Fw: Intermittent SCTP multihoming breakage

2007-01-25 Thread Vlad Yasevich

BTW, if anyone needs a reproducer, I can provide one.

-vlad
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IPv6: Implement RFC 4429 Optimistic Duplicate Address Detection

2007-01-25 Thread Vlad Yasevich

Neil Horman wrote:
 On Wed, Jan 24, 2007 at 05:54:47PM -0800, Sridhar Samudrala wrote:
 Sec 2.1 of RFC 4429 says

Unless noted otherwise, components of the IPv6 protocol stack should
treat addresses in the Optimistic state equivalently to those in the
Deprecated state, indicating that the address is available for use
but should not be used if another suitable address is available.  For
example, Default Address Selection [RFC3484] uses the address state
to decide which source address to use for an outgoing packet.
Implementations should treat an address in state Optimistic as if it
were in state Deprecated.  If address states are recorded as
individual flags, this can easily be achieved by also setting
'Deprecated' when 'Optimistic' is set.

 So i think DEPRECATED flag also should be set when we mark an address
 as OPTIMISTIC so that we don't use it as source address for new 
 connections if another address is available until DAD is completed.

 Thanks
 Sridhar

 
 Oh, good catch.  Thank you Sri.  However, I'm worried about the next 
 paragraph:
 
 It is important to note that the address lifetime rules of [RFC2462]
still apply, and so an address may be Deprecated as well as
Optimistic.  When DAD completes without incident, the address becomes
either a Preferred or a Deprecated address, as per RFC 2462
 
 Given that, it seems to me that addresses which are flagged as Deprecated may
 enter and exit that state independently of the DAD process, which I think 
 gives
 rise to the possibility of a race.  I.e. if an address becomes deprecated 
 right
 before DAD completes, and then addrconf_dad_complete clears the 
 IFA_F_DEPRECATED
 flag, that seems wrong.  Instead I think it would be better if we tested for 
 the
 OPTIMISTIC flag in ipv6_dev_get_saddr in parallel with the DEPRECATED flag.  I
 may be wrong about this, but I'm going to err on the side of safety.  If you 
 can
 ensure that this race is not possible.  Please let me know, and I'll happily
 just set the flag.  I'll repost a new patch soon.

I tend to agree with Neil here.  Marking optimistic addresses as deprecated 
doesn't
buy as much since the address can transition in and out of deprecated state 
regardless
of DAD.

However, there is a problem with the current implementation in that OPTIMISTIC 
address
will never be chosen as source because it's always TENTATIVE and OPTIMISTIC at 
the
same time.  What needs to happen is for ipv6_dev_get_saddr() to not ignore 
OPTIMISTIC
addresses and treat them same as DEPRECATED.

-vlad

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Fix sorting of SACK blocks

2007-01-25 Thread Stephen Hemminger

On Thu, 25 Jan 2007 20:29:03 +0200
Baruch Even [EMAIL PROTECTED] wrote:

 The sorting of SACK blocks actually munges them rather than sort, causing the
 TCP stack to ignore some SACK information and breaking the assumption of
 ordered SACK blocks after sorting.
 
 The sort takes the data from a second buffer which isn't moved causing
 subsequent data moves to occur from the wrong location. The fix is to
 use a temporary buffer as a normal sort does.
 
 Signed-Off-By: Baruch Even [EMAIL PROTECTED]
 
 diff -X 2.6-rc6/Documentation/dontdiff -ur 2.6-rc6/net/ipv4/tcp_input.c 
 2.6-mod/net/ipv4/tcp_input.c
 --- 2.6-rc6/net/ipv4/tcp_input.c  2007-01-25 19:04:20.0 +0200
 +++ 2.6-mod/net/ipv4/tcp_input.c  2007-01-25 19:52:04.0 +0200
 @@ -1011,10 +1011,11 @@
   for (j = 0; j  i; j++){
   if (after(ntohl(sp[j].start_seq),
 ntohl(sp[j+1].start_seq))){
 - sp[j].start_seq = 
 htonl(tp-recv_sack_cache[j+1].start_seq);
 - sp[j].end_seq = 
 htonl(tp-recv_sack_cache[j+1].end_seq);
 - sp[j+1].start_seq = 
 htonl(tp-recv_sack_cache[j].start_seq);
 - sp[j+1].end_seq = 
 htonl(tp-recv_sack_cache[j].end_seq);
 + struct tcp_sack_block_wire tmp;
 +
 + tmp = sp[j];
 + sp[j] = sp[j+1];
 + sp[j+1] = tmp;
   }
  
   }

This looks okay, but is there a test case that can be run?


-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Fix sorting of SACK blocks

2007-01-25 Thread Baruch Even

The sorting of SACK blocks actually munges them rather than sort, causing the
TCP stack to ignore some SACK information and breaking the assumption of
ordered SACK blocks after sorting.

The sort takes the data from a second buffer which isn't moved causing
subsequent data moves to occur from the wrong location. The fix is to
use a temporary buffer as a normal sort does.

Signed-Off-By: Baruch Even [EMAIL PROTECTED]

diff -X 2.6-rc6/Documentation/dontdiff -ur 2.6-rc6/net/ipv4/tcp_input.c 
2.6-mod/net/ipv4/tcp_input.c
--- 2.6-rc6/net/ipv4/tcp_input.c2007-01-25 19:04:20.0 +0200
+++ 2.6-mod/net/ipv4/tcp_input.c2007-01-25 19:52:04.0 +0200
@@ -1011,10 +1011,11 @@
for (j = 0; j  i; j++){
if (after(ntohl(sp[j].start_seq),
  ntohl(sp[j+1].start_seq))){
-   sp[j].start_seq = 
htonl(tp-recv_sack_cache[j+1].start_seq);
-   sp[j].end_seq = 
htonl(tp-recv_sack_cache[j+1].end_seq);
-   sp[j+1].start_seq = 
htonl(tp-recv_sack_cache[j].start_seq);
-   sp[j+1].end_seq = 
htonl(tp-recv_sack_cache[j].end_seq);
+   struct tcp_sack_block_wire tmp;
+
+   tmp = sp[j];
+   sp[j] = sp[j+1];
+   sp[j+1] = tmp;
}
 
}
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Fix sorting of SACK blocks

2007-01-25 Thread Baruch Even

* Stephen Hemminger [EMAIL PROTECTED] [070125 20:47]:
 On Thu, 25 Jan 2007 20:29:03 +0200
 Baruch Even [EMAIL PROTECTED] wrote:
 
  The sorting of SACK blocks actually munges them rather than sort, causing 
  the
  TCP stack to ignore some SACK information and breaking the assumption of
  ordered SACK blocks after sorting.
  
  The sort takes the data from a second buffer which isn't moved causing
  subsequent data moves to occur from the wrong location. The fix is to
  use a temporary buffer as a normal sort does.
  
  Signed-Off-By: Baruch Even [EMAIL PROTECTED]
  
  diff -X 2.6-rc6/Documentation/dontdiff -ur 2.6-rc6/net/ipv4/tcp_input.c 
  2.6-mod/net/ipv4/tcp_input.c
  --- 2.6-rc6/net/ipv4/tcp_input.c2007-01-25 19:04:20.0 +0200
  +++ 2.6-mod/net/ipv4/tcp_input.c2007-01-25 19:52:04.0 +0200
  @@ -1011,10 +1011,11 @@
  for (j = 0; j  i; j++){
  if (after(ntohl(sp[j].start_seq),
ntohl(sp[j+1].start_seq))){
  -   sp[j].start_seq = 
  htonl(tp-recv_sack_cache[j+1].start_seq);
  -   sp[j].end_seq = 
  htonl(tp-recv_sack_cache[j+1].end_seq);
  -   sp[j+1].start_seq = 
  htonl(tp-recv_sack_cache[j].start_seq);
  -   sp[j+1].end_seq = 
  htonl(tp-recv_sack_cache[j].end_seq);
  +   struct tcp_sack_block_wire tmp;
  +
  +   tmp = sp[j];
  +   sp[j] = sp[j+1];
  +   sp[j+1] = tmp;
  }
   
  }
 
 This looks okay, but is there a test case that can be run?

There is nothing visible that shows the problem, the only option is to
add some code to print the SACK blocks after sorting and run it over a
large BDP connection that can be saturated. You'll obviously need to
have several holes, I believe that the bug will be visible when you have
ACK packets with three SACK blocks where the first block is the highest
which should be the normal case.

Cheers,
Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Possible bugs in SACK processing

2007-01-25 Thread Baruch Even

In addition to the patch I've provided there are two more issues that I
believe are bugs in the SACK processing code. Since I'm not certain but
I don't have the time to look into them I'd like to raise them for other
folks to look at.

First issue is the checking of the applicability of the fast path. The
sack blocks are compared directly, but there is no comparison of the
number of sack blocks. If in the former sack we had two blocks and now
we have three we will compare the third sack block from now against old
or uninitialised data. The chance of anything really bad happening might
not be high but it seems to be a bad behaviour.

The second issue is that there is no check that the fast path is
actually behind the hint. Consider a scenario where we have three sack
blocks and the first sack update is about an old location. And then
comes another sack packet with only an update to the old location. The
result will be that after the former sack block the hint is in the
latest location it can be and when the next sack packet arrives we
detect its an increase only but the fast path hint is too far and we do
no updating at all.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 31/31] net: Add etun driver

2007-01-25 Thread Ben Greear


Eric W. Biederman wrote:

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

etun is a simple two headed tunnel driver that at the link layer
looks like ethernet.  It's target audience is communicating
between network namespaces but it is general enough it may
have other uses as well.



This looks almost identical to my redir-dev module.  Which is
fine..I don't really care which gets into the kernel so long as
one of them does...

Comments and questions are inline below.


+/*
+ * The higher levels take care of making this non-reentrant (it's
+ * called with bh's disabled).
+ */
+static int etun_xmit(struct sk_buff *skb, struct net_device *tx_dev)
+{
+   struct etun_info *tx_info = tx_dev-priv;
+   struct net_device *rx_dev = tx_info-rx_dev;
+   struct etun_info *rx_info = rx_dev-priv;
+
+   tx_info-stats.tx_packets++;
+   tx_info-stats.tx_bytes += skb-len;
+
+   /* Drop the skb state that was needed to get here */
+   skb_orphan(skb);
+   if (skb-dst)
+   skb-dst = dst_pop(skb-dst); /* Allow for smart routing */


I ended up setting dst to NULL.  What does the dst_pop() accomplish?


+   
+   /* Switch to the receiving device */
+   skb-pkt_type = PACKET_HOST;
+   skb-protocol = eth_type_trans(skb, rx_dev);
+   skb-dev = rx_dev;
+   skb-ip_summed = CHECKSUM_NONE;
+
+   /* If both halves agree no checksum is needed */
+   if (tx_dev-features  NETIF_F_NO_CSUM)
+   skb-ip_summed = rx_info-ip_summed;
+
+	rx_dev-last_rx = jiffies; 


Do you need to set tx_dev-trans_start to jiffies as well?


+   rx_info-stats.rx_packets++;
+   rx_info-stats.rx_bytes += skb-len;


I think you need to zero out the skb-tstamp as well.  This lets it
be re-calculated when the receive logic of the other device is called.

Otherwise this fails:

rx skb on eth1, delay skb for network emulation, bridge onto etun0, rx on etun1
(time-stamp is still what it was when rx'd on eth1, which is too old.)



+   netif_rx(skb);
+
+   return 0;
+}
+



+static int etun_open(struct net_device *tx_dev)
+{
+   struct etun_info *tx_info = tx_dev-priv;
+   struct net_device *rx_dev = tx_info-rx_dev;
+   if (rx_dev-flags  IFF_UP) {
+   netif_carrier_on(tx_dev);
+   netif_carrier_on(rx_dev);
+   }
+   netif_start_queue(tx_dev);


Does this carrier logic keep etun0 from transmitting to
etun1 if etun0 is UP but etun1 is not UP yet?


+   return 0;
+}
+
+static int etun_stop(struct net_device *tx_dev)
+{
+   struct etun_info *tx_info = tx_dev-priv;
+   struct net_device *rx_dev = tx_info-rx_dev;
+   netif_stop_queue(tx_dev);
+   if (netif_carrier_ok(tx_dev)) {
+   netif_carrier_off(tx_dev);
+   netif_carrier_off(rx_dev);
+   }
+   return 0;
+}
+
+static void etun_set_multicast_list(struct net_device *dev)
+{
+   /* Nothing sane I can do here */
+   return;
+}
+
+static int etun_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
+{
+   return -EOPNOTSUPP;
+}
+
+/* Only allow letters and numbers in an etun device name */
+static int is_valid_name(const char *name)
+{
+   const char *ptr;
+   for (ptr = name; *ptr; ptr++) {
+   if (!isalnum(*ptr))
+   return 0;
+   }
+   return 1;
+}
+
+static struct net_device *etun_alloc(net_t net, const char *name)
+{
+   struct net_device *dev;
+   struct etun_info *info;
+   int err;
+
+   if (!name || !is_valid_name(name))
+   return ERR_PTR(-EINVAL);
+
+   dev = alloc_netdev(sizeof(struct etun_info), name, ether_setup);
+   if (!dev)
+   return ERR_PTR(-ENOMEM);
+   
+   info = dev-priv;
+   info-dev = dev;
+   dev-nd_net = net;
+
+   random_ether_addr(dev-dev_addr);
+   dev-tx_queue_len= 0; /* A queue is silly for a loopback device */
+   dev-hard_start_xmit = etun_xmit;
+   dev-get_stats   = etun_get_stats;
+   dev-open= etun_open;
+   dev-stop= etun_stop;
+   dev-set_multicast_list  = etun_set_multicast_list;
+   dev-do_ioctl= etun_ioctl;
+   dev-features= NETIF_F_FRAGLIST
+ | NETIF_F_HIGHDMA
+ | NETIF_F_LLTX;
+   dev-flags   = IFF_BROADCAST | IFF_MULTICAST |IFF_PROMISC;
+   dev-ethtool_ops = etun_ethtool_ops;
+   dev-destructor  = free_netdev;


You should add ability to change MTU.  I believe it is as trivial as this:

int redirdev_change_mtu(struct net_device *dev, int new_mtu) {
dev-mtu = new_mtu;
return 0;
}



+   err = register_netdev(dev);
+   if (err) {
+   free_netdev(dev);
+   dev = ERR_PTR(err);
+   goto out;
+   }
+   netif_carrier_off(dev);
+out:
+   return dev;
+}
+

Re: [PATCH] IPv6: Implement RFC 4429 Optimistic Duplicate Address Detection

2007-01-25 Thread Neil Horman

On Thu, Jan 25, 2007 at 12:16:59PM -0500, Vlad Yasevich wrote:
snip 
 I tend to agree with Neil here.  Marking optimistic addresses as deprecated 
 doesn't
 buy as much since the address can transition in and out of deprecated state 
 regardless
 of DAD.
 
 However, there is a problem with the current implementation in that 
 OPTIMISTIC address
 will never be chosen as source because it's always TENTATIVE and OPTIMISTIC 
 at the
 same time.  What needs to happen is for ipv6_dev_get_saddr() to not ignore 
 OPTIMISTIC
 addresses and treat them same as DEPRECATED.
 
 -vlad


Heres an updated patch.  Same as the previous patch but it adds three
modifications to ipv6_dev_get_saddr, which do the following:

a) Adds logic to not remove addresses that are both tentative and optimistic
from the set of considered addresses

b) Treats optimistic addresses and deptrecated address in the same fashion by
checking for both flags appropriately during source address selection.

Thoughts welcome.

Thanks  Regards
Neil

Signed-off-by: Neil Horman [EMAIL PROTECTED]


 include/linux/if_addr.h |1 
 include/linux/ipv6.h|2 +
 include/linux/sysctl.h  |1 
 include/net/addrconf.h  |4 +-
 net/ipv6/addrconf.c |   69 
 net/ipv6/mcast.c|4 +-
 net/ipv6/ndisc.c|   82 +++-
 7 files changed, 125 insertions(+), 38 deletions(-)



diff --git a/include/linux/if_addr.h b/include/linux/if_addr.h
index d557e4c..43f3bed 100644
--- a/include/linux/if_addr.h
+++ b/include/linux/if_addr.h
@@ -39,6 +39,7 @@ enum
 #define IFA_F_TEMPORARYIFA_F_SECONDARY
 
 #defineIFA_F_NODAD 0x02
+#define IFA_F_OPTIMISTIC   0x04
 #defineIFA_F_HOMEADDRESS   0x10
 #define IFA_F_DEPRECATED   0x20
 #define IFA_F_TENTATIVE0x40
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index f824113..5d37abf 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -177,6 +177,7 @@ struct ipv6_devconf {
 #endif
 #endif
__s32   proxy_ndp;
+   __s32   optimistic_dad;
void*sysctl;
 };
 
@@ -205,6 +206,7 @@ enum {
DEVCONF_RTR_PROBE_INTERVAL,
DEVCONF_ACCEPT_RA_RT_INFO_MAX_PLEN,
DEVCONF_PROXY_NDP,
+   DEVCONF_OPTIMISTIC_DAD,
DEVCONF_MAX
 };
 
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 81480e6..972a33a 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -570,6 +570,7 @@ enum {
NET_IPV6_RTR_PROBE_INTERVAL=21,
NET_IPV6_ACCEPT_RA_RT_INFO_MAX_PLEN=22,
NET_IPV6_PROXY_NDP=23,
+   NET_IPV6_OPTIMISTIC_DAD=24,
__NET_IPV6_MAX
 };
 
diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 88df8fc..d248a19 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -73,7 +73,9 @@ extern intipv6_get_saddr(struct dst_entry 
*dst,
 extern int ipv6_dev_get_saddr(struct net_device *dev, 
   struct in6_addr *daddr,
   struct in6_addr *saddr);
-extern int ipv6_get_lladdr(struct net_device *dev, struct 
in6_addr *);
+extern int ipv6_get_lladdr(struct net_device *dev, 
+   struct in6_addr *,
+   unsigned char banned_flags);
 extern int ipv6_rcv_saddr_equal(const struct sock *sk, 
  const struct sock *sk2);
 extern voidaddrconf_join_solict(struct net_device *dev,
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 2a7e461..46f91ee 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -830,7 +830,8 @@ retry:
ift = !max_addresses ||
  ipv6_count_addresses(idev)  max_addresses ? 
ipv6_add_addr(idev, addr, tmp_plen,
- ipv6_addr_type(addr)IPV6_ADDR_SCOPE_MASK, 
IFA_F_TEMPORARY) : NULL;
+ ipv6_addr_type(addr)IPV6_ADDR_SCOPE_MASK, 
+ IFA_F_TEMPORARY|IFA_F_OPTIMISTIC) : NULL;
if (!ift || IS_ERR(ift)) {
in6_ifa_put(ifp);
in6_dev_put(idev);
@@ -962,13 +963,14 @@ int ipv6_dev_get_saddr(struct net_device *daddr_dev,
 * - Tentative Address (RFC2462 section 5.4)
 *  - A tentative address is not considered
 *assigned to an interface in the traditional
-*sense.
+*sense, unless it is also flagged as optimistic.
 * - Candidate Source Address (section 4)
 *  - In any case, anycast addresses, multicast
 *addresses, and the

Re: [PATCH] IPv6: Implement RFC 4429 Optimistic Duplicate Address Detection

2007-01-25 Thread Vlad Yasevich

Hi Neil

 @@ -1027,15 +1029,17 @@ int ipv6_dev_get_saddr(struct net_device *daddr_dev,
   }
   }
  
 - /* Rule 3: Avoid deprecated address */
 + /* Rule 3: Avoid deprecated and optimistic address */
   if (hiscore.rule  3) {
   if (ipv6_saddr_preferred(hiscore.addr_type) ||
 - !(ifa_result-flags  IFA_F_DEPRECATED))
 + ((!(ifa_result-flags  IFA_F_DEPRECATED)) 
  
 + (!(ifa_result-flags  IFA_F_OPTIMISTIC

One style comment.  Looks like some extra parenthesis that I don't thing are 
needed.
I think you can say

+   (!(ifa_result-flags  IFA_F_DEPRECATED)) 
 
+!(ifa_result-flags  IFA_F_OPTIMISTIC


   hiscore.attrs |= 
 IPV6_SADDR_SCORE_PREFERRED;
   hiscore.rule++;
   }
   if (ipv6_saddr_preferred(score.addr_type) ||
 - !(ifa-flags  IFA_F_DEPRECATED)) {
 + ((!(ifa-flags  IFA_F_DEPRECATED)) 
 + (!(ifa_result-flags  IFA_F_OPTIMISTIC {

same here.

   score.attrs |= IPV6_SADDR_SCORE_PREFERRED;
   if (!(hiscore.attrs  
 IPV6_SADDR_SCORE_PREFERRED)) {
   score.rule = 3;

-vlad

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 31/31] net: Add etun driver

2007-01-25 Thread Eric W. Biederman

Ben Greear [EMAIL PROTECTED] writes:

 Eric W. Biederman wrote:
 From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

 etun is a simple two headed tunnel driver that at the link layer
 looks like ethernet.  It's target audience is communicating
 between network namespaces but it is general enough it may
 have other uses as well.


 This looks almost identical to my redir-dev module.  Which is
 fine..I don't really care which gets into the kernel so long as
 one of them does...

 Comments and questions are inline below.

If is I don't really care much either.

 +/*
 + * The higher levels take care of making this non-reentrant (it's
 + * called with bh's disabled).
 + */
 +static int etun_xmit(struct sk_buff *skb, struct net_device *tx_dev)
 +{
 +struct etun_info *tx_info = tx_dev-priv;
 +struct net_device *rx_dev = tx_info-rx_dev;
 +struct etun_info *rx_info = rx_dev-priv;
 +
 +tx_info-stats.tx_packets++;
 +tx_info-stats.tx_bytes += skb-len;
 +
 +/* Drop the skb state that was needed to get here */
 +skb_orphan(skb);
 +if (skb-dst)
 +skb-dst = dst_pop(skb-dst);   /* Allow for smart routing */

 I ended up setting dst to NULL.  What does the dst_pop() accomplish?

It allows an ambitious routing program to realize all of the routing
is on one machine and compute a route through multiple network
stack traversals.

I don't know it every makes sense to really use that but since
in the normal case this just sets dst to NULL.  I figured I would
leave it in, in case that ever looks useful.

 +
 +/* Switch to the receiving device */
 +skb-pkt_type = PACKET_HOST;
 +skb-protocol = eth_type_trans(skb, rx_dev);
 +skb-dev = rx_dev;
 +skb-ip_summed = CHECKSUM_NONE;
 +
 +/* If both halves agree no checksum is needed */
 +if (tx_dev-features  NETIF_F_NO_CSUM)
 +skb-ip_summed = rx_info-ip_summed;
 +
 +rx_dev-last_rx = jiffies;

 Do you need to set tx_dev-trans_start to jiffies as well?

Could be.  I haven't had any problems with it but I may have missed
a trick or two.

 +rx_info-stats.rx_packets++;
 +rx_info-stats.rx_bytes += skb-len;

 I think you need to zero out the skb-tstamp as well.  This lets it
 be re-calculated when the receive logic of the other device is called.

 Otherwise this fails:

 rx skb on eth1, delay skb for network emulation, bridge onto etun0, rx on 
 etun1
 (time-stamp is still what it was when rx'd on eth1, which is too old.)

Quite possibly.  I wouldn't be at all surprised if I missed something like that.

 +static int etun_open(struct net_device *tx_dev)
 +{
 +struct etun_info *tx_info = tx_dev-priv;
 +struct net_device *rx_dev = tx_info-rx_dev;
 +if (rx_dev-flags  IFF_UP) {
 +netif_carrier_on(tx_dev);
 +netif_carrier_on(rx_dev);
 +}
 +netif_start_queue(tx_dev);

 Does this carrier logic keep etun0 from transmitting to
 etun1 if etun0 is UP but etun1 is not UP yet?

A little bit.  It also allows user space to see that there really
is not a connection.  I think I was just having fun when I implemented
that bit.

 +
 +random_ether_addr(dev-dev_addr);
 + dev-tx_queue_len = 0; /* A queue is silly for a loopback device */
 +dev-hard_start_xmit= etun_xmit;
 +dev-get_stats  = etun_get_stats;
 +dev-open   = etun_open;
 +dev-stop   = etun_stop;
 +dev-set_multicast_list = etun_set_multicast_list;
 +dev-do_ioctl   = etun_ioctl;
 +dev-features   = NETIF_F_FRAGLIST
 +  | NETIF_F_HIGHDMA
 +  | NETIF_F_LLTX;
 +dev-flags  = IFF_BROADCAST | IFF_MULTICAST |IFF_PROMISC;
 +dev-ethtool_ops= etun_ethtool_ops;
 +dev-destructor = free_netdev;

 You should add ability to change MTU.  I believe it is as trivial as this:

 int redirdev_change_mtu(struct net_device *dev, int new_mtu) {
   dev-mtu = new_mtu;
   return 0;
 }

It should be.  If I missed that it was an oversight.


 +dev_hold(dev0);
 +dev_hold(dev1);
 +info0-rx_dev = dev1;
 +info1-rx_dev = dev0;

 Can this race such that someone could manage to tx on one of these
 devices before you assign the rx_dev?  Maybe register-netdev after
 this assignment here, instead of in the alloc_etun method above?

Good paranoid thought.

 +
 +/* Only place one member of the pair on the list
 + * so I don't confuse list_for_each_entry_safe,
 + * by deleting two list entries at once.
 + */
 +rtnl_lock();
 +list_add(info0-list, etun_list);
 +INIT_LIST_HEAD(info1-list);
 +rtnl_unlock();
 +
 +return 0;
 +}
 +
 +static int etun_unregister_pair(struct net_device *dev0)
 +{
 +struct etun_info *info0, *info1;
 +struct net_device *dev1;
 +
 +ASSERT_RTNL();
 +
 +if (!dev0)
 +return -ENODEV;
 +
 +info0 = dev0-priv;
 +dev1  = info0-rx_dev;
 +info1 =

Re: [PATCH RFC 2/31] net: Implement a place holder network namespace

2007-01-25 Thread Eric W. Biederman

Stephen Hemminger [EMAIL PROTECTED] writes:
 +
 +#define __per_net_start ((char *)0)
 +#define __per_net_end   ((char *)0)

 Don't use these use NULL

NULL has the wrong data type.  These are compiled out character array
normally generated by the linker script.  I'm not even certain I need
the above but allows for compile time and not link time optimization
so it is probably better that way.  The fact that these happen to be
equal to NULL is their least interesting property.  The fact that
you can subtract the and get 0 is much more interesting.

 +
 +static inline int copy_net(int flags, struct task_struct *tsk) { return 0; }
 +
 +/* Don't let the list of network namespaces change */
 +static inline void net_lock(void) {}
 +static inline void net_unlock(void) {}

 Don't make all one line, or use #define instead.

Why?

Anyway I appreciate the picking of the nits, and it should lead
to better code.

I guess this implies you are in favor of the general idea of
where this is going?

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: owner-Match in 2.6.20-rc5 (fwd)

2007-01-25 Thread David Miller

From: Jozsef Kadlecsik [EMAIL PROTECTED]
Date: Thu, 25 Jan 2007 21:31:56 +0100 (CET)

 The report below was posted on the netfilter user list. Isn't there any 
 ill side effect by reverting the change?

Performance regression :-(

This optimization saves a whole handful of heavy atomic operations in
the packet transmit path of TCP.

As I understand it, the owner-Match is not in the upstream tree, and
it's the only thing that cares, so I see no reason to cater for it.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IPv6: Implement RFC 4429 Optimistic Duplicate Address Detection

2007-01-25 Thread Neil Horman

On Thu, Jan 25, 2007 at 03:18:59PM -0500, Vlad Yasevich wrote:
 Hi Neil
 
  @@ -1027,15 +1029,17 @@ int ipv6_dev_get_saddr(struct net_device *daddr_dev,
  }
  }
   
  -   /* Rule 3: Avoid deprecated address */
  +   /* Rule 3: Avoid deprecated and optimistic address */
  if (hiscore.rule  3) {
  if (ipv6_saddr_preferred(hiscore.addr_type) ||
  -   !(ifa_result-flags  IFA_F_DEPRECATED))
  +   ((!(ifa_result-flags  IFA_F_DEPRECATED)) 
   
  +   (!(ifa_result-flags  IFA_F_OPTIMISTIC
 
 One style comment.  Looks like some extra parenthesis that I don't thing are 
 needed.
 I think you can say
 
 + (!(ifa_result-flags  IFA_F_DEPRECATED)) 
  
 +  !(ifa_result-flags  IFA_F_OPTIMISTIC
 
 
  hiscore.attrs |= 
  IPV6_SADDR_SCORE_PREFERRED;
  hiscore.rule++;
  }
  if (ipv6_saddr_preferred(score.addr_type) ||
  -   !(ifa-flags  IFA_F_DEPRECATED)) {
  +   ((!(ifa-flags  IFA_F_DEPRECATED)) 
  +   (!(ifa_result-flags  IFA_F_OPTIMISTIC {
 
 same here.
 
  score.attrs |= IPV6_SADDR_SCORE_PREFERRED;
  if (!(hiscore.attrs  
  IPV6_SADDR_SCORE_PREFERRED)) {
  score.rule = 3;
 
 -vlad


I prefer to be more explicit in my order of operation, but that does seem more
consistent with the prevaling style.  New patch attached.

Thanks  Regards
Neil

Signed-off-by: Neil Horman [EMAIL PROTECTED]


 include/linux/if_addr.h |1 
 include/linux/ipv6.h|2 +
 include/linux/sysctl.h  |1 
 include/net/addrconf.h  |4 +-
 net/ipv6/addrconf.c |   69 
 net/ipv6/mcast.c|4 +-
 net/ipv6/ndisc.c|   82 +++-
 7 files changed, 125 insertions(+), 38 deletions(-)



diff --git a/include/linux/if_addr.h b/include/linux/if_addr.h
index d557e4c..43f3bed 100644
--- a/include/linux/if_addr.h
+++ b/include/linux/if_addr.h
@@ -39,6 +39,7 @@ enum
 #define IFA_F_TEMPORARYIFA_F_SECONDARY
 
 #defineIFA_F_NODAD 0x02
+#define IFA_F_OPTIMISTIC   0x04
 #defineIFA_F_HOMEADDRESS   0x10
 #define IFA_F_DEPRECATED   0x20
 #define IFA_F_TENTATIVE0x40
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index f824113..5d37abf 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -177,6 +177,7 @@ struct ipv6_devconf {
 #endif
 #endif
__s32   proxy_ndp;
+   __s32   optimistic_dad;
void*sysctl;
 };
 
@@ -205,6 +206,7 @@ enum {
DEVCONF_RTR_PROBE_INTERVAL,
DEVCONF_ACCEPT_RA_RT_INFO_MAX_PLEN,
DEVCONF_PROXY_NDP,
+   DEVCONF_OPTIMISTIC_DAD,
DEVCONF_MAX
 };
 
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 81480e6..972a33a 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -570,6 +570,7 @@ enum {
NET_IPV6_RTR_PROBE_INTERVAL=21,
NET_IPV6_ACCEPT_RA_RT_INFO_MAX_PLEN=22,
NET_IPV6_PROXY_NDP=23,
+   NET_IPV6_OPTIMISTIC_DAD=24,
__NET_IPV6_MAX
 };
 
diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 88df8fc..d248a19 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -73,7 +73,9 @@ extern intipv6_get_saddr(struct dst_entry 
*dst,
 extern int ipv6_dev_get_saddr(struct net_device *dev, 
   struct in6_addr *daddr,
   struct in6_addr *saddr);
-extern int ipv6_get_lladdr(struct net_device *dev, struct 
in6_addr *);
+extern int ipv6_get_lladdr(struct net_device *dev, 
+   struct in6_addr *,
+   unsigned char banned_flags);
 extern int ipv6_rcv_saddr_equal(const struct sock *sk, 
  const struct sock *sk2);
 extern voidaddrconf_join_solict(struct net_device *dev,
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 2a7e461..057a260 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -830,7 +830,8 @@ retry:
ift = !max_addresses ||
  ipv6_count_addresses(idev)  max_addresses ? 
ipv6_add_addr(idev, addr, tmp_plen,
- ipv6_addr_type(addr)IPV6_ADDR_SCOPE_MASK, 
IFA_F_TEMPORARY) : NULL;
+ ipv6_addr_type(addr)IPV6_ADDR_SCOPE_MASK,

Re: [PATCH] Fix sorting of SACK blocks

2007-01-25 Thread David Miller

From: Baruch Even [EMAIL PROTECTED]
Date: Thu, 25 Jan 2007 20:29:03 +0200

 The sorting of SACK blocks actually munges them rather than sort, causing the
 TCP stack to ignore some SACK information and breaking the assumption of
 ordered SACK blocks after sorting.

 The sort takes the data from a second buffer which isn't moved causing
 subsequent data moves to occur from the wrong location. The fix is to
 use a temporary buffer as a normal sort does.

 Signed-Off-By: Baruch Even [EMAIL PROTECTED]

Thanks for finding this bug Baruch.

It probably explains some weird TCP traces I've seen over
the years :-)

I'll review this and apply it later today, thanks again.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: owner-Match in 2.6.20-rc5 (fwd)

2007-01-25 Thread David Miller

From: Jan Engelhardt [EMAIL PROTECTED]
Date: Thu, 25 Jan 2007 22:07:07 +0100 (MET)

  The report below was posted on the netfilter user list. Isn't there any 
  ill side effect by reverting the change?

 Performance regression :-(

 This optimization saves a whole handful of heavy atomic operations in
 the packet transmit path of TCP.

 As I understand it, the owner-Match is not in the upstream tree, and
 it's the only thing that cares, so I see no reason to cater for it.

 For me, it's there.
 -rw-r--r-- 1 jengelh users 2247 Jan 25 21:37
 /erk/kernel/linux-2.6.20-rc6/net/ipv4/netfilter/ipt_owner.c

Ok, I'll see what I can do about this :-)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: owner-Match in 2.6.20-rc5 (fwd)

2007-01-25 Thread Jan Engelhardt


  The report below was posted on the netfilter user list. Isn't there any 
  ill side effect by reverting the change?
 
 Performance regression :-(
 
 This optimization saves a whole handful of heavy atomic operations in
 the packet transmit path of TCP.
 
 As I understand it, the owner-Match is not in the upstream tree, and
 it's the only thing that cares, so I see no reason to cater for it.
 
 For me, it's there.
 -rw-r--r-- 1 jengelh users 2247 Jan 25 21:37
 /erk/kernel/linux-2.6.20-rc6/net/ipv4/netfilter/ipt_owner.c

Ok, I'll see what I can do about this :-)


People really depend on this. Much more than than pid/comm/smpunsafe stuff.
For example, a web server [cgi enabled, etc.] which also runs squid,
to force all webtraffic through it:

-A OUTPUT -p tcp --dport 80 -m owner ! --uid-owner
  squid -j REDIRECT --to-ports 3128


-`J'
-- 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IPv6: Implement RFC 4429 Optimistic Duplicate Address Detection

2007-01-25 Thread Vlad Yasevich

Hi Neil

I went through the RFC again it seems like the following is missing:

Section 3.3:

 * (modifies section 5.4.2) The host MUST join the all-nodes multicast
address and the solicited-node multicast address of the
Tentative address.  The host SHOULD NOT delay before sending
Neighbor Solicitation messages.

For this, addrconf_dad_kick() should pass 0 to addrconf_mod_timer
when the address is optimistic.  Otherwise, we'll delay DAD some of
the purpose of optimistic addresses is lost.

-vlad
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] RFC: Broadcom PHY forcing fix

2007-01-25 Thread Kumar Gala

Maciej,

I've got a BCM5461 that requires this fix to be able to force the speeds 
on the PHY.  Not sure if its needed on the other variants or not.  The 
problem is the genphy_config_aneg resets the PHY when forcing the speed 
and once we reset the BCM5461 it doesn't remember any of its settings.

Let me know if this works for you or not.

- k

diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
index 29666c8..bf752f4 100644
--- a/drivers/net/phy/broadcom.c
+++ b/drivers/net/phy/broadcom.c
@@ -99,6 +99,61 @@ static int bcm54xx_config_intr(struct ph
return err;
 }
 
+/* bcm_setup_forced
+ *
+ * description: Configures MII_BMCR to force speed/duplex
+ *   to the values in phydev. Assumes that the values are valid.
+ *   Please see phy_sanitize_settings() */
+static int bcm54xx_setup_forced(struct phy_device *phydev)
+{
+   int ctl = 0;
+   phydev-pause = phydev-asym_pause = 0;
+
+   if (SPEED_100 == phydev-speed)
+   ctl |= BMCR_SPEED100;
+
+   if (DUPLEX_FULL == phydev-duplex)
+   ctl |= BMCR_FULLDPLX;
+   
+   ctl = phy_write(phydev, MII_BMCR, ctl);
+
+   if (ctl  0)
+   return ctl;
+
+   return ctl;
+}
+
+int bcm54xx_config_aneg(struct phy_device *phydev)
+{
+   int err = 0;
+
+   if (AUTONEG_ENABLE == phydev-autoneg) {
+   err = genphy_config_advert(phydev);
+
+   if (err  0)
+   return err;
+
+   err = genphy_restart_aneg(phydev);
+   } else {
+   if (SPEED_1000 == phydev-speed) {
+   int adv;
+   adv = phy_read(phydev, MII_ADVERTISE);
+   adv = ~(ADVERTISE_ALL | ADVERTISE_100BASE4);
+
+   err = phy_write(phydev, MII_ADVERTISE, adv);
+
+   if (err  0)
+   return err;
+
+   err = genphy_restart_aneg(phydev);
+   } else {
+   err = bcm54xx_setup_forced(phydev);
+   }
+   }
+
+   return err;
+}
+
 static struct phy_driver bcm5411_driver = {
.phy_id = 0x00206070,
.phy_id_mask= 0xfff0,
@@ -106,7 +161,7 @@ static struct phy_driver bcm5411_driver 
.features   = PHY_GBIT_FEATURES,
.flags  = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT,
.config_init= bcm54xx_config_init,
-   .config_aneg= genphy_config_aneg,
+   .config_aneg= bcm54xx_config_aneg,
.read_status= genphy_read_status,
.ack_interrupt  = bcm54xx_ack_interrupt,
.config_intr= bcm54xx_config_intr,
@@ -120,7 +175,7 @@ static struct phy_driver bcm5421_driver 
.features   = PHY_GBIT_FEATURES,
.flags  = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT,
.config_init= bcm54xx_config_init,
-   .config_aneg= genphy_config_aneg,
+   .config_aneg= bcm54xx_config_aneg,
.read_status= genphy_read_status,
.ack_interrupt  = bcm54xx_ack_interrupt,
.config_intr= bcm54xx_config_intr,
@@ -134,7 +189,7 @@ static struct phy_driver bcm5461_driver 
.features   = PHY_GBIT_FEATURES,
.flags  = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT,
.config_init= bcm54xx_config_init,
-   .config_aneg= genphy_config_aneg,
+   .config_aneg= bcm54xx_config_aneg,
.read_status= genphy_read_status,
.ack_interrupt  = bcm54xx_ack_interrupt,
.config_intr= bcm54xx_config_intr,
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [BNX2]: Fix 2nd port's MAC address.

2007-01-25 Thread David Miller

From: Michael Chan [EMAIL PROTECTED]
Date: Wed, 24 Jan 2007 21:35:45 -0800

 [BNX2]: Fix 2nd port's MAC address.

 On the 5709, we need to add the proper offset to calculate the shared
 memory base address of the 2nd port correctly.  Otherwise, the 2nd
 port's MAC address and other information will be the same as the 1st
 port.

 Update version to 1.5.4.

 Signed-off-by: Michael Chan [EMAIL PROTECTED]

Applied, thanks Michael.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net: decnet handle a failure in neigh_parms_alloc (take 2)

2007-01-25 Thread David Miller

From: Steven Whitehouse [EMAIL PROTECTED]
Date: Thu, 25 Jan 2007 11:43:18 +

 Hi,

 On Wed, Jan 24, 2007 at 09:55:45PM -0700, Eric W. Biederman wrote:

  While enhancing the neighbour code to handle multiple network
  namespaces I noticed that decnet is assuming neigh_parms_alloc
  will allways succeed, which is clearly wrong.  So handle the
  failure.

  Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
 Acked-by: Steven Whitehouse [EMAIL PROTECTED]

Applied, thanks everyone.

 Also you should cc Patrick as he is now the maintainer,

Yep, would be a good idea in the future.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [BUG] problem with BPF in PF_PACKET sockets, introduced in linux-2.6.19

2007-01-25 Thread David Miller

From: Alexey Kuznetsov [EMAIL PROTECTED]
Date: Thu, 25 Jan 2007 16:22:20 +0300

 Actually, it can. Return value was used only as sign of error,
 so that the mistake was to return original unsigned result casted to int.

 Alternative fix is enclosed. To be honest, it is not better than
 yours: duplication of couple lines of code against passing return
 value by pointer.

Yes, this version of a fix would work as well.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Fix sorting of SACK blocks

2007-01-25 Thread David Miller

From: Baruch Even [EMAIL PROTECTED]
Date: Thu, 25 Jan 2007 20:29:03 +0200

 The sorting of SACK blocks actually munges them rather than sort, causing the
 TCP stack to ignore some SACK information and breaking the assumption of
 ordered SACK blocks after sorting.

 The sort takes the data from a second buffer which isn't moved causing
 subsequent data moves to occur from the wrong location. The fix is to
 use a temporary buffer as a normal sort does.

 Signed-Off-By: Baruch Even [EMAIL PROTECTED]

BTW, in reviewing this I note that there is now only one remaining
use of tp-recv_sack_cache[] and that is the code earlier in this
function which is trying to detect if all we are doing is extending
the leading edge of a SACK block.

It would be nice to be able to clear out that usage as well, and
remove recv_sack_cache[] and thus make tcp_sock smaller.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IPv6: Implement RFC 4429 Optimistic Duplicate Address Detection

2007-01-25 Thread YOSHIFUJI Hideaki / 吉藤英明

In article [EMAIL PROTECTED] (at Thu, 25 Jan 2007 14:45:00 -0500), Neil 
Horman [EMAIL PROTECTED] says:

 diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
 index 2a7e461..46f91ee 100644
 --- a/net/ipv6/addrconf.c
 +++ b/net/ipv6/addrconf.c
 @@ -830,7 +830,8 @@ retry:
   ift = !max_addresses ||
 ipv6_count_addresses(idev)  max_addresses ? 
   ipv6_add_addr(idev, addr, tmp_plen,
 -   ipv6_addr_type(addr)IPV6_ADDR_SCOPE_MASK, 
 IFA_F_TEMPORARY) : NULL;
 +   ipv6_addr_type(addr)IPV6_ADDR_SCOPE_MASK, 
 +   IFA_F_TEMPORARY|IFA_F_OPTIMISTIC) : NULL;
   if (!ift || IS_ERR(ift)) {
   in6_ifa_put(ifp);
   in6_dev_put(idev);

If optimistic_dad is disabled, flags should be IFA_F_TEMPORARY,
not IFA_F_TEMPORARY|IFA_F_OPTIMISTIC.

Another idea is to use IFA_F_OPTIMISTIC not
IFA_F_OPTIMISTIC|IFA_F_TENTATIVE until the DAD has been finished.

 @@ -1027,15 +1029,17 @@ int ipv6_dev_get_saddr(struct net_device *daddr_dev,
:
 + /* Rule 3: Avoid deprecated and optimistic address */
   if (hiscore.rule  3) {
   if (ipv6_saddr_preferred(hiscore.addr_type) ||
 - !(ifa_result-flags  IFA_F_DEPRECATED))
 + ((!(ifa_result-flags  IFA_F_DEPRECATED)) 
  
 + (!(ifa_result-flags  IFA_F_OPTIMISTIC
   hiscore.attrs |= 
 IPV6_SADDR_SCORE_PREFERRED;
   hiscore.rule++;

((ifa_result-flags  
(IFA_F_DEPRECATED|IFA_F_OPTIMISTIC)) == 0)

   }
   if (ipv6_saddr_preferred(score.addr_type) ||
 - !(ifa-flags  IFA_F_DEPRECATED)) {
 + ((!(ifa-flags  IFA_F_DEPRECATED)) 
 + (!(ifa_result-flags  IFA_F_OPTIMISTIC {
   score.attrs |= IPV6_SADDR_SCORE_PREFERRED;
   if (!(hiscore.attrs  
 IPV6_SADDR_SCORE_PREFERRED)) {
   score.rule = 3;

ditto.

 @@ -2123,7 +2133,8 @@ static void addrconf_add_linklocal(struct inet6_dev 
 *idev, struct in6_addr *addr
  {
   struct inet6_ifaddr * ifp;
  
 - ifp = ipv6_add_addr(idev, addr, 64, IFA_LINK, IFA_F_PERMANENT);
 + ifp = ipv6_add_addr(idev, addr, 64, IFA_LINK, 
 + IFA_F_PERMANENT|IFA_F_OPTIMISTIC);
   if (!IS_ERR(ifp)) {
   addrconf_dad_start(ifp, 0);
   in6_ifa_put(ifp);

Please do not always put IFA_F_OPTIMISTIC.

  
 + /*
 +  * Optimistic nodes need to joing the anycast address
 +  * right away
 +  */
 + if (ifp-flags  IFA_F_OPTIMISTIC)
 + addrconf_join_anycast(ifp);
 +
   if (ifp-prefix_len != 128  (ifp-flagsIFA_F_PERMANENT))
   addrconf_prefix_route(ifp-addr, ifp-prefix_len, dev, 0,
   flags);

Should we join anycast even if the node is a host (not a router)?!

When you add a call to addrconf_join_anycast(), 
you must consider when to leave this.


 @@ -2573,6 +2594,18 @@ static void addrconf_dad_start(struct inet6_ifaddr 
 *ifp, u32 flags)
   addrconf_dad_stop(ifp);
   return;
   }
 +
 + /*
 +  * Forwarding devices (routers) should not use
 +  * optimistic addresses
 +  * Nor should interfaces that don't know the 
 +  * Source address for their default gateway
 +  * RFC 4429 Sec 3.3
 +  */
 + if ((ipv6_devconf.forwarding) ||
 +(ifp-rt == NULL))
 + ifp-flags = ~IFA_F_OPTIMISTIC;
 +
   addrconf_dad_kick(ifp);
   spin_unlock_bh(ifp-lock);
  out:

Please test this condition when you are adding the
address.

BTW, you have not implemented the later condition,
right?   Sefault gatewa is not tested.

 index 6a9f616..fcd22e3 100644
 --- a/net/ipv6/ndisc.c
 +++ b/net/ipv6/ndisc.c
 @@ -498,7 +498,21 @@ static void ndisc_send_na(struct net_device *dev, struct 
 neighbour *neigh,
  msg-icmph.icmp6_unused = 0;
  msg-icmph.icmp6_router= router;
  msg-icmph.icmp6_solicited = solicited;
 -msg-icmph.icmp6_override  = override;
 + if (!ifp || !(ifp-flags  IFA_F_OPTIMISTIC))
 + msg-icmph.icmp6_override  = override;
 + else {
 + /*
 +  * We must clear the override flag on all
 +  * neighbor advertisements from source 
 +  * addresses that are OPTIMISTIC - RFC 4429
 +  * section 2.2
 +  */
 + if (override)
 + printk(KERN_WARNING
 + Disallowing override flag for OPTIMISTIC 
 addr\n);
 + msg-icmph.icmp6_override = 0;
 + }
 +

Ifp is already put.  Please clear override in the code
where we try getting

[PATCH] d80211: configure hardware when the interface is brought up

2007-01-25 Thread Pavel Roskin

ieee80211_hw_config() is called from scanning functions and ioctl
handlers, but not when the interface is brought up.  This is
unreasonable.  Since the config function is provided by hardware drivers
to d80211, the later should be responsible for calling it in all
situations when the hardware needs to be reconfigured.

Without this patch, bcm43xx_d80211 needs the channel to be set again
after the interface goes down and up.  Similar problems are reported for
rt2x00 drivers.

Failure in ieee80211_hw_config() leads to the interface staying down.

Signed-off-by: Pavel Roskin [EMAIL PROTECTED]
---

 net/d80211/ieee80211.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/net/d80211/ieee80211.c b/net/d80211/ieee80211.c
index 2f1dce5..7219416 100644
--- a/net/d80211/ieee80211.c
+++ b/net/d80211/ieee80211.c
@@ -2239,6 +2239,8 @@ static int ieee80211_open(struct net_device *dev)
res = 0;
if (local-ops-open)
res = local-ops-open(local_to_hw(local));
+   if (res == 0)
+   res = ieee80211_hw_config(local);
if (res == 0) {
res = dev_open(local-mdev);
if (res) {


-- 
Regards,
Pavel Roskin


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 0/31] An introduction and A path for merging network namespace work

2007-01-25 Thread Eric W. Biederman


The idea of a network namespace is fundamentally quite simple.  We create
a mechanism that from the users perspective allows creation of separate
instances of the network stack.  When combined with mechanism like chroot
this results in a much more complete isolation.  When seen in the context
of application migration this allows for taking your IP address and other
global identifiers with you.

What does this mean in the context of the networking stack?  The basic
idea is to tag processes with a network namespace that is used when
they create new sockets  or otherwise initiate a new fresh communication
with the networking stack.  The idea is to tag all sockets with a
network namespace they will always be in and all operations on them
will be relative to.  The idea is to tag all network devices with
a network namespace they are a member of, but may be changed during
the lifetime of a device.  

Mostly a network namespace at it's most basic level is about names.
It is about creating a view of the networking stack where you can
name the network devices that are members anything you want.  Likewise
for iptables rules and all of the rest of the state.  It is a lot
like creating a new directory in a filesystem.  The underlying data
structures don't really change just the users view of those data
structures, and we continue to have a single network stack.




My goal today is that even if we can't agree on a specific set of
patches that we come to an agreement on roughly what those patches
should accomplish, and what process we should go through to get
them merged.




For implementing a network namespace the core problem is that there is
a lot of networking code, and it is continually evolving.   This means
that the task of implementing a network namespace is not a small one,
a lot of code must be read, touched and updated, while hoping 
someone doesn't change something important before you get your changes
in.  To do this sanely means we need an incremental path to our goal,
that allows small pieces to be reviewed and merged as they are ready.


The path I am recommending today is to first lay down some basic
infrastructure.  Then one layer at a time modify the existing code
to handle multiple simultaneous network namespaces but to modify
each component of that layer to refuse to operate in the context
of anything but the initial network namespace, thus preventing
code that has not yet been updated with situations it does not
know how to deal with.

Eventually this will get down to the real meat of the problem and
practical things like ipv4 sockets will work.

This should allow for a network stack that compiles, builds and works
at each step of the way.  Not too far into the process support
for multiple network namespaces that works should be available with
the limitation that except for the initial network namespace all of
the rest will look like a kernel with most parts of the networking
stack compiled out, but within those parts that are present it
should be fully useable.





To make my thinking clear I have provided a initial patchset, that
makes quite a bit of progress especially in laying the ground work.
My goal is to have the question does this basic path make sense?

To that end I have omitted posting some of the prerequisite cleanup
and infrastructure patches (like my sysctl work), that are just noise
in this context, and I have failed to rebase my patchset against Dave
Miller's latest networking tree.  Those are important details but
they are not important to this conversation.


If my basic path and the basic patches look like they are heading
in the right direction we can start moving towards what needs to
happen to ensure a review of the patches, and what we need to do
to start merging them.  If the basic path does not appear reasonable
well that would be good to know as well.





There are essentially two different approaches to modify networking
code to handle multiple network namesspaces.  Either all of the global
variables can be replicated once for each network namespace and we
build up parallel namespace specific data structures.  Or the data
elements in the data structure are tagged, with what namespace they
belong to and we filter them.  It depends on the context which
is most appropriate and easier.  As a general rule large hash tables
call for filtering and a small global variable set calls for simply
having multiple instances of the data structure.

The biggest intrusion I expect to see in the logic of the networking
stack is initialization and tear down.  As we need to initialize
and clean up all of those per network namespace variables when
we create and destroy and network namespace.



A git tree with all of my patches against 2.6.20-rc5 is available at:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-netns.git

In addition to what I have posted here and all of it's prerequisites
the tree includes further patches that get the basics of ipv4 and
iptables

[PATCH RFC 3/31] net: Add a network namespace parameter to tasks

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

This is the network namespace from which all which all sockets
and anything else under user control ultimately get their network
namespace parameters.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 include/linux/nsproxy.h |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 0b9f0dc..cc76610 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -3,6 +3,7 @@
 
 #include linux/spinlock.h
 #include linux/sched.h
+#include linux/net_namespace_type.h
 
 struct mnt_namespace;
 struct uts_namespace;
@@ -28,6 +29,7 @@ struct nsproxy {
struct ipc_namespace *ipc_ns;
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns;
+   net_t net_ns;
 };
 extern struct nsproxy init_nsproxy;
 
-- 
1.4.4.1.g278f

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC 8/31] net: Make /sys/class/net handle multiple network namespaces

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

In combination with the sysfs support I am in the process of merging
with gregkh, creates a separate instance of the /sys/class/net directory
for each network namespace so two devices with the same name do not conflict.
Then a network namespace sensitive follow link method on the /sys/class/net
directory ensures that you see the directory instance for your current network
namespace.

Ensuring all existing applications continue to see what we is currently
present in sysfs.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 net/core/net-sysfs.c |   53 +-
 1 files changed, 52 insertions(+), 1 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 5d08cc9..b08c1be 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -11,12 +11,14 @@
 
 #include linux/capability.h
 #include linux/kernel.h
+#include linux/sysfs.h
 #include linux/netdevice.h
 #include linux/if_arp.h
 #include net/sock.h
 #include linux/rtnetlink.h
 #include linux/wireless.h
 #include net/iw_handler.h
+#include net/net_namespace.h
 
 #define to_class_dev(obj) container_of(obj,struct class_device,kobj)
 #define to_net_dev(class) container_of(class, struct net_device, class_dev)
@@ -431,6 +433,24 @@ static void netdev_release(struct class_device *cd)
kfree((char *)dev - dev-padded);
 }
 
+static DEFINE_PER_NET(struct dentry *, net_shadow) = NULL;
+
+static struct dentry *net_class_device_dparent(struct class_device *cd)
+{
+   struct net_device *dev
+   = container_of(cd, struct net_device, class_dev);
+   net_t net = dev-nd_net;
+
+   return per_net(net_shadow, net);
+}
+
+static void *class_net_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+   dput(nd-dentry);
+   nd-dentry = dget(per_net(net_shadow, current-nsproxy-net_ns));
+   return NULL;
+}
+
 static struct class net_class = {
.name = net,
.release = netdev_release,
@@ -438,6 +458,8 @@ static struct class net_class = {
 #ifdef CONFIG_HOTPLUG
.uevent = netdev_uevent,
 #endif
+   .class_device_dparent   = net_class_device_dparent,
+   .class_follow_link  = class_net_follow_link,
 };
 
 void netdev_unregister_sysfs(struct net_device * dev)
@@ -470,7 +492,36 @@ int netdev_register_sysfs(struct net_device *dev)
return class_device_add(class_dev);
 }
 
+static int netdev_sysfs_net_init(net_t net)
+{
+   struct dentry *shadow;
+   int error = 0;
+   shadow = sysfs_create_shadow_dir(net_class.subsys.kset.kobj);
+   if (IS_ERR(shadow))
+   error = PTR_ERR(shadow);
+   else
+   per_net(net_shadow, net) = shadow;
+   return error;
+}
+
+static void netdev_sysfs_net_exit(net_t net)
+{
+   sysfs_remove_shadow_dir(per_net(net_shadow, net));
+   per_net(net_shadow, net) = NULL;
+}
+
+static struct pernet_operations netdev_sysfs_ops = {
+   .init = netdev_sysfs_net_init,
+   .exit = netdev_sysfs_net_exit,
+};
+ 
 int netdev_sysfs_init(void)
 {
-   return class_register(net_class);
+   int rc;
+   if ((rc = class_register(net_class)))
+   goto out;
+   if ((rc = register_pernet_subsys(netdev_sysfs_ops)))
+   goto out;
+out:
+   return rc;
 }
-- 
1.4.4.1.g278f

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC 1/31] net: Add net_namespace_type.h to allow for per network namespace variables.

2007-01-25 Thread Eric W. Biederman

The problem:
   To properly implement a ``level 2'' network namespace we need to
move many of the networking stack global variables into the network
namespace.  We want to keep it explicit that the code is accessing a
variable in a network namespace.  We want to be able to completely
compile out the network namespace support so we can do comparitive
performance testing, and so to not penalize users who don't need
network namespace support. Because the network stack is a moving
target we want something simple that  allows for the bulk of the
changes to be merged before we enable network namespace support.

My biggest challenge when looking into this was to find an approach
that would allow the code to compile out, in a way that does not yield
any performance overhead and does not make the code ugly.  While
playing with the different possibilities I discovered that gcc will
not pass 0 byte structures that are arguments to functions and instead
will simply optmize them away.  This appears to be true on i386 all of
the way back to gcc-2.95 and I verified that it also works with gcc
4.1 on x86_64.  Since this is part of the ABI I never expect it to
change.  Hopefully gcc uses this nice optimization on all
architectures, I suspect so as C++ allows passing function arguments
of type void in certain circumstances.

Using this observation I was able to come up with an network namespace
implementation network namespace code that allows the changes to
completely compile out when we don't build the kernel with network
namespace support.

This patch implements my dummy network namespace support that should
completely compiles out.  Further patches will add the real version.
Starting with the dummy gives a quick hint of where I am going and
allows for dependencies to be overcome.

When doing my proof of concept implementation one of the other
problems I had was that as the network stack comes in so many modular
pieces figuring out how to get their global variables into the network
namespace structure was a challenge.  The basic technique used by our
per cpu variables for having the linker build and dynamically change
structures for us appears applicable here and a lot less nuisance then
what I did before so I am implementing a tailored version of that
technique as well, and again this makes it very simple to compile the
code out.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 include/linux/net_namespace_type.h |   52 
 1 files changed, 52 insertions(+), 0 deletions(-)

diff --git a/include/linux/net_namespace_type.h 
b/include/linux/net_namespace_type.h
new file mode 100644
index 000..8173f59
--- /dev/null
+++ b/include/linux/net_namespace_type.h
@@ -0,0 +1,52 @@
+/* 
+ * Definition of the network namespace reference type
+ * And operations upon it.
+ */
+#ifndef __LINUX_NET_NAMESPACE_TYPE_H
+#define __LINUX_NET_NAMESPACE_TYPE_H
+
+#define __pernetname(name) per_net__##name
+
+typedef struct {} net_t;
+
+#define __data_pernet 
+
+/* Look up a per network namespace variable */
+static inline unsigned long __per_net_offset(net_t net) { return 0; }
+
+/* Like per_net but returns a pseudo variable address that must be moved
+ * __per_net_offset() bytes before it will point to a real variable.
+ * Useful for static initializers.
+ */
+#define __per_net_base(name)   __pernetname(name)
+
+/* Get the network namespace reference from a per_net variable address */
+#define net_of(ptr, name) ({ net_t net; ptr; net; })
+
+/* Look up a per network namespace variable */
+#define per_net(name, net) \
+   (*(__per_net_offset(net), __per_net_base(name)))
+ 
+/* Are the two network namespaces the same */
+static inline int net_eq(net_t a, net_t b) { return 1; }
+/* Get an unsigned value appropriate for hashing the network namespace */
+static inline unsigned int net_hval(net_t net) { return 0; }
+
+/* Convert to and from to and from void pointers */
+static inline void *net_to_voidp(net_t net) { return NULL; }
+static inline net_t net_from_voidp(void *ptr) { net_t net; return net; }
+
+static inline int null_net(net_t net) { return 0; }
+
+#define DEFINE_PER_NET(type, name) \
+   __data_pernet __typeof__(type) __pernetname(name)
+
+#define DECLARE_PER_NET(type, name) \
+   extern __typeof__(type) __pernetname(name)
+
+#define EXPORT_PER_NET_SYMBOL(var) \
+   EXPORT_SYMBOL(__pernetname(var))
+#define EXPORT_PER_NET_SYMBOL_GPL(var) \
+   EXPORT_SYMBOL_GPL(__pernetname(var))
+
+#endif /* __LINUX_NET_NAMESPACE_TYPE_H */
-- 
1.4.4.1.g278f

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC 30/31] net: Make AF_UNIX per network namespace safe.

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

Because of the global nature of garbage collection, and
because of the cost of per namespace hash tables
unix_socket_table has been kept global.  With a filter
added on lookups so we don't see sockets from the wrong
namespace.

Currently I don't fold the namesapce into the hash so
multiple namespaces using the same socket name will be
guaranateed a hash collision.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 include/net/af_unix.h  |   10 ++--
 net/unix/af_unix.c |  116 
 net/unix/sysctl_net_unix.c |   24 +
 3 files changed, 103 insertions(+), 47 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index c0398f5..1f40dd2 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -89,12 +89,12 @@ struct unix_sock {
 #define unix_sk(__sk) ((struct unix_sock *)__sk)
 
 #ifdef CONFIG_SYSCTL
-extern int sysctl_unix_max_dgram_qlen;
-extern void unix_sysctl_register(void);
-extern void unix_sysctl_unregister(void);
+DECLARE_PER_NET(int, sysctl_unix_max_dgram_qlen);
+extern void unix_sysctl_register(net_t net);
+extern void unix_sysctl_unregister(net_t net);
 #else
-static inline void unix_sysctl_register(void) {}
-static inline void unix_sysctl_unregister(void) {}
+static inline void unix_sysctl_register(net_t net) {}
+static inline void unix_sysctl_unregister(net_t net) {}
 #endif
 #endif
 #endif
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 8015a03..3f57cb2 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -118,7 +118,7 @@
 #include linux/security.h
 #include net/net_namespace.h
 
-int sysctl_unix_max_dgram_qlen __read_mostly = 10;
+DEFINE_PER_NET(int, sysctl_unix_max_dgram_qlen) = 10;
 
 struct hlist_head unix_socket_table[UNIX_HASH_SIZE + 1];
 DEFINE_SPINLOCK(unix_table_lock);
@@ -245,7 +245,8 @@ static inline void unix_insert_socket(struct hlist_head 
*list, struct sock *sk)
spin_unlock(unix_table_lock);
 }
 
-static struct sock *__unix_find_socket_byname(struct sockaddr_un *sunname,
+static struct sock *__unix_find_socket_byname(net_t net,
+ struct sockaddr_un *sunname,
  int len, int type, unsigned hash)
 {
struct sock *s;
@@ -254,6 +255,9 @@ static struct sock *__unix_find_socket_byname(struct 
sockaddr_un *sunname,
sk_for_each(s, node, unix_socket_table[hash ^ type]) {
struct unix_sock *u = unix_sk(s);
 
+   if (!net_eq(s-sk_net, net))
+   continue;
+
if (u-addr-len == len 
!memcmp(u-addr-name, sunname, len))
goto found;
@@ -263,21 +267,22 @@ found:
return s;
 }
 
-static inline struct sock *unix_find_socket_byname(struct sockaddr_un *sunname,
+static inline struct sock *unix_find_socket_byname(net_t net,
+  struct sockaddr_un *sunname,
   int len, int type,
   unsigned hash)
 {
struct sock *s;
 
spin_lock(unix_table_lock);
-   s = __unix_find_socket_byname(sunname, len, type, hash);
+   s = __unix_find_socket_byname(net, sunname, len, type, hash);
if (s)
sock_hold(s);
spin_unlock(unix_table_lock);
return s;
 }
 
-static struct sock *unix_find_socket_byinode(struct inode *i)
+static struct sock *unix_find_socket_byinode(net_t net, struct inode *i)
 {
struct sock *s;
struct hlist_node *node;
@@ -287,6 +292,9 @@ static struct sock *unix_find_socket_byinode(struct inode 
*i)
unix_socket_table[i-i_ino  (UNIX_HASH_SIZE - 1)]) {
struct dentry *dentry = unix_sk(s)-dentry;
 
+   if (!net_eq(s-sk_net, net))
+   continue;
+
if(dentry  dentry-d_inode == i)
{
sock_hold(s);
@@ -588,7 +596,7 @@ static struct sock * unix_create1(net_t net, struct socket 
*sock)
af_unix_sk_receive_queue_lock_key);
 
sk-sk_write_space  = unix_write_space;
-   sk-sk_max_ack_backlog  = sysctl_unix_max_dgram_qlen;
+   sk-sk_max_ack_backlog  = per_net(sysctl_unix_max_dgram_qlen, net);
sk-sk_destruct = unix_sock_destructor;
u = unix_sk(sk);
u-dentry = NULL;
@@ -604,9 +612,6 @@ out:
 
 static int unix_create(net_t net, struct socket *sock, int protocol)
 {
-   if (!net_eq(net, init_net()))
-   return -EAFNOSUPPORT;
-
if (protocol  protocol != PF_UNIX)
return -EPROTONOSUPPORT;
 
@@ -650,6 +655,7 @@ static int unix_release(struct socket *sock)
 static int unix_autobind(struct socket *sock)
 {
struct sock *sk = sock-sk;
+   net_t net = sk-sk_net;
struct

[PATCH RFC 31/31] net: Add etun driver

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

etun is a simple two headed tunnel driver that at the link layer
looks like ethernet.  It's target audience is communicating
between network namespaces but it is general enough it may
have other uses as well.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 drivers/net/Kconfig  |   14 ++
 drivers/net/Makefile |1 +
 drivers/net/etun.c   |  470 ++
 3 files changed, 485 insertions(+), 0 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 8aa8dd0..969d3df 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -119,6 +119,20 @@ config TUN
 
  If you don't know what to use this for, you don't need it.
 
+config ETUN
+   tristate Ethernet tunnel device driver support
+   depends on SYSFS
+   ---help---
+ ETUN provices a pair of network devices that can be used for
+ configuring interesting topolgies.  What one devices transmits
+ the other receives and vice versa.  The link level framing
+ is ethernet for wide compatibility with network stacks.
+
+ To compile this driver as a module, choose M here: the module
+ will be called etun.
+
+ If you don't know what to use this for, you don't need it.
+
 config NET_SB1000
tristate General Instruments Surfboard 1000
depends on PNP
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 4c0d4e5..396af4f 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -185,6 +185,7 @@ obj-$(CONFIG_MACSONIC) += macsonic.o
 obj-$(CONFIG_MACMACE) += macmace.o
 obj-$(CONFIG_MAC89x0) += mac89x0.o
 obj-$(CONFIG_TUN) += tun.o
+obj-$(CONFIG_ETUN) += etun.o
 obj-$(CONFIG_NET_NETX) += netx-eth.o
 obj-$(CONFIG_DL2K) += dl2k.o
 obj-$(CONFIG_R8169) += r8169.o
diff --git a/drivers/net/etun.c b/drivers/net/etun.c
new file mode 100644
index 000..1dd8cd8
--- /dev/null
+++ b/drivers/net/etun.c
@@ -0,0 +1,470 @@
+/*
+ *  ETUN - Universal ETUN device driver.
+ *  Copyright (C) 2006 Linux Networx
+ *
+ */
+
+#define DRV_NAME   etun
+#define DRV_VERSION1.0
+#define DRV_DESCRIPTIONEthernet pseudo tunnel device driver
+#define DRV_COPYRIGHT  (C) 2007 Linux Networx
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/list.h
+#include linux/spinlock.h
+#include linux/skbuff.h
+#include linux/netdevice.h
+#include linux/etherdevice.h
+#include linux/ethtool.h
+#include linux/rtnetlink.h
+#include linux/if.h
+#include linux/if_ether.h
+#include linux/ctype.h
+#include net/net_namespace.h
+#include net/dst.h
+
+
+/* Device cheksum strategy.
+ *
+ * etun is designed to a be a pair of virutal devices
+ * connecting two network stack instances.  
+ * 
+ * Typically it will either be used with ethernet bridging or
+ * it will be used to route packets between the two stacks.
+ * 
+ * The only checksum offloading I can do is to completely
+ * skip the checksumming step all together.
+ *
+ * When used for ethernet bridging I don't believe any
+ * checksum off loading is safe.  
+ * - If my source is an external interface the checksum may be
+ *   invalid so I don't want to report I have already checked it.
+ * - If my destination is an external interface I don't want to put
+ *   a packet on the wire with someone computing the checksum.
+ *
+ * When used for routing between two stacks checksums should
+ * be as unnecessary as they are on the loopback device.
+ *
+ * So by default I am safe and disable checksumming and
+ * other advanced features like SG and TSO.
+ *
+ * However because I think these features could be useful
+ * I provide the ethtool functions to and enable/disable
+ * them at runtime.
+ *
+ * If you think you can correctly enable these go ahead.
+ * For checksums both the transmitter and the receiver must
+ * agree before the are actually disabled.
+ */
+
+#define ETUN_NUM_STATS 1
+static struct {
+   const char string[ETH_GSTRING_LEN];
+} ethtool_stats_keys[ETUN_NUM_STATS] = {
+   { partner_ifindex },
+};
+
+struct etun_info {
+   struct net_device   *rx_dev;
+   unsignedip_summed;
+   struct net_device_stats stats;
+   struct list_headlist;
+   struct net_device   *dev;
+};
+
+/*
+ * I have to hold the rtnl_lock during device delete.
+ * So I use the rtnl_lock to protect my list manipulations
+ * as well.  Crude but simple.
+ */
+static LIST_HEAD(etun_list);
+
+/*
+ * The higher levels take care of making this non-reentrant (it's
+ * called with bh's disabled).
+ */
+static int etun_xmit(struct sk_buff *skb, struct net_device *tx_dev)
+{
+   struct etun_info *tx_info = tx_dev-priv;
+   struct net_device *rx_dev = tx_info-rx_dev;
+   struct etun_info *rx_info = rx_dev-priv;
+
+   tx_info-stats.tx_packets++;
+   tx_info-stats.tx_bytes += skb-len;
+
+   /* Drop the skb state that was needed to get here */
+   skb_orphan(skb);
+

[PATCH RFC 23/31] net: Modify all rtnetlink methods to only work in the initial namespace

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

Before I can enable rtnetlink to work in all network namespaces
I need to be certain that something won't break.  So this
patch deliberately disables all of the methods and when they
are audited this extra check can be disabled.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 net/bridge/br_netlink.c |9 +
 net/core/fib_rules.c|7 +++
 net/core/neighbour.c|   18 ++
 net/core/rtnetlink.c|   13 +
 net/decnet/dn_dev.c |   12 
 net/decnet/dn_fib.c |8 
 net/decnet/dn_route.c   |8 
 net/decnet/dn_rules.c   |5 +
 net/decnet/dn_table.c   |4 
 net/ipv4/devinet.c  |   12 
 net/ipv4/fib_frontend.c |   12 
 net/ipv4/fib_rules.c|5 +
 net/ipv6/addrconf.c |   31 +++
 net/ipv6/fib6_rules.c   |5 +
 net/ipv6/ip6_fib.c  |4 
 net/ipv6/route.c|   12 
 net/sched/act_api.c |8 
 net/sched/cls_api.c |8 
 net/sched/sch_api.c |   20 
 19 files changed, 201 insertions(+), 0 deletions(-)

diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 119b97d..85165a1 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -14,6 +14,7 @@
 #include linux/rtnetlink.h
 #include net/netlink.h
 #include net/net_namespace.h
+#include net/sock.h
 #include br_private.h
 
 static inline size_t br_nlmsg_size(void)
@@ -104,9 +105,13 @@ errout:
  */
 static int br_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
 {
+   net_t net = skb-sk-sk_net;
struct net_device *dev;
int idx;
 
+   if (!net_eq(net, init_net()))
+   return 0;
+
read_lock(per_net(dev_base_lock, init_net()));
for (dev = per_net(dev_base, init_net()), idx = 0; dev; dev = 
dev-next) {
/* not a bridge port */
@@ -133,12 +138,16 @@ skip:
  */
 static int br_rtm_setlink(struct sk_buff *skb,  struct nlmsghdr *nlh, void 
*arg)
 {
+   net_t net = skb-sk-sk_net;
struct ifinfomsg *ifm;
struct nlattr *protinfo;
struct net_device *dev;
struct net_bridge_port *p;
u8 new_state;
 
+   if (!net_eq(net, init_net()))
+   return -EINVAL;
+
if (nlmsg_len(nlh)  sizeof(*ifm))
return -EINVAL;
 
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 2fa2708..00b4148 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -163,6 +163,9 @@ int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr* 
nlh, void *arg)
struct nlattr *tb[FRA_MAX+1];
int err = -EINVAL;
 
+   if (!net_eq(net, init_net()))
+   return -EINVAL;
+
if (nlh-nlmsg_len  nlmsg_msg_size(sizeof(*frh)))
goto errout;
 
@@ -244,12 +247,16 @@ errout:
 
 int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
 {
+   net_t net = skb-sk-sk_net;
struct fib_rule_hdr *frh = nlmsg_data(nlh);
struct fib_rules_ops *ops = NULL;
struct fib_rule *rule;
struct nlattr *tb[FRA_MAX+1];
int err = -EINVAL;
 
+   if (!net_eq(net, init_net()))
+   return -EINVAL;
+
if (nlh-nlmsg_len  nlmsg_msg_size(sizeof(*frh)))
goto errout;
 
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index f5d4f92..d89c6fe 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1445,6 +1445,9 @@ int neigh_delete(struct sk_buff *skb, struct nlmsghdr 
*nlh, void *arg)
struct net_device *dev = NULL;
int err = -EINVAL;
 
+   if (!net_eq(net, init_net()))
+   return -EINVAL;
+
if (nlmsg_len(nlh)  sizeof(*ndm))
goto out;
 
@@ -1511,6 +1514,9 @@ int neigh_add(struct sk_buff *skb, struct nlmsghdr *nlh, 
void *arg)
struct net_device *dev = NULL;
int err;
 
+   if (!net_eq(net, init_net()))
+   return -EINVAL;
+
err = nlmsg_parse(nlh, sizeof(*ndm), tb, NDA_MAX, NULL);
if (err  0)
goto out;
@@ -1783,11 +1789,15 @@ static struct nla_policy 
nl_ntbl_parm_policy[NDTPA_MAX+1] __read_mostly = {
 
 int neightbl_set(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 {
+   net_t net = skb-sk-sk_net;
struct neigh_table *tbl;
struct ndtmsg *ndtmsg;
struct nlattr *tb[NDTA_MAX+1];
int err;
 
+   if (!net_eq(net, init_net()))
+   return -EINVAL;
+
err = nlmsg_parse(nlh, sizeof(*ndtmsg), tb, NDTA_MAX,
  nl_neightbl_policy);
if (err  0)
@@ -1907,11 +1917,15 @@ errout:
 
 int neightbl_dump_info(struct sk_buff *skb, struct netlink_callback *cb)
 {
+   net_t net = skb-sk-sk_net;
int family, tidx, nidx = 0;
int tbl_skip = cb-args[0];
int neigh_skip = cb-args[1];

[PATCH RFC 15/31] net: Make the loopback device per network namespace

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

This patch makes the loopback_dev per network namespace.
The loopback device registers itself as a pernet_device so
we can register the new loopback_dev instance when we add
a new network namespace and so we can unregister the
loopback device when we destory the network namespace.

Currently the loopback device statitics are kept accross
all loopback devices, a minor glitch that will not affect
correct operation but something we may want to fix.

This patch modifies all users the loopback_dev so they
access it as per_net(loopback_dev, init_net()), keeping all of the
code compiling and working.  A later pass will be needed to
update the users to use something other than the initial network
namespace.

The only non-trivial modification was the ipv6 code in route.c as the
loopback_dev can no longer be used in static initializers, and
even that change was very simple.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 drivers/net/loopback.c   |   24 
 include/linux/netdevice.h|2 +-
 net/core/dst.c   |8 
 net/decnet/dn_dev.c  |4 ++--
 net/decnet/dn_route.c|   14 +++---
 net/ipv4/devinet.c   |4 ++--
 net/ipv4/ipconfig.c  |8 +---
 net/ipv4/ipvs/ip_vs_core.c   |2 +-
 net/ipv4/route.c |   18 +-
 net/ipv4/xfrm4_policy.c  |2 +-
 net/ipv6/addrconf.c  |8 
 net/ipv6/netfilter/ip6t_REJECT.c |2 +-
 net/ipv6/route.c |   24 +++-
 net/ipv6/xfrm6_policy.c  |2 +-
 net/xfrm/xfrm_policy.c   |4 ++--
 15 files changed, 75 insertions(+), 51 deletions(-)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index 22b672d..e9abf3f 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -57,6 +57,7 @@
 #include linux/ip.h
 #include linux/tcp.h
 #include linux/percpu.h
+#include net/net_namespace.h
 
 struct pcpu_lstats {
unsigned long packets;
@@ -204,7 +205,7 @@ static const struct ethtool_ops loopback_ethtool_ops = {
  * The loopback device is special. There is only one instance and
  * it is statically allocated. Don't do this for other devices.
  */
-struct net_device loopback_dev = {
+DEFINE_PER_NET(struct net_device, loopback_dev) = {
.name   = lo,
.get_stats  = get_stats,
.priv   = loopback_stats,
@@ -228,13 +229,28 @@ struct net_device loopback_dev = {
.ethtool_ops= loopback_ethtool_ops,
 };
 
+static int loopback_net_init(net_t net)
+{
+   per_net(loopback_dev, net).nd_net = net;
+   return register_netdev(per_net(loopback_dev, net));
+}
+
+static void loopback_net_exit(net_t net)
+{
+   unregister_netdev(per_net(loopback_dev, net));
+}
+
+static struct pernet_operations loopback_net_ops = {
+   .init = loopback_net_init,
+   .exit = loopback_net_exit,
+};
+
 /* Setup and register the loopback device. */
 static int __init loopback_init(void)
 {
-   loopback_dev.nd_net = init_net();
-   return register_netdev(loopback_dev);
+   return register_pernet_device(loopback_net_ops);
 };
 
 module_init(loopback_init);
 
-EXPORT_SYMBOL(loopback_dev);
+EXPORT_PER_NET_SYMBOL(loopback_dev);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9e28671..73931a0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -570,7 +570,7 @@ struct packet_type {
 #include linux/interrupt.h
 #include linux/notifier.h
 
-extern struct net_device   loopback_dev;   /* The loopback 
*/
+DECLARE_PER_NET(struct net_device, loopback_dev);  /* The loopback 
*/
 extern struct net_device   *dev_base;  /* All devices 
*/
 extern rwlock_tdev_base_lock;  /* 
Device list lock */
 
diff --git a/net/core/dst.c b/net/core/dst.c
index 8c4a272..3435771 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -241,13 +241,13 @@ static inline void dst_ifdown(struct dst_entry *dst, 
struct net_device *dev,
dst-input = dst_discard_in;
dst-output = dst_discard_out;
} else {
-   dst-dev = loopback_dev;
-   dev_hold(loopback_dev);
+   dst-dev = per_net(loopback_dev, init_net());
+   dev_hold(dst-dev);
dev_put(dev);
if (dst-neighbour  dst-neighbour-dev == dev) {
-   dst-neighbour-dev = loopback_dev;
+   dst-neighbour-dev = per_net(loopback_dev, 
init_net());
dev_put(dev);
-   dev_hold(loopback_dev);
+   dev_hold(dst-neighbour-dev);
}
}
 }
diff --git a/net/decnet/dn_dev.c b/net/decnet/dn_dev.c
index 19b1469..dbaf001 100644

[PATCH RFC 19/31] net: sysfs interface support for moving devices between network namespaces.

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

I haven't a clue if this interface will meet with widespread approval but
at this point it is simple, and very useful.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 net/core/net-sysfs.c |   35 +++
 1 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 1be6f94..f8a5c6b 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -188,6 +188,40 @@ static ssize_t store_mtu(struct class_device *cd, const 
char *buf, size_t len)
return netdev_store(cd, buf, len, change_mtu);
 }
 
+static ssize_t show_new_ns_pid(struct class_device *cd, char *buf)
+{
+   return -EPERM;
+}
+static int change_new_ns_pid(struct net_device *dev, unsigned long new_ns_pid)
+{
+   struct task_struct *tsk;
+   int err;
+   net_t net;
+   /* Look up the network namespace */
+   err = -ESRCH;
+   rcu_read_lock();
+   tsk = find_task_by_pid(new_ns_pid);
+   if (tsk) {
+   task_lock(tsk);
+   if (tsk-nsproxy) {
+   err = 0;
+   net = get_net(tsk-nsproxy-net_ns);
+   }
+   task_unlock(tsk);
+   }
+   rcu_read_unlock();
+   /* If I found a network namespace move the device */
+   if (!err) {
+   err = dev_change_net_namespace(dev, net, NULL);
+   put_net(net);
+   }
+   return err;
+}
+static ssize_t store_new_ns_pid(struct class_device *cd, const char *buf, 
size_t len)
+{
+   return netdev_store(cd, buf, len, change_new_ns_pid);
+}
+
 NETDEVICE_SHOW(flags, fmt_hex);
 
 static int change_flags(struct net_device *dev, unsigned long new_flags)
@@ -243,6 +277,7 @@ static struct class_device_attribute net_class_attributes[] 
= {
__ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len,
   store_tx_queue_len),
__ATTR(weight, S_IRUGO | S_IWUSR, show_weight, store_weight),
+   __ATTR(new_ns_pid, S_IWUSR, show_new_ns_pid, store_new_ns_pid),
{}
 };
 
-- 
1.4.4.1.g278f

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC 4/31] net: Add a network namespace tag to struct net_device

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

Please note that network devices do not increase the count
count on the network namespace.  The are inside the network
namespace and so the network namespace tag is in the nature
of a back pointer and so getting and putting the network namespace
is unnecessary.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 include/linux/netdevice.h |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4cb8b39..6a1579d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -38,6 +38,7 @@
 #include linux/device.h
 #include linux/percpu.h
 #include linux/dmaengine.h
+#include linux/net_namespace_type.h
 
 struct vlan_group;
 struct ethtool_ops;
@@ -525,6 +526,9 @@ struct net_device
void(*poll_controller)(struct net_device *dev);
 #endif
 
+   /* Network namespace this network device is inside */
+   net_t   nd_net;
+
/* bridge stuff */
struct net_bridge_port  *br_port;
 
-- 
1.4.4.1.g278f

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC 5/31] net: Add a network namespace parameter to struct sock

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

Sockets need to get a reference to their network namespace,
or possibly a simple hold if someone registers on the network
namespace notifier and will free the sockets when the namespace
is going to be destroyed.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 include/net/inet_timewait_sock.h |1 +
 include/net/sock.h   |3 +++
 2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index f7be1ac..162c2b9 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -115,6 +115,7 @@ struct inet_timewait_sock {
 #define tw_refcnt  __tw_common.skc_refcnt
 #define tw_hash__tw_common.skc_hash
 #define tw_prot__tw_common.skc_prot
+#define tw_net __tw_common.skc_net
volatile unsigned char  tw_substate;
/* 3 bits hole, try to pack */
unsigned char   tw_rcv_wscale;
diff --git a/include/net/sock.h b/include/net/sock.h
index 03684e7..5bf6bb5 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -105,6 +105,7 @@ struct proto;
  * @skc_refcnt: reference count
  * @skc_hash: hash value used with various protocol lookup tables
  * @skc_prot: protocol handlers inside a network family
+ * @skc_net: reference to the network namespace of this socket
  *
  * This is the minimal network layer representation of sockets, the header
  * for struct sock and struct inet_timewait_sock.
@@ -119,6 +120,7 @@ struct sock_common {
atomic_tskc_refcnt;
unsigned intskc_hash;
struct proto*skc_prot;
+   net_t   skc_net;
 };
 
 /**
@@ -195,6 +197,7 @@ struct sock {
 #define sk_refcnt  __sk_common.skc_refcnt
 #define sk_hash__sk_common.skc_hash
 #define sk_prot__sk_common.skc_prot
+#define sk_net __sk_common.skc_net
unsigned char   sk_shutdown : 2,
sk_no_check : 2,
sk_userlocks : 4;
-- 
1.4.4.1.g278f

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC 6/31] net: Add a helper to get a reference to the initial network namespace.

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

The initial network namespace is special and we need to use it for various
things.  Probably the biggest initial use will be to ensure code that
can't cope with multiple namespaces only sees the initial network namespace.

For that reason and because getting at the initial network namespace is just
a little clumsy add a helper function.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 include/net/net_namespace.h |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 06a9ba1..9208e2e 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -27,6 +27,12 @@ struct net_namespace_head {
struct work_struct work;
 };
 
+/* Get the initial network namespace */
+static inline net_t init_net(void)
+{
+   return init_nsproxy.net_ns;
+}
+
 static inline net_t get_net(net_t net) { return net; }
 static inline void put_net(net_t net) {}
 static inline net_t hold_net(net_t net) { return net; }
-- 
1.4.4.1.g278f

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC 26/31] net: Make the netlink methods in rtnetlink handle multiple network namespaces

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

It turns out after a quick audit that except for removing the checks
there is really nothing to do here.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 net/core/rtnetlink.c |   21 +++--
 1 files changed, 3 insertions(+), 18 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 29a81bf..0a42258 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -409,9 +409,6 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct 
netlink_callback *cb)
int s_idx = cb-args[0];
struct net_device *dev;
 
-   if (!net_eq(net, init_net()))
-   return 0;
-
read_lock(per_net(dev_base_lock, net));
for (dev=per_net(dev_base, net), idx=0; dev; dev = dev-next, idx++) {
if (idx  s_idx)
@@ -446,9 +443,6 @@ static int rtnl_setlink(struct sk_buff *skb, struct 
nlmsghdr *nlh, void *arg)
struct nlattr *tb[IFLA_MAX+1];
char ifname[IFNAMSIZ];
 
-   if (!net_eq(net, init_net()))
-   return -EINVAL;
-
err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFLA_MAX, ifla_policy);
if (err  0)
goto errout;
@@ -622,9 +616,6 @@ static int rtnl_getlink(struct sk_buff *skb, struct 
nlmsghdr* nlh, void *arg)
int iw_buf_len = 0;
int err;
 
-   if (!net_eq(net, init_net()))
-   return -EINVAL;
-
err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFLA_MAX, ifla_policy);
if (err  0)
return err;
@@ -673,13 +664,9 @@ errout:
 
 static int rtnl_dump_all(struct sk_buff *skb, struct netlink_callback *cb)
 {
-   net_t net = skb-sk-sk_net;
int idx;
int s_idx = cb-family;
 
-   if (!net_eq(net, init_net()))
-   return 0;
-
if (s_idx == 0)
s_idx = 1;
for (idx=1; idxNPROTO; idx++) {
@@ -701,6 +688,7 @@ static int rtnl_dump_all(struct sk_buff *skb, struct 
netlink_callback *cb)
 
 void rtmsg_ifinfo(int type, struct net_device *dev, unsigned change)
 {
+   net_t net = dev-nd_net;
struct sk_buff *skb;
int err = -ENOBUFS;
 
@@ -712,10 +700,10 @@ void rtmsg_ifinfo(int type, struct net_device *dev, 
unsigned change)
/* failure implies BUG in if_nlmsg_size() */
BUG_ON(err  0);
 
-   err = rtnl_notify(skb, init_net(), 0, RTNLGRP_LINK, NULL, GFP_KERNEL);
+   err = rtnl_notify(skb, net, 0, RTNLGRP_LINK, NULL, GFP_KERNEL);
 errout:
if (err  0)
-   rtnl_set_sk_err(init_net(), RTNLGRP_LINK, err);
+   rtnl_set_sk_err(net, RTNLGRP_LINK, err);
 }
 
 /* Protected by RTNL sempahore.  */
@@ -862,9 +850,6 @@ static int rtnetlink_event(struct notifier_block *this, 
unsigned long event, voi
 {
struct net_device *dev = ptr;
 
-   if (!net_eq(dev-nd_net, init_net()))
-   return NOTIFY_DONE;
-
switch (event) {
case NETDEV_UNREGISTER:
rtmsg_ifinfo(RTM_DELLINK, dev, ~0U);
-- 
1.4.4.1.g278f

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC 22/31] net: Add network namespace clone support.

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

This patch allows you to create a new network namespace
using sys_clone(...).

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 include/linux/sched.h|1 +
 kernel/nsproxy.c |   11 +++
 net/core/net_namespace.c |   38 ++
 3 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4463735..9e0f91a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -26,6 +26,7 @@
 #define CLONE_STOPPED  0x0200  /* Start in stopped state */
 #define CLONE_NEWUTS   0x0400  /* New utsname group? */
 #define CLONE_NEWIPC   0x0800  /* New ipcs */
+#define CLONE_NEWNET   0x2000  /* New network namespace */
 
 /*
  * Scheduling policies
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 4f3c95a..7861c4c 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -20,6 +20,7 @@
 #include linux/mnt_namespace.h
 #include linux/utsname.h
 #include linux/pid_namespace.h
+#include net/net_namespace.h
 
 struct nsproxy init_nsproxy = INIT_NSPROXY(init_nsproxy);
 EXPORT_SYMBOL_GPL(init_nsproxy);
@@ -70,6 +71,7 @@ struct nsproxy *dup_namespaces(struct nsproxy *orig)
get_ipc_ns(ns-ipc_ns);
if (ns-pid_ns)
get_pid_ns(ns-pid_ns);
+   get_net(ns-net_ns);
}
 
return ns;
@@ -117,10 +119,18 @@ int copy_namespaces(int flags, struct task_struct *tsk)
if (err)
goto out_pid;
 
+   err = copy_net(flags, tsk);
+   if (err)
+   goto out_net;
+
 out:
put_nsproxy(old_ns);
return err;
 
+out_net:
+   if (new_ns-pid_ns)
+   put_pid_ns(new_ns-pid_ns);
+
 out_pid:
if (new_ns-ipc_ns)
put_ipc_ns(new_ns-ipc_ns);
@@ -146,5 +156,6 @@ void free_nsproxy(struct nsproxy *ns)
put_ipc_ns(ns-ipc_ns);
if (ns-pid_ns)
put_pid_ns(ns-pid_ns);
+   put_net(ns-net_ns);
kfree(ns);
 }
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 93e3879..cc56105 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -175,6 +175,44 @@ out_undo:
goto out;
 }
 
+int copy_net(int flags, struct task_struct *tsk)
+{
+   net_t old_net = tsk-nsproxy-net_ns;
+   net_t new_net;
+   int err;
+
+   get_net(old_net);
+
+   if (!(flags  CLONE_NEWNET))
+   return 0;
+
+   err = -EPERM;
+   if (!capable(CAP_SYS_ADMIN))
+   goto out;
+
+   err = -ENOMEM;
+   new_net = net_alloc();
+   if (null_net(new_net))
+   goto out;
+
+   mutex_lock(net_mutex);
+   err = setup_net(new_net);
+   if (err)
+   goto out_unlock;
+
+   net_lock();
+   net_list_append(new_net);
+   net_unlock();
+
+   tsk-nsproxy-net_ns = new_net;
+
+out_unlock:
+   mutex_unlock(net_mutex);
+out:
+   put_net(old_net);
+   return err;
+}
+
 void pernet_modcopy(void *pnetdst, const void *src, unsigned long size)
 {
net_t net;
-- 
1.4.4.1.g278f

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC 27/31] net: Make the xfrm sysctls per network namespace.

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

In particalure I moved:
/proc/sys/net/core/xfrm_aevent_etime
/proc/sys/net/core/xfrm_aevent_rseqth

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 include/net/xfrm.h |4 ++--
 net/core/sysctl_net_core.c |   37 ++---
 net/xfrm/xfrm_state.c  |8 
 net/xfrm/xfrm_user.c   |   10 ++
 4 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index e476541..9b2e727 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -24,8 +24,8 @@
MODULE_ALIAS(xfrm-mode- __stringify(family) - __stringify(encap))
 
 extern struct sock *xfrm_nl;
-extern u32 sysctl_xfrm_aevent_etime;
-extern u32 sysctl_xfrm_aevent_rseqth;
+DECLARE_PER_NET(u32, sysctl_xfrm_aevent_etime);
+DECLARE_PER_NET(u32, sysctl_xfrm_aevent_rseqth);
 
 extern struct mutex xfrm_cfg_mutex;
 
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 76f7a29..90f2a39 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -88,24 +88,6 @@ ctl_table core_table[] = {
.mode   = 0644,
.proc_handler   = proc_dointvec
},
-#ifdef CONFIG_XFRM
-   {
-   .ctl_name   = NET_CORE_AEVENT_ETIME,
-   .procname   = xfrm_aevent_etime,
-   .data   = sysctl_xfrm_aevent_etime,
-   .maxlen = sizeof(u32),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec
-   },
-   {
-   .ctl_name   = NET_CORE_AEVENT_RSEQTH,
-   .procname   = xfrm_aevent_rseqth,
-   .data   = sysctl_xfrm_aevent_rseqth,
-   .maxlen = sizeof(u32),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec
-   },
-#endif /* CONFIG_XFRM */
 #endif /* CONFIG_NET */
{
.ctl_name   = NET_CORE_SOMAXCONN,
@@ -127,6 +109,23 @@ ctl_table core_table[] = {
 };
 
 DEFINE_PER_NET(struct ctl_table, multi_core_table[]) = {
-   /* Stub for holding per network namespace sysctls */
+#ifdef CONFIG_XFRM
+   {
+   .ctl_name   = NET_CORE_AEVENT_ETIME,
+   .procname   = xfrm_aevent_etime,
+   .data   = __per_net_base(sysctl_xfrm_aevent_etime),
+   .maxlen = sizeof(u32),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
+   {
+   .ctl_name   = NET_CORE_AEVENT_RSEQTH,
+   .procname   = xfrm_aevent_rseqth,
+   .data   = __per_net_base(sysctl_xfrm_aevent_rseqth),
+   .maxlen = sizeof(u32),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
+#endif /* CONFIG_XFRM */
{}
 };
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index fdb08d9..3304a2d 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -27,11 +27,11 @@
 struct sock *xfrm_nl;
 EXPORT_SYMBOL(xfrm_nl);
 
-u32 sysctl_xfrm_aevent_etime = XFRM_AE_ETIME;
-EXPORT_SYMBOL(sysctl_xfrm_aevent_etime);
+DEFINE_PER_NET(u32, sysctl_xfrm_aevent_etime) = XFRM_AE_ETIME;
+EXPORT_PER_NET_SYMBOL(sysctl_xfrm_aevent_etime);
 
-u32 sysctl_xfrm_aevent_rseqth = XFRM_AE_SEQT_SIZE;
-EXPORT_SYMBOL(sysctl_xfrm_aevent_rseqth);
+DEFINE_PER_NET(u32, sysctl_xfrm_aevent_rseqth) = XFRM_AE_SEQT_SIZE;
+EXPORT_PER_NET_SYMBOL(sysctl_xfrm_aevent_rseqth);
 
 /* Each xfrm_state may be linked to two tables:
 
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 55affa7..15e962b 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -375,7 +375,8 @@ error:
return err;
 }
 
-static struct xfrm_state *xfrm_state_construct(struct xfrm_usersa_info *p,
+static struct xfrm_state *xfrm_state_construct(net_t net,
+  struct xfrm_usersa_info *p,
   struct rtattr **xfrma,
   int *errp)
 {
@@ -411,9 +412,9 @@ static struct xfrm_state *xfrm_state_construct(struct 
xfrm_usersa_info *p,
goto error;
 
x-km.seq = p-seq;
-   x-replay_maxdiff = sysctl_xfrm_aevent_rseqth;
+   x-replay_maxdiff = per_net(sysctl_xfrm_aevent_rseqth, net);
/* sysctl_xfrm_aevent_etime is in 100ms units */
-   x-replay_maxage = (sysctl_xfrm_aevent_etime*HZ)/XFRM_AE_ETH_M;
+   x-replay_maxage = (per_net(sysctl_xfrm_aevent_etime, 
net)*HZ)/XFRM_AE_ETH_M;
x-preplay.bitmap = 0;
x-preplay.seq = x-replay.seq+x-replay_maxdiff;
x-preplay.oseq = x-replay.oseq +x-replay_maxdiff;
@@ -437,6 +438,7 @@ error_no_put:
 static int xfrm_add_sa(struct sk_buff *skb, struct nlmsghdr *nlh,
struct rtattr **xfrma)
 {
+   net_t net = skb-sk-sk_net;

[PATCH RFC 28/31] net: Make the SOMAXCONN sysctl per network namespace

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 include/linux/socket.h |3 ++-
 net/core/sysctl_net_core.c |   16 
 net/socket.c   |7 ---
 3 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 92cd38e..aa159ea 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -23,8 +23,9 @@ struct __kernel_sockaddr_storage {
 #include linux/uio.h /* iovec support*/
 #include linux/types.h   /* pid_t*/
 #include linux/compiler.h/* __user   */
+#include linux/net_namespace_type.h
 
-extern int sysctl_somaxconn;
+DECLARE_PER_NET(int, sysctl_somaxconn);
 #ifdef CONFIG_PROC_FS
 struct seq_file;
 extern void socket_seq_show(struct seq_file *seq);
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 90f2a39..14eca68 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -90,14 +90,6 @@ ctl_table core_table[] = {
},
 #endif /* CONFIG_NET */
{
-   .ctl_name   = NET_CORE_SOMAXCONN,
-   .procname   = somaxconn,
-   .data   = sysctl_somaxconn,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec
-   },
-   {
.ctl_name   = NET_CORE_BUDGET,
.procname   = netdev_budget,
.data   = netdev_budget,
@@ -127,5 +119,13 @@ DEFINE_PER_NET(struct ctl_table, multi_core_table[]) = {
.proc_handler   = proc_dointvec
},
 #endif /* CONFIG_XFRM */
+   {
+   .ctl_name   = NET_CORE_SOMAXCONN,
+   .procname   = somaxconn,
+   .data   = __per_net_base(sysctl_somaxconn),
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
{}
 };
diff --git a/net/socket.c b/net/socket.c
index 7371654..ab2aeea 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1305,7 +1305,7 @@ asmlinkage long sys_bind(int fd, struct sockaddr __user 
*umyaddr, int addrlen)
  * ready for listening.
  */
 
-int sysctl_somaxconn __read_mostly = SOMAXCONN;
+DEFINE_PER_NET(int, sysctl_somaxconn)= SOMAXCONN;
 
 asmlinkage long sys_listen(int fd, int backlog)
 {
@@ -1314,8 +1314,9 @@ asmlinkage long sys_listen(int fd, int backlog)
 
sock = sockfd_lookup_light(fd, err, fput_needed);
if (sock) {
-   if ((unsigned)backlog  sysctl_somaxconn)
-   backlog = sysctl_somaxconn;
+   net_t net = sock-sk-sk_net;
+   if ((unsigned)backlog  per_net(sysctl_somaxconn, net))
+   backlog = per_net(sysctl_somaxconn, net);
 
err = security_socket_listen(sock, backlog);
if (!err)
-- 
1.4.4.1.g278f

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC 11/31] net: Initialize the network namespace of network devices.

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

Except for carefully selected pseudo devices all network
interfaces should start out in the initial network namespace.
Ultimately it will be register_netdev that examines what
dev-nd_net is set to and places a device in a network namespace.

This patch modifies alloc_netdev to initialize the network
namespace a device is in with the initial network namespace.
This gets it right for the vast majority of devices so their
drivers need not be modified and for those few pseudo devices
that need something different they can change this parameter
before calling register_netdevice.

The network namespace parameter on a network device is not
reference counted as the devices are inside of a network namespace
and cannot remain in that namespace past the lifetime of the
network namespace.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 drivers/net/loopback.c |1 +
 net/core/dev.c |1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index 2b739fd..22b672d 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -231,6 +231,7 @@ struct net_device loopback_dev = {
 /* Setup and register the loopback device. */
 static int __init loopback_init(void)
 {
+   loopback_dev.nd_net = init_net();
return register_netdev(loopback_dev);
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 90e4c0e..a3ee150 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3192,6 +3192,7 @@ struct net_device *alloc_netdev(int sizeof_priv, const 
char *name,
dev = (struct net_device *)
(((long)p + NETDEV_ALIGN_CONST)  ~NETDEV_ALIGN_CONST);
dev-padded = (char *)dev - (char *)p;
+   dev-nd_net = init_net();
 
if (sizeof_priv)
dev-priv = netdev_priv(dev);
-- 
1.4.4.1.g278f

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC 24/31] net: Make rtnetlink network namespace aware

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

After this patch none of the netlink callback support anything
except the initial network namespace but the rtnetlink infrastructure
now handles multiple network namespaces.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 include/linux/rtnetlink.h |8 ++--
 net/bridge/br_netlink.c   |4 +-
 net/core/fib_rules.c  |4 +-
 net/core/neighbour.c  |4 +-
 net/core/rtnetlink.c  |   74 +++-
 net/core/wireless.c   |5 ++-
 net/decnet/dn_dev.c   |4 +-
 net/decnet/dn_route.c |2 +-
 net/decnet/dn_table.c |4 +-
 net/ipv4/devinet.c|4 +-
 net/ipv4/fib_semantics.c  |4 +-
 net/ipv4/ipmr.c   |4 +-
 net/ipv4/route.c  |2 +-
 net/ipv6/addrconf.c   |   14 
 net/ipv6/route.c  |6 ++--
 net/sched/cls_api.c   |2 +-
 net/sched/sch_api.c   |4 +-
 17 files changed, 98 insertions(+), 51 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 4a629ea..6c8281d 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -581,11 +581,11 @@ struct rtnetlink_link
 };
 
 extern struct rtnetlink_link * rtnetlink_links[NPROTO];
-extern int rtnetlink_send(struct sk_buff *skb, u32 pid, u32 group, int echo);
-extern int rtnl_unicast(struct sk_buff *skb, u32 pid);
-extern int rtnl_notify(struct sk_buff *skb, u32 pid, u32 group,
+extern int rtnetlink_send(struct sk_buff *skb, net_t net, u32 pid, u32 group, 
int echo);
+extern int rtnl_unicast(struct sk_buff *skb, net_t net, u32 pid);
+extern int rtnl_notify(struct sk_buff *skb, net_t net, u32 pid, u32 group,
   struct nlmsghdr *nlh, gfp_t flags);
-extern void rtnl_set_sk_err(u32 group, int error);
+extern void rtnl_set_sk_err(net_t net, u32 group, int error);
 extern int rtnetlink_put_metrics(struct sk_buff *skb, u32 *metrics);
 extern int rtnl_put_cacheinfo(struct sk_buff *skb, struct dst_entry *dst,
  u32 id, u32 ts, u32 tsage, long expires,
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 85165a1..372fb18 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -94,10 +94,10 @@ void br_ifinfo_notify(int event, struct net_bridge_port 
*port)
/* failure implies BUG in br_nlmsg_size() */
BUG_ON(err  0);
 
-   err = rtnl_notify(skb, 0, RTNLGRP_LINK, NULL, GFP_ATOMIC);
+   err = rtnl_notify(skb, init_net(), 0, RTNLGRP_LINK, NULL, GFP_ATOMIC);
 errout:
if (err  0)
-   rtnl_set_sk_err(RTNLGRP_LINK, err);
+   rtnl_set_sk_err(init_net(), RTNLGRP_LINK, err);
 }
 
 /*
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 00b4148..5f65973 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -418,10 +418,10 @@ static void notify_rule_change(int event, struct fib_rule 
*rule,
/* failure implies BUG in fib_rule_nlmsg_size() */
BUG_ON(err  0);
 
-   err = rtnl_notify(skb, pid, ops-nlgroup, nlh, GFP_KERNEL);
+   err = rtnl_notify(skb, init_net(), pid, ops-nlgroup, nlh, GFP_KERNEL);
 errout:
if (err  0)
-   rtnl_set_sk_err(ops-nlgroup, err);
+   rtnl_set_sk_err(init_net(), ops-nlgroup, err);
 }
 
 static void attach_rules(struct list_head *rules, struct net_device *dev)
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index d89c6fe..6f61207 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2453,10 +2453,10 @@ static void __neigh_notify(struct neighbour *n, int 
type, int flags)
/* failure implies BUG in neigh_nlmsg_size() */
BUG_ON(err  0);
 
-   err = rtnl_notify(skb, 0, RTNLGRP_NEIGH, NULL, GFP_ATOMIC);
+   err = rtnl_notify(skb, init_net(), 0, RTNLGRP_NEIGH, NULL, GFP_ATOMIC);
 errout:
if (err  0)
-   rtnl_set_sk_err(RTNLGRP_NEIGH, err);
+   rtnl_set_sk_err(init_net(), RTNLGRP_NEIGH, err);
 }
 
 void neigh_app_ns(struct neighbour *n)
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 9be586c..29a81bf 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -58,7 +58,7 @@
 #endif /* CONFIG_NET_WIRELESS_RTNETLINK */
 
 static DEFINE_MUTEX(rtnl_mutex);
-static struct sock *rtnl;
+static DEFINE_PER_NET(struct sock *, rtnl);
 
 void rtnl_lock(void)
 {
@@ -72,9 +72,17 @@ void __rtnl_unlock(void)
 
 void rtnl_unlock(void)
 {
+   net_t net;
mutex_unlock(rtnl_mutex);
-   if (rtnl  rtnl-sk_receive_queue.qlen)
-   rtnl-sk_data_ready(rtnl, 0);
+   
+   net_lock();
+   for_each_net(net) {
+   struct sock *rtnl = per_net(rtnl, net);
+   if (rtnl  rtnl-sk_receive_queue.qlen)
+   rtnl-sk_data_ready(rtnl, 0);
+   }
+   net_unlock();
+
netdev_run_todo();
 }
 
@@ -151,8 +159,9 @@ size_t rtattr_strlcpy(char *dest, const struct rtattr *rta, 
size_t

[PATCH RFC 17/31] net: Factor out __dev_alloc_name from dev_alloc_name

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

When forcibly changing the network namespace of a device
I need something that can generate a name for the device
in the new namespace without overwriting the old name.

__dev_alloc_name provides me that functionality.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 net/core/dev.c |   44 +---
 1 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 32fe905..fc0d2af 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -655,9 +655,10 @@ int dev_valid_name(const char *name)
 }
 
 /**
- * dev_alloc_name - allocate a name for a device
- * @dev: device
+ * __dev_alloc_name - allocate a name for a device
+ * @net: network namespace to allocate the device name in
  * @name: name format string
+ * @buf:  scratch buffer and result name string
  *
  * Passed a format string - eg lt%d it will try and find a suitable
  * id. It scans list of devices to build up a free map, then chooses
@@ -668,18 +669,13 @@ int dev_valid_name(const char *name)
  * Returns the number of the unit assigned or a negative errno code.
  */
 
-int dev_alloc_name(struct net_device *dev, const char *name)
+static int __dev_alloc_name(net_t net, const char *name, char buf[IFNAMSIZ])
 {
int i = 0;
-   char buf[IFNAMSIZ];
const char *p;
const int max_netdevices = 8*PAGE_SIZE;
long *inuse;
struct net_device *d;
-   net_t net;
-
-   BUG_ON(null_net(dev-nd_net));
-   net = dev-nd_net;
 
p = strnchr(name, IFNAMSIZ-1, '%');
if (p) {
@@ -713,10 +709,8 @@ int dev_alloc_name(struct net_device *dev, const char 
*name)
}
 
snprintf(buf, sizeof(buf), name, i);
-   if (!__dev_get_by_name(net, buf)) {
-   strlcpy(dev-name, buf, IFNAMSIZ);
+   if (!__dev_get_by_name(net, buf))
return i;
-   }
 
/* It is possible to run out of possible slots
 * when the name is long and there isn't enough space left
@@ -725,6 +719,34 @@ int dev_alloc_name(struct net_device *dev, const char 
*name)
return -ENFILE;
 }
 
+/**
+ * dev_alloc_name - allocate a name for a device
+ * @dev: device
+ * @name: name format string
+ *
+ * Passed a format string - eg lt%d it will try and find a suitable
+ * id. It scans list of devices to build up a free map, then chooses
+ * the first empty slot. The caller must hold the dev_base or rtnl lock
+ * while allocating the name and adding the device in order to avoid
+ * duplicates.
+ * Limited to bits_per_byte * page size devices (ie 32K on most platforms).
+ * Returns the number of the unit assigned or a negative errno code.
+ */
+
+int dev_alloc_name(struct net_device *dev, const char *name)
+{
+   char buf[IFNAMSIZ];
+   net_t net;
+   int ret;
+
+   BUG_ON(null_net(dev-nd_net));
+   net = dev-nd_net;
+   ret = __dev_alloc_name(net, name, buf);
+   if (ret = 0)
+   strlcpy(dev-name, buf, IFNAMSIZ);
+   return ret;
+}
+
 
 /**
  * dev_change_name - change name of a device
-- 
1.4.4.1.g278f

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC 20/31] net: Implement CONFIG_NET_NS

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

Add the config option to enable multiple network namespaces.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 net/Kconfig |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/net/Kconfig b/net/Kconfig
index 7dfc949..4671398 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -27,6 +27,13 @@ if NET
 
 menu Networking options
 
+config NET_NS
+   bool Network namespace support
+   depends on EXPERIMENTAL
+   help
+ Support what appear to user space as multiple instances of the 
+ network stack.
+
 config NETDEBUG
bool Network packet debugging
help
-- 
1.4.4.1.g278f

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC 25/31] net: Make wireless netlink event generation handle multiple network namespaces

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 net/core/wireless.c |   15 ++-
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/net/core/wireless.c b/net/core/wireless.c
index 9036359..d534617 100644
--- a/net/core/wireless.c
+++ b/net/core/wireless.c
@@ -1934,8 +1934,13 @@ static void wireless_nlevent_process(unsigned long data)
 {
struct sk_buff *skb;
 
-   while ((skb = skb_dequeue(wireless_nlevent_queue)))
-   rtnl_notify(skb, init_net(), 0, RTNLGRP_LINK, NULL, GFP_ATOMIC);
+   while ((skb = skb_dequeue(wireless_nlevent_queue))) {
+   struct net_device *dev = skb-dev;
+   net_t net = dev-nd_net;
+   skb-dev = NULL;
+   rtnl_notify(skb, net, 0, RTNLGRP_LINK, NULL, GFP_ATOMIC);
+   dev_put(dev);
+   }
 }
 
 static DECLARE_TASKLET(wireless_nlevent_tasklet, wireless_nlevent_process, 0);
@@ -1992,9 +1997,6 @@ static inline void rtmsg_iwinfo(struct net_device *   
dev,
struct sk_buff *skb;
int size = NLMSG_GOODSIZE;
 
-   if (!net_eq(dev-nd_net, init_net()))
-   return;
-
skb = alloc_skb(size, GFP_ATOMIC);
if (!skb)
return;
@@ -2004,6 +2006,9 @@ static inline void rtmsg_iwinfo(struct net_device *   
dev,
kfree_skb(skb);
return;
}
+   /* Remember the device until we are in process context */
+   dev_hold(dev);
+   skb-dev = dev;
NETLINK_CB(skb).dst_group = RTNLGRP_LINK;
skb_queue_tail(wireless_nlevent_queue, skb);
tasklet_schedule(wireless_nlevent_tasklet);
-- 
1.4.4.1.g278f

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC 29/31] net: Make AF_PACKET handle multiple network namespaces

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

This is done by making all of the relevant global variables
per network namespace.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 net/packet/af_packet.c |  125 +++-
 1 files changed, 81 insertions(+), 44 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 4ac9f9f..c772491 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -152,8 +152,8 @@ dev-hard_header == NULL (ll header is added by device, we 
cannot control it)
  */
 
 /* List of all packet sockets. */
-static HLIST_HEAD(packet_sklist);
-static DEFINE_RWLOCK(packet_sklist_lock);
+static DEFINE_PER_NET(rwlock_t, packet_sklist_lock);
+static DEFINE_PER_NET(struct hlist_head, packet_sklist);
 
 static atomic_t packet_socks_nr;
 
@@ -264,9 +264,6 @@ static int packet_rcv_spkt(struct sk_buff *skb, struct 
packet_type *pt, struct n
struct sock *sk;
struct sockaddr_pkt *spkt;
 
-   if (!net_eq(dev-nd_net, init_net()))
-   goto out;
-
/*
 *  When we registered the protocol we saved the socket in the data
 *  field for just this event.
@@ -288,6 +285,9 @@ static int packet_rcv_spkt(struct sk_buff *skb, struct 
packet_type *pt, struct n
if (skb-pkt_type == PACKET_LOOPBACK)
goto out;
 
+   if (!net_eq(dev-nd_net, sk-sk_net))
+   goto out;
+
if ((skb = skb_share_check(skb, GFP_ATOMIC)) == NULL)
goto oom;
 
@@ -359,7 +359,7 @@ static int packet_sendmsg_spkt(struct kiocb *iocb, struct 
socket *sock,
 */
 
saddr-spkt_device[13] = 0;
-   dev = dev_get_by_name(init_net(), saddr-spkt_device);
+   dev = dev_get_by_name(sk-sk_net, saddr-spkt_device);
err = -ENODEV;
if (dev == NULL)
goto out_unlock;
@@ -475,15 +475,15 @@ static int packet_rcv(struct sk_buff *skb, struct 
packet_type *pt, struct net_de
int skb_len = skb-len;
unsigned snaplen;
 
-   if (!net_eq(dev-nd_net, init_net()))
-   goto drop;
-
if (skb-pkt_type == PACKET_LOOPBACK)
goto drop;
 
sk = pt-af_packet_priv;
po = pkt_sk(sk);
 
+   if (!net_eq(dev-nd_net, sk-sk_net))
+   goto drop;
+
skb-dev = dev;
 
if (dev-hard_header) {
@@ -583,15 +583,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
packet_type *pt, struct net_d
unsigned short macoff, netoff;
struct sk_buff *copy_skb = NULL;
 
-   if (!net_eq(dev-nd_net, init_net()))
-   goto drop;
-
if (skb-pkt_type == PACKET_LOOPBACK)
goto drop;
 
sk = pt-af_packet_priv;
po = pkt_sk(sk);
 
+   if (!net_eq(dev-nd_net, sk-sk_net))
+   goto drop;
+
if (dev-hard_header) {
if (sk-sk_type != SOCK_DGRAM)
skb_push(skb, skb-data - skb-mac.raw);
@@ -744,7 +744,7 @@ static int packet_sendmsg(struct kiocb *iocb, struct socket 
*sock,
}
 
 
-   dev = dev_get_by_index(init_net(), ifindex);
+   dev = dev_get_by_index(sk-sk_net, ifindex);
err = -ENXIO;
if (dev == NULL)
goto out_unlock;
@@ -817,15 +817,17 @@ static int packet_release(struct socket *sock)
 {
struct sock *sk = sock-sk;
struct packet_sock *po;
+   net_t net;
 
if (!sk)
return 0;
 
+   net = sk-sk_net;
po = pkt_sk(sk);
 
-   write_lock_bh(packet_sklist_lock);
+   write_lock_bh(per_net(packet_sklist_lock, net));
sk_del_node_init(sk);
-   write_unlock_bh(packet_sklist_lock);
+   write_unlock_bh(per_net(packet_sklist_lock, net));
 
/*
 *  Unhook packet receive handler.
@@ -943,7 +945,7 @@ static int packet_bind_spkt(struct socket *sock, struct 
sockaddr *uaddr, int add
return -EINVAL;
strlcpy(name,uaddr-sa_data,sizeof(name));
 
-   dev = dev_get_by_name(init_net(), name);
+   dev = dev_get_by_name(sk-sk_net, name);
if (dev) {
err = packet_do_bind(sk, dev, pkt_sk(sk)-num);
dev_put(dev);
@@ -971,7 +973,7 @@ static int packet_bind(struct socket *sock, struct sockaddr 
*uaddr, int addr_len
 
if (sll-sll_ifindex) {
err = -ENODEV;
-   dev = dev_get_by_index(init_net(), sll-sll_ifindex);
+   dev = dev_get_by_index(sk-sk_net, sll-sll_ifindex);
if (dev == NULL)
goto out;
}
@@ -1000,9 +1002,6 @@ static int packet_create(net_t net, struct socket *sock, 
int protocol)
__be16 proto = (__force __be16)protocol; /* weird, but documented */
int err;
 
-   if (!net_eq(net, init_net()))
-   return -EAFNOSUPPORT;
-
if (!capable(CAP_NET_RAW))
return -EPERM;
if (sock-type != SOCK_DGRAM  sock-type !=

[PATCH RFC 9/31] net: Implement the per network namespace sysctl infrastructure

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

The user interface is: register_net_sysctl_table and
unregister_net_sysctl_table.  Very much like the current
interface except there is an network namespace parameter.

This this any sysctl in the net_root_table and it's
subdirectories are registered with register_net_sysctl
shows up only to tasks in the same network namespace.

All other sysctls continue to be globally visible.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 include/linux/sysctl.h |7 
 include/net/sock.h |1 +
 kernel/sysctl.c|   71 ++-
 net/core/sysctl_net_core.c |5 +++
 net/sysctl_net.c   |   20 
 5 files changed, 102 insertions(+), 2 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 8eba2d2..286e723 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -1044,6 +1044,13 @@ struct ctl_table_header * 
register_sysctl_table(ctl_table * table);
 
 void unregister_sysctl_table(struct ctl_table_header * table);
 
+#ifdef CONFIG_NET
+#include linux/net_namespace_type.h
+extern struct ctl_table_header *register_net_sysctl_table(net_t net, struct 
ctl_table *table);
+extern void unregister_net_sysctl_table(struct ctl_table_header *header);
+DECLARE_PER_NET(struct ctl_table, net_root_table[]);
+#endif
+
 #else /* __KERNEL__ */
 
 #endif /* __KERNEL__ */
diff --git a/include/net/sock.h b/include/net/sock.h
index 5bf6bb5..01a2781 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1414,6 +1414,7 @@ extern void sk_init(void);
 
 #ifdef CONFIG_SYSCTL
 extern struct ctl_table core_table[];
+DECLARE_PER_NET(struct ctl_table, multi_core_table[]);
 #endif
 
 extern int sysctl_optmem_max;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 7da313e..ae6a424 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -45,6 +45,7 @@
 #include linux/syscalls.h
 #include linux/nfs_fs.h
 #include linux/acpi.h
+#include net/net_namespace.h
 
 #include asm/uaccess.h
 #include asm/processor.h
@@ -135,6 +136,10 @@ static int proc_do_cad_pid(ctl_table *table, int write, 
struct file *filp,
  void __user *buffer, size_t *lenp, loff_t *ppos);
 #endif
 
+#ifdef CONFIG_NET
+static DEFINE_PER_NET(struct ctl_table_header, net_table_header);
+#endif
+
 static ctl_table root_table[];
 static struct ctl_table_header root_table_header =
{ root_table, LIST_HEAD_INIT(root_table_header.ctl_entry) };
@@ -1059,6 +1064,7 @@ struct ctl_table_header *sysctl_head_next(struct 
ctl_table_header *prev)
 {
struct ctl_table_header *head;
struct list_head *tmp;
+   net_t net = current-nsproxy-net_ns;
spin_lock(sysctl_lock);
if (prev) {
tmp = prev-ctl_entry;
@@ -1076,6 +1082,10 @@ struct ctl_table_header *sysctl_head_next(struct 
ctl_table_header *prev)
next:
tmp = tmp-next;
if (tmp == root_table_header.ctl_entry)
+#ifdef CONFIG_NET
+   tmp = per_net(net_table_header, net).ctl_entry;
+   else if (tmp == per_net(net_table_header, net).ctl_entry)
+#endif
break;
}
spin_unlock(sysctl_lock);
@@ -1290,7 +1300,8 @@ int do_sysctl_strategy (ctl_table *table,
  * This routine returns %NULL on a failure to register, and a pointer
  * to the table header on success.
  */
-struct ctl_table_header *register_sysctl_table(ctl_table * table)
+static struct ctl_table_header *__register_sysctl_table(
+   struct ctl_table_header *root, ctl_table * table)
 {
struct ctl_table_header *tmp;
tmp = kmalloc(sizeof(struct ctl_table_header), GFP_KERNEL);
@@ -1301,11 +1312,16 @@ struct ctl_table_header 
*register_sysctl_table(ctl_table * table)
tmp-used = 0;
tmp-unregistering = NULL;
spin_lock(sysctl_lock);
-   list_add_tail(tmp-ctl_entry, root_table_header.ctl_entry);
+   list_add_tail(tmp-ctl_entry, root-ctl_entry);
spin_unlock(sysctl_lock);
return tmp;
 }
 
+struct ctl_table_header *register_sysctl_table(ctl_table *table)
+{
+   return __register_sysctl_table(root_table_header, table);
+}
+
 /**
  * unregister_sysctl_table - unregister a sysctl table hierarchy
  * @header: the header returned from register_sysctl_table
@@ -1322,6 +1338,57 @@ void unregister_sysctl_table(struct ctl_table_header * 
header)
kfree(header);
 }
 
+#ifdef CONFIG_NET
+
+static void *fixup_per_net_addr(net_t net, void *addr)
+{
+   char *ptr = addr;
+   if ((ptr = __per_net_start)  (ptr  __per_net_end))
+   ptr += __per_net_offset(net);
+   return ptr;
+}
+
+static void sysctl_net_table_fixup(net_t net, struct ctl_table *table)
+{
+   for (; table-ctl_name || table-procname; table++) {
+   table-child  = fixup_per_net_addr(net, table-child);
+   table-data   = fixup_per_net_addr(net, table-data);
+

[PATCH RFC 18/31] net: Implment network device movement between namespaces

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

This patch introduces NETIF_F_NETNS_LOCAL a flag to indicate
a network device is local to a single network namespace and
should never be moved.  Useful for pseudo devices that we
need an instance in each network namespace (like the loopback
device) and for any device we find that cannot handle multiple
network namespaces so we may trap them in the initial network
namespace.

This patch introduces the function dev_change_net_namespace
a function used to move a network device from one network
namespace to another.  To the network device nothing
special appears to happen, to the components of the network
stack it appears as if the network device was unregistered
in the network namespace it is in, and a new device
was registered in the network namespace the device
was moved to.

This patch sets up a namespace device destructor that
upon the exit of a network namespace moves all of the
movable network devices  to the initial network namespace
so they are not lost.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 drivers/net/loopback.c|3 +-
 include/linux/netdevice.h |3 +
 net/core/dev.c|  222 +++-
 3 files changed, 201 insertions(+), 27 deletions(-)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index e9abf3f..7d15de0 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -225,7 +225,8 @@ DEFINE_PER_NET(struct net_device, loopback_dev) = {
  | NETIF_F_TSO
 #endif
  | NETIF_F_NO_CSUM | NETIF_F_HIGHDMA
- | NETIF_F_LLTX,
+ | NETIF_F_LLTX
+ | NETIF_F_NETNS_LOCAL,
.ethtool_ops= loopback_ethtool_ops,
 };
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0b4a4dc..3fcaf60 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -324,6 +324,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED1024/* Device cannot handle VLAN 
packets */
 #define NETIF_F_GSO2048/* Enable software GSO. */
 #define NETIF_F_LLTX   4096/* LockLess TX */
+#define NETIF_F_NETNS_LOCAL8192/* Does not change network namespaces */
 
/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT  16
@@ -710,6 +711,8 @@ extern int  dev_ethtool(net_t net, struct ifreq *);
 extern unsigneddev_get_flags(const struct net_device *);
 extern int dev_change_flags(struct net_device *, unsigned);
 extern int dev_change_name(struct net_device *, char *);
+extern int dev_change_net_namespace(struct net_device *, net_t,
+const char *);
 extern int dev_set_mtu(struct net_device *, int);
 extern int dev_set_mac_address(struct net_device *,
struct sockaddr *);
diff --git a/net/core/dev.c b/net/core/dev.c
index fc0d2af..52994e4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -198,6 +198,52 @@ static inline struct hlist_head *dev_index_hash(net_t net, 
int ifindex)
return per_net(dev_index_head, net)[ifindex  
((1NETDEV_HASHBITS)-1)];
 }
 
+/* Device list insertion */
+static int list_netdevice(struct net_device *dev)
+{
+   net_t net = dev-nd_net;
+
+   ASSERT_RTNL();
+
+   dev-next = NULL;
+   write_lock_bh(per_net(dev_base_lock, net));
+   *per_net(dev_tail, net) = dev;
+   per_net(dev_tail, net) = dev-next;
+   hlist_add_head(dev-name_hlist, dev_name_hash(net, dev-name));
+   hlist_add_head(dev-index_hlist, dev_index_hash(net, dev-ifindex));
+   write_unlock_bh(per_net(dev_base_lock, net));
+   return 0;
+}
+
+/* Device list removal */
+static int unlist_netdevice(struct net_device *dev)
+{
+   struct net_device *d, **dp;
+   net_t net = dev-nd_net;
+
+   ASSERT_RTNL();
+
+   /* Unlink dev from the device chain */
+   for (dp = per_net(dev_base, net); (d = *dp) != NULL; dp = d-next) {
+   if (d == dev) {
+   write_lock_bh(per_net(dev_base_lock, net));
+   hlist_del(dev-name_hlist);
+   hlist_del(dev-index_hlist);
+   if (per_net(dev_tail, net) == dev-next)
+   per_net(dev_tail, net) = dp;
+   *dp = d-next;
+   write_unlock_bh(per_net(dev_base_lock, net));
+   break;
+   }
+   }
+   if (!d) {
+   printk(KERN_ERR unlist net_device: '%s' not found\n,
+  dev-name);
+   return -ENODEV;
+   }
+   return 0;
+}
+
 /*
  * Our notifier list
  */
@@ -3054,15 +3100,9 @@ int register_netdevice(struct net_device *dev)

[PATCH RFC 2/31] net: Implement a place holder network namespace

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

Many of the changes to the network stack will simply be adding a
network namespace parameter to function calls or moving variables
from globals to being per network namespace.  When those variables
have initializers that cannot statically compute the proper value,
a function that runs at the creation and destruction of network
namespaces will need to be registered, and the logic will need to
be changed to accomidate that.

Adding unconditional support for these functions ensures that even when
everything else is compiled out the modified network stack logic will
continue to run correctly.

This patch adds struct pernet_operations that has an init (constructor)
and an exit (destructor) method.  When registered the init method
is called for every existing namespace, and when unregistered the
exit method is called for every existing namespace.  When a new
network namespace is created all of the init methods are called
in the order in which they were registered, and when a network namespace
is destroyed the exit methods are called in the reverse order in
which they were registered.

There are two distinct types of pernet_operations recognized: subsys and
device.  At creation all subsys init functions are called before device
init functions, and at destruction all device exit functions are called
before subsys exit function.  For other ordering the preservation
of the order of registration combined with the various kinds of
kernel initcalls should be sufficient.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 include/net/net_namespace.h |   62 ++
 net/core/Makefile   |2 +-
 net/core/net_namespace.c|  149 +++
 3 files changed, 212 insertions(+), 1 deletions(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
new file mode 100644
index 000..06a9ba1
--- /dev/null
+++ b/include/net/net_namespace.h
@@ -0,0 +1,62 @@
+/* 
+ * Operations on the network namespace
+ */
+#ifndef __NET_NET_NAMESPACE_H
+#define __NET_NET_NAMESPACE_H
+
+#include asm/atomic.h
+#include linux/workqueue.h
+#include linux/nsproxy.h
+#include linux/net_namespace_type.h
+
+/* How many bytes in each network namespace should we allocate
+ * for use by modules when they are loaded.
+ */
+#ifdef CONFIG_MODULES
+# define PER_NET_MODULE_RESERVE 2048
+#else
+# define PER_NET_MODULE_RESERVE 0
+#endif
+
+struct net_namespace_head {
+   atomic_t count; /* To decided when the network namespace
+* should go 
+*/
+   atomic_t use_count; /* For references we destroy on demand */
+   struct list_head list;
+   struct work_struct work;
+};
+
+static inline net_t get_net(net_t net) { return net; }
+static inline void put_net(net_t net) {}
+static inline net_t hold_net(net_t net) { return net; }
+static inline void release_net(net_t net) {} 
+
+#define __per_net_start((char *)0)
+#define __per_net_end  ((char *)0)
+
+static inline int copy_net(int flags, struct task_struct *tsk) { return 0; }
+
+/* Don't let the list of network namespaces change */
+static inline void net_lock(void) {}
+static inline void net_unlock(void) {}
+
+#define for_each_net(VAR) if (1)
+
+extern net_t net_template;
+
+#define NET_CREATE 0x0001  /* A network namespace has been created */
+#define NET_DESTROY0x0002  /* A network namespace is being destroyed */
+
+struct pernet_operations {
+   struct list_head list;
+   int (*init)(net_t net);
+   void (*exit)(net_t net);
+};
+
+extern int register_pernet_subsys(struct pernet_operations *);
+extern void unregister_pernet_subsys(struct pernet_operations *);
+extern int register_pernet_device(struct pernet_operations *);
+extern void unregister_pernet_device(struct pernet_operations *);
+
+#endif /* __NET_NET_NAMESPACE_H */
diff --git a/net/core/Makefile b/net/core/Makefile
index 73272d5..554dbdc 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -3,7 +3,7 @@
 #
 
 obj-y := sock.o request_sock.o skbuff.o iovec.o datagram.o stream.o scm.o \
-gen_stats.o gen_estimator.o
+gen_stats.o gen_estimator.o net_namespace.o
 
 obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
 
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
new file mode 100644
index 000..4ae266d
--- /dev/null
+++ b/net/core/net_namespace.c
@@ -0,0 +1,149 @@
+#include linux/rtnetlink.h
+#include net/net_namespace.h
+
+/*
+ * Our network namespace constructor/destructor lists
+ */
+
+static LIST_HEAD(pernet_list);
+static struct list_head *first_device = pernet_list;
+static DEFINE_MUTEX(net_mutex);
+net_t net_template;
+
+static int register_pernet_operations(struct list_head *list,
+ struct pernet_operations *ops)
+{
+   net_t net, undo_net;
+   int error;
+
+   error = 0;
+   list_add_tail(ops-list, list);
+

[PATCH RFC 21/31] net: Implement the guts of the network namespace infrastructure

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

Support is added for the .data.pernet section where all of
the variables who have a single instance in each network
namespace will live.  Every architectures linker script
is modified so is should work.

Summarizing the functions:
net_ns_init creates a slab and allocates the template and
the initial network namespace.

pernet_modcopy keeps the network namespaces in sync with
the loaded modules.  Initializing new data variables as
they are added.

The network namespace destruction because the last reference
can come from interrupt context queues itself for later with
schedule_work.   Then we alert everyone the network namespace
is disappearing.  If a buggy user is still holding a reference
to the network namespace we print a nasty message and leak
the network namespace.

The wrest are just light-weight wrapper functions to make things
more convinient.

A little should probably be said about net_head the variable
at the start of my network namespace structure.  It is the only
variable with a location decided by the C code instead of the linker
and I string them together in a linked list so I can iterate.

Probably more interesting is that it looks like it is saner not
to directly use a pointer to my network namespace but instead to
use an offset.  All of the references to data in my network namespace
are coming from per_net(...) which takes the address of the variable
in the .data.pernet section and then adds my magic offset.  If I
used a pointer I would have to subract an additional value and
export an extra symbol.  Not good for performance or maintenance :)

The expected usage of network namespace variables is to replace
sequences like:  loopback_dev with per_net(loopback_dev, net)
where net is some network namespace reference.  In my preliminary
tests the only a single additional addition is inserted so it
appears to be an efficient idiom.  Hopefully it is also easy to
comprehend and use.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 arch/alpha/kernel/vmlinux.lds.S|2 +
 arch/arm/kernel/vmlinux.lds.S  |3 +
 arch/arm26/kernel/vmlinux-arm26-xip.lds.in |3 +
 arch/arm26/kernel/vmlinux-arm26.lds.in |3 +
 arch/avr32/kernel/vmlinux.lds.c|3 +
 arch/cris/arch-v10/vmlinux.lds.S   |2 +
 arch/cris/arch-v32/vmlinux.lds.S   |2 +
 arch/frv/kernel/vmlinux.lds.S  |2 +
 arch/h8300/kernel/vmlinux.lds.S|3 +
 arch/i386/kernel/vmlinux.lds.S |3 +
 arch/ia64/kernel/vmlinux.lds.S |2 +
 arch/m32r/kernel/vmlinux.lds.S |3 +
 arch/m68k/kernel/vmlinux-std.lds   |3 +
 arch/m68k/kernel/vmlinux-sun3.lds  |3 +
 arch/m68knommu/kernel/vmlinux.lds.S|3 +
 arch/mips/kernel/vmlinux.lds.S |3 +
 arch/parisc/kernel/vmlinux.lds.S   |3 +
 arch/powerpc/kernel/vmlinux.lds.S  |2 +
 arch/ppc/kernel/vmlinux.lds.S  |2 +
 arch/s390/kernel/vmlinux.lds.S |3 +
 arch/sh/kernel/vmlinux.lds.S   |3 +
 arch/sh64/kernel/vmlinux.lds.S |3 +
 arch/sparc/kernel/vmlinux.lds.S|3 +
 arch/sparc64/kernel/vmlinux.lds.S  |3 +
 arch/v850/kernel/vmlinux.lds.S |6 +-
 arch/x86_64/kernel/vmlinux.lds.S   |3 +
 arch/xtensa/kernel/vmlinux.lds.S   |2 +
 include/asm-generic/vmlinux.lds.h  |8 +
 include/asm-um/common.lds.S|4 +-
 include/linux/module.h |3 +
 include/linux/net_namespace_type.h |   63 -
 include/net/net_namespace.h|   49 ++-
 kernel/module.c|  211 -
 net/core/net_namespace.c   |  232 
 34 files changed, 631 insertions(+), 15 deletions(-)

diff --git a/arch/alpha/kernel/vmlinux.lds.S b/arch/alpha/kernel/vmlinux.lds.S
index 76bf071..ad20077 100644
--- a/arch/alpha/kernel/vmlinux.lds.S
+++ b/arch/alpha/kernel/vmlinux.lds.S
@@ -72,6 +72,8 @@ SECTIONS
   .data.percpu : { *(.data.percpu) }
   __per_cpu_end = .;
 
+  DATA_PER_NET
+
   . = ALIGN(2*8192);
   __init_end = .;
   /* Freed after init ends here */
diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
index a8fa75e..5b003f9 100644
--- a/arch/arm/kernel/vmlinux.lds.S
+++ b/arch/arm/kernel/vmlinux.lds.S
@@ -61,6 +61,9 @@ SECTIONS
__per_cpu_start = .;
*(.data.percpu)
__per_cpu_end = .;
+
+   DATA_PER_NET
+
 #ifndef CONFIG_XIP_KERNEL
__init_begin = _stext;
*(.init.data)
diff --git a/arch/arm26/kernel/vmlinux-arm26-xip.lds.in 
b/arch/arm26/kernel/vmlinux-arm26-xip.lds.in
index ca61ec8..69d5772 100644
--- a/arch/arm26/kernel/vmlinux-arm26-xip.lds.in
+++

[PATCH RFC 14/31] net: Support multiple network namespaces with netlink

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

Each netlink socket will live in exactly one network namespace,
this includes the controlling kernel sockets.

This patch updates all of the existing netlink protocols
to only support the initial network namespace.  Request
by clients in other namespaces will get -ECONREFUSED.
As they would if the kernel did not have the support for
that netlink protocol compiled in.

As each netlink protocol is updated to be multiple network
namespace safe it can register multiple kernel sockets
to acquire a presence in the rest of the network namespaces.

The implementation in af_netlink is a simple filter implemenation
at hash table insertion and hash table look up time.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 drivers/scsi/scsi_netlink.c |2 +-
 drivers/scsi/scsi_transport_iscsi.c |2 +-
 include/linux/netlink.h |3 +-
 kernel/audit.c  |4 +-
 lib/kobject_uevent.c|4 +-
 net/bridge/netfilter/ebt_ulog.c |5 +-
 net/core/rtnetlink.c|4 +-
 net/decnet/netfilter/dn_rtmsg.c |3 +-
 net/ipv4/fib_frontend.c |3 +-
 net/ipv4/inet_diag.c|4 +-
 net/ipv4/netfilter/ip_queue.c   |6 +-
 net/ipv4/netfilter/ipt_ULOG.c   |4 +-
 net/ipv6/netfilter/ip6_queue.c  |4 +-
 net/netfilter/nfnetlink.c   |2 +-
 net/netfilter/nfnetlink_log.c   |3 +-
 net/netfilter/nfnetlink_queue.c |3 +-
 net/netlink/af_netlink.c|  104 ++-
 net/netlink/genetlink.c |4 +-
 net/xfrm/xfrm_user.c|2 +-
 19 files changed, 112 insertions(+), 54 deletions(-)

diff --git a/drivers/scsi/scsi_netlink.c b/drivers/scsi/scsi_netlink.c
index 1b59b27..02c2c1e 100644
--- a/drivers/scsi/scsi_netlink.c
+++ b/drivers/scsi/scsi_netlink.c
@@ -167,7 +167,7 @@ scsi_netlink_init(void)
return;
}
 
-   scsi_nl_sock = netlink_kernel_create(NETLINK_SCSITRANSPORT,
+   scsi_nl_sock = netlink_kernel_create(init_net(), NETLINK_SCSITRANSPORT,
SCSI_NL_GRP_CNT, scsi_nl_rcv, THIS_MODULE);
if (!scsi_nl_sock) {
printk(KERN_ERR %s: register of recieve handler failed\n,
diff --git a/drivers/scsi/scsi_transport_iscsi.c 
b/drivers/scsi/scsi_transport_iscsi.c
index 9c22f13..1ad22c2 100644
--- a/drivers/scsi/scsi_transport_iscsi.c
+++ b/drivers/scsi/scsi_transport_iscsi.c
@@ -1435,7 +1435,7 @@ static __init int iscsi_transport_init(void)
if (err)
goto unregister_conn_class;
 
-   nls = netlink_kernel_create(NETLINK_ISCSI, 1, iscsi_if_rx,
+   nls = netlink_kernel_create(init_net(), NETLINK_ISCSI, 1, iscsi_if_rx,
THIS_MODULE);
if (!nls) {
err = -ENOBUFS;
diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index b3b9b60..9dacd00 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -151,7 +151,7 @@ struct netlink_skb_parms
 #define NETLINK_CREDS(skb) (NETLINK_CB((skb)).creds)
 
 
-extern struct sock *netlink_kernel_create(int unit, unsigned int groups, void 
(*input)(struct sock *sk, int len), struct module *module);
+extern struct sock *netlink_kernel_create(net_t net, int unit, unsigned int 
groups, void (*input)(struct sock *sk, int len), struct module *module);
 extern void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err);
 extern int netlink_has_listeners(struct sock *sk, unsigned int group);
 extern int netlink_unicast(struct sock *ssk, struct sk_buff *skb, __u32 pid, 
int nonblock);
@@ -188,6 +188,7 @@ struct netlink_callback
 
 struct netlink_notify
 {
+   net_t net;
int pid;
int protocol;
 };
diff --git a/kernel/audit.c b/kernel/audit.c
index d9b690a..b0c5c61 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -696,8 +696,8 @@ static int __init audit_init(void)
 
printk(KERN_INFO audit: initializing netlink socket (%s)\n,
   audit_default ? enabled : disabled);
-   audit_sock = netlink_kernel_create(NETLINK_AUDIT, 0, audit_receive,
-  THIS_MODULE);
+   audit_sock = netlink_kernel_create(init_net(), NETLINK_AUDIT, 0,
+  audit_receive, THIS_MODULE);
if (!audit_sock)
audit_panic(cannot initialize netlink socket);
else
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 84272ed..9a5d4ca 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -292,8 +292,8 @@ EXPORT_SYMBOL_GPL(add_uevent_var);
 #if defined(CONFIG_NET)
 static int __init kobject_uevent_init(void)
 {
-   uevent_sock = netlink_kernel_create(NETLINK_KOBJECT_UEVENT, 1, NULL,
-   THIS_MODULE);
+   uevent_sock = netlink_kernel_create(init_net(), NETLINK_KOBJECT_UEVENT, 
1,
+

[PATCH RFC 10/31] net: Make socket creation namespace safe.

2007-01-25 Thread Eric W. Biederman

From: Eric W. Biederman [EMAIL PROTECTED] - unquoted

This patch passes in the namespace a new socket should be created in
and has the socket code do the appropriate reference counting.  By
virtue of this all socket create methods are touched.  In addition
the socket create methods are modified so that they will fail if
you attempt to create a socket in a non-default network namespace.

Failing if we attempt to create a socket outside of the default
socket namespace ensures that as we incrementally make the network stack
network namespace aware we will not export functionality that someone
has not audited and made certain is network namespace safe.
Allowing us to partially enable network namespaces before all of the
exotic protocols are supported.

Any protocol layers I have missed will fail to compile because I now
pass an extra parameter into the socket creation code.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
---
 drivers/net/pppoe.c  |4 ++--
 drivers/net/pppox.c  |7 +--
 include/linux/if_pppox.h |2 +-
 include/linux/net.h  |3 ++-
 include/net/llc_conn.h   |2 +-
 include/net/sock.h   |4 +++-
 net/appletalk/ddp.c  |7 +--
 net/atm/common.c |4 ++--
 net/atm/common.h |2 +-
 net/atm/pvc.c|7 +--
 net/atm/svc.c|   11 +++
 net/ax25/af_ax25.c   |9 ++---
 net/bluetooth/af_bluetooth.c |7 +--
 net/bluetooth/bnep/sock.c|4 ++--
 net/bluetooth/cmtp/sock.c|4 ++--
 net/bluetooth/hci_sock.c |4 ++--
 net/bluetooth/hidp/sock.c|4 ++--
 net/bluetooth/l2cap.c|   10 +-
 net/bluetooth/rfcomm/sock.c  |   10 +-
 net/bluetooth/sco.c  |   10 +-
 net/core/sock.c  |6 --
 net/decnet/af_decnet.c   |   13 -
 net/econet/af_econet.c   |7 +--
 net/ipv4/af_inet.c   |7 +--
 net/ipv6/af_inet6.c  |7 +--
 net/ipx/af_ipx.c |7 +--
 net/irda/af_irda.c   |   11 +++
 net/key/af_key.c |7 +--
 net/llc/af_llc.c |7 +--
 net/llc/llc_conn.c   |6 +++---
 net/netlink/af_netlink.c |   13 -
 net/netrom/af_netrom.c   |9 ++---
 net/packet/af_packet.c   |7 +--
 net/rose/af_rose.c   |9 ++---
 net/sctp/ipv6.c  |2 +-
 net/sctp/protocol.c  |2 +-
 net/socket.c |8 
 net/tipc/socket.c|9 ++---
 net/unix/af_unix.c   |   13 -
 net/wanrouter/af_wanpipe.c   |   15 +--
 net/x25/af_x25.c |   13 -
 41 files changed, 182 insertions(+), 111 deletions(-)

diff --git a/drivers/net/pppoe.c b/drivers/net/pppoe.c
index d34fe16..d09334d 100644
--- a/drivers/net/pppoe.c
+++ b/drivers/net/pppoe.c
@@ -475,12 +475,12 @@ static struct proto pppoe_sk_proto = {
  * Initialize a new struct sock.
  *
  **/
-static int pppoe_create(struct socket *sock)
+static int pppoe_create(net_t net, struct socket *sock)
 {
int error = -ENOMEM;
struct sock *sk;
 
-   sk = sk_alloc(PF_PPPOX, GFP_KERNEL, pppoe_sk_proto, 1);
+   sk = sk_alloc(net, PF_PPPOX, GFP_KERNEL, pppoe_sk_proto, 1);
if (!sk)
goto out;
 
diff --git a/drivers/net/pppox.c b/drivers/net/pppox.c
index 9315046..0d5c7bc 100644
--- a/drivers/net/pppox.c
+++ b/drivers/net/pppox.c
@@ -106,10 +106,13 @@ int pppox_ioctl(struct socket *sock, unsigned int cmd, 
unsigned long arg)
 
 EXPORT_SYMBOL(pppox_ioctl);
 
-static int pppox_create(struct socket *sock, int protocol)
+static int pppox_create(net_t net, struct socket *sock, int protocol)
 {
int rc = -EPROTOTYPE;
 
+   if (!net_eq(net, init_net()))
+   return -EAFNOSUPPORT;
+
if (protocol  0 || protocol  PX_MAX_PROTO)
goto out;
 
@@ -118,7 +121,7 @@ static int pppox_create(struct socket *sock, int protocol)
!try_module_get(pppox_protos[protocol]-owner))
goto out;
 
-   rc = pppox_protos[protocol]-create(sock);
+   rc = pppox_protos[protocol]-create(net, sock);
 
module_put(pppox_protos[protocol]-owner);
 out:
diff --git a/include/linux/if_pppox.h b/include/linux/if_pppox.h
index 4fab3d0..f6ffd83 100644
--- a/include/linux/if_pppox.h
+++ b/include/linux/if_pppox.h
@@ -148,7 +148,7 @@ static inline struct sock *sk_pppox(struct pppox_sock *po)
 struct module;
 
 struct pppox_proto {
-   int (*create)(struct socket *sock);
+   int (*create)(net_t net, struct socket *sock);
int (*ioctl)(struct socket *sock, unsigned int cmd,
 unsigned long arg);
struct module   *owner;
diff --git

Re: [PATCH RFC 2/31] net: Implement a place holder network namespace

2007-01-25 Thread Stephen Hemminger

On Thu, 25 Jan 2007 12:00:04 -0700
Eric W. Biederman [EMAIL PROTECTED] wrote:

 From: Eric W. Biederman [EMAIL PROTECTED] - unquoted
 
 Many of the changes to the network stack will simply be adding a
 network namespace parameter to function calls or moving variables
 from globals to being per network namespace.  When those variables
 have initializers that cannot statically compute the proper value,
 a function that runs at the creation and destruction of network
 namespaces will need to be registered, and the logic will need to
 be changed to accomidate that.
 
 Adding unconditional support for these functions ensures that even when
 everything else is compiled out the modified network stack logic will
 continue to run correctly.
 
 This patch adds struct pernet_operations that has an init (constructor)
 and an exit (destructor) method.  When registered the init method
 is called for every existing namespace, and when unregistered the
 exit method is called for every existing namespace.  When a new
 network namespace is created all of the init methods are called
 in the order in which they were registered, and when a network namespace
 is destroyed the exit methods are called in the reverse order in
 which they were registered.
 
 There are two distinct types of pernet_operations recognized: subsys and
 device.  At creation all subsys init functions are called before device
 init functions, and at destruction all device exit functions are called
 before subsys exit function.  For other ordering the preservation
 of the order of registration combined with the various kinds of
 kernel initcalls should be sufficient.
 
 Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]

 +
 +static inline net_t get_net(net_t net) { return net; }
 +static inline void put_net(net_t net) {}
 +static inline net_t hold_net(net_t net) { return net; }
 +static inline void release_net(net_t net) {} 
 +
 +#define __per_net_start  ((char *)0)
 +#define __per_net_end((char *)0

Don't use these use NULL

 +
 +static inline int copy_net(int flags, struct task_struct *tsk) { return 0; }
 +
 +/* Don't let the list of network namespaces change */
 +static inline void net_lock(void) {}
 +static inline void net_unlock(void) {}

Don't make all one line, or use #define instead.


 +
 +#define for_each_net(VAR) if (1)
 +
 +extern net_t net_template;
 +
 +#define NET_CREATE   0x0001  /* A network namespace has been created */
 +#define NET_DESTROY  0x0002  /* A network namespace is being destroyed */
 +
 +struct pernet_operations {
 + struct list_head list;
 + int (*init)(net_t net);
 + void (*exit)(net_t net);
 +};
 +
 +extern int register_pernet_subsys(struct pernet_operations *);
 +extern void unregister_pernet_subsys(struct pernet_operations *);
 +extern int register_pernet_device(struct pernet_operations *);
 +extern void unregister_pernet_device(struct pernet_operations *);
 +
 +#endif /* __NET_NET_NAMESPACE_H */
 diff --git a/net/core/Makefile b/net/core/Makefile
 index 73272d5..554dbdc 100644
 --- a/net/core/Makefile
 +++ b/net/core/Makefile
 @@ -3,7 +3,7 @@
  #
  
  obj-y := sock.o request_sock.o skbuff.o iovec.o datagram.o stream.o scm.o \
 -  gen_stats.o gen_estimator.o
 +  gen_stats.o gen_estimator.o net_namespace.o
  
  obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
  
 diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
 new file mode 100644
 index 000..4ae266d
 --- /dev/null
 +++ b/net/core/net_namespace.c
 @@ -0,0 +1,149 @@
 +#include linux/rtnetlink.h
 +#include net/net_namespace.h
 +
 +/*
 + *   Our network namespace constructor/destructor lists
 + */
 +
 +static LIST_HEAD(pernet_list);
 +static struct list_head *first_device = pernet_list;
 +static DEFINE_MUTEX(net_mutex);
 +net_t net_template;
 +
 +static int register_pernet_operations(struct list_head *list,
 +   struct pernet_operations *ops)
 +{
 + net_t net, undo_net;
 + int error;
 +
 + error = 0;
 + list_add_tail(ops-list, list);
 + for_each_net(net) {
 + if (ops-init) {
 + error = ops-init(net);
 + if (error)
 + goto out_undo;
 + }
 + }
 +out:
 + return error;
 +
 +out_undo:
 + /* If I have an error cleanup all namespaces I initialized */
 + list_del(ops-list);
 + for_each_net(undo_net) {
 + if (net_eq(undo_net, net))
 + goto undone;
 + if (ops-exit)
 + ops-exit(undo_net);
 + }
 +undone:
 + goto out;
 +}
 +
 +static void unregister_pernet_operations(struct pernet_operations *ops)
 +{
 + net_t net;
 +
 + list_del(ops-list);
 + for_each_net(net) 
 + if (ops-exit)
 + ops-exit(net);
 +}
 +


You should use RCU for this because registering/unregistering network
namespaces is obviously a much rarer occurrence than referencing them.
-- 
Stephen Hemminger

Re: [PATCH RFC 1/31] net: Add net_namespace_type.h to allow for per network namespace variables.

2007-01-25 Thread Stephen Hemminger

Can all this be a nop if a CONFIG option is not selected?





 diff --git a/include/linux/net_namespace_type.h 
 b/include/linux/net_namespace_type.h
 new file mode 100644
 index 000..8173f59
 --- /dev/null
 +++ b/include/linux/net_namespace_type.h
 @@ -0,0 +1,52 @@
 +/* 
 + * Definition of the network namespace reference type
 + * And operations upon it.
 + */
 +#ifndef __LINUX_NET_NAMESPACE_TYPE_H
 +#define __LINUX_NET_NAMESPACE_TYPE_H
 +
 +#define __pernetname(name) per_net__##name

Code obfuscation, please don't do that

 +typedef struct {} net_t;

No typedef for this please.

 +
 +#define __data_pernet 
 +
 +/* Look up a per network namespace variable */
 +static inline unsigned long __per_net_offset(net_t net) { return 0; }
 +
 +/* Like per_net but returns a pseudo variable address that must be moved
 + * __per_net_offset() bytes before it will point to a real variable.
 + * Useful for static initializers.
 + */
 +#define __per_net_base(name)   __pernetname(name)
 +
 +/* Get the network namespace reference from a per_net variable address */
 +#define net_of(ptr, name) ({ net_t net; ptr; net; })
 +
 +/* Look up a per network namespace variable */
 +#define per_net(name, net) \
 + (*(__per_net_offset(net), __per_net_base(name)))
 + 
 +/* Are the two network namespaces the same */
 +static inline int net_eq(net_t a, net_t b) { return 1; }
 +/* Get an unsigned value appropriate for hashing the network namespace */
 +static inline unsigned int net_hval(net_t net) { return 0; }
 +
 +/* Convert to and from to and from void pointers */
 +static inline void *net_to_voidp(net_t net) { return NULL; }
 +static inline net_t net_from_voidp(void *ptr) { net_t net; return net; }
 +
 +static inline int null_net(net_t net) { return 0; }
 +
 +#define DEFINE_PER_NET(type, name)   \
 + __data_pernet __typeof__(type) __pernetname(name)
 +
 +#define DECLARE_PER_NET(type, name) \
 + extern __typeof__(type) __pernetname(name)
 +
 +#define EXPORT_PER_NET_SYMBOL(var)   \
 + EXPORT_SYMBOL(__pernetname(var))
 +#define EXPORT_PER_NET_SYMBOL_GPL(var)   \
 + EXPORT_SYMBOL_GPL(__pernetname(var))
 +
 +#endif /* __LINUX_NET_NAMESPACE_TYPE_H */


-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/31] net: Add net_namespace_type.h to allow for per network namespace variables.

2007-01-25 Thread Eric W. Biederman

Stephen Hemminger [EMAIL PROTECTED] writes:

 Can all this be a nop if a CONFIG option is not selected?

That is exactly what this infrastructure supports.
What you see is the version that comes into effect when
the CONFIG option is not selected.

From using an empty structure to replace a pointer to make
that a NOP to most of the rest below.


 diff --git a/include/linux/net_namespace_type.h
 b/include/linux/net_namespace_type.h
 new file mode 100644
 index 000..8173f59
 --- /dev/null
 +++ b/include/linux/net_namespace_type.h
 @@ -0,0 +1,52 @@
 +/* 
 + * Definition of the network namespace reference type
 + * And operations upon it.
 + */
 +#ifndef __LINUX_NET_NAMESPACE_TYPE_H
 +#define __LINUX_NET_NAMESPACE_TYPE_H
 +
 +#define __pernetname(name) per_net__##name

 Code obfuscation, please don't do that

Single point of making the naming rules, better maintenance.
The basic point is that variables that come through this path
you should not access directly.  Tweaking the name enforces that
even in the compiled out state.

 +typedef struct {} net_t;

 No typedef for this please.

Why.  That is conventially how we do opaque types in linux
when someone is doing something sophisticated.


You probably want to look down to patch 21 to see what the compiled
in version of these look like.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

wireless-dev updated 2007-01-25

2007-01-25 Thread John W. Linville

The following changes since commit a4893aa0bb61c7bbced8fcdea874cb8d0e1d3a8d:
  John W. Linville (1):
Merge branch 'from-linus'

are found in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-dev.git

Gertjan van Wingerde (1):
  d80211: Select CRYPTO_ECB when enabler d80211.

Ivo van Doorn (4):
  eeprom_93cx6
  rt2x00 should use generic eeprom_93cx6
  crc-itu-t
  rt2x00 should use generic crc-itu-t

Jan Kiszka (1):
  d80211: Fix inconsistent sta_lock usage

John W. Linville (2):
  Merge http://bu3sch.de/git/wireless-dev
  Merge git://git.kernel.org/.../jbenc/dscape

Michael Buesch (47):
  bcm43xx-d80211: Add some PHY register definitions.
  bcm43xx-d80211: Move ILT stuff to OFDM table stuff
  bcm43xx-d80211: Remove PHY OFDM routing bit, if we are on A-PHY.
  bcm43xx-d80211: Merge new LO-control code.
  bcm43xx-d80211: Fix compilation: Missing files for LO and VSTACK.
  bcm43xx-d80211: Rename struct bcm43xx_phyinfo to struct bcm43xx_phy
  bcm43xx-d80211: merge struct bcm43xx_radioinfo into struct bcm43xx_phy
  bcm43xx-d80211: Merge all radio stuff into phy.c
  Merge branch 'master' of git://kernel.org/.../linville/wireless-dev
  bcm43xx-d80211: Drain TXstatus queue before enabling IRQs.
  bcm43xx-d80211: Fix antenna selection for TX and RX.
  bcm43xx-d80211: Fix bogus LO validation failure.
  bcm43xx-d80211: Remove netpoll and ethtool stuff.
  Merge branch 'master' of git://kernel.org/.../linville/wireless-dev
  Merge branch 'master' of git://kernel.org/.../linville/wireless-dev
  Merge branch 'master' of git://kernel.org/.../linville/wireless-dev
  Merge branch 'master' of git://zeus2.kernel.org/.../linville/wireless-dev
  Remove obsolete SSB driver library.
  Implement new SSB subsystem.
  bcm43xx-d80211: Port driver to the new SSB subsystem.
  bcm43xx-d80211: Fix LO feedthrough measurement.
  ssb: Fix dependencies. MIPS core must depend on MIPS platform.
  ssb: Fix typo. SSB_PCICORE_SBTOPCI1_CFG1 does not exist.
  ssb, bcm43xx-d80211: Move DMA translation logic to ssb.
  bcm43xx-d80211: Remove bogus call to refresh_templates in add_interface.
  ssb, bcm43xx-d80211: Add function to set DMA mask on SSB.
  ssb: Fix busnumber assignment. Must assign it before scanning the bus.
  ssb: Allow disabling of all PCI related stuff.
  ssb: PCMCIA-hostbus support.
  bcm43xx-d80211: Support for PCMCIA devices.
  bcm43xx-d80211: re-add chipid printk
  bcm43xx-d80211: Remove leds_exit() call in detach stage.
  bcm43xx-d80211: Get rid of PHY-connected semantics.
  bcm43xx-d80211: Fix wrong register write in lo_measure_feedthrough().
  bcm43xx-d80211: Fix error return codes.
  bcm43xx-d80211: Various cleanups all over the code.
  bcm43xx-d80211: Fix semantical errors in LO measure setup.
  bcm43xx-d80211: gphy init: Some cleanups and some bugfixes.
  ssb: Add missing include to delay.h in ssb/pcmcia.c
  ssb, usb: Implement SSB based Broadcom USB OHCI driver.
  ssb: PCIcore hostmode fixes.
  ssb: export ssb_clockspeed()
  ssb: add PM config register definitions
  ssb: b44 related fixes.
  bcm43xx-d80211: Fix initial LO Calibration.
  bcm43xx-d80211: Fix loopback gain calculation.
  bcm43xx-d80211: Fix DMA TX skb doublefree

Michael Wu (2):
  d80211: Only free WEP crypto ciphers when they have been allocated 
correctly.
  d80211: Fix __ieee80211_if_del on live interfaces

Pavel Roskin (1):
  bcm43xx_d80211: Fix major memory corruption bug

 drivers/Kconfig|2 +
 drivers/Makefile   |1 +
 drivers/misc/Kconfig   |4 -
 drivers/misc/Makefile  |1 -
 drivers/misc/ssb.c | 1074 -
 drivers/net/wireless/d80211/bcm43xx/Kconfig|   40 +-
 drivers/net/wireless/d80211/bcm43xx/Makefile   |   11 +-
 drivers/net/wireless/d80211/bcm43xx/bcm43xx.h  |  441 +-
 .../net/wireless/d80211/bcm43xx/bcm43xx_debugfs.c  |   72 +-
 .../net/wireless/d80211/bcm43xx/bcm43xx_debugfs.h  |   16 +-
 drivers/net/wireless/d80211/bcm43xx/bcm43xx_dma.c  |  197 +-
 drivers/net/wireless/d80211/bcm43xx/bcm43xx_dma.h  |   48 +-
 drivers/net/wireless/d80211/bcm43xx/bcm43xx_ilt.c  |  337 --
 drivers/net/wireless/d80211/bcm43xx/bcm43xx_ilt.h  |   32 -
 drivers/net/wireless/d80211/bcm43xx/bcm43xx_leds.c |   89 +-
 drivers/net/wireless/d80211/bcm43xx/bcm43xx_leds.h |   12 +-
 drivers/net/wireless/d80211/bcm43xx/bcm43xx_lo.c   | 1060 +
 drivers/net/wireless/d80211/bcm43xx/bcm43xx_lo.h   |   91 +
 drivers/net/wireless/d80211/bcm43xx/bcm43xx_main.c | 3972 +++-
 drivers/net/wireless/d80211/bcm43xx/bcm43xx_main.h |   40 +-
 drivers/net/wireless/d80211/bcm43xx/bcm43xx_pci.c  |  147 +

Re: [PATCH 2.6.20-rc5] IPV6: skb is unexpectedly freed.

2007-01-25 Thread Masayuki Nakagawa

David, Yoshifuji-san, Herbert,

I appreciate your feedback.

I made an another patch that simply replaced __kfree_skb() in exit path with
kfree_skb(). I tested it overnight with a chat benchmark tool and
my test program, which can reproduce the original problem.

As a result, I didn't see any problem.
(For example, neither oops nor memory leak happened.)

I will post the patch a few moments later. Please take a look at it.

Thanks,
Masa

David Miller wrote:
 From: YOSHIFUJI Hideaki [EMAIL PROTECTED]
 Date: Wed, 24 Jan 2007 13:37:25 +0900 (JST)

   
 In article [EMAIL PROTECTED] (at Wed, 24 Jan 2007 15:31:47 +1100), Herbert 
 Xu [EMAIL PROTECTED] says:

 
 Masayuki Nakagawa [EMAIL PROTECTED] wrote:
   
 I suggest to use kfree_skb() instead of __kfree_skb().
 
 I agree.  In fact please do it for all paths in that function, i.e.,
 just change __kfree_skb to kfree_skb rather than adding a special case
 for this path.
   
 I do think so, too.
 

 So do I, but initially I want to push his basic patch in
 so that I can push the same exact thing into -stable to
 fix this bug.

 So if you make the subsequent change, please make it relative
 to the original patch.

 Thank you.
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

   
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] TCP: Replace __kfree_skb() with kfree_skb()

2007-01-25 Thread Masayuki Nakagawa

This patch simply replaces __kfree_skb() in exit path with kfree_skb().
In tcp_rcv_state_process(), generally skbs should be destroyed only when
the ref count is zero.
That is the way things are supposed to be done in the kernel.

This change might reveals a memory leak of skb.
If it happens, it would be because someone doesn't deal with the skb properly.

Signed-off-by: Masayuki Nakagawa [EMAIL PROTECTED]

--- linux-2.6/net/ipv4/tcp_input.c.orig 2007-01-25 07:04:35.0 -0800
+++ linux-2.6/net/ipv4/tcp_input.c  2007-01-25 07:05:05.0 -0800
@@ -4423,8 +4423,6 @@ int tcp_rcv_state_process(struct sock *s
 * in the interest of security over speed unless
 * it's still in use.
 */
-   kfree_skb(skb);
-   return 0;
}
goto discard;

@@ -4634,7 +4632,7 @@ int tcp_rcv_state_process(struct sock *s

if (!queued) {
 discard:
-   __kfree_skb(skb);
+   kfree_skb(skb);
}
return 0;
 }
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Fix sorting of SACK blocks

2007-01-25 Thread Baruch Even

* David Miller [EMAIL PROTECTED] [070126 01:55]:
 From: Baruch Even [EMAIL PROTECTED]
 Date: Thu, 25 Jan 2007 20:29:03 +0200

  The sorting of SACK blocks actually munges them rather than sort, causing 
  the
  TCP stack to ignore some SACK information and breaking the assumption of
  ordered SACK blocks after sorting.

  The sort takes the data from a second buffer which isn't moved causing
  subsequent data moves to occur from the wrong location. The fix is to
  use a temporary buffer as a normal sort does.

  Signed-Off-By: Baruch Even [EMAIL PROTECTED]

 BTW, in reviewing this I note that there is now only one remaining
 use of tp-recv_sack_cache[] and that is the code earlier in this
 function which is trying to detect if all we are doing is extending
 the leading edge of a SACK block.

 It would be nice to be able to clear out that usage as well, and
 remove recv_sack_cache[] and thus make tcp_sock smaller.

You actually need recv_sack_cache to detect if you can use the fast
path. Another alternative is to somehow hash the values of the sack
blocks but then you rely on probabilty that you will properly detect the
ability to use the fast path. Hashing will save some space but you can't
get rid of it completely unless you go back to the old and slow method
of SACK processing.

There were thoughts thrown a while back about using a different data
structure, I think you said you started working on something like that.
If that comes to fruition the cache might go.

FWIW, my other mail about possible bugs actually says that you might
need to add another value to check, the number of sack blocks in the
cache.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up

2007-01-25 Thread Jiri Benc

On Thu, 25 Jan 2007 01:50:54 -0500, Pavel Roskin wrote:
 It turns out d80211 uses the config method of the hardware drivers
 very sparingly.  It's only used for scanning and in ioctl commands.  It
 is not called after the interface has been brought up with the open
 method.
 
 I don't know whose responsibility it should be to apply the
 configuration when the interface is brought up.  I'm not familiar with
 d80211 design principles.

I think it should be done in the stack (actually, it's on todo list for
quite some time). However, I don't consider this as a big problem -
just do it in a driver (like your patch does) for now.

 Jiri

-- 
Jiri Benc
SUSE Labs
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up

2007-01-25 Thread Ivo Van Doorn


Hi,


I have discovered that while I can indeed associate without
wpa_supplicant using bcm43xx_d80211 driver, I have to set the channel
every time the interface is brought down and up.

It turns out d80211 uses the config method of the hardware drivers
very sparingly.  It's only used for scanning and in ioctl commands.  It
is not called after the interface has been brought up with the open
method.


Correct, similar problems have been detected in rt2x00. The temporary
solution in there is to demand a scanning operation after the interface
has been brought up.


I don't know whose responsibility it should be to apply the
configuration when the interface is brought up.  I'm not familiar with
d80211 design principles.


Well my personal preference would be if the dscape stack handles it,
unless the stack guarentees the conf structure has been initialized
and contains valid data when the interface is being brought up.

Ivo
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up

2007-01-25 Thread Jiri Benc

On Thu, 25 Jan 2007 12:47:08 +0100, Ivo Van Doorn wrote:
 Correct, similar problems have been detected in rt2x00. The temporary
 solution in there is to demand a scanning operation after the interface
 has been brought up.

Scanning? No no no, please! That would be a clear bug and misbehaviour.

 Jiri

-- 
Jiri Benc
SUSE Labs
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up

2007-01-25 Thread Ivo Van Doorn


Hi,


 Correct, similar problems have been detected in rt2x00. The temporary
 solution in there is to demand a scanning operation after the interface
 has been brought up.

Scanning? No no no, please! That would be a clear bug and misbehaviour.


Hmm, I think I forgot to add one little thing in my comment.
The scanning operation is demanded in the rt2x00 README, so the driver
doesn't start the scanning automatically and just awaits the user commands.
The user is also free to change the channel to make the configuration active.
But a scanning command will also display if it has at least found some AP,
so without scanning results attempting to scan will very likely fail. ;)

Ivo
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up

2007-01-25 Thread Jiri Benc

On Thu, 25 Jan 2007 09:05:32 -0500, Gene Heskett wrote:
 Oh?  I'm sitting here watching the tty0 screen of my lappy after x has 
 been started, and I have established a connection, but SoftMAC is still 
 logging its scan activity, starting with channel 1 and scanning 14 
 channels.  Its doing this at approximately 2 minute intervals.  So I 
 think we have your definition of a clear bug and misbehaviour.

Yes. As well as some other wireless drivers. But that's not worth
fixing.

Also, please note that softmac isn't going to do user space MLME so
it is not relevant at all.

 Jiri

-- 
Jiri Benc
SUSE Labs
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up

2007-01-25 Thread Gene Heskett

On Thursday 25 January 2007 09:23, Jiri Benc wrote:
On Thu, 25 Jan 2007 09:05:32 -0500, Gene Heskett wrote:
 Oh?  I'm sitting here watching the tty0 screen of my lappy after x has
 been started, and I have established a connection, but SoftMAC is
 still logging its scan activity, starting with channel 1 and scanning
 14 channels.  Its doing this at approximately 2 minute intervals.  So
 I think we have your definition of a clear bug and misbehaviour.

Yes. As well as some other wireless drivers. But that's not worth
fixing.

Also, please note that softmac isn't going to do user space MLME so
it is not relevant at all.

MLME?  More acronyms I've not put in my wet dictionary.. :)

 Jiri

-- 
Cheers, Gene
There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order.
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2007 by Maurice Eugene Heskett, all rights reserved.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up

2007-01-25 Thread Gene Heskett

On Thursday 25 January 2007 07:50, Jiri Benc wrote:
On Thu, 25 Jan 2007 12:47:08 +0100, Ivo Van Doorn wrote:
 Correct, similar problems have been detected in rt2x00. The temporary
 solution in there is to demand a scanning operation after the
 interface has been brought up.

Scanning? No no no, please! That would be a clear bug and misbehaviour.

 Jiri

Oh?  I'm sitting here watching the tty0 screen of my lappy after x has 
been started, and I have established a connection, but SoftMAC is still 
logging its scan activity, starting with channel 1 and scanning 14 
channels.  Its doing this at approximately 2 minute intervals.  So I 
think we have your definition of a clear bug and misbehaviour.

-- 
Cheers, Gene
There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order.
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2007 by Maurice Eugene Heskett, all rights reserved.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up

2007-01-25 Thread Larry Finger

Gene Heskett wrote:
 On Thursday 25 January 2007 07:50, Jiri Benc wrote:
 On Thu, 25 Jan 2007 12:47:08 +0100, Ivo Van Doorn wrote:
 Correct, similar problems have been detected in rt2x00. The temporary
 solution in there is to demand a scanning operation after the
 interface has been brought up.
 Scanning? No no no, please! That would be a clear bug and misbehaviour.

 Jiri
 
 Oh?  I'm sitting here watching the tty0 screen of my lappy after x has 
 been started, and I have established a connection, but SoftMAC is still 
 logging its scan activity, starting with channel 1 and scanning 14 
 channels.  Its doing this at approximately 2 minute intervals.  So I 
 think we have your definition of a clear bug and misbehaviour.
 

Are you running NetworkManager? If so, that is the source of the scanning.

Larry

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up

2007-01-25 Thread Dan Williams

On Thu, 2007-01-25 at 09:36 -0600, Larry Finger wrote:
 Gene Heskett wrote:
  On Thursday 25 January 2007 07:50, Jiri Benc wrote:
  On Thu, 25 Jan 2007 12:47:08 +0100, Ivo Van Doorn wrote:
  Correct, similar problems have been detected in rt2x00. The temporary
  solution in there is to demand a scanning operation after the
  interface has been brought up.
  Scanning? No no no, please! That would be a clear bug and misbehaviour.
 
  Jiri
  
  Oh?  I'm sitting here watching the tty0 screen of my lappy after x has 
  been started, and I have established a connection, but SoftMAC is still 
  logging its scan activity, starting with channel 1 and scanning 14 
  channels.  Its doing this at approximately 2 minute intervals.  So I 
  think we have your definition of a clear bug and misbehaviour.
  
 
 Are you running NetworkManager? If so, that is the source of the scanning.

Right; and NM scans at 2m intervals by default, unless you've clicked
the menu (or a few other instances) where it will jump up to 20s and
then back off to 2m again.

dan


 Larry
 
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up

2007-01-25 Thread Michael Wu

On Thursday 25 January 2007 01:50, Pavel Roskin wrote:
 If the hardware drivers are supposed to do it, here's my patch.  It is
 working fine for me and ready to be applied.  The changelog is in the
 subject.
Let's fix this in the stack. This problem will be fixed for most users once 
auto channel selection is implemented, and fixing it for users manually 
setting the channel should be trivial.

-Michael Wu


pgp5LtQxHoU24.pgp
Description: PGP signature

Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up

2007-01-25 Thread John W. Linville

On Thu, Jan 25, 2007 at 11:51:27AM -0500, Michael Wu wrote:
 On Thursday 25 January 2007 01:50, Pavel Roskin wrote:
  If the hardware drivers are supposed to do it, here's my patch.  It is
  working fine for me and ready to be applied.  The changelog is in the
  subject.
 Let's fix this in the stack. This problem will be fixed for most users once 
 auto channel selection is implemented, and fixing it for users manually 
 setting the channel should be trivial.

ACK...fixing the stack makes the most sense.

-- 
John W. Linville
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] bcm43xx-d80211: Interrogate hardware-enable switch and update LEDs

2007-01-25 Thread John W. Linville

On Sat, Dec 30, 2006 at 11:25:15PM -0600, Larry Finger wrote:
 The current bcm43xx driver ignores any wireless-enable switches on mini-PCI
 and mini-PCI-E cards. This patch implements a new routine to interrogate the
 radio hardware enabled bit in the interface, logs the initial state and any
 changes in the switch (if debugging enabled), activates the LED to show the
 state, and changes the periodic work handler to provide 1 second response
 to switch changes and to account for changes in the periodic work specs. It
 also incorporates changes in the LED state that were accepted into mainline
 some time ago.

Larry,

I wanted to merge this, but a slew of changes pulled from Michael
makes this patch fail to apply.  Would you mind refactoring the patch
and resubmitting?

Thanks,

John
-- 
John W. Linville
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[take34 6/10] kevent: Pipe notifications.

2007-01-25 Thread Evgeniy Polyakov


Pipe notifications.


diff --git a/fs/pipe.c b/fs/pipe.c
index 68090e8..0c75bf1 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -16,6 +16,7 @@
 #include linux/uio.h
 #include linux/highmem.h
 #include linux/pagemap.h
+#include linux/kevent.h
 
 #include asm/uaccess.h
 #include asm/ioctls.h
@@ -313,6 +314,7 @@ redo:
break;
}
if (do_wakeup) {
+   kevent_pipe_notify(inode, KEVENT_SOCKET_SEND);
wake_up_interruptible_sync(pipe-wait);
kill_fasync(pipe-fasync_writers, SIGIO, POLL_OUT);
}
@@ -322,6 +324,7 @@ redo:
 
/* Signal writers asynchronously that there is more room. */
if (do_wakeup) {
+   kevent_pipe_notify(inode, KEVENT_SOCKET_SEND);
wake_up_interruptible(pipe-wait);
kill_fasync(pipe-fasync_writers, SIGIO, POLL_OUT);
}
@@ -484,6 +487,7 @@ redo2:
break;
}
if (do_wakeup) {
+   kevent_pipe_notify(inode, KEVENT_SOCKET_RECV);
wake_up_interruptible_sync(pipe-wait);
kill_fasync(pipe-fasync_readers, SIGIO, POLL_IN);
do_wakeup = 0;
@@ -495,6 +499,7 @@ redo2:
 out:
mutex_unlock(inode-i_mutex);
if (do_wakeup) {
+   kevent_pipe_notify(inode, KEVENT_SOCKET_RECV);
wake_up_interruptible(pipe-wait);
kill_fasync(pipe-fasync_readers, SIGIO, POLL_IN);
}
@@ -590,6 +595,7 @@ pipe_release(struct inode *inode, int decr, int decw)
free_pipe_info(inode);
} else {
wake_up_interruptible(pipe-wait);
+   kevent_pipe_notify(inode, 
KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
kill_fasync(pipe-fasync_readers, SIGIO, POLL_IN);
kill_fasync(pipe-fasync_writers, SIGIO, POLL_OUT);
}
diff --git a/kernel/kevent/kevent_pipe.c b/kernel/kevent/kevent_pipe.c
new file mode 100644
index 000..91dc1eb
--- /dev/null
+++ b/kernel/kevent/kevent_pipe.c
@@ -0,0 +1,123 @@
+/*
+ * kevent_pipe.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include linux/kernel.h
+#include linux/types.h
+#include linux/slab.h
+#include linux/spinlock.h
+#include linux/file.h
+#include linux/fs.h
+#include linux/kevent.h
+#include linux/pipe_fs_i.h
+
+static int kevent_pipe_callback(struct kevent *k)
+{
+   struct inode *inode = k-st-origin;
+   struct pipe_inode_info *pipe = inode-i_pipe;
+   int nrbufs = pipe-nrbufs;
+
+   if (k-event.event  KEVENT_SOCKET_RECV  nrbufs  0) {
+   if (!pipe-writers)
+   return -1;
+   return 1;
+   }
+   
+   if (k-event.event  KEVENT_SOCKET_SEND  nrbufs  PIPE_BUFFERS) {
+   if (!pipe-readers)
+   return -1;
+   return 1;
+   }
+
+   return 0;
+}
+
+int kevent_pipe_enqueue(struct kevent *k)
+{
+   struct file *pipe;
+   int err = -EBADF;
+   struct inode *inode;
+
+   pipe = fget(k-event.id.raw[0]);
+   if (!pipe)
+   goto err_out_exit;
+
+   inode = igrab(pipe-f_dentry-d_inode);
+   if (!inode)
+   goto err_out_fput;
+
+   err = -EINVAL;
+   if (!S_ISFIFO(inode-i_mode))
+   goto err_out_iput;
+
+   err = kevent_storage_enqueue(inode-st, k);
+   if (err)
+   goto err_out_iput;
+
+   if (k-event.req_flags  KEVENT_REQ_ALWAYS_QUEUE) {
+   kevent_requeue(k);
+   err = 0;
+   } else {
+   err = k-callbacks.callback(k);
+   if (err)
+   goto err_out_dequeue;
+   }
+
+   fput(pipe);
+
+   return err;
+
+err_out_dequeue:
+   kevent_storage_dequeue(k-st, k);
+err_out_iput:
+   iput(inode);
+err_out_fput:
+   fput(pipe);
+err_out_exit:
+   return err;
+}
+
+int kevent_pipe_dequeue(struct kevent *k)
+{
+   struct inode *inode = k-st-origin;
+
+   kevent_storage_dequeue(k-st, k);
+   iput(inode);
+
+   return 0;
+}
+
+void kevent_pipe_notify(struct inode

[take34 4/10] kevent: Socket notifications.

2007-01-25 Thread Evgeniy Polyakov


Socket notifications.

This patch includes socket send/recv/accept notifications.
Using trivial web server based on kevent and this features
instead of epoll it's performance increased more than noticebly.
More details about various benchmarks and server itself 
(evserver_kevent.c) can be found on project's homepage.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/fs/inode.c b/fs/inode.c
index bf21dc6..82817b1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@
 #include linux/cdev.h
 #include linux/bootmem.h
 #include linux/inotify.h
+#include linux/kevent.h
 #include linux/mount.h
 
 /*
@@ -164,12 +165,18 @@ static struct inode *alloc_inode(struct super_block *sb)
}
inode-i_private = NULL;
inode-i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+   kevent_storage_init(inode, inode-st);
+#endif
}
return inode;
 }
 
 void destroy_inode(struct inode *inode) 
 {
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+   kevent_storage_fini(inode-st);
+#endif
BUG_ON(inode_has_buffers(inode));
security_inode_free(inode);
if (inode-i_sb-s_op-destroy_inode)
diff --git a/include/net/sock.h b/include/net/sock.h
index 03684e7..d840399 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -49,6 +49,7 @@
 #include linux/skbuff.h  /* struct sk_buff */
 #include linux/mm.h
 #include linux/security.h
+#include linux/kevent.h
 
 #include linux/filter.h
 
@@ -451,6 +452,21 @@ static inline int sk_stream_memory_free(struct sock *sk)
 
 extern void sk_stream_rfree(struct sk_buff *skb);
 
+struct socket_alloc {
+   struct socket socket;
+   struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+   return container_of(inode, struct socket_alloc, vfs_inode)-socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+   return container_of(socket, struct socket_alloc, socket)-vfs_inode;
+}
+
 static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
 {
skb-sk = sk;
@@ -478,6 +494,7 @@ static inline void sk_add_backlog(struct sock *sk, struct 
sk_buff *skb)
sk-sk_backlog.tail = skb;
}
skb-next = NULL;
+   kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 }
 
 #define sk_wait_event(__sk, __timeo, __condition)  \
@@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kiocb(struct 
sock_iocb *si)
return si-kiocb;
 }
 
-struct socket_alloc {
-   struct socket socket;
-   struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
-   return container_of(inode, struct socket_alloc, vfs_inode)-socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
-   return container_of(socket, struct socket_alloc, socket)-vfs_inode;
-}
-
 extern void __sk_stream_mem_reclaim(struct sock *sk);
 extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index b7d8317..2763b30 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -864,6 +864,7 @@ static inline int tcp_prequeue(struct sock *sk, struct 
sk_buff *skb)
tp-ucopy.memory = 0;
} else if (skb_queue_len(tp-ucopy.prequeue) == 1) {
wake_up_interruptible(sk-sk_sleep);
+   kevent_socket_notify(sk, 
KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
if (!inet_csk_ack_scheduled(sk))
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
  (3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 000..d1a2701
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,144 @@
+/*
+ * kevent_socket.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include linux/kernel.h
+#include linux/types.h
+#include linux/list.h
+#include linux/slab.h
+#include linux/spinlock.h
+#include linux/timer.h
+#include linux/file.h

[take34 5/10] kevent: Timer notifications.

2007-01-25 Thread Evgeniy Polyakov


Timer notifications.

Timer notifications can be used for fine grained per-process time 
management, since interval timers are very inconvenient to use, 
and they are limited.

This subsystem uses high-resolution timers.
id.raw[0] is used as number of seconds
id.raw[1] is used as number of nanoseconds

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 000..c21a155
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,114 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include linux/kernel.h
+#include linux/types.h
+#include linux/list.h
+#include linux/slab.h
+#include linux/spinlock.h
+#include linux/hrtimer.h
+#include linux/jiffies.h
+#include linux/kevent.h
+
+struct kevent_timer
+{
+   struct hrtimer  ktimer;
+   struct kevent_storage   ktimer_storage;
+   struct kevent   *ktimer_event;
+};
+
+static int kevent_timer_func(struct hrtimer *timer)
+{
+   struct kevent_timer *t = container_of(timer, struct kevent_timer, 
ktimer);
+   struct kevent *k = t-ktimer_event;
+
+   kevent_storage_ready(t-ktimer_storage, NULL, KEVENT_MASK_ALL);
+   hrtimer_forward(timer, timer-base-softirq_time,
+   ktime_set(k-event.id.raw[0], k-event.id.raw[1]));
+   return HRTIMER_RESTART;
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+   int err;
+   struct kevent_timer *t;
+
+   t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+   if (!t)
+   return -ENOMEM;
+
+   hrtimer_init(t-ktimer, CLOCK_MONOTONIC, HRTIMER_REL);
+   t-ktimer.expires = ktime_set(k-event.id.raw[0], k-event.id.raw[1]);
+   t-ktimer.function = kevent_timer_func;
+   t-ktimer_event = k;
+
+   err = kevent_storage_init(t-ktimer, t-ktimer_storage);
+   if (err)
+   goto err_out_free;
+   lockdep_set_class(t-ktimer_storage.lock, kevent_timer_key);
+
+   err = kevent_storage_enqueue(t-ktimer_storage, k);
+   if (err)
+   goto err_out_st_fini;
+
+   hrtimer_start(t-ktimer, t-ktimer.expires, HRTIMER_REL);
+
+   return 0;
+
+err_out_st_fini:
+   kevent_storage_fini(t-ktimer_storage);
+err_out_free:
+   kfree(t);
+
+   return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+   struct kevent_storage *st = k-st;
+   struct kevent_timer *t = container_of(st, struct kevent_timer, 
ktimer_storage);
+
+   hrtimer_cancel(t-ktimer);
+   kevent_storage_dequeue(st, k);
+   kfree(t);
+
+   return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+   k-event.ret_data[0] = jiffies_to_msecs(jiffies);
+   return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+   struct kevent_callbacks tc = {
+   .callback = kevent_timer_callback,
+   .enqueue = kevent_timer_enqueue,
+   .dequeue = kevent_timer_dequeue,
+   .flags = 0,
+   };
+
+   return kevent_add_callbacks(tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);
+

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[take34 7/10] kevent: Signal notifications.

2007-01-25 Thread Evgeniy Polyakov


Signal notifications.

This type of notifications allows to deliver signals through kevent queue.
One can find example application signal.c on project homepage.

If KEVENT_SIGNAL_NOMASK bit is set in raw_u64 id then signal will be
delivered only through queue, otherwise both delivery types are used - old
through update of mask of pending signals and through queue.

If signal is delivered only through kevent queue mask of pending signals
is not updated at all, which is equal to putting signal into blocked mask,
but with delivery of that signal through kevent queue.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4463735..e7372f2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -82,6 +82,7 @@ struct sched_param {
 #include linux/resource.h
 #include linux/timer.h
 #include linux/hrtimer.h
+#include linux/kevent_storage.h
 #include linux/task_io_accounting.h
 
 #include asm/processor.h
@@ -1048,6 +1049,10 @@ struct task_struct {
 #ifdef CONFIG_TASK_DELAY_ACCT
struct task_delay_info *delays;
 #endif
+#ifdef CONFIG_KEVENT_SIGNAL
+   struct kevent_storage st;
+   u32 kevent_signals;
+#endif
 #ifdef CONFIG_FAULT_INJECTION
int make_it_fail;
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index fc723e5..fd7c749 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -49,6 +49,7 @@
 #include linux/delayacct.h
 #include linux/taskstats_kern.h
 #include linux/random.h
+#include linux/kevent.h
 
 #include asm/pgtable.h
 #include asm/pgalloc.h
@@ -118,6 +119,9 @@ void __put_task_struct(struct task_struct *tsk)
WARN_ON(atomic_read(tsk-usage));
WARN_ON(tsk == current);
 
+#ifdef CONFIG_KEVENT_SIGNAL
+   kevent_storage_fini(tsk-st);
+#endif
security_task_free(tsk);
free_uid(tsk-user);
put_group_info(tsk-group_info);
@@ -1126,6 +1130,10 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
if (retval)
goto bad_fork_cleanup_namespaces;
 
+#ifdef CONFIG_KEVENT_SIGNAL
+   kevent_storage_init(p, p-st);
+#endif
+
p-set_child_tid = (clone_flags  CLONE_CHILD_SETTID) ? child_tidptr : 
NULL;
/*
 * Clear TID on mm_release()?
diff --git a/kernel/kevent/kevent_signal.c b/kernel/kevent/kevent_signal.c
new file mode 100644
index 000..abe3972
--- /dev/null
+++ b/kernel/kevent/kevent_signal.c
@@ -0,0 +1,94 @@
+/*
+ * kevent_signal.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include linux/kernel.h
+#include linux/types.h
+#include linux/slab.h
+#include linux/spinlock.h
+#include linux/file.h
+#include linux/fs.h
+#include linux/kevent.h
+
+static int kevent_signal_callback(struct kevent *k)
+{
+   struct task_struct *tsk = k-st-origin;
+   int sig = k-event.id.raw[0];
+   int ret = 0;
+
+   if (sig == tsk-kevent_signals)
+   ret = 1;
+
+   if (ret  (k-event.id.raw_u64  KEVENT_SIGNAL_NOMASK))
+   tsk-kevent_signals |= 0x8000;
+
+   return ret;
+}
+
+int kevent_signal_enqueue(struct kevent *k)
+{
+   int err;
+
+   err = kevent_storage_enqueue(current-st, k);
+   if (err)
+   goto err_out_exit;
+
+   if (k-event.req_flags  KEVENT_REQ_ALWAYS_QUEUE) {
+   kevent_requeue(k);
+   err = 0;
+   } else {
+   err = k-callbacks.callback(k);
+   if (err)
+   goto err_out_dequeue;
+   }
+
+   return err;
+
+err_out_dequeue:
+   kevent_storage_dequeue(k-st, k);
+err_out_exit:
+   return err;
+}
+
+int kevent_signal_dequeue(struct kevent *k)
+{
+   kevent_storage_dequeue(k-st, k);
+   return 0;
+}
+
+int kevent_signal_notify(struct task_struct *tsk, int sig)
+{
+   tsk-kevent_signals = sig;
+   kevent_storage_ready(tsk-st, NULL, KEVENT_SIGNAL_DELIVERY);
+   return (tsk-kevent_signals  0x8000);
+}
+
+static int __init kevent_init_signal(void)
+{
+   struct kevent_callbacks sc = {
+   .callback = kevent_signal_callback,
+   .enqueue = kevent_signal_enqueue,
+   .dequeue = kevent_signal_dequeue,
+   .flags = 0,
+

[take34 8/10] kevent: Kevent posix timer notifications.

2007-01-25 Thread Evgeniy Polyakov


Kevent posix timer notifications.

Simple extensions to POSIX timers which allows
to deliver notification of the timer expiration
through kevent queue.

Example application posix_timer.c can be found
in archive on project homepage.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h
index 8786e01..3768746 100644
--- a/include/asm-generic/siginfo.h
+++ b/include/asm-generic/siginfo.h
@@ -235,6 +235,7 @@ typedef struct siginfo {
 #define SIGEV_NONE 1   /* other notification: meaningless */
 #define SIGEV_THREAD   2   /* deliver via thread creation */
 #define SIGEV_THREAD_ID 4  /* deliver to thread */
+#define SIGEV_KEVENT   8   /* deliver through kevent queue */
 
 /*
  * This works because the alignment is ok on all current architectures
@@ -260,6 +261,8 @@ typedef struct sigevent {
void (*_function)(sigval_t);
void *_attribute;   /* really pthread_attr_t */
} _sigev_thread;
+
+   int kevent_fd;
} _sigev_un;
 } sigevent_t;
 
diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index a7dd38f..4b9deb4 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -4,6 +4,7 @@
 #include linux/spinlock.h
 #include linux/list.h
 #include linux/sched.h
+#include linux/kevent_storage.h
 
 union cpu_time_count {
cputime_t cpu;
@@ -49,6 +50,9 @@ struct k_itimer {
sigval_t it_sigev_value;/* value word of sigevent struct */
struct task_struct *it_process; /* process to send signal to */
struct sigqueue *sigq;  /* signal queue entry. */
+#ifdef CONFIG_KEVENT_TIMER
+   struct kevent_storage st;
+#endif
union {
struct {
struct hrtimer timer;
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index 5fe87de..5ec805e 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -48,6 +48,8 @@
 #include linux/wait.h
 #include linux/workqueue.h
 #include linux/module.h
+#include linux/kevent.h
+#include linux/file.h
 
 /*
  * Management arrays for POSIX timers.  Timers are kept in slab memory
@@ -224,6 +226,100 @@ static int posix_ktime_get_ts(clockid_t which_clock, 
struct timespec *tp)
return 0;
 }
 
+#ifdef CONFIG_KEVENT_TIMER
+static int posix_kevent_enqueue(struct kevent *k)
+{
+   /*
+* It is not ugly - there is no pointer in the id field union, 
+* but its size is 64bits, which is ok for any known pointer size.
+*/
+   struct k_itimer *tmr = (struct k_itimer *)(unsigned 
long)k-event.id.raw_u64;
+   return kevent_storage_enqueue(tmr-st, k);
+}
+static int posix_kevent_dequeue(struct kevent *k)
+{
+   struct k_itimer *tmr = (struct k_itimer *)(unsigned 
long)k-event.id.raw_u64;
+   kevent_storage_dequeue(tmr-st, k);
+   return 0;
+}
+static int posix_kevent_callback(struct kevent *k)
+{
+   return 1;
+}
+static int posix_kevent_init(void)
+{
+   struct kevent_callbacks tc = {
+   .callback = posix_kevent_callback,
+   .enqueue = posix_kevent_enqueue,
+   .dequeue = posix_kevent_dequeue,
+   .flags = KEVENT_CALLBACKS_KERNELONLY};
+
+   return kevent_add_callbacks(tc, KEVENT_POSIX_TIMER);
+}
+
+extern struct file_operations kevent_user_fops;
+
+static int posix_kevent_init_timer(struct k_itimer *tmr, int fd)
+{
+   struct ukevent uk;
+   struct file *file;
+   struct kevent_user *u;
+   int err;
+
+   file = fget(fd);
+   if (!file) {
+   err = -EBADF;
+   goto err_out;
+   }
+
+   if (file-f_op != kevent_user_fops) {
+   err = -EINVAL;
+   goto err_out_fput;
+   }
+
+   u = file-private_data;
+
+   memset(uk, 0, sizeof(struct ukevent));
+
+   uk.event = KEVENT_MASK_ALL;
+   uk.type = KEVENT_POSIX_TIMER;
+   uk.id.raw_u64 = (unsigned long)(tmr); /* Just cast to something unique 
*/
+   uk.req_flags = KEVENT_REQ_ONESHOT | KEVENT_REQ_ALWAYS_QUEUE;
+   uk.ptr = tmr-it_sigev_value.sival_ptr;
+
+   err = kevent_user_add_ukevent(uk, u);
+   if (err)
+   goto err_out_fput;
+
+   fput(file);
+
+   return 0;
+
+err_out_fput:
+   fput(file);
+err_out:
+   return err;
+}
+
+static void posix_kevent_fini_timer(struct k_itimer *tmr)
+{
+   kevent_storage_fini(tmr-st);
+}
+#else
+static int posix_kevent_init_timer(struct k_itimer *tmr, int fd)
+{
+   return -ENOSYS;
+}
+static int posix_kevent_init(void)
+{
+   return 0;
+}
+static void posix_kevent_fini_timer(struct k_itimer *tmr)
+{
+}
+#endif
+
+
 /*
  * Initialize everything, well, just everything in Posix clocks/timers ;)
  */
@@ -241,6 +337,11 @@ static __init int init_posix_timers(void)
register_posix_clock(CLOCK_REALTIME, clock_realtime);

[take34 3/10] kevent: poll/select() notifications.

2007-01-25 Thread Evgeniy Polyakov


poll/select() notifications.

This patch includes generic poll/select notifications.
kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake, a lot of allocations and so on).

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/fs/file_table.c b/fs/file_table.c
index 4c17a18..46f458c 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -20,6 +20,7 @@
 #include linux/cdev.h
 #include linux/fsnotify.h
 #include linux/sysctl.h
+#include linux/kevent.h
 #include linux/percpu_counter.h
 
 #include asm/atomic.h
@@ -119,6 +120,7 @@ struct file *get_empty_filp(void)
f-f_uid = tsk-fsuid;
f-f_gid = tsk-fsgid;
eventpoll_init_file(f);
+   kevent_init_file(f);
/* f-f_version: 0 */
return f;
 
@@ -164,6 +166,7 @@ void fastcall __fput(struct file *file)
 * in the file cleanup chain.
 */
eventpoll_release(file);
+   kevent_cleanup_file(file);
locks_remove_flock(file);
 
if (file-f_op  file-f_op-release)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 186da81..59e6069 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -280,6 +280,7 @@ extern int dir_notify_enable;
 #include linux/init.h
 #include linux/pid.h
 #include linux/mutex.h
+#include linux/kevent_storage.h
 
 #include asm/atomic.h
 #include asm/semaphore.h
@@ -408,6 +409,8 @@ struct address_space_operations {
 
int (*readpages)(struct file *filp, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages);
+   int (*aio_readpages)(struct file *filp, struct address_space *mapping,
+   struct list_head *pages, unsigned nr_pages, void *priv);
 
/*
 * ext3 requires that a successful prepare_write() call be followed
@@ -578,6 +581,10 @@ struct inode {
struct mutexinotify_mutex;  /* protects the watches list */
 #endif
 
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+   struct kevent_storage   st;
+#endif
+
unsigned long   i_state;
unsigned long   dirtied_when;   /* jiffies of first dirtying */
 
@@ -737,6 +744,9 @@ struct file {
struct list_headf_ep_links;
spinlock_t  f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+   struct kevent_storage   st;
+#endif
struct address_space*f_mapping;
 };
 extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 000..58129fa
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,234 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/kernel.h
+#include linux/types.h
+#include linux/list.h
+#include linux/slab.h
+#include linux/spinlock.h
+#include linux/timer.h
+#include linux/file.h
+#include linux/kevent.h
+#include linux/poll.h
+#include linux/fs.h
+
+static struct kmem_cache *kevent_poll_container_cache;
+static struct kmem_cache *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+   struct poll_table_structpt;
+   struct kevent   *k;
+};
+
+struct kevent_poll_wait_container
+{
+   struct list_headcontainer_entry;
+   wait_queue_head_t   *whead;
+   wait_queue_twait;
+   struct kevent   *k;
+};
+
+struct kevent_poll_private
+{
+   struct list_headcontainer_list;
+   spinlock_t  container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+   unsigned mode, int sync, void *key)
+{
+   struct kevent_poll_wait_container *cont =
+   container_of(wait, struct kevent_poll_wait_container, wait);
+   struct kevent *k = cont-k;
+
+   kevent_storage_ready(k-st, NULL, KEVENT_MASK_ALL);
+   return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+   struct poll_table_struct *poll_table)
+{
+   struct kevent *k =
+   container_of(poll_table, struct kevent_poll_ctl, pt)-k;
+   struct kevent_poll_private *priv = k-priv;
+

[take34 9/10] kevent: Private userspace notifications.

2007-01-25 Thread Evgeniy Polyakov


Private userspace notifications.

Allows to register notifications of any private userspace
events over kevent. Events can be marked as readt using 
kevent_ctl(KEVENT_READY) command.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/kernel/kevent/kevent_unotify.c b/kernel/kevent/kevent_unotify.c
new file mode 100644
index 000..618c09c
--- /dev/null
+++ b/kernel/kevent/kevent_unotify.c
@@ -0,0 +1,62 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include linux/kernel.h
+#include linux/kevent.h
+
+static int kevent_unotify_callback(struct kevent *k)
+{
+   return 1;
+}
+
+int kevent_unotify_enqueue(struct kevent *k)
+{
+   int err;
+
+   err = kevent_storage_enqueue(k-user-st, k);
+   if (err)
+   goto err_out_exit;
+
+   if (k-event.req_flags  KEVENT_REQ_ALWAYS_QUEUE)
+   kevent_requeue(k);
+
+   return 0;
+
+err_out_exit:
+   return err;
+}
+
+int kevent_unotify_dequeue(struct kevent *k)
+{
+   kevent_storage_dequeue(k-st, k);
+   return 0;
+}
+
+static int __init kevent_init_unotify(void)
+{
+   struct kevent_callbacks sc = {
+   .callback = kevent_unotify_callback,
+   .enqueue = kevent_unotify_enqueue,
+   .dequeue = kevent_unotify_dequeue,
+   .flags = 0,
+   };
+
+   return kevent_add_callbacks(sc, KEVENT_UNOTIFY);
+}
+module_init(kevent_init_unotify);

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [tipc-discussion] [RFC: 2.6 patch] net/tipc/: possible cleanups

2007-01-25 Thread Jon Maloy


Adrian Bunk wrote:


This patch contains the following possible cleanups:
- make needlessly global functions static
- #if 0 unused functions
 

Thanks. I think most of those were due for our next release, anyway. But 
we'll
get it in, one way or another. 


- remove all EXPORT_SYMBOL's

My impression is that most of this might have users that are not yet 
submitted for inclusion in the kernel - one year after TIPC was merged.
 

Not quite. The exported symbols belong to a public API for driver 
programmers.
We know about several users of this API, and there will be more, but I 
don't think

any of them are aspiring to have their code be included in the kernel.


If this is true, please submit the users for inclusion in the kernel.

Signed-off-by: Adrian Bunk [EMAIL PROTECTED]

---

include/net/tipc/tipc.h  |   60 
include/net/tipc/tipc_port.h |9 
net/tipc/addr.c  |2 
net/tipc/cluster.c   |2 
net/tipc/cluster.h   |1 
net/tipc/config.c|9 +++-

net/tipc/config.h|7 ---
net/tipc/core.c  |   74 ++-
net/tipc/core.h  |   14 --
net/tipc/dbg.c   |   15 +--
net/tipc/dbg.h   |3 -
net/tipc/discover.c  |4 +
net/tipc/discover.h  |1 
net/tipc/link.c  |   14 +++---

net/tipc/link.h  |4 -
net/tipc/name_table.c|3 -
net/tipc/node.c  |6 +-
net/tipc/node.h  |1 
net/tipc/port.c  |   59 +--
net/tipc/port.h  |2 
net/tipc/subscr.c|2 
net/tipc/zone.c  |2 
net/tipc/zone.h  |1 
23 files changed, 97 insertions(+), 198 deletions(-)


--- linux-2.6.20-rc4-mm1/include/net/tipc/tipc.h.old2007-01-24 
19:12:15.0 +0100
+++ linux-2.6.20-rc4-mm1/include/net/tipc/tipc.h2007-01-24 
20:40:58.0 +0100
@@ -50,8 +50,6 @@
 * TIPC operating mode routines
 */

-u32 tipc_get_addr(void);
-
#define TIPC_NOT_RUNNING  0
#define TIPC_NODE_MODE1
#define TIPC_NET_MODE 2
@@ -62,8 +60,6 @@

void tipc_detach(unsigned int userref);

-int tipc_get_mode(void);
-
/*
 * TIPC port manipulation routines
 */
@@ -153,12 +149,8 @@

int tipc_shutdown(u32 ref); /* Sends SHUTDOWN msg */

-int tipc_isconnected(u32 portref, int *isconnected);
-
int tipc_peer(u32 portref, struct tipc_portid *peer);

-int tipc_ref_valid(u32 portref); 
-

/*
 * TIPC messaging routines
 */
@@ -170,38 +162,12 @@
  unsigned int num_sect,
  struct iovec const *msg_sect);

-int tipc_send_buf(u32 portref,
- struct sk_buff *buf,
- unsigned int dsz);
-
int tipc_send2name(u32 portref, 
		   struct tipc_name const *name, 
		   u32 domain,	/* 0:own zone */

   unsigned int num_sect,
   struct iovec const *msg_sect);

-int tipc_send_buf2name(u32 portref,
-  struct tipc_name const *name,
-  u32 domain,
-  struct sk_buff *buf,
-  unsigned int dsz);
-
-int tipc_forward2name(u32 portref, 
-		  struct tipc_name const *name, 
-		  u32 domain,   /*0: own zone */

- unsigned int section_count,
- struct iovec const *msg_sect,
- struct tipc_portid const *origin,
- unsigned int importance);
-
-int tipc_forward_buf2name(u32 portref,
- struct tipc_name const *name,
- u32 domain,
- struct sk_buff *buf,
- unsigned int dsz,
- struct tipc_portid const *orig,
- unsigned int importance);
-
int tipc_send2port(u32 portref,
   struct tipc_portid const *dest,
   unsigned int num_sect,
@@ -212,20 +178,6 @@
   struct sk_buff *buf,
   unsigned int dsz);

-int tipc_forward2port(u32 portref,
- struct tipc_portid const *dest,
- unsigned int num_sect,
- struct iovec const *msg_sect,
- struct tipc_portid const *origin,
- unsigned int importance);
-
-int tipc_forward_buf2port(u32 portref,
- struct tipc_portid const *dest,
- struct sk_buff *buf,
- unsigned int dsz,
- struct tipc_portid const *orig,
- unsigned int importance);
-
int tipc_multicast(u32 portref, 
		   struct tipc_name_seq const *seq, 
		   u32 domain,	/* 0:own zone */

@@ -240,18 +192,6 @@
   unsigned int size);
#endif

-/*
- * TIPC subscription routines
- */
-
-int tipc_ispublished(struct tipc_name const *name);
-
-/*
- * Get number of available nodes

[ANNOUNCE] PRO/1000 PCI-e Software Developer Manual is now available

2007-01-25 Thread John Ronciak


The Software Developer Manual for the PRO/1000 PCI-e controllers is
now available via the http://e1000.sf.net/ web site.  The file is
OpenSDM_8257x-10.pdf.  I know it's been a long time coming but
sometimes that's just how it goes.  Enjoy.

--
Cheers,
John
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] PRO/1000 PCI-e Software Developer Manual is now available

2007-01-25 Thread Jeff Garzik


John Ronciak wrote:

The Software Developer Manual for the PRO/1000 PCI-e controllers is
now available via the http://e1000.sf.net/ web site.  The file is
OpenSDM_8257x-10.pdf.  I know it's been a long time coming but
sometimes that's just how it goes.  Enjoy.


Nice, thanks for posting it!

Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] PRO/1000 PCI-e Software Developer Manual is now available

2007-01-25 Thread John W. Linville

On Thu, Jan 25, 2007 at 01:40:23PM -0800, John Ronciak wrote:
 The Software Developer Manual for the PRO/1000 PCI-e controllers is
 now available via the http://e1000.sf.net/ web site.  The file is
 OpenSDM_8257x-10.pdf.  I know it's been a long time coming but
 sometimes that's just how it goes.  Enjoy.

I congratulate (and thank) you, sir!

Now, if we could only get such an announcement from the wireless side
of Intel's house... :-)

John
-- 
John W. Linville
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: + oops-in-drivers-net-shaperc.patch added to -mm tree

2007-01-25 Thread David Miller

From: [EMAIL PROTECTED]
Date: Wed, 24 Jan 2007 19:54:51 -0800

 Hi,
 
 The following code:
 [...]
 
 Causes the following oops:
 
 ...
 [   66.355188]  [c0396c74] error_code+0x7c/0x84
 [   66.355192]  [f8adaf03] packet_sendmsg+0x147/0x201 [af_packet]
 [   66.355199]  [c030e1c5] sock_sendmsg+0xf9/0x116
 [   66.355204]  [c030eb54] sys_sendto+0xbf/0xe0
 [   66.355208]  [c030f494] sys_socketcall+0x1aa/0x277
 [   66.355212]  [c01041ea] sysenter_past_esp+0x5f/0x99
 [   66.355216]  ===
 [   66.355218] Code:  Bad EIP value.
 [   66.355223] EIP: [] 0x0 SS:ESP 0068:f6261d70
 
 shaper_header() should check for shaper-dev not being NULL (ie. the
 shaper was actually attached) as in the following patch.
 This happens in mainline too (tested 2.6.19.2).
 
 Signed-off-by: Frederik Deweerdt [EMAIL PROTECTED]
 Cc: David S. Miller [EMAIL PROTECTED]
 Cc: Stephen Hemminger [EMAIL PROTECTED]
 Signed-off-by: Andrew Morton [EMAIL PROTECTED]

Shaper is actually OK.  None of these hardware header callbacks
should be invoked if the device is down.  Yet, this is what is
accidently being allowed in the AF_PACKET socket layer.

Shaper makes sure to fail -open() if shaper-dev is NULL, in order
to prevent this.

But AF_PACKET does it's check of device state too late, after the
dev-header() call.  That's the bug.

I'll fix it like this:

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 594c078..6dc01bd 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -359,6 +359,10 @@ static int packet_sendmsg_spkt(struct kiocb *iocb, struct 
socket *sock,
if (dev == NULL)
goto out_unlock;

+   err = -ENETDOWN;
+   if (!(dev-flags  IFF_UP))
+   goto out_unlock;
+
/*
 *  You may not queue a frame bigger than the mtu. This is the 
lowest level
 *  raw protocol and you must do your own fragmentation at this 
level.
@@ -407,10 +411,6 @@ static int packet_sendmsg_spkt(struct kiocb *iocb, struct 
socket *sock,
if (err)
goto out_free;
 
-   err = -ENETDOWN;
-   if (!(dev-flags  IFF_UP))
-   goto out_free;
-
/*
 *  Now send it
 */
@@ -738,6 +738,10 @@ static int packet_sendmsg(struct kiocb *iocb, struct 
socket *sock,
if (sock-type == SOCK_RAW)
reserve = dev-hard_header_len;
 
+   err = -ENETDOWN;
+   if (!(dev-flags  IFF_UP))
+   goto out_unlock;
+
err = -EMSGSIZE;
if (len  dev-mtu+reserve)
goto out_unlock;
@@ -770,10 +774,6 @@ static int packet_sendmsg(struct kiocb *iocb, struct 
socket *sock,
skb-dev = dev;
skb-priority = sk-sk_priority;
 
-   err = -ENETDOWN;
-   if (!(dev-flags  IFF_UP))
-   goto out_free;
-
/*
 *  Now send it
 */
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[take34 1/10] kevent: Description.

2007-01-25 Thread Evgeniy Polyakov


Description.


diff --git a/Documentation/kevent.txt b/Documentation/kevent.txt
new file mode 100644
index 000..d6e126f
--- /dev/null
+++ b/Documentation/kevent.txt
@@ -0,0 +1,271 @@
+Description.
+
+int kevent_init(struct kevent_ring *ring, unsigned int ring_size, 
+   unsigned int flags);
+
+num - size of the ring buffer in events 
+ring - pointer to allocated ring buffer
+flags - various flags, see KEVENT_FLAGS_* definitions.
+
+Return value: kevent control file descriptor or negative error value.
+
+ struct kevent_ring
+ {
+   unsigned int ring_kidx, ring_over;
+   struct ukevent event[0];
+ }
+
+ring_kidx - index in the ring buffer where kernel will put new events 
+   when kevent_wait() or kevent_get_events() is called 
+ring_over - number of overflows of ring_uidx happend from the start.
+   Overflow counter is used to prevent situation when two threads 
+   are going to free the same events, but one of them was scheduled 
+   away for too long, so ring indexes were wrapped, so when that 
+   thread will be awakened, it will free not those events, which 
+   it suppose to free.
+
+Example userspace code (ring_buffer.c) can be found on project's homepage.
+
+Each kevent syscall can be so called cancellation point in glibc, i.e. when 
+thread has been cancelled in kevent syscall, thread can be safely removed 
+and no events will be lost, since each syscall (kevent_wait() or 
+kevent_get_events()) will copy event into special ring buffer, accessible 
+from other threads or even processes (if shared memory is used).
+
+When kevent is removed (not dequeued when it is ready, but just removed), 
+even if it was ready, it is not copied into ring buffer, since if it is 
+removed, no one cares about it (otherwise user would wait until it becomes 
+ready and got it through usual way using kevent_get_events() or kevent_wait()) 
+and thus no need to copy it to the ring buffer.
+
+---
+
+
+int kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent 
*arg);
+
+fd - is the file descriptor referring to the kevent queue to manipulate. 
+It is created by opening /dev/kevent char device, which is created with 
+dynamic minor number and major number assigned for misc devices. 
+
+cmd - is the requested operation. It can be one of the following:
+KEVENT_CTL_ADD - add event notification 
+KEVENT_CTL_REMOVE - remove event notification 
+KEVENT_CTL_MODIFY - modify existing notification 
+KEVENT_CTL_READY - mark existing events as ready, if number of events is 
zero,
+   it just wakes up parked in syscall thread
+
+num - number of struct ukevent in the array pointed to by arg 
+arg - array of struct ukevent
+
+Return value: 
+ number of events processed or negative error value.
+
+When called, kevent_ctl will carry out the operation specified in the 
+cmd parameter.
+---
+
+ int kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, 
+   struct timespec timeout, struct ukevent *buf, unsigned flags);
+
+ctl_fd - file descriptor referring to the kevent queue 
+min_nr - minimum number of completed events that kevent_get_events will block 
+waiting for 
+max_nr - number of struct ukevent in buf 
+timeout - time to wait before returning less than min_nr 
+ events. If this is -1, then wait forever. 
+buf - pointer to an array of struct ukevent. 
+flags - various flags, see KEVENT_FLAGS_* definitions.
+
+Return value:
+ number of events copied or negative error value.
+
+kevent_get_events will wait timeout milliseconds for at least min_nr completed 
+events, copying completed struct ukevents to buf and deleting any 
+KEVENT_REQ_ONESHOT event requests. In nonblocking mode it returns as many 
+events as possible, but not more than max_nr. In blocking mode it waits until 
+timeout or if at least min_nr events are ready.
+
+This function copies event into ring buffer if it was initialized, if ring 
buffer
+is full, KEVENT_RET_COPY_FAILED flag is set in ret_flags field.
+---
+
+ int kevent_wait(int ctl_fd, unsigned int num, unsigned int old_uidx, 
+   struct timespec timeout, unsigned int flags);
+
+ctl_fd - file descriptor referring to the kevent queue 
+num - number of processed kevents 
+old_uidx - the last index user is aware of
+timeout - time to wait until there is free space in kevent queue
+flags - various flags, see KEVENT_FLAGS_* definitions.
+
+Return value:
+ number of events copied into ring buffer or negative error value.
+
+This syscall waits until either timeout expires or at least one event becomes 
+ready. It also copies events into special ring buffer. If ring buffer is full,
+it waits until there are ready events and then return.
+If kevent is one-shot kevent it is

[take34 0/10] kevent: Generic event handling mechanism.

2007-01-25 Thread Evgeniy Polyakov


Generic event handling mechanism.

Kevent is a generic subsytem which allows to handle event notifications.
It supports both level and edge triggered events. It is similar to
poll/epoll in some cases, but it is more scalable, it is faster and
allows to work with essentially eny kind of events.

Events are provided into kernel through control syscall and can be read
back through ring buffer or using usual syscalls.
Kevent update (i.e. readiness switching) happens directly from internals
of the appropriate state machine of the underlying subsytem (like
network, filesystem, timer or any other).

Homepage:
http://tservice.net.ru/~s0mbre/old/?section=projectsitem=kevent

Documentation page:
http://linux-net.osdl.org/index.php/Kevent

Consider for inclusion.

P.S. If you want to be removed from Cc: list just drop me a mail.

Changes from 'take33' patchset:
 * Added optional header pointer and its size into aio_sendfile_path(), 
   which allows to send header and file in one syscall instead of
   send(header), open file, sendfile(file).

Changes from 'take32' patchset:
 * Updated documentation (aio_sendfile_path()).
 * Fixed typo in forward declaration.

Changes from 'take31' patchset:
 * Added aio_sendfile_path() - this syscall allows to asynchronosly transfer
   file specified by provided pathname to destination socket.
   Opened file descriptor is returned.
 * Added trivial scheduler which selects execution thread. It allows
   to specify given thread 'by-hands', but since kaio provides '-1' it uses
   round-robin to get processing thread. In theory it can be bound to
   scheduler statistics or gamma-ray receiver data.
 * Number of bug fixes in kevent based AIO mpage_readpages().
   
   Benchmark of the 100 1Mb files transfer (files are in VFS already) using 
   sync sendfile or this new version shows about 10Mb/sec performance win
   for aio_sendfile_path().

Changes from 'take30' patchset:
 * AIO state machine.
 * aio_sendfile() implementation.
 * moved kevent_user_get/kevent_user_put into header.
 * use *zalloc where needed.

Changes from 'take29' patchset:
 * new private userspace notifications - allows to queue any userspace private 
event and then mark it as ready using kevent_ctl(KEVENT_READY) command
 * KEVENT_REQ_READY flag - if set kevent will be marked as ready at enqueue time
 * port to 2.6.20-rc2 tree (54abb5fcdae74a811ed440ec6556cabc6b24f404 commit)
 * use struct kmem_cache instead of kmem_cache_t
 * added notificaion type into search key, this allows to have the same id for 
different types of notifications

Changes from 'take28' patchset:
 * optimized af_unix to use socket notifications
 * changed ALWAYS_QUEUE behaviour with poll/select notifications - previously
kevent was not queued into poll wait queue when ALWAYS_QUEUE flag
is set
 * added KEVENT_POLL_POLLRDHUP definition into ukevent.h header
 * libevent-1.2 patch (Jamal, your request is completed, so I'm waiting two 
weeks
before starting final countdown :)
All regression tests passed successfully except test_evbuffer(), which 
is
crashed on my amd64 linux 2.6 test machine for all types of 
notifications,
probably it was fixed in libevent-1.2a version, I did not check.
Patch and README can be found at project homepage.

Changes from 'take27' patchset:
 * made kevent default yes in non embedded case.
 * added falgs to callback structures - currently used to check if kevent
can be requested from kernelspace only (posix timers) or 
userspace (all others)

Changes from 'take26' patchset:
 * made kevent visible in config only in case of embedded setup.
 * added comment about KEVENT_MAX number.
 * spell fix.

Changes from 'take25' patchset:
 * use timespec as timeout parameter.
 * added high-resolution timer to handle absolute timeouts.
 * added flags to waiting and initialization syscalls.
 * kevent_commit() has new_uidx parameter.
 * kevent_wait() has old_uidx parameter, which, if not equal to u-uidx,
results in immediate wakeup (usefull for the case when entries
are added asynchronously from kernel (not supported for now)).
 * added interface to mark any event as ready.
 * event POSIX timers support.
 * return -ENOSYS if there is no registered event type.
 * provided file descriptor must be checked for fifo type (spotted by Eric 
Dumazet).
 * signal notifications.
 * documentation update.
 * lighttpd patch updated (the latest benchmarks with lighttpd patch can be 
found in blog).

Changes from 'take24' patchset:
 * new (old (new)) ring buffer implementation with kernel and user indexes.
 * added initialization syscall instead of opening /dev/kevent
 * kevent_commit() syscall to commit ring buffer entries
 * changed KEVENT_REQ_WAKEUP_ONE flag to KEVENT_REQ_WAKEUP_ALL, kevent wakes
   only first thread always if that flag is not set
 * KEVENT_REQ_ALWAYS_QUEUE flag. If set, kevent will be queued into ready queue
   instead of copying back

[take34 10/10] kevent: Kevent based AIO (aio_sendfile()/aio_sendfile_path()).

2007-01-25 Thread Evgeniy Polyakov


Kevent based AIO (aio_sendfile()/aio_sendfile_path()).

aio_sendfile()/aio_sendfile_path() contains of two major parts: AIO 
state machine and page processing code. 
The former is just a small subsystem, which allows to queue callback 
for theirs invocation in process' context on behalf of pool of kernel 
threads. It allows to queue caches of callbacks to the local thread 
or to any other specified. Each cache of callbacks is processed until 
there are callbacks in it, callbacks can requeue themselfs into the 
same cache.

Real work is being done in page processing code - code which populates 
pages into VFS cache and then sends pages to the destination socket 
via -sendpage(). Unlike previous aio_sendfile() implementation, new 
one does not require low-level filesystem specific callbacks (-get_block())
at all, instead I extended struct address_space_operations to contain new 
member called -aio_readpages(), which is exactly the same as -readpage() 
(read: mpage_readpages()) except different BIO allocation and sumbission 
routines. I changed mpage_readpages() to provide mpage_alloc() and 
mpage_bio_submit() to the new function called __mpage_readpages(), which is 
exactly old mpage_readpages() with provided callback invocation instead of 
usage for old functions. mpage_readpages_aio() provides kevent specific 
callbacks, which calls old functions, but with different destructor callbacks,
which are essentially the same, except that they reschedule AIO processing.

aio_sendfile_path() is essentially aio_sendfile(), except that it takes
source filename as parameter and returns opened file descriptor.

Benchmark of the 100 1MB files transfer (files are in VFS already) using sync 
sendfile() against aio_sendfile_path() shows about 10MB/sec performance win 
(78 MB/s vs 66-72 MB/s over 1 Gb network, sendfile sending server is one-way 
AMD Athlong 64 3500+) for aio_sendfile_path().

AIO state machine is a base for network AIO (which becomes
quite trivial), but I will not start implementation until
roadback of kevent as a whole and AIO implementation become more clear.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/fs/bio.c b/fs/bio.c
index 7618bcb..291e7e8 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -120,7 +120,7 @@ void bio_free(struct bio *bio, struct bio_set *bio_set)
 /*
  * default destructor for a bio allocated with bio_alloc_bioset()
  */
-static void bio_fs_destructor(struct bio *bio)
+void bio_fs_destructor(struct bio *bio)
 {
bio_free(bio, fs_bio_set);
 }
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index beaf25f..f08c957 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1650,6 +1650,13 @@ ext3_readpages(struct file *file, struct address_space 
*mapping,
return mpage_readpages(mapping, pages, nr_pages, ext3_get_block);
 }
 
+static int
+ext3_readpages_aio(struct file *file, struct address_space *mapping,
+   struct list_head *pages, unsigned nr_pages, void *priv)
+{
+   return mpage_readpages_aio(mapping, pages, nr_pages, ext3_get_block, 
priv);
+}
+
 static void ext3_invalidatepage(struct page *page, unsigned long offset)
 {
journal_t *journal = EXT3_JOURNAL(page-mapping-host);
@@ -1768,6 +1775,7 @@ static int ext3_journalled_set_page_dirty(struct page 
*page)
 }
 
 static const struct address_space_operations ext3_ordered_aops = {
+   .aio_readpages  = ext3_readpages_aio,
.readpage   = ext3_readpage,
.readpages  = ext3_readpages,
.writepage  = ext3_ordered_writepage,
diff --git a/fs/mpage.c b/fs/mpage.c
index 692a3e5..e5ba44b 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -102,7 +102,7 @@ static struct bio *mpage_bio_submit(int rw, struct bio *bio)
 static struct bio *
 mpage_alloc(struct block_device *bdev,
sector_t first_sector, int nr_vecs,
-   gfp_t gfp_flags)
+   gfp_t gfp_flags, void *priv)
 {
struct bio *bio;
 
@@ -116,6 +116,7 @@ mpage_alloc(struct block_device *bdev,
if (bio) {
bio-bi_bdev = bdev;
bio-bi_sector = first_sector;
+   bio-bi_private = priv;
}
return bio;
 }
@@ -175,7 +176,10 @@ map_buffer_to_page(struct page *page, struct buffer_head 
*bh, int page_block)
 static struct bio *
 do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
sector_t *last_block_in_bio, struct buffer_head *map_bh,
-   unsigned long *first_logical_block, get_block_t get_block)
+   unsigned long *first_logical_block, get_block_t get_block,
+   struct bio *(*alloc)(struct block_device *bdev, sector_t 
first_sector, 
+   int nr_vecs, gfp_t gfp_flags, void *priv),
+   struct bio *(*submit)(int rw, struct bio *bio), void *priv)
 {
struct inode *inode = page-mapping-host;
const unsigned blkbits = inode-i_blkbits;
@@ -302,25 +306,25 @@ do_mpage_readpage(struct bio *bio, struct page *page,

Re: [take34 0/10] kevent: Generic event handling mechanism.

2007-01-25 Thread Evgeniy Polyakov

On Thu, Jan 25, 2007 at 04:48:30PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
 Changes from 'take33' patchset:
  * Added optional header pointer and its size into aio_sendfile_path(), 
which allows to send header and file in one syscall instead of
send(header), open file, sendfile(file).

Btw, aio_sendfile and aio_sendfile_path use naive and actually the 
simplest approach of async IO - it just stupidly blocks on sending or 
resends (like repeated sending approach) - I'm a bit lazy to use kevent 
there, since there is _no_ gain after a bit more deep analysis 
(hint: there are multiple IO threads, some of them might block), 
and network AIO does not exist (yet, kevent status is in hinged state,
and I was asked to postpone additional feature addons, which otherwise
could happen a bit more frequently then current kevent/kernel releases), 
due to kevent future is indeterminate...

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

94 matches

Mail list logo