Re: [PATCH] net: decnet handle a failure in neigh_parms_alloc (take 2)
Hi, On Wed, Jan 24, 2007 at 09:55:45PM -0700, Eric W. Biederman wrote: While enhancing the neighbour code to handle multiple network namespaces I noticed that decnet is assuming neigh_parms_alloc will allways succeed, which is clearly wrong. So handle the failure. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] Acked-by: Steven Whitehouse [EMAIL PROTECTED] Also you should cc Patrick as he is now the maintainer, Steve. --- net/decnet/dn_dev.c | 11 +-- 1 files changed, 9 insertions(+), 2 deletions(-) diff --git a/net/decnet/dn_dev.c b/net/decnet/dn_dev.c index 324eb47..913e25a 100644 --- a/net/decnet/dn_dev.c +++ b/net/decnet/dn_dev.c @@ -1140,16 +1140,23 @@ struct dn_dev *dn_dev_create(struct net_device *dev, int *err) init_timer(dn_db-timer); dn_db-uptime = jiffies; + + dn_db-neigh_parms = neigh_parms_alloc(dev, dn_neigh_table); + if (!dn_db-neigh_parms) { + dev-dn_ptr = NULL; + kfree(dn_db); + return NULL; + } + if (dn_db-parms.up) { if (dn_db-parms.up(dev) 0) { + neigh_parms_release(dn_neigh_table, dn_db-neigh_parms); dev-dn_ptr = NULL; kfree(dn_db); return NULL; } } - dn_db-neigh_parms = neigh_parms_alloc(dev, dn_neigh_table); - dn_dev_sysctl_register(dev, dn_db-parms); dn_dev_set_timer(dev); -- 1.4.4.1.g278f - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IPv6: Implement RFC 4429 Optimistic Duplicate Address Detection
On Wed, Jan 24, 2007 at 05:54:47PM -0800, Sridhar Samudrala wrote: Sec 2.1 of RFC 4429 says Unless noted otherwise, components of the IPv6 protocol stack should treat addresses in the Optimistic state equivalently to those in the Deprecated state, indicating that the address is available for use but should not be used if another suitable address is available. For example, Default Address Selection [RFC3484] uses the address state to decide which source address to use for an outgoing packet. Implementations should treat an address in state Optimistic as if it were in state Deprecated. If address states are recorded as individual flags, this can easily be achieved by also setting 'Deprecated' when 'Optimistic' is set. So i think DEPRECATED flag also should be set when we mark an address as OPTIMISTIC so that we don't use it as source address for new connections if another address is available until DAD is completed. Thanks Sridhar Oh, good catch. Thank you Sri. However, I'm worried about the next paragraph: It is important to note that the address lifetime rules of [RFC2462] still apply, and so an address may be Deprecated as well as Optimistic. When DAD completes without incident, the address becomes either a Preferred or a Deprecated address, as per RFC 2462 Given that, it seems to me that addresses which are flagged as Deprecated may enter and exit that state independently of the DAD process, which I think gives rise to the possibility of a race. I.e. if an address becomes deprecated right before DAD completes, and then addrconf_dad_complete clears the IFA_F_DEPRECATED flag, that seems wrong. Instead I think it would be better if we tested for the OPTIMISTIC flag in ipv6_dev_get_saddr in parallel with the DEPRECATED flag. I may be wrong about this, but I'm going to err on the side of safety. If you can ensure that this race is not possible. Please let me know, and I'll happily just set the flag. I'll repost a new patch soon. Thanks Regards Neil On Tue, 2007-01-23 at 15:51 -0500, Neil Horman wrote: On Tue, Jan 23, 2007 at 09:18:20AM +0900, YOSHIFUJI Hideaki / 吉藤英明 wrote: Hello. snip New patch attached, incorporating Yoshijui and Vlads latest comments. I didn't follow guidance on the ndisc_recv_ns comment, Yoshifuji, since Vlad had already suggested an alternate solution in a previous post, but from looking at them both, they should be equivalent. Thanks Regards Neil Signed-off-by: Neil Horman [EMAIL PROTECTED] include/linux/if_addr.h |1 include/linux/ipv6.h|2 + include/linux/sysctl.h |1 include/net/addrconf.h |4 +- net/ipv6/addrconf.c | 56 net/ipv6/mcast.c|4 +- net/ipv6/ndisc.c| 82 +++- 7 files changed, 117 insertions(+), 33 deletions(-) diff --git a/include/linux/if_addr.h b/include/linux/if_addr.h index d557e4c..43f3bed 100644 --- a/include/linux/if_addr.h +++ b/include/linux/if_addr.h @@ -39,6 +39,7 @@ enum #define IFA_F_TEMPORARYIFA_F_SECONDARY #defineIFA_F_NODAD 0x02 +#define IFA_F_OPTIMISTIC 0x04 #defineIFA_F_HOMEADDRESS 0x10 #define IFA_F_DEPRECATED 0x20 #define IFA_F_TENTATIVE0x40 diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index f824113..5d37abf 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -177,6 +177,7 @@ struct ipv6_devconf { #endif #endif __s32 proxy_ndp; + __s32 optimistic_dad; void*sysctl; }; @@ -205,6 +206,7 @@ enum { DEVCONF_RTR_PROBE_INTERVAL, DEVCONF_ACCEPT_RA_RT_INFO_MAX_PLEN, DEVCONF_PROXY_NDP, + DEVCONF_OPTIMISTIC_DAD, DEVCONF_MAX }; diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index 81480e6..972a33a 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -570,6 +570,7 @@ enum { NET_IPV6_RTR_PROBE_INTERVAL=21, NET_IPV6_ACCEPT_RA_RT_INFO_MAX_PLEN=22, NET_IPV6_PROXY_NDP=23, + NET_IPV6_OPTIMISTIC_DAD=24, __NET_IPV6_MAX }; diff --git a/include/net/addrconf.h b/include/net/addrconf.h index 88df8fc..d248a19 100644 --- a/include/net/addrconf.h +++ b/include/net/addrconf.h @@ -73,7 +73,9 @@ extern intipv6_get_saddr(struct dst_entry *dst, extern int ipv6_dev_get_saddr(struct net_device *dev, struct in6_addr *daddr, struct in6_addr *saddr); -extern int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *); +extern int ipv6_get_lladdr(struct net_device *dev, + struct in6_addr *, +
Re: [BUG] problem with BPF in PF_PACKET sockets, introduced in linux-2.6.19
Hello! So this whole idea to make run_filter() return signed integers and fail on negative is entirely flawed, it simply cannot work and retain the expected semantics which have been there forever. Actually, it can. Return value was used only as sign of error, so that the mistake was to return original unsigned result casted to int. Alternative fix is enclosed. To be honest, it is not better than yours: duplication of couple lines of code against passing return value by pointer. Alexey diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index da73e8a..51e5537 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -437,11 +437,13 @@ static inline int run_filter(struct sk_b rcu_read_lock_bh(); filter = rcu_dereference(sk-sk_filter); if (filter != NULL) { - err = sk_run_filter(skb, filter-insns, filter-len); - if (!err) + unsigned int res; + + res = sk_run_filter(skb, filter-insns, filter-len); + if (!res) err = -EPERM; - else if (*snaplen err) - *snaplen = err; + else if (*snaplen res) + *snaplen = res; } rcu_read_unlock_bh(); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: bonding: bug in balance-alb mode (incorrect update-ARP-replie s)
Jay Vosburgh [EMAIL PROTECTED] wrote: Is your test occuring on an isolated network, and is there other concurrent network traffic that might be affecting things? The problem still persists as long as the box is connected to our Ciscos. I tried to simulate it with a dumb switch with my two boxes connected only. But there were no unsolicited ARP-replies anymore. On the Ciscos I sometimes see ARP-replies with a destination MAC of 00:00:00:00:00:00 (!) from some Linux-boxes which are using bonding. Currently I don't have a clue from where they're coming... The boxes are receiving around 200 ARP-replies a minute - so yes, there's concurrent network traffic :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Marvell Libertas 8388 802.11 USB - added to Orbit
I've slapped the two Marvell Libertas 8388 802.11 USB cards onto Winlab's Orbit testbed on sandbox 8. This allows anyone willing to help hack on the driver with access to a node with the wireless card. http://www.orbit-lab.org/wiki/Documentation/Developers Luis - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Marvell Libertas 8388 802.11 USB - added to Orbit
On Thu, Jan 25, 2007 at 10:24:40AM -0500, Luis R. Rodriguez wrote: I've slapped the two Marvell Libertas 8388 802.11 USB cards onto Winlab's Orbit testbed on sandbox 8. This allows anyone willing to help hack on the driver with access to a node with the wireless card. http://www.orbit-lab.org/wiki/Documentation/Developers Very cool...thanks, Luis! -- John W. Linville [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lksctp-developers] Fw: Intermittent SCTP multihoming breakage
Hi Steve Steve Hill wrote: On Wed, 10 Jan 2007, Sridhar Samudrala wrote: So looks like there may be an issue with PR-SCTP(partial reliability) support and packet loss. I will take a look into this. Do you still see this problem even if you don't set timetolive? No, the problem seems to go away if the timetolive is set to 0, so this is what I have now done since I had not intended to set the timetolive in the first place (but I thought it was still worth posting details of the problem since it does appear to be a bug). I think I found this bug. It was rather interesting to figure out. The problem appears to be that data messages time-out within the rto. As a result, they move the abandoned list and are never retransmitted. This clears the retransmit list and the retransmit timer, however the data is still charged as in-flight against the association. This in turn causes new data not to be send, since we are 'supposedly' utilizing our congestion window. Can you try the attached patch and let me know if the problem is fixed. You can try reducing rto_max or path_max_retrans to get the failover to happen a little faster. Regards -vlad [SCTP]: Fix connection hang with PR-SCTP The problem that this patch corrects happens when all of the following conditions are satisfisfied: 1. PR-SCTP is used and the timeout on the chunks is set below RTO.Max. 2. One of the paths on a multihomed associations is brought down. In this scenario, data will expire within the rto of the initial transmission and will never be retransmitted. However this data still fills the send buffer and is counted against the association as outstanding data. This causes any new data to not be sent and retransmission to not happen. The fix is to discount the abandoned data from the outstanding count and peers rwnd estimation. This allows new data to be sent and a retransmission timer restarted. Even though this new data will most like expire withing the rto, the timer still counts as a strike agains the transport and forces the FORWARD-TSN chunk to be retransmitted as well. Signed-off-by: Vlad Yasevich [EMAIL PROTECTED] --- net/sctp/outqueue.c | 27 ++- 1 files changed, 22 insertions(+), 5 deletions(-) diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c index fba567a..54d1b7f 100644 --- a/net/sctp/outqueue.c +++ b/net/sctp/outqueue.c @@ -396,6 +396,19 @@ void sctp_retransmit_mark(struct sctp_outq *q, if (sctp_chunk_abandoned(chunk)) { list_del_init(lchunk); sctp_insert_list(q-abandoned, lchunk); + + /* If this chunk has not been previousely acked, +* stop considering it 'outstanding'. Our peer +* will most likely never see it since it will +* not be retransmitted +*/ + if (!chunk-tsn_gap_acked) { + chunk-transport-flight_size -= + sctp_data_size(chunk); + q-outstanding_bytes -= sctp_data_size(chunk); + q-asoc-peer.rwnd += (sctp_data_size(chunk) + + sizeof(struct sk_buff)); + } continue; } @@ -1244,6 +1257,15 @@ static void sctp_check_transmitted(struct sctp_outq *q, if (sctp_chunk_abandoned(tchunk)) { /* Move the chunk to abandoned list. */ sctp_insert_list(q-abandoned, lchunk); + + /* If this chunk has not been acked, stop +* considering it as 'outstanding'. +*/ + if (!tchunk-tsn_gap_acked) { + tchunk-transport-flight_size -= + sctp_data_size(tchunk); + q-outstanding_bytes -= sctp_data_size(tchunk); + } continue; } @@ -1695,11 +1717,6 @@ static void sctp_generate_fwdtsn(struct sctp_outq *q, __u32 ctsn) */ if (TSN_lte(tsn, ctsn)) { list_del_init(lchunk); - if (!chunk-tsn_gap_acked) { - chunk-transport-flight_size -= - sctp_data_size(chunk); - q-outstanding_bytes -= sctp_data_size(chunk); - } sctp_chunk_free(chunk); } else { if (TSN_lte(tsn, asoc-adv_peer_ack_point+1)) { -- 1.4.4.2.g8336
Re: [Lksctp-developers] Fw: Intermittent SCTP multihoming breakage
BTW, if anyone needs a reproducer, I can provide one. -vlad - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IPv6: Implement RFC 4429 Optimistic Duplicate Address Detection
Neil Horman wrote: On Wed, Jan 24, 2007 at 05:54:47PM -0800, Sridhar Samudrala wrote: Sec 2.1 of RFC 4429 says Unless noted otherwise, components of the IPv6 protocol stack should treat addresses in the Optimistic state equivalently to those in the Deprecated state, indicating that the address is available for use but should not be used if another suitable address is available. For example, Default Address Selection [RFC3484] uses the address state to decide which source address to use for an outgoing packet. Implementations should treat an address in state Optimistic as if it were in state Deprecated. If address states are recorded as individual flags, this can easily be achieved by also setting 'Deprecated' when 'Optimistic' is set. So i think DEPRECATED flag also should be set when we mark an address as OPTIMISTIC so that we don't use it as source address for new connections if another address is available until DAD is completed. Thanks Sridhar Oh, good catch. Thank you Sri. However, I'm worried about the next paragraph: It is important to note that the address lifetime rules of [RFC2462] still apply, and so an address may be Deprecated as well as Optimistic. When DAD completes without incident, the address becomes either a Preferred or a Deprecated address, as per RFC 2462 Given that, it seems to me that addresses which are flagged as Deprecated may enter and exit that state independently of the DAD process, which I think gives rise to the possibility of a race. I.e. if an address becomes deprecated right before DAD completes, and then addrconf_dad_complete clears the IFA_F_DEPRECATED flag, that seems wrong. Instead I think it would be better if we tested for the OPTIMISTIC flag in ipv6_dev_get_saddr in parallel with the DEPRECATED flag. I may be wrong about this, but I'm going to err on the side of safety. If you can ensure that this race is not possible. Please let me know, and I'll happily just set the flag. I'll repost a new patch soon. I tend to agree with Neil here. Marking optimistic addresses as deprecated doesn't buy as much since the address can transition in and out of deprecated state regardless of DAD. However, there is a problem with the current implementation in that OPTIMISTIC address will never be chosen as source because it's always TENTATIVE and OPTIMISTIC at the same time. What needs to happen is for ipv6_dev_get_saddr() to not ignore OPTIMISTIC addresses and treat them same as DEPRECATED. -vlad - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix sorting of SACK blocks
On Thu, 25 Jan 2007 20:29:03 +0200 Baruch Even [EMAIL PROTECTED] wrote: The sorting of SACK blocks actually munges them rather than sort, causing the TCP stack to ignore some SACK information and breaking the assumption of ordered SACK blocks after sorting. The sort takes the data from a second buffer which isn't moved causing subsequent data moves to occur from the wrong location. The fix is to use a temporary buffer as a normal sort does. Signed-Off-By: Baruch Even [EMAIL PROTECTED] diff -X 2.6-rc6/Documentation/dontdiff -ur 2.6-rc6/net/ipv4/tcp_input.c 2.6-mod/net/ipv4/tcp_input.c --- 2.6-rc6/net/ipv4/tcp_input.c 2007-01-25 19:04:20.0 +0200 +++ 2.6-mod/net/ipv4/tcp_input.c 2007-01-25 19:52:04.0 +0200 @@ -1011,10 +1011,11 @@ for (j = 0; j i; j++){ if (after(ntohl(sp[j].start_seq), ntohl(sp[j+1].start_seq))){ - sp[j].start_seq = htonl(tp-recv_sack_cache[j+1].start_seq); - sp[j].end_seq = htonl(tp-recv_sack_cache[j+1].end_seq); - sp[j+1].start_seq = htonl(tp-recv_sack_cache[j].start_seq); - sp[j+1].end_seq = htonl(tp-recv_sack_cache[j].end_seq); + struct tcp_sack_block_wire tmp; + + tmp = sp[j]; + sp[j] = sp[j+1]; + sp[j+1] = tmp; } } This looks okay, but is there a test case that can be run? -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Fix sorting of SACK blocks
The sorting of SACK blocks actually munges them rather than sort, causing the TCP stack to ignore some SACK information and breaking the assumption of ordered SACK blocks after sorting. The sort takes the data from a second buffer which isn't moved causing subsequent data moves to occur from the wrong location. The fix is to use a temporary buffer as a normal sort does. Signed-Off-By: Baruch Even [EMAIL PROTECTED] diff -X 2.6-rc6/Documentation/dontdiff -ur 2.6-rc6/net/ipv4/tcp_input.c 2.6-mod/net/ipv4/tcp_input.c --- 2.6-rc6/net/ipv4/tcp_input.c2007-01-25 19:04:20.0 +0200 +++ 2.6-mod/net/ipv4/tcp_input.c2007-01-25 19:52:04.0 +0200 @@ -1011,10 +1011,11 @@ for (j = 0; j i; j++){ if (after(ntohl(sp[j].start_seq), ntohl(sp[j+1].start_seq))){ - sp[j].start_seq = htonl(tp-recv_sack_cache[j+1].start_seq); - sp[j].end_seq = htonl(tp-recv_sack_cache[j+1].end_seq); - sp[j+1].start_seq = htonl(tp-recv_sack_cache[j].start_seq); - sp[j+1].end_seq = htonl(tp-recv_sack_cache[j].end_seq); + struct tcp_sack_block_wire tmp; + + tmp = sp[j]; + sp[j] = sp[j+1]; + sp[j+1] = tmp; } } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix sorting of SACK blocks
* Stephen Hemminger [EMAIL PROTECTED] [070125 20:47]: On Thu, 25 Jan 2007 20:29:03 +0200 Baruch Even [EMAIL PROTECTED] wrote: The sorting of SACK blocks actually munges them rather than sort, causing the TCP stack to ignore some SACK information and breaking the assumption of ordered SACK blocks after sorting. The sort takes the data from a second buffer which isn't moved causing subsequent data moves to occur from the wrong location. The fix is to use a temporary buffer as a normal sort does. Signed-Off-By: Baruch Even [EMAIL PROTECTED] diff -X 2.6-rc6/Documentation/dontdiff -ur 2.6-rc6/net/ipv4/tcp_input.c 2.6-mod/net/ipv4/tcp_input.c --- 2.6-rc6/net/ipv4/tcp_input.c2007-01-25 19:04:20.0 +0200 +++ 2.6-mod/net/ipv4/tcp_input.c2007-01-25 19:52:04.0 +0200 @@ -1011,10 +1011,11 @@ for (j = 0; j i; j++){ if (after(ntohl(sp[j].start_seq), ntohl(sp[j+1].start_seq))){ - sp[j].start_seq = htonl(tp-recv_sack_cache[j+1].start_seq); - sp[j].end_seq = htonl(tp-recv_sack_cache[j+1].end_seq); - sp[j+1].start_seq = htonl(tp-recv_sack_cache[j].start_seq); - sp[j+1].end_seq = htonl(tp-recv_sack_cache[j].end_seq); + struct tcp_sack_block_wire tmp; + + tmp = sp[j]; + sp[j] = sp[j+1]; + sp[j+1] = tmp; } } This looks okay, but is there a test case that can be run? There is nothing visible that shows the problem, the only option is to add some code to print the SACK blocks after sorting and run it over a large BDP connection that can be saturated. You'll obviously need to have several holes, I believe that the bug will be visible when you have ACK packets with three SACK blocks where the first block is the highest which should be the normal case. Cheers, Baruch - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Possible bugs in SACK processing
In addition to the patch I've provided there are two more issues that I believe are bugs in the SACK processing code. Since I'm not certain but I don't have the time to look into them I'd like to raise them for other folks to look at. First issue is the checking of the applicability of the fast path. The sack blocks are compared directly, but there is no comparison of the number of sack blocks. If in the former sack we had two blocks and now we have three we will compare the third sack block from now against old or uninitialised data. The chance of anything really bad happening might not be high but it seems to be a bad behaviour. The second issue is that there is no check that the fast path is actually behind the hint. Consider a scenario where we have three sack blocks and the first sack update is about an old location. And then comes another sack packet with only an update to the old location. The result will be that after the former sack block the hint is in the latest location it can be and when the next sack packet arrives we detect its an increase only but the fast path hint is too far and we do no updating at all. Baruch - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 31/31] net: Add etun driver
Eric W. Biederman wrote: From: Eric W. Biederman [EMAIL PROTECTED] - unquoted etun is a simple two headed tunnel driver that at the link layer looks like ethernet. It's target audience is communicating between network namespaces but it is general enough it may have other uses as well. This looks almost identical to my redir-dev module. Which is fine..I don't really care which gets into the kernel so long as one of them does... Comments and questions are inline below. +/* + * The higher levels take care of making this non-reentrant (it's + * called with bh's disabled). + */ +static int etun_xmit(struct sk_buff *skb, struct net_device *tx_dev) +{ + struct etun_info *tx_info = tx_dev-priv; + struct net_device *rx_dev = tx_info-rx_dev; + struct etun_info *rx_info = rx_dev-priv; + + tx_info-stats.tx_packets++; + tx_info-stats.tx_bytes += skb-len; + + /* Drop the skb state that was needed to get here */ + skb_orphan(skb); + if (skb-dst) + skb-dst = dst_pop(skb-dst); /* Allow for smart routing */ I ended up setting dst to NULL. What does the dst_pop() accomplish? + + /* Switch to the receiving device */ + skb-pkt_type = PACKET_HOST; + skb-protocol = eth_type_trans(skb, rx_dev); + skb-dev = rx_dev; + skb-ip_summed = CHECKSUM_NONE; + + /* If both halves agree no checksum is needed */ + if (tx_dev-features NETIF_F_NO_CSUM) + skb-ip_summed = rx_info-ip_summed; + + rx_dev-last_rx = jiffies; Do you need to set tx_dev-trans_start to jiffies as well? + rx_info-stats.rx_packets++; + rx_info-stats.rx_bytes += skb-len; I think you need to zero out the skb-tstamp as well. This lets it be re-calculated when the receive logic of the other device is called. Otherwise this fails: rx skb on eth1, delay skb for network emulation, bridge onto etun0, rx on etun1 (time-stamp is still what it was when rx'd on eth1, which is too old.) + netif_rx(skb); + + return 0; +} + +static int etun_open(struct net_device *tx_dev) +{ + struct etun_info *tx_info = tx_dev-priv; + struct net_device *rx_dev = tx_info-rx_dev; + if (rx_dev-flags IFF_UP) { + netif_carrier_on(tx_dev); + netif_carrier_on(rx_dev); + } + netif_start_queue(tx_dev); Does this carrier logic keep etun0 from transmitting to etun1 if etun0 is UP but etun1 is not UP yet? + return 0; +} + +static int etun_stop(struct net_device *tx_dev) +{ + struct etun_info *tx_info = tx_dev-priv; + struct net_device *rx_dev = tx_info-rx_dev; + netif_stop_queue(tx_dev); + if (netif_carrier_ok(tx_dev)) { + netif_carrier_off(tx_dev); + netif_carrier_off(rx_dev); + } + return 0; +} + +static void etun_set_multicast_list(struct net_device *dev) +{ + /* Nothing sane I can do here */ + return; +} + +static int etun_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) +{ + return -EOPNOTSUPP; +} + +/* Only allow letters and numbers in an etun device name */ +static int is_valid_name(const char *name) +{ + const char *ptr; + for (ptr = name; *ptr; ptr++) { + if (!isalnum(*ptr)) + return 0; + } + return 1; +} + +static struct net_device *etun_alloc(net_t net, const char *name) +{ + struct net_device *dev; + struct etun_info *info; + int err; + + if (!name || !is_valid_name(name)) + return ERR_PTR(-EINVAL); + + dev = alloc_netdev(sizeof(struct etun_info), name, ether_setup); + if (!dev) + return ERR_PTR(-ENOMEM); + + info = dev-priv; + info-dev = dev; + dev-nd_net = net; + + random_ether_addr(dev-dev_addr); + dev-tx_queue_len= 0; /* A queue is silly for a loopback device */ + dev-hard_start_xmit = etun_xmit; + dev-get_stats = etun_get_stats; + dev-open= etun_open; + dev-stop= etun_stop; + dev-set_multicast_list = etun_set_multicast_list; + dev-do_ioctl= etun_ioctl; + dev-features= NETIF_F_FRAGLIST + | NETIF_F_HIGHDMA + | NETIF_F_LLTX; + dev-flags = IFF_BROADCAST | IFF_MULTICAST |IFF_PROMISC; + dev-ethtool_ops = etun_ethtool_ops; + dev-destructor = free_netdev; You should add ability to change MTU. I believe it is as trivial as this: int redirdev_change_mtu(struct net_device *dev, int new_mtu) { dev-mtu = new_mtu; return 0; } + err = register_netdev(dev); + if (err) { + free_netdev(dev); + dev = ERR_PTR(err); + goto out; + } + netif_carrier_off(dev); +out: + return dev; +} +
Re: [PATCH] IPv6: Implement RFC 4429 Optimistic Duplicate Address Detection
On Thu, Jan 25, 2007 at 12:16:59PM -0500, Vlad Yasevich wrote: snip I tend to agree with Neil here. Marking optimistic addresses as deprecated doesn't buy as much since the address can transition in and out of deprecated state regardless of DAD. However, there is a problem with the current implementation in that OPTIMISTIC address will never be chosen as source because it's always TENTATIVE and OPTIMISTIC at the same time. What needs to happen is for ipv6_dev_get_saddr() to not ignore OPTIMISTIC addresses and treat them same as DEPRECATED. -vlad Heres an updated patch. Same as the previous patch but it adds three modifications to ipv6_dev_get_saddr, which do the following: a) Adds logic to not remove addresses that are both tentative and optimistic from the set of considered addresses b) Treats optimistic addresses and deptrecated address in the same fashion by checking for both flags appropriately during source address selection. Thoughts welcome. Thanks Regards Neil Signed-off-by: Neil Horman [EMAIL PROTECTED] include/linux/if_addr.h |1 include/linux/ipv6.h|2 + include/linux/sysctl.h |1 include/net/addrconf.h |4 +- net/ipv6/addrconf.c | 69 net/ipv6/mcast.c|4 +- net/ipv6/ndisc.c| 82 +++- 7 files changed, 125 insertions(+), 38 deletions(-) diff --git a/include/linux/if_addr.h b/include/linux/if_addr.h index d557e4c..43f3bed 100644 --- a/include/linux/if_addr.h +++ b/include/linux/if_addr.h @@ -39,6 +39,7 @@ enum #define IFA_F_TEMPORARYIFA_F_SECONDARY #defineIFA_F_NODAD 0x02 +#define IFA_F_OPTIMISTIC 0x04 #defineIFA_F_HOMEADDRESS 0x10 #define IFA_F_DEPRECATED 0x20 #define IFA_F_TENTATIVE0x40 diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index f824113..5d37abf 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -177,6 +177,7 @@ struct ipv6_devconf { #endif #endif __s32 proxy_ndp; + __s32 optimistic_dad; void*sysctl; }; @@ -205,6 +206,7 @@ enum { DEVCONF_RTR_PROBE_INTERVAL, DEVCONF_ACCEPT_RA_RT_INFO_MAX_PLEN, DEVCONF_PROXY_NDP, + DEVCONF_OPTIMISTIC_DAD, DEVCONF_MAX }; diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index 81480e6..972a33a 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -570,6 +570,7 @@ enum { NET_IPV6_RTR_PROBE_INTERVAL=21, NET_IPV6_ACCEPT_RA_RT_INFO_MAX_PLEN=22, NET_IPV6_PROXY_NDP=23, + NET_IPV6_OPTIMISTIC_DAD=24, __NET_IPV6_MAX }; diff --git a/include/net/addrconf.h b/include/net/addrconf.h index 88df8fc..d248a19 100644 --- a/include/net/addrconf.h +++ b/include/net/addrconf.h @@ -73,7 +73,9 @@ extern intipv6_get_saddr(struct dst_entry *dst, extern int ipv6_dev_get_saddr(struct net_device *dev, struct in6_addr *daddr, struct in6_addr *saddr); -extern int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *); +extern int ipv6_get_lladdr(struct net_device *dev, + struct in6_addr *, + unsigned char banned_flags); extern int ipv6_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2); extern voidaddrconf_join_solict(struct net_device *dev, diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 2a7e461..46f91ee 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -830,7 +830,8 @@ retry: ift = !max_addresses || ipv6_count_addresses(idev) max_addresses ? ipv6_add_addr(idev, addr, tmp_plen, - ipv6_addr_type(addr)IPV6_ADDR_SCOPE_MASK, IFA_F_TEMPORARY) : NULL; + ipv6_addr_type(addr)IPV6_ADDR_SCOPE_MASK, + IFA_F_TEMPORARY|IFA_F_OPTIMISTIC) : NULL; if (!ift || IS_ERR(ift)) { in6_ifa_put(ifp); in6_dev_put(idev); @@ -962,13 +963,14 @@ int ipv6_dev_get_saddr(struct net_device *daddr_dev, * - Tentative Address (RFC2462 section 5.4) * - A tentative address is not considered *assigned to an interface in the traditional -*sense. +*sense, unless it is also flagged as optimistic. * - Candidate Source Address (section 4) * - In any case, anycast addresses, multicast *addresses, and the
Re: [PATCH] IPv6: Implement RFC 4429 Optimistic Duplicate Address Detection
Hi Neil @@ -1027,15 +1029,17 @@ int ipv6_dev_get_saddr(struct net_device *daddr_dev, } } - /* Rule 3: Avoid deprecated address */ + /* Rule 3: Avoid deprecated and optimistic address */ if (hiscore.rule 3) { if (ipv6_saddr_preferred(hiscore.addr_type) || - !(ifa_result-flags IFA_F_DEPRECATED)) + ((!(ifa_result-flags IFA_F_DEPRECATED)) + (!(ifa_result-flags IFA_F_OPTIMISTIC One style comment. Looks like some extra parenthesis that I don't thing are needed. I think you can say + (!(ifa_result-flags IFA_F_DEPRECATED)) +!(ifa_result-flags IFA_F_OPTIMISTIC hiscore.attrs |= IPV6_SADDR_SCORE_PREFERRED; hiscore.rule++; } if (ipv6_saddr_preferred(score.addr_type) || - !(ifa-flags IFA_F_DEPRECATED)) { + ((!(ifa-flags IFA_F_DEPRECATED)) + (!(ifa_result-flags IFA_F_OPTIMISTIC { same here. score.attrs |= IPV6_SADDR_SCORE_PREFERRED; if (!(hiscore.attrs IPV6_SADDR_SCORE_PREFERRED)) { score.rule = 3; -vlad - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 31/31] net: Add etun driver
Ben Greear [EMAIL PROTECTED] writes: Eric W. Biederman wrote: From: Eric W. Biederman [EMAIL PROTECTED] - unquoted etun is a simple two headed tunnel driver that at the link layer looks like ethernet. It's target audience is communicating between network namespaces but it is general enough it may have other uses as well. This looks almost identical to my redir-dev module. Which is fine..I don't really care which gets into the kernel so long as one of them does... Comments and questions are inline below. If is I don't really care much either. +/* + * The higher levels take care of making this non-reentrant (it's + * called with bh's disabled). + */ +static int etun_xmit(struct sk_buff *skb, struct net_device *tx_dev) +{ +struct etun_info *tx_info = tx_dev-priv; +struct net_device *rx_dev = tx_info-rx_dev; +struct etun_info *rx_info = rx_dev-priv; + +tx_info-stats.tx_packets++; +tx_info-stats.tx_bytes += skb-len; + +/* Drop the skb state that was needed to get here */ +skb_orphan(skb); +if (skb-dst) +skb-dst = dst_pop(skb-dst); /* Allow for smart routing */ I ended up setting dst to NULL. What does the dst_pop() accomplish? It allows an ambitious routing program to realize all of the routing is on one machine and compute a route through multiple network stack traversals. I don't know it every makes sense to really use that but since in the normal case this just sets dst to NULL. I figured I would leave it in, in case that ever looks useful. + +/* Switch to the receiving device */ +skb-pkt_type = PACKET_HOST; +skb-protocol = eth_type_trans(skb, rx_dev); +skb-dev = rx_dev; +skb-ip_summed = CHECKSUM_NONE; + +/* If both halves agree no checksum is needed */ +if (tx_dev-features NETIF_F_NO_CSUM) +skb-ip_summed = rx_info-ip_summed; + +rx_dev-last_rx = jiffies; Do you need to set tx_dev-trans_start to jiffies as well? Could be. I haven't had any problems with it but I may have missed a trick or two. +rx_info-stats.rx_packets++; +rx_info-stats.rx_bytes += skb-len; I think you need to zero out the skb-tstamp as well. This lets it be re-calculated when the receive logic of the other device is called. Otherwise this fails: rx skb on eth1, delay skb for network emulation, bridge onto etun0, rx on etun1 (time-stamp is still what it was when rx'd on eth1, which is too old.) Quite possibly. I wouldn't be at all surprised if I missed something like that. +static int etun_open(struct net_device *tx_dev) +{ +struct etun_info *tx_info = tx_dev-priv; +struct net_device *rx_dev = tx_info-rx_dev; +if (rx_dev-flags IFF_UP) { +netif_carrier_on(tx_dev); +netif_carrier_on(rx_dev); +} +netif_start_queue(tx_dev); Does this carrier logic keep etun0 from transmitting to etun1 if etun0 is UP but etun1 is not UP yet? A little bit. It also allows user space to see that there really is not a connection. I think I was just having fun when I implemented that bit. + +random_ether_addr(dev-dev_addr); + dev-tx_queue_len = 0; /* A queue is silly for a loopback device */ +dev-hard_start_xmit= etun_xmit; +dev-get_stats = etun_get_stats; +dev-open = etun_open; +dev-stop = etun_stop; +dev-set_multicast_list = etun_set_multicast_list; +dev-do_ioctl = etun_ioctl; +dev-features = NETIF_F_FRAGLIST + | NETIF_F_HIGHDMA + | NETIF_F_LLTX; +dev-flags = IFF_BROADCAST | IFF_MULTICAST |IFF_PROMISC; +dev-ethtool_ops= etun_ethtool_ops; +dev-destructor = free_netdev; You should add ability to change MTU. I believe it is as trivial as this: int redirdev_change_mtu(struct net_device *dev, int new_mtu) { dev-mtu = new_mtu; return 0; } It should be. If I missed that it was an oversight. +dev_hold(dev0); +dev_hold(dev1); +info0-rx_dev = dev1; +info1-rx_dev = dev0; Can this race such that someone could manage to tx on one of these devices before you assign the rx_dev? Maybe register-netdev after this assignment here, instead of in the alloc_etun method above? Good paranoid thought. + +/* Only place one member of the pair on the list + * so I don't confuse list_for_each_entry_safe, + * by deleting two list entries at once. + */ +rtnl_lock(); +list_add(info0-list, etun_list); +INIT_LIST_HEAD(info1-list); +rtnl_unlock(); + +return 0; +} + +static int etun_unregister_pair(struct net_device *dev0) +{ +struct etun_info *info0, *info1; +struct net_device *dev1; + +ASSERT_RTNL(); + +if (!dev0) +return -ENODEV; + +info0 = dev0-priv; +dev1 = info0-rx_dev; +info1 =
Re: [PATCH RFC 2/31] net: Implement a place holder network namespace
Stephen Hemminger [EMAIL PROTECTED] writes: + +#define __per_net_start ((char *)0) +#define __per_net_end ((char *)0) Don't use these use NULL NULL has the wrong data type. These are compiled out character array normally generated by the linker script. I'm not even certain I need the above but allows for compile time and not link time optimization so it is probably better that way. The fact that these happen to be equal to NULL is their least interesting property. The fact that you can subtract the and get 0 is much more interesting. + +static inline int copy_net(int flags, struct task_struct *tsk) { return 0; } + +/* Don't let the list of network namespaces change */ +static inline void net_lock(void) {} +static inline void net_unlock(void) {} Don't make all one line, or use #define instead. Why? Anyway I appreciate the picking of the nits, and it should lead to better code. I guess this implies you are in favor of the general idea of where this is going? Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: owner-Match in 2.6.20-rc5 (fwd)
From: Jozsef Kadlecsik [EMAIL PROTECTED] Date: Thu, 25 Jan 2007 21:31:56 +0100 (CET) The report below was posted on the netfilter user list. Isn't there any ill side effect by reverting the change? Performance regression :-( This optimization saves a whole handful of heavy atomic operations in the packet transmit path of TCP. As I understand it, the owner-Match is not in the upstream tree, and it's the only thing that cares, so I see no reason to cater for it. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IPv6: Implement RFC 4429 Optimistic Duplicate Address Detection
On Thu, Jan 25, 2007 at 03:18:59PM -0500, Vlad Yasevich wrote: Hi Neil @@ -1027,15 +1029,17 @@ int ipv6_dev_get_saddr(struct net_device *daddr_dev, } } - /* Rule 3: Avoid deprecated address */ + /* Rule 3: Avoid deprecated and optimistic address */ if (hiscore.rule 3) { if (ipv6_saddr_preferred(hiscore.addr_type) || - !(ifa_result-flags IFA_F_DEPRECATED)) + ((!(ifa_result-flags IFA_F_DEPRECATED)) + (!(ifa_result-flags IFA_F_OPTIMISTIC One style comment. Looks like some extra parenthesis that I don't thing are needed. I think you can say + (!(ifa_result-flags IFA_F_DEPRECATED)) + !(ifa_result-flags IFA_F_OPTIMISTIC hiscore.attrs |= IPV6_SADDR_SCORE_PREFERRED; hiscore.rule++; } if (ipv6_saddr_preferred(score.addr_type) || - !(ifa-flags IFA_F_DEPRECATED)) { + ((!(ifa-flags IFA_F_DEPRECATED)) + (!(ifa_result-flags IFA_F_OPTIMISTIC { same here. score.attrs |= IPV6_SADDR_SCORE_PREFERRED; if (!(hiscore.attrs IPV6_SADDR_SCORE_PREFERRED)) { score.rule = 3; -vlad I prefer to be more explicit in my order of operation, but that does seem more consistent with the prevaling style. New patch attached. Thanks Regards Neil Signed-off-by: Neil Horman [EMAIL PROTECTED] include/linux/if_addr.h |1 include/linux/ipv6.h|2 + include/linux/sysctl.h |1 include/net/addrconf.h |4 +- net/ipv6/addrconf.c | 69 net/ipv6/mcast.c|4 +- net/ipv6/ndisc.c| 82 +++- 7 files changed, 125 insertions(+), 38 deletions(-) diff --git a/include/linux/if_addr.h b/include/linux/if_addr.h index d557e4c..43f3bed 100644 --- a/include/linux/if_addr.h +++ b/include/linux/if_addr.h @@ -39,6 +39,7 @@ enum #define IFA_F_TEMPORARYIFA_F_SECONDARY #defineIFA_F_NODAD 0x02 +#define IFA_F_OPTIMISTIC 0x04 #defineIFA_F_HOMEADDRESS 0x10 #define IFA_F_DEPRECATED 0x20 #define IFA_F_TENTATIVE0x40 diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index f824113..5d37abf 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -177,6 +177,7 @@ struct ipv6_devconf { #endif #endif __s32 proxy_ndp; + __s32 optimistic_dad; void*sysctl; }; @@ -205,6 +206,7 @@ enum { DEVCONF_RTR_PROBE_INTERVAL, DEVCONF_ACCEPT_RA_RT_INFO_MAX_PLEN, DEVCONF_PROXY_NDP, + DEVCONF_OPTIMISTIC_DAD, DEVCONF_MAX }; diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index 81480e6..972a33a 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -570,6 +570,7 @@ enum { NET_IPV6_RTR_PROBE_INTERVAL=21, NET_IPV6_ACCEPT_RA_RT_INFO_MAX_PLEN=22, NET_IPV6_PROXY_NDP=23, + NET_IPV6_OPTIMISTIC_DAD=24, __NET_IPV6_MAX }; diff --git a/include/net/addrconf.h b/include/net/addrconf.h index 88df8fc..d248a19 100644 --- a/include/net/addrconf.h +++ b/include/net/addrconf.h @@ -73,7 +73,9 @@ extern intipv6_get_saddr(struct dst_entry *dst, extern int ipv6_dev_get_saddr(struct net_device *dev, struct in6_addr *daddr, struct in6_addr *saddr); -extern int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *); +extern int ipv6_get_lladdr(struct net_device *dev, + struct in6_addr *, + unsigned char banned_flags); extern int ipv6_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2); extern voidaddrconf_join_solict(struct net_device *dev, diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 2a7e461..057a260 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -830,7 +830,8 @@ retry: ift = !max_addresses || ipv6_count_addresses(idev) max_addresses ? ipv6_add_addr(idev, addr, tmp_plen, - ipv6_addr_type(addr)IPV6_ADDR_SCOPE_MASK, IFA_F_TEMPORARY) : NULL; + ipv6_addr_type(addr)IPV6_ADDR_SCOPE_MASK,
Re: [PATCH] Fix sorting of SACK blocks
From: Baruch Even [EMAIL PROTECTED] Date: Thu, 25 Jan 2007 20:29:03 +0200 The sorting of SACK blocks actually munges them rather than sort, causing the TCP stack to ignore some SACK information and breaking the assumption of ordered SACK blocks after sorting. The sort takes the data from a second buffer which isn't moved causing subsequent data moves to occur from the wrong location. The fix is to use a temporary buffer as a normal sort does. Signed-Off-By: Baruch Even [EMAIL PROTECTED] Thanks for finding this bug Baruch. It probably explains some weird TCP traces I've seen over the years :-) I'll review this and apply it later today, thanks again. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: owner-Match in 2.6.20-rc5 (fwd)
From: Jan Engelhardt [EMAIL PROTECTED] Date: Thu, 25 Jan 2007 22:07:07 +0100 (MET) The report below was posted on the netfilter user list. Isn't there any ill side effect by reverting the change? Performance regression :-( This optimization saves a whole handful of heavy atomic operations in the packet transmit path of TCP. As I understand it, the owner-Match is not in the upstream tree, and it's the only thing that cares, so I see no reason to cater for it. For me, it's there. -rw-r--r-- 1 jengelh users 2247 Jan 25 21:37 /erk/kernel/linux-2.6.20-rc6/net/ipv4/netfilter/ipt_owner.c Ok, I'll see what I can do about this :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: owner-Match in 2.6.20-rc5 (fwd)
The report below was posted on the netfilter user list. Isn't there any ill side effect by reverting the change? Performance regression :-( This optimization saves a whole handful of heavy atomic operations in the packet transmit path of TCP. As I understand it, the owner-Match is not in the upstream tree, and it's the only thing that cares, so I see no reason to cater for it. For me, it's there. -rw-r--r-- 1 jengelh users 2247 Jan 25 21:37 /erk/kernel/linux-2.6.20-rc6/net/ipv4/netfilter/ipt_owner.c Ok, I'll see what I can do about this :-) People really depend on this. Much more than than pid/comm/smpunsafe stuff. For example, a web server [cgi enabled, etc.] which also runs squid, to force all webtraffic through it: -A OUTPUT -p tcp --dport 80 -m owner ! --uid-owner squid -j REDIRECT --to-ports 3128 -`J' -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IPv6: Implement RFC 4429 Optimistic Duplicate Address Detection
Hi Neil I went through the RFC again it seems like the following is missing: Section 3.3: * (modifies section 5.4.2) The host MUST join the all-nodes multicast address and the solicited-node multicast address of the Tentative address. The host SHOULD NOT delay before sending Neighbor Solicitation messages. For this, addrconf_dad_kick() should pass 0 to addrconf_mod_timer when the address is optimistic. Otherwise, we'll delay DAD some of the purpose of optimistic addresses is lost. -vlad - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] RFC: Broadcom PHY forcing fix
Maciej, I've got a BCM5461 that requires this fix to be able to force the speeds on the PHY. Not sure if its needed on the other variants or not. The problem is the genphy_config_aneg resets the PHY when forcing the speed and once we reset the BCM5461 it doesn't remember any of its settings. Let me know if this works for you or not. - k diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c index 29666c8..bf752f4 100644 --- a/drivers/net/phy/broadcom.c +++ b/drivers/net/phy/broadcom.c @@ -99,6 +99,61 @@ static int bcm54xx_config_intr(struct ph return err; } +/* bcm_setup_forced + * + * description: Configures MII_BMCR to force speed/duplex + * to the values in phydev. Assumes that the values are valid. + * Please see phy_sanitize_settings() */ +static int bcm54xx_setup_forced(struct phy_device *phydev) +{ + int ctl = 0; + phydev-pause = phydev-asym_pause = 0; + + if (SPEED_100 == phydev-speed) + ctl |= BMCR_SPEED100; + + if (DUPLEX_FULL == phydev-duplex) + ctl |= BMCR_FULLDPLX; + + ctl = phy_write(phydev, MII_BMCR, ctl); + + if (ctl 0) + return ctl; + + return ctl; +} + +int bcm54xx_config_aneg(struct phy_device *phydev) +{ + int err = 0; + + if (AUTONEG_ENABLE == phydev-autoneg) { + err = genphy_config_advert(phydev); + + if (err 0) + return err; + + err = genphy_restart_aneg(phydev); + } else { + if (SPEED_1000 == phydev-speed) { + int adv; + adv = phy_read(phydev, MII_ADVERTISE); + adv = ~(ADVERTISE_ALL | ADVERTISE_100BASE4); + + err = phy_write(phydev, MII_ADVERTISE, adv); + + if (err 0) + return err; + + err = genphy_restart_aneg(phydev); + } else { + err = bcm54xx_setup_forced(phydev); + } + } + + return err; +} + static struct phy_driver bcm5411_driver = { .phy_id = 0x00206070, .phy_id_mask= 0xfff0, @@ -106,7 +161,7 @@ static struct phy_driver bcm5411_driver .features = PHY_GBIT_FEATURES, .flags = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT, .config_init= bcm54xx_config_init, - .config_aneg= genphy_config_aneg, + .config_aneg= bcm54xx_config_aneg, .read_status= genphy_read_status, .ack_interrupt = bcm54xx_ack_interrupt, .config_intr= bcm54xx_config_intr, @@ -120,7 +175,7 @@ static struct phy_driver bcm5421_driver .features = PHY_GBIT_FEATURES, .flags = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT, .config_init= bcm54xx_config_init, - .config_aneg= genphy_config_aneg, + .config_aneg= bcm54xx_config_aneg, .read_status= genphy_read_status, .ack_interrupt = bcm54xx_ack_interrupt, .config_intr= bcm54xx_config_intr, @@ -134,7 +189,7 @@ static struct phy_driver bcm5461_driver .features = PHY_GBIT_FEATURES, .flags = PHY_HAS_MAGICANEG | PHY_HAS_INTERRUPT, .config_init= bcm54xx_config_init, - .config_aneg= genphy_config_aneg, + .config_aneg= bcm54xx_config_aneg, .read_status= genphy_read_status, .ack_interrupt = bcm54xx_ack_interrupt, .config_intr= bcm54xx_config_intr, - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BNX2]: Fix 2nd port's MAC address.
From: Michael Chan [EMAIL PROTECTED] Date: Wed, 24 Jan 2007 21:35:45 -0800 [BNX2]: Fix 2nd port's MAC address. On the 5709, we need to add the proper offset to calculate the shared memory base address of the 2nd port correctly. Otherwise, the 2nd port's MAC address and other information will be the same as the 1st port. Update version to 1.5.4. Signed-off-by: Michael Chan [EMAIL PROTECTED] Applied, thanks Michael. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: decnet handle a failure in neigh_parms_alloc (take 2)
From: Steven Whitehouse [EMAIL PROTECTED] Date: Thu, 25 Jan 2007 11:43:18 + Hi, On Wed, Jan 24, 2007 at 09:55:45PM -0700, Eric W. Biederman wrote: While enhancing the neighbour code to handle multiple network namespaces I noticed that decnet is assuming neigh_parms_alloc will allways succeed, which is clearly wrong. So handle the failure. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] Acked-by: Steven Whitehouse [EMAIL PROTECTED] Applied, thanks everyone. Also you should cc Patrick as he is now the maintainer, Yep, would be a good idea in the future. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] problem with BPF in PF_PACKET sockets, introduced in linux-2.6.19
From: Alexey Kuznetsov [EMAIL PROTECTED] Date: Thu, 25 Jan 2007 16:22:20 +0300 Actually, it can. Return value was used only as sign of error, so that the mistake was to return original unsigned result casted to int. Alternative fix is enclosed. To be honest, it is not better than yours: duplication of couple lines of code against passing return value by pointer. Yes, this version of a fix would work as well. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix sorting of SACK blocks
From: Baruch Even [EMAIL PROTECTED] Date: Thu, 25 Jan 2007 20:29:03 +0200 The sorting of SACK blocks actually munges them rather than sort, causing the TCP stack to ignore some SACK information and breaking the assumption of ordered SACK blocks after sorting. The sort takes the data from a second buffer which isn't moved causing subsequent data moves to occur from the wrong location. The fix is to use a temporary buffer as a normal sort does. Signed-Off-By: Baruch Even [EMAIL PROTECTED] BTW, in reviewing this I note that there is now only one remaining use of tp-recv_sack_cache[] and that is the code earlier in this function which is trying to detect if all we are doing is extending the leading edge of a SACK block. It would be nice to be able to clear out that usage as well, and remove recv_sack_cache[] and thus make tcp_sock smaller. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IPv6: Implement RFC 4429 Optimistic Duplicate Address Detection
In article [EMAIL PROTECTED] (at Thu, 25 Jan 2007 14:45:00 -0500), Neil Horman [EMAIL PROTECTED] says: diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 2a7e461..46f91ee 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -830,7 +830,8 @@ retry: ift = !max_addresses || ipv6_count_addresses(idev) max_addresses ? ipv6_add_addr(idev, addr, tmp_plen, - ipv6_addr_type(addr)IPV6_ADDR_SCOPE_MASK, IFA_F_TEMPORARY) : NULL; + ipv6_addr_type(addr)IPV6_ADDR_SCOPE_MASK, + IFA_F_TEMPORARY|IFA_F_OPTIMISTIC) : NULL; if (!ift || IS_ERR(ift)) { in6_ifa_put(ifp); in6_dev_put(idev); If optimistic_dad is disabled, flags should be IFA_F_TEMPORARY, not IFA_F_TEMPORARY|IFA_F_OPTIMISTIC. Another idea is to use IFA_F_OPTIMISTIC not IFA_F_OPTIMISTIC|IFA_F_TENTATIVE until the DAD has been finished. @@ -1027,15 +1029,17 @@ int ipv6_dev_get_saddr(struct net_device *daddr_dev, : + /* Rule 3: Avoid deprecated and optimistic address */ if (hiscore.rule 3) { if (ipv6_saddr_preferred(hiscore.addr_type) || - !(ifa_result-flags IFA_F_DEPRECATED)) + ((!(ifa_result-flags IFA_F_DEPRECATED)) + (!(ifa_result-flags IFA_F_OPTIMISTIC hiscore.attrs |= IPV6_SADDR_SCORE_PREFERRED; hiscore.rule++; ((ifa_result-flags (IFA_F_DEPRECATED|IFA_F_OPTIMISTIC)) == 0) } if (ipv6_saddr_preferred(score.addr_type) || - !(ifa-flags IFA_F_DEPRECATED)) { + ((!(ifa-flags IFA_F_DEPRECATED)) + (!(ifa_result-flags IFA_F_OPTIMISTIC { score.attrs |= IPV6_SADDR_SCORE_PREFERRED; if (!(hiscore.attrs IPV6_SADDR_SCORE_PREFERRED)) { score.rule = 3; ditto. @@ -2123,7 +2133,8 @@ static void addrconf_add_linklocal(struct inet6_dev *idev, struct in6_addr *addr { struct inet6_ifaddr * ifp; - ifp = ipv6_add_addr(idev, addr, 64, IFA_LINK, IFA_F_PERMANENT); + ifp = ipv6_add_addr(idev, addr, 64, IFA_LINK, + IFA_F_PERMANENT|IFA_F_OPTIMISTIC); if (!IS_ERR(ifp)) { addrconf_dad_start(ifp, 0); in6_ifa_put(ifp); Please do not always put IFA_F_OPTIMISTIC. + /* + * Optimistic nodes need to joing the anycast address + * right away + */ + if (ifp-flags IFA_F_OPTIMISTIC) + addrconf_join_anycast(ifp); + if (ifp-prefix_len != 128 (ifp-flagsIFA_F_PERMANENT)) addrconf_prefix_route(ifp-addr, ifp-prefix_len, dev, 0, flags); Should we join anycast even if the node is a host (not a router)?! When you add a call to addrconf_join_anycast(), you must consider when to leave this. @@ -2573,6 +2594,18 @@ static void addrconf_dad_start(struct inet6_ifaddr *ifp, u32 flags) addrconf_dad_stop(ifp); return; } + + /* + * Forwarding devices (routers) should not use + * optimistic addresses + * Nor should interfaces that don't know the + * Source address for their default gateway + * RFC 4429 Sec 3.3 + */ + if ((ipv6_devconf.forwarding) || +(ifp-rt == NULL)) + ifp-flags = ~IFA_F_OPTIMISTIC; + addrconf_dad_kick(ifp); spin_unlock_bh(ifp-lock); out: Please test this condition when you are adding the address. BTW, you have not implemented the later condition, right? Sefault gatewa is not tested. index 6a9f616..fcd22e3 100644 --- a/net/ipv6/ndisc.c +++ b/net/ipv6/ndisc.c @@ -498,7 +498,21 @@ static void ndisc_send_na(struct net_device *dev, struct neighbour *neigh, msg-icmph.icmp6_unused = 0; msg-icmph.icmp6_router= router; msg-icmph.icmp6_solicited = solicited; -msg-icmph.icmp6_override = override; + if (!ifp || !(ifp-flags IFA_F_OPTIMISTIC)) + msg-icmph.icmp6_override = override; + else { + /* + * We must clear the override flag on all + * neighbor advertisements from source + * addresses that are OPTIMISTIC - RFC 4429 + * section 2.2 + */ + if (override) + printk(KERN_WARNING + Disallowing override flag for OPTIMISTIC addr\n); + msg-icmph.icmp6_override = 0; + } + Ifp is already put. Please clear override in the code where we try getting
[PATCH] d80211: configure hardware when the interface is brought up
ieee80211_hw_config() is called from scanning functions and ioctl handlers, but not when the interface is brought up. This is unreasonable. Since the config function is provided by hardware drivers to d80211, the later should be responsible for calling it in all situations when the hardware needs to be reconfigured. Without this patch, bcm43xx_d80211 needs the channel to be set again after the interface goes down and up. Similar problems are reported for rt2x00 drivers. Failure in ieee80211_hw_config() leads to the interface staying down. Signed-off-by: Pavel Roskin [EMAIL PROTECTED] --- net/d80211/ieee80211.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/net/d80211/ieee80211.c b/net/d80211/ieee80211.c index 2f1dce5..7219416 100644 --- a/net/d80211/ieee80211.c +++ b/net/d80211/ieee80211.c @@ -2239,6 +2239,8 @@ static int ieee80211_open(struct net_device *dev) res = 0; if (local-ops-open) res = local-ops-open(local_to_hw(local)); + if (res == 0) + res = ieee80211_hw_config(local); if (res == 0) { res = dev_open(local-mdev); if (res) { -- Regards, Pavel Roskin - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 0/31] An introduction and A path for merging network namespace work
The idea of a network namespace is fundamentally quite simple. We create a mechanism that from the users perspective allows creation of separate instances of the network stack. When combined with mechanism like chroot this results in a much more complete isolation. When seen in the context of application migration this allows for taking your IP address and other global identifiers with you. What does this mean in the context of the networking stack? The basic idea is to tag processes with a network namespace that is used when they create new sockets or otherwise initiate a new fresh communication with the networking stack. The idea is to tag all sockets with a network namespace they will always be in and all operations on them will be relative to. The idea is to tag all network devices with a network namespace they are a member of, but may be changed during the lifetime of a device. Mostly a network namespace at it's most basic level is about names. It is about creating a view of the networking stack where you can name the network devices that are members anything you want. Likewise for iptables rules and all of the rest of the state. It is a lot like creating a new directory in a filesystem. The underlying data structures don't really change just the users view of those data structures, and we continue to have a single network stack. My goal today is that even if we can't agree on a specific set of patches that we come to an agreement on roughly what those patches should accomplish, and what process we should go through to get them merged. For implementing a network namespace the core problem is that there is a lot of networking code, and it is continually evolving. This means that the task of implementing a network namespace is not a small one, a lot of code must be read, touched and updated, while hoping someone doesn't change something important before you get your changes in. To do this sanely means we need an incremental path to our goal, that allows small pieces to be reviewed and merged as they are ready. The path I am recommending today is to first lay down some basic infrastructure. Then one layer at a time modify the existing code to handle multiple simultaneous network namespaces but to modify each component of that layer to refuse to operate in the context of anything but the initial network namespace, thus preventing code that has not yet been updated with situations it does not know how to deal with. Eventually this will get down to the real meat of the problem and practical things like ipv4 sockets will work. This should allow for a network stack that compiles, builds and works at each step of the way. Not too far into the process support for multiple network namespaces that works should be available with the limitation that except for the initial network namespace all of the rest will look like a kernel with most parts of the networking stack compiled out, but within those parts that are present it should be fully useable. To make my thinking clear I have provided a initial patchset, that makes quite a bit of progress especially in laying the ground work. My goal is to have the question does this basic path make sense? To that end I have omitted posting some of the prerequisite cleanup and infrastructure patches (like my sysctl work), that are just noise in this context, and I have failed to rebase my patchset against Dave Miller's latest networking tree. Those are important details but they are not important to this conversation. If my basic path and the basic patches look like they are heading in the right direction we can start moving towards what needs to happen to ensure a review of the patches, and what we need to do to start merging them. If the basic path does not appear reasonable well that would be good to know as well. There are essentially two different approaches to modify networking code to handle multiple network namesspaces. Either all of the global variables can be replicated once for each network namespace and we build up parallel namespace specific data structures. Or the data elements in the data structure are tagged, with what namespace they belong to and we filter them. It depends on the context which is most appropriate and easier. As a general rule large hash tables call for filtering and a small global variable set calls for simply having multiple instances of the data structure. The biggest intrusion I expect to see in the logic of the networking stack is initialization and tear down. As we need to initialize and clean up all of those per network namespace variables when we create and destroy and network namespace. A git tree with all of my patches against 2.6.20-rc5 is available at: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-netns.git In addition to what I have posted here and all of it's prerequisites the tree includes further patches that get the basics of ipv4 and iptables
[PATCH RFC 3/31] net: Add a network namespace parameter to tasks
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted This is the network namespace from which all which all sockets and anything else under user control ultimately get their network namespace parameters. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- include/linux/nsproxy.h |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h index 0b9f0dc..cc76610 100644 --- a/include/linux/nsproxy.h +++ b/include/linux/nsproxy.h @@ -3,6 +3,7 @@ #include linux/spinlock.h #include linux/sched.h +#include linux/net_namespace_type.h struct mnt_namespace; struct uts_namespace; @@ -28,6 +29,7 @@ struct nsproxy { struct ipc_namespace *ipc_ns; struct mnt_namespace *mnt_ns; struct pid_namespace *pid_ns; + net_t net_ns; }; extern struct nsproxy init_nsproxy; -- 1.4.4.1.g278f - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 8/31] net: Make /sys/class/net handle multiple network namespaces
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted In combination with the sysfs support I am in the process of merging with gregkh, creates a separate instance of the /sys/class/net directory for each network namespace so two devices with the same name do not conflict. Then a network namespace sensitive follow link method on the /sys/class/net directory ensures that you see the directory instance for your current network namespace. Ensuring all existing applications continue to see what we is currently present in sysfs. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- net/core/net-sysfs.c | 53 +- 1 files changed, 52 insertions(+), 1 deletions(-) diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 5d08cc9..b08c1be 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -11,12 +11,14 @@ #include linux/capability.h #include linux/kernel.h +#include linux/sysfs.h #include linux/netdevice.h #include linux/if_arp.h #include net/sock.h #include linux/rtnetlink.h #include linux/wireless.h #include net/iw_handler.h +#include net/net_namespace.h #define to_class_dev(obj) container_of(obj,struct class_device,kobj) #define to_net_dev(class) container_of(class, struct net_device, class_dev) @@ -431,6 +433,24 @@ static void netdev_release(struct class_device *cd) kfree((char *)dev - dev-padded); } +static DEFINE_PER_NET(struct dentry *, net_shadow) = NULL; + +static struct dentry *net_class_device_dparent(struct class_device *cd) +{ + struct net_device *dev + = container_of(cd, struct net_device, class_dev); + net_t net = dev-nd_net; + + return per_net(net_shadow, net); +} + +static void *class_net_follow_link(struct dentry *dentry, struct nameidata *nd) +{ + dput(nd-dentry); + nd-dentry = dget(per_net(net_shadow, current-nsproxy-net_ns)); + return NULL; +} + static struct class net_class = { .name = net, .release = netdev_release, @@ -438,6 +458,8 @@ static struct class net_class = { #ifdef CONFIG_HOTPLUG .uevent = netdev_uevent, #endif + .class_device_dparent = net_class_device_dparent, + .class_follow_link = class_net_follow_link, }; void netdev_unregister_sysfs(struct net_device * dev) @@ -470,7 +492,36 @@ int netdev_register_sysfs(struct net_device *dev) return class_device_add(class_dev); } +static int netdev_sysfs_net_init(net_t net) +{ + struct dentry *shadow; + int error = 0; + shadow = sysfs_create_shadow_dir(net_class.subsys.kset.kobj); + if (IS_ERR(shadow)) + error = PTR_ERR(shadow); + else + per_net(net_shadow, net) = shadow; + return error; +} + +static void netdev_sysfs_net_exit(net_t net) +{ + sysfs_remove_shadow_dir(per_net(net_shadow, net)); + per_net(net_shadow, net) = NULL; +} + +static struct pernet_operations netdev_sysfs_ops = { + .init = netdev_sysfs_net_init, + .exit = netdev_sysfs_net_exit, +}; + int netdev_sysfs_init(void) { - return class_register(net_class); + int rc; + if ((rc = class_register(net_class))) + goto out; + if ((rc = register_pernet_subsys(netdev_sysfs_ops))) + goto out; +out: + return rc; } -- 1.4.4.1.g278f - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 1/31] net: Add net_namespace_type.h to allow for per network namespace variables.
The problem: To properly implement a ``level 2'' network namespace we need to move many of the networking stack global variables into the network namespace. We want to keep it explicit that the code is accessing a variable in a network namespace. We want to be able to completely compile out the network namespace support so we can do comparitive performance testing, and so to not penalize users who don't need network namespace support. Because the network stack is a moving target we want something simple that allows for the bulk of the changes to be merged before we enable network namespace support. My biggest challenge when looking into this was to find an approach that would allow the code to compile out, in a way that does not yield any performance overhead and does not make the code ugly. While playing with the different possibilities I discovered that gcc will not pass 0 byte structures that are arguments to functions and instead will simply optmize them away. This appears to be true on i386 all of the way back to gcc-2.95 and I verified that it also works with gcc 4.1 on x86_64. Since this is part of the ABI I never expect it to change. Hopefully gcc uses this nice optimization on all architectures, I suspect so as C++ allows passing function arguments of type void in certain circumstances. Using this observation I was able to come up with an network namespace implementation network namespace code that allows the changes to completely compile out when we don't build the kernel with network namespace support. This patch implements my dummy network namespace support that should completely compiles out. Further patches will add the real version. Starting with the dummy gives a quick hint of where I am going and allows for dependencies to be overcome. When doing my proof of concept implementation one of the other problems I had was that as the network stack comes in so many modular pieces figuring out how to get their global variables into the network namespace structure was a challenge. The basic technique used by our per cpu variables for having the linker build and dynamically change structures for us appears applicable here and a lot less nuisance then what I did before so I am implementing a tailored version of that technique as well, and again this makes it very simple to compile the code out. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- include/linux/net_namespace_type.h | 52 1 files changed, 52 insertions(+), 0 deletions(-) diff --git a/include/linux/net_namespace_type.h b/include/linux/net_namespace_type.h new file mode 100644 index 000..8173f59 --- /dev/null +++ b/include/linux/net_namespace_type.h @@ -0,0 +1,52 @@ +/* + * Definition of the network namespace reference type + * And operations upon it. + */ +#ifndef __LINUX_NET_NAMESPACE_TYPE_H +#define __LINUX_NET_NAMESPACE_TYPE_H + +#define __pernetname(name) per_net__##name + +typedef struct {} net_t; + +#define __data_pernet + +/* Look up a per network namespace variable */ +static inline unsigned long __per_net_offset(net_t net) { return 0; } + +/* Like per_net but returns a pseudo variable address that must be moved + * __per_net_offset() bytes before it will point to a real variable. + * Useful for static initializers. + */ +#define __per_net_base(name) __pernetname(name) + +/* Get the network namespace reference from a per_net variable address */ +#define net_of(ptr, name) ({ net_t net; ptr; net; }) + +/* Look up a per network namespace variable */ +#define per_net(name, net) \ + (*(__per_net_offset(net), __per_net_base(name))) + +/* Are the two network namespaces the same */ +static inline int net_eq(net_t a, net_t b) { return 1; } +/* Get an unsigned value appropriate for hashing the network namespace */ +static inline unsigned int net_hval(net_t net) { return 0; } + +/* Convert to and from to and from void pointers */ +static inline void *net_to_voidp(net_t net) { return NULL; } +static inline net_t net_from_voidp(void *ptr) { net_t net; return net; } + +static inline int null_net(net_t net) { return 0; } + +#define DEFINE_PER_NET(type, name) \ + __data_pernet __typeof__(type) __pernetname(name) + +#define DECLARE_PER_NET(type, name) \ + extern __typeof__(type) __pernetname(name) + +#define EXPORT_PER_NET_SYMBOL(var) \ + EXPORT_SYMBOL(__pernetname(var)) +#define EXPORT_PER_NET_SYMBOL_GPL(var) \ + EXPORT_SYMBOL_GPL(__pernetname(var)) + +#endif /* __LINUX_NET_NAMESPACE_TYPE_H */ -- 1.4.4.1.g278f - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 30/31] net: Make AF_UNIX per network namespace safe.
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted Because of the global nature of garbage collection, and because of the cost of per namespace hash tables unix_socket_table has been kept global. With a filter added on lookups so we don't see sockets from the wrong namespace. Currently I don't fold the namesapce into the hash so multiple namespaces using the same socket name will be guaranateed a hash collision. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- include/net/af_unix.h | 10 ++-- net/unix/af_unix.c | 116 net/unix/sysctl_net_unix.c | 24 + 3 files changed, 103 insertions(+), 47 deletions(-) diff --git a/include/net/af_unix.h b/include/net/af_unix.h index c0398f5..1f40dd2 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -89,12 +89,12 @@ struct unix_sock { #define unix_sk(__sk) ((struct unix_sock *)__sk) #ifdef CONFIG_SYSCTL -extern int sysctl_unix_max_dgram_qlen; -extern void unix_sysctl_register(void); -extern void unix_sysctl_unregister(void); +DECLARE_PER_NET(int, sysctl_unix_max_dgram_qlen); +extern void unix_sysctl_register(net_t net); +extern void unix_sysctl_unregister(net_t net); #else -static inline void unix_sysctl_register(void) {} -static inline void unix_sysctl_unregister(void) {} +static inline void unix_sysctl_register(net_t net) {} +static inline void unix_sysctl_unregister(net_t net) {} #endif #endif #endif diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 8015a03..3f57cb2 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -118,7 +118,7 @@ #include linux/security.h #include net/net_namespace.h -int sysctl_unix_max_dgram_qlen __read_mostly = 10; +DEFINE_PER_NET(int, sysctl_unix_max_dgram_qlen) = 10; struct hlist_head unix_socket_table[UNIX_HASH_SIZE + 1]; DEFINE_SPINLOCK(unix_table_lock); @@ -245,7 +245,8 @@ static inline void unix_insert_socket(struct hlist_head *list, struct sock *sk) spin_unlock(unix_table_lock); } -static struct sock *__unix_find_socket_byname(struct sockaddr_un *sunname, +static struct sock *__unix_find_socket_byname(net_t net, + struct sockaddr_un *sunname, int len, int type, unsigned hash) { struct sock *s; @@ -254,6 +255,9 @@ static struct sock *__unix_find_socket_byname(struct sockaddr_un *sunname, sk_for_each(s, node, unix_socket_table[hash ^ type]) { struct unix_sock *u = unix_sk(s); + if (!net_eq(s-sk_net, net)) + continue; + if (u-addr-len == len !memcmp(u-addr-name, sunname, len)) goto found; @@ -263,21 +267,22 @@ found: return s; } -static inline struct sock *unix_find_socket_byname(struct sockaddr_un *sunname, +static inline struct sock *unix_find_socket_byname(net_t net, + struct sockaddr_un *sunname, int len, int type, unsigned hash) { struct sock *s; spin_lock(unix_table_lock); - s = __unix_find_socket_byname(sunname, len, type, hash); + s = __unix_find_socket_byname(net, sunname, len, type, hash); if (s) sock_hold(s); spin_unlock(unix_table_lock); return s; } -static struct sock *unix_find_socket_byinode(struct inode *i) +static struct sock *unix_find_socket_byinode(net_t net, struct inode *i) { struct sock *s; struct hlist_node *node; @@ -287,6 +292,9 @@ static struct sock *unix_find_socket_byinode(struct inode *i) unix_socket_table[i-i_ino (UNIX_HASH_SIZE - 1)]) { struct dentry *dentry = unix_sk(s)-dentry; + if (!net_eq(s-sk_net, net)) + continue; + if(dentry dentry-d_inode == i) { sock_hold(s); @@ -588,7 +596,7 @@ static struct sock * unix_create1(net_t net, struct socket *sock) af_unix_sk_receive_queue_lock_key); sk-sk_write_space = unix_write_space; - sk-sk_max_ack_backlog = sysctl_unix_max_dgram_qlen; + sk-sk_max_ack_backlog = per_net(sysctl_unix_max_dgram_qlen, net); sk-sk_destruct = unix_sock_destructor; u = unix_sk(sk); u-dentry = NULL; @@ -604,9 +612,6 @@ out: static int unix_create(net_t net, struct socket *sock, int protocol) { - if (!net_eq(net, init_net())) - return -EAFNOSUPPORT; - if (protocol protocol != PF_UNIX) return -EPROTONOSUPPORT; @@ -650,6 +655,7 @@ static int unix_release(struct socket *sock) static int unix_autobind(struct socket *sock) { struct sock *sk = sock-sk; + net_t net = sk-sk_net; struct
[PATCH RFC 31/31] net: Add etun driver
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted etun is a simple two headed tunnel driver that at the link layer looks like ethernet. It's target audience is communicating between network namespaces but it is general enough it may have other uses as well. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- drivers/net/Kconfig | 14 ++ drivers/net/Makefile |1 + drivers/net/etun.c | 470 ++ 3 files changed, 485 insertions(+), 0 deletions(-) diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 8aa8dd0..969d3df 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -119,6 +119,20 @@ config TUN If you don't know what to use this for, you don't need it. +config ETUN + tristate Ethernet tunnel device driver support + depends on SYSFS + ---help--- + ETUN provices a pair of network devices that can be used for + configuring interesting topolgies. What one devices transmits + the other receives and vice versa. The link level framing + is ethernet for wide compatibility with network stacks. + + To compile this driver as a module, choose M here: the module + will be called etun. + + If you don't know what to use this for, you don't need it. + config NET_SB1000 tristate General Instruments Surfboard 1000 depends on PNP diff --git a/drivers/net/Makefile b/drivers/net/Makefile index 4c0d4e5..396af4f 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -185,6 +185,7 @@ obj-$(CONFIG_MACSONIC) += macsonic.o obj-$(CONFIG_MACMACE) += macmace.o obj-$(CONFIG_MAC89x0) += mac89x0.o obj-$(CONFIG_TUN) += tun.o +obj-$(CONFIG_ETUN) += etun.o obj-$(CONFIG_NET_NETX) += netx-eth.o obj-$(CONFIG_DL2K) += dl2k.o obj-$(CONFIG_R8169) += r8169.o diff --git a/drivers/net/etun.c b/drivers/net/etun.c new file mode 100644 index 000..1dd8cd8 --- /dev/null +++ b/drivers/net/etun.c @@ -0,0 +1,470 @@ +/* + * ETUN - Universal ETUN device driver. + * Copyright (C) 2006 Linux Networx + * + */ + +#define DRV_NAME etun +#define DRV_VERSION1.0 +#define DRV_DESCRIPTIONEthernet pseudo tunnel device driver +#define DRV_COPYRIGHT (C) 2007 Linux Networx + +#include linux/module.h +#include linux/kernel.h +#include linux/list.h +#include linux/spinlock.h +#include linux/skbuff.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/ethtool.h +#include linux/rtnetlink.h +#include linux/if.h +#include linux/if_ether.h +#include linux/ctype.h +#include net/net_namespace.h +#include net/dst.h + + +/* Device cheksum strategy. + * + * etun is designed to a be a pair of virutal devices + * connecting two network stack instances. + * + * Typically it will either be used with ethernet bridging or + * it will be used to route packets between the two stacks. + * + * The only checksum offloading I can do is to completely + * skip the checksumming step all together. + * + * When used for ethernet bridging I don't believe any + * checksum off loading is safe. + * - If my source is an external interface the checksum may be + * invalid so I don't want to report I have already checked it. + * - If my destination is an external interface I don't want to put + * a packet on the wire with someone computing the checksum. + * + * When used for routing between two stacks checksums should + * be as unnecessary as they are on the loopback device. + * + * So by default I am safe and disable checksumming and + * other advanced features like SG and TSO. + * + * However because I think these features could be useful + * I provide the ethtool functions to and enable/disable + * them at runtime. + * + * If you think you can correctly enable these go ahead. + * For checksums both the transmitter and the receiver must + * agree before the are actually disabled. + */ + +#define ETUN_NUM_STATS 1 +static struct { + const char string[ETH_GSTRING_LEN]; +} ethtool_stats_keys[ETUN_NUM_STATS] = { + { partner_ifindex }, +}; + +struct etun_info { + struct net_device *rx_dev; + unsignedip_summed; + struct net_device_stats stats; + struct list_headlist; + struct net_device *dev; +}; + +/* + * I have to hold the rtnl_lock during device delete. + * So I use the rtnl_lock to protect my list manipulations + * as well. Crude but simple. + */ +static LIST_HEAD(etun_list); + +/* + * The higher levels take care of making this non-reentrant (it's + * called with bh's disabled). + */ +static int etun_xmit(struct sk_buff *skb, struct net_device *tx_dev) +{ + struct etun_info *tx_info = tx_dev-priv; + struct net_device *rx_dev = tx_info-rx_dev; + struct etun_info *rx_info = rx_dev-priv; + + tx_info-stats.tx_packets++; + tx_info-stats.tx_bytes += skb-len; + + /* Drop the skb state that was needed to get here */ + skb_orphan(skb); +
[PATCH RFC 23/31] net: Modify all rtnetlink methods to only work in the initial namespace
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted Before I can enable rtnetlink to work in all network namespaces I need to be certain that something won't break. So this patch deliberately disables all of the methods and when they are audited this extra check can be disabled. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- net/bridge/br_netlink.c |9 + net/core/fib_rules.c|7 +++ net/core/neighbour.c| 18 ++ net/core/rtnetlink.c| 13 + net/decnet/dn_dev.c | 12 net/decnet/dn_fib.c |8 net/decnet/dn_route.c |8 net/decnet/dn_rules.c |5 + net/decnet/dn_table.c |4 net/ipv4/devinet.c | 12 net/ipv4/fib_frontend.c | 12 net/ipv4/fib_rules.c|5 + net/ipv6/addrconf.c | 31 +++ net/ipv6/fib6_rules.c |5 + net/ipv6/ip6_fib.c |4 net/ipv6/route.c| 12 net/sched/act_api.c |8 net/sched/cls_api.c |8 net/sched/sch_api.c | 20 19 files changed, 201 insertions(+), 0 deletions(-) diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c index 119b97d..85165a1 100644 --- a/net/bridge/br_netlink.c +++ b/net/bridge/br_netlink.c @@ -14,6 +14,7 @@ #include linux/rtnetlink.h #include net/netlink.h #include net/net_namespace.h +#include net/sock.h #include br_private.h static inline size_t br_nlmsg_size(void) @@ -104,9 +105,13 @@ errout: */ static int br_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb) { + net_t net = skb-sk-sk_net; struct net_device *dev; int idx; + if (!net_eq(net, init_net())) + return 0; + read_lock(per_net(dev_base_lock, init_net())); for (dev = per_net(dev_base, init_net()), idx = 0; dev; dev = dev-next) { /* not a bridge port */ @@ -133,12 +138,16 @@ skip: */ static int br_rtm_setlink(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg) { + net_t net = skb-sk-sk_net; struct ifinfomsg *ifm; struct nlattr *protinfo; struct net_device *dev; struct net_bridge_port *p; u8 new_state; + if (!net_eq(net, init_net())) + return -EINVAL; + if (nlmsg_len(nlh) sizeof(*ifm)) return -EINVAL; diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c index 2fa2708..00b4148 100644 --- a/net/core/fib_rules.c +++ b/net/core/fib_rules.c @@ -163,6 +163,9 @@ int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg) struct nlattr *tb[FRA_MAX+1]; int err = -EINVAL; + if (!net_eq(net, init_net())) + return -EINVAL; + if (nlh-nlmsg_len nlmsg_msg_size(sizeof(*frh))) goto errout; @@ -244,12 +247,16 @@ errout: int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg) { + net_t net = skb-sk-sk_net; struct fib_rule_hdr *frh = nlmsg_data(nlh); struct fib_rules_ops *ops = NULL; struct fib_rule *rule; struct nlattr *tb[FRA_MAX+1]; int err = -EINVAL; + if (!net_eq(net, init_net())) + return -EINVAL; + if (nlh-nlmsg_len nlmsg_msg_size(sizeof(*frh))) goto errout; diff --git a/net/core/neighbour.c b/net/core/neighbour.c index f5d4f92..d89c6fe 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -1445,6 +1445,9 @@ int neigh_delete(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg) struct net_device *dev = NULL; int err = -EINVAL; + if (!net_eq(net, init_net())) + return -EINVAL; + if (nlmsg_len(nlh) sizeof(*ndm)) goto out; @@ -1511,6 +1514,9 @@ int neigh_add(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg) struct net_device *dev = NULL; int err; + if (!net_eq(net, init_net())) + return -EINVAL; + err = nlmsg_parse(nlh, sizeof(*ndm), tb, NDA_MAX, NULL); if (err 0) goto out; @@ -1783,11 +1789,15 @@ static struct nla_policy nl_ntbl_parm_policy[NDTPA_MAX+1] __read_mostly = { int neightbl_set(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg) { + net_t net = skb-sk-sk_net; struct neigh_table *tbl; struct ndtmsg *ndtmsg; struct nlattr *tb[NDTA_MAX+1]; int err; + if (!net_eq(net, init_net())) + return -EINVAL; + err = nlmsg_parse(nlh, sizeof(*ndtmsg), tb, NDTA_MAX, nl_neightbl_policy); if (err 0) @@ -1907,11 +1917,15 @@ errout: int neightbl_dump_info(struct sk_buff *skb, struct netlink_callback *cb) { + net_t net = skb-sk-sk_net; int family, tidx, nidx = 0; int tbl_skip = cb-args[0]; int neigh_skip = cb-args[1];
[PATCH RFC 15/31] net: Make the loopback device per network namespace
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted This patch makes the loopback_dev per network namespace. The loopback device registers itself as a pernet_device so we can register the new loopback_dev instance when we add a new network namespace and so we can unregister the loopback device when we destory the network namespace. Currently the loopback device statitics are kept accross all loopback devices, a minor glitch that will not affect correct operation but something we may want to fix. This patch modifies all users the loopback_dev so they access it as per_net(loopback_dev, init_net()), keeping all of the code compiling and working. A later pass will be needed to update the users to use something other than the initial network namespace. The only non-trivial modification was the ipv6 code in route.c as the loopback_dev can no longer be used in static initializers, and even that change was very simple. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- drivers/net/loopback.c | 24 include/linux/netdevice.h|2 +- net/core/dst.c |8 net/decnet/dn_dev.c |4 ++-- net/decnet/dn_route.c| 14 +++--- net/ipv4/devinet.c |4 ++-- net/ipv4/ipconfig.c |8 +--- net/ipv4/ipvs/ip_vs_core.c |2 +- net/ipv4/route.c | 18 +- net/ipv4/xfrm4_policy.c |2 +- net/ipv6/addrconf.c |8 net/ipv6/netfilter/ip6t_REJECT.c |2 +- net/ipv6/route.c | 24 +++- net/ipv6/xfrm6_policy.c |2 +- net/xfrm/xfrm_policy.c |4 ++-- 15 files changed, 75 insertions(+), 51 deletions(-) diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c index 22b672d..e9abf3f 100644 --- a/drivers/net/loopback.c +++ b/drivers/net/loopback.c @@ -57,6 +57,7 @@ #include linux/ip.h #include linux/tcp.h #include linux/percpu.h +#include net/net_namespace.h struct pcpu_lstats { unsigned long packets; @@ -204,7 +205,7 @@ static const struct ethtool_ops loopback_ethtool_ops = { * The loopback device is special. There is only one instance and * it is statically allocated. Don't do this for other devices. */ -struct net_device loopback_dev = { +DEFINE_PER_NET(struct net_device, loopback_dev) = { .name = lo, .get_stats = get_stats, .priv = loopback_stats, @@ -228,13 +229,28 @@ struct net_device loopback_dev = { .ethtool_ops= loopback_ethtool_ops, }; +static int loopback_net_init(net_t net) +{ + per_net(loopback_dev, net).nd_net = net; + return register_netdev(per_net(loopback_dev, net)); +} + +static void loopback_net_exit(net_t net) +{ + unregister_netdev(per_net(loopback_dev, net)); +} + +static struct pernet_operations loopback_net_ops = { + .init = loopback_net_init, + .exit = loopback_net_exit, +}; + /* Setup and register the loopback device. */ static int __init loopback_init(void) { - loopback_dev.nd_net = init_net(); - return register_netdev(loopback_dev); + return register_pernet_device(loopback_net_ops); }; module_init(loopback_init); -EXPORT_SYMBOL(loopback_dev); +EXPORT_PER_NET_SYMBOL(loopback_dev); diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 9e28671..73931a0 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -570,7 +570,7 @@ struct packet_type { #include linux/interrupt.h #include linux/notifier.h -extern struct net_device loopback_dev; /* The loopback */ +DECLARE_PER_NET(struct net_device, loopback_dev); /* The loopback */ extern struct net_device *dev_base; /* All devices */ extern rwlock_tdev_base_lock; /* Device list lock */ diff --git a/net/core/dst.c b/net/core/dst.c index 8c4a272..3435771 100644 --- a/net/core/dst.c +++ b/net/core/dst.c @@ -241,13 +241,13 @@ static inline void dst_ifdown(struct dst_entry *dst, struct net_device *dev, dst-input = dst_discard_in; dst-output = dst_discard_out; } else { - dst-dev = loopback_dev; - dev_hold(loopback_dev); + dst-dev = per_net(loopback_dev, init_net()); + dev_hold(dst-dev); dev_put(dev); if (dst-neighbour dst-neighbour-dev == dev) { - dst-neighbour-dev = loopback_dev; + dst-neighbour-dev = per_net(loopback_dev, init_net()); dev_put(dev); - dev_hold(loopback_dev); + dev_hold(dst-neighbour-dev); } } } diff --git a/net/decnet/dn_dev.c b/net/decnet/dn_dev.c index 19b1469..dbaf001 100644
[PATCH RFC 19/31] net: sysfs interface support for moving devices between network namespaces.
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted I haven't a clue if this interface will meet with widespread approval but at this point it is simple, and very useful. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- net/core/net-sysfs.c | 35 +++ 1 files changed, 35 insertions(+), 0 deletions(-) diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 1be6f94..f8a5c6b 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -188,6 +188,40 @@ static ssize_t store_mtu(struct class_device *cd, const char *buf, size_t len) return netdev_store(cd, buf, len, change_mtu); } +static ssize_t show_new_ns_pid(struct class_device *cd, char *buf) +{ + return -EPERM; +} +static int change_new_ns_pid(struct net_device *dev, unsigned long new_ns_pid) +{ + struct task_struct *tsk; + int err; + net_t net; + /* Look up the network namespace */ + err = -ESRCH; + rcu_read_lock(); + tsk = find_task_by_pid(new_ns_pid); + if (tsk) { + task_lock(tsk); + if (tsk-nsproxy) { + err = 0; + net = get_net(tsk-nsproxy-net_ns); + } + task_unlock(tsk); + } + rcu_read_unlock(); + /* If I found a network namespace move the device */ + if (!err) { + err = dev_change_net_namespace(dev, net, NULL); + put_net(net); + } + return err; +} +static ssize_t store_new_ns_pid(struct class_device *cd, const char *buf, size_t len) +{ + return netdev_store(cd, buf, len, change_new_ns_pid); +} + NETDEVICE_SHOW(flags, fmt_hex); static int change_flags(struct net_device *dev, unsigned long new_flags) @@ -243,6 +277,7 @@ static struct class_device_attribute net_class_attributes[] = { __ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len, store_tx_queue_len), __ATTR(weight, S_IRUGO | S_IWUSR, show_weight, store_weight), + __ATTR(new_ns_pid, S_IWUSR, show_new_ns_pid, store_new_ns_pid), {} }; -- 1.4.4.1.g278f - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 4/31] net: Add a network namespace tag to struct net_device
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted Please note that network devices do not increase the count count on the network namespace. The are inside the network namespace and so the network namespace tag is in the nature of a back pointer and so getting and putting the network namespace is unnecessary. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- include/linux/netdevice.h |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 4cb8b39..6a1579d 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -38,6 +38,7 @@ #include linux/device.h #include linux/percpu.h #include linux/dmaengine.h +#include linux/net_namespace_type.h struct vlan_group; struct ethtool_ops; @@ -525,6 +526,9 @@ struct net_device void(*poll_controller)(struct net_device *dev); #endif + /* Network namespace this network device is inside */ + net_t nd_net; + /* bridge stuff */ struct net_bridge_port *br_port; -- 1.4.4.1.g278f - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 5/31] net: Add a network namespace parameter to struct sock
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted Sockets need to get a reference to their network namespace, or possibly a simple hold if someone registers on the network namespace notifier and will free the sockets when the namespace is going to be destroyed. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- include/net/inet_timewait_sock.h |1 + include/net/sock.h |3 +++ 2 files changed, 4 insertions(+), 0 deletions(-) diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h index f7be1ac..162c2b9 100644 --- a/include/net/inet_timewait_sock.h +++ b/include/net/inet_timewait_sock.h @@ -115,6 +115,7 @@ struct inet_timewait_sock { #define tw_refcnt __tw_common.skc_refcnt #define tw_hash__tw_common.skc_hash #define tw_prot__tw_common.skc_prot +#define tw_net __tw_common.skc_net volatile unsigned char tw_substate; /* 3 bits hole, try to pack */ unsigned char tw_rcv_wscale; diff --git a/include/net/sock.h b/include/net/sock.h index 03684e7..5bf6bb5 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -105,6 +105,7 @@ struct proto; * @skc_refcnt: reference count * @skc_hash: hash value used with various protocol lookup tables * @skc_prot: protocol handlers inside a network family + * @skc_net: reference to the network namespace of this socket * * This is the minimal network layer representation of sockets, the header * for struct sock and struct inet_timewait_sock. @@ -119,6 +120,7 @@ struct sock_common { atomic_tskc_refcnt; unsigned intskc_hash; struct proto*skc_prot; + net_t skc_net; }; /** @@ -195,6 +197,7 @@ struct sock { #define sk_refcnt __sk_common.skc_refcnt #define sk_hash__sk_common.skc_hash #define sk_prot__sk_common.skc_prot +#define sk_net __sk_common.skc_net unsigned char sk_shutdown : 2, sk_no_check : 2, sk_userlocks : 4; -- 1.4.4.1.g278f - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 6/31] net: Add a helper to get a reference to the initial network namespace.
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted The initial network namespace is special and we need to use it for various things. Probably the biggest initial use will be to ensure code that can't cope with multiple namespaces only sees the initial network namespace. For that reason and because getting at the initial network namespace is just a little clumsy add a helper function. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- include/net/net_namespace.h |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index 06a9ba1..9208e2e 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -27,6 +27,12 @@ struct net_namespace_head { struct work_struct work; }; +/* Get the initial network namespace */ +static inline net_t init_net(void) +{ + return init_nsproxy.net_ns; +} + static inline net_t get_net(net_t net) { return net; } static inline void put_net(net_t net) {} static inline net_t hold_net(net_t net) { return net; } -- 1.4.4.1.g278f - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 26/31] net: Make the netlink methods in rtnetlink handle multiple network namespaces
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted It turns out after a quick audit that except for removing the checks there is really nothing to do here. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- net/core/rtnetlink.c | 21 +++-- 1 files changed, 3 insertions(+), 18 deletions(-) diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 29a81bf..0a42258 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -409,9 +409,6 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb) int s_idx = cb-args[0]; struct net_device *dev; - if (!net_eq(net, init_net())) - return 0; - read_lock(per_net(dev_base_lock, net)); for (dev=per_net(dev_base, net), idx=0; dev; dev = dev-next, idx++) { if (idx s_idx) @@ -446,9 +443,6 @@ static int rtnl_setlink(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg) struct nlattr *tb[IFLA_MAX+1]; char ifname[IFNAMSIZ]; - if (!net_eq(net, init_net())) - return -EINVAL; - err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFLA_MAX, ifla_policy); if (err 0) goto errout; @@ -622,9 +616,6 @@ static int rtnl_getlink(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg) int iw_buf_len = 0; int err; - if (!net_eq(net, init_net())) - return -EINVAL; - err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFLA_MAX, ifla_policy); if (err 0) return err; @@ -673,13 +664,9 @@ errout: static int rtnl_dump_all(struct sk_buff *skb, struct netlink_callback *cb) { - net_t net = skb-sk-sk_net; int idx; int s_idx = cb-family; - if (!net_eq(net, init_net())) - return 0; - if (s_idx == 0) s_idx = 1; for (idx=1; idxNPROTO; idx++) { @@ -701,6 +688,7 @@ static int rtnl_dump_all(struct sk_buff *skb, struct netlink_callback *cb) void rtmsg_ifinfo(int type, struct net_device *dev, unsigned change) { + net_t net = dev-nd_net; struct sk_buff *skb; int err = -ENOBUFS; @@ -712,10 +700,10 @@ void rtmsg_ifinfo(int type, struct net_device *dev, unsigned change) /* failure implies BUG in if_nlmsg_size() */ BUG_ON(err 0); - err = rtnl_notify(skb, init_net(), 0, RTNLGRP_LINK, NULL, GFP_KERNEL); + err = rtnl_notify(skb, net, 0, RTNLGRP_LINK, NULL, GFP_KERNEL); errout: if (err 0) - rtnl_set_sk_err(init_net(), RTNLGRP_LINK, err); + rtnl_set_sk_err(net, RTNLGRP_LINK, err); } /* Protected by RTNL sempahore. */ @@ -862,9 +850,6 @@ static int rtnetlink_event(struct notifier_block *this, unsigned long event, voi { struct net_device *dev = ptr; - if (!net_eq(dev-nd_net, init_net())) - return NOTIFY_DONE; - switch (event) { case NETDEV_UNREGISTER: rtmsg_ifinfo(RTM_DELLINK, dev, ~0U); -- 1.4.4.1.g278f - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 22/31] net: Add network namespace clone support.
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted This patch allows you to create a new network namespace using sys_clone(...). Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- include/linux/sched.h|1 + kernel/nsproxy.c | 11 +++ net/core/net_namespace.c | 38 ++ 3 files changed, 50 insertions(+), 0 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 4463735..9e0f91a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -26,6 +26,7 @@ #define CLONE_STOPPED 0x0200 /* Start in stopped state */ #define CLONE_NEWUTS 0x0400 /* New utsname group? */ #define CLONE_NEWIPC 0x0800 /* New ipcs */ +#define CLONE_NEWNET 0x2000 /* New network namespace */ /* * Scheduling policies diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index 4f3c95a..7861c4c 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -20,6 +20,7 @@ #include linux/mnt_namespace.h #include linux/utsname.h #include linux/pid_namespace.h +#include net/net_namespace.h struct nsproxy init_nsproxy = INIT_NSPROXY(init_nsproxy); EXPORT_SYMBOL_GPL(init_nsproxy); @@ -70,6 +71,7 @@ struct nsproxy *dup_namespaces(struct nsproxy *orig) get_ipc_ns(ns-ipc_ns); if (ns-pid_ns) get_pid_ns(ns-pid_ns); + get_net(ns-net_ns); } return ns; @@ -117,10 +119,18 @@ int copy_namespaces(int flags, struct task_struct *tsk) if (err) goto out_pid; + err = copy_net(flags, tsk); + if (err) + goto out_net; + out: put_nsproxy(old_ns); return err; +out_net: + if (new_ns-pid_ns) + put_pid_ns(new_ns-pid_ns); + out_pid: if (new_ns-ipc_ns) put_ipc_ns(new_ns-ipc_ns); @@ -146,5 +156,6 @@ void free_nsproxy(struct nsproxy *ns) put_ipc_ns(ns-ipc_ns); if (ns-pid_ns) put_pid_ns(ns-pid_ns); + put_net(ns-net_ns); kfree(ns); } diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 93e3879..cc56105 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -175,6 +175,44 @@ out_undo: goto out; } +int copy_net(int flags, struct task_struct *tsk) +{ + net_t old_net = tsk-nsproxy-net_ns; + net_t new_net; + int err; + + get_net(old_net); + + if (!(flags CLONE_NEWNET)) + return 0; + + err = -EPERM; + if (!capable(CAP_SYS_ADMIN)) + goto out; + + err = -ENOMEM; + new_net = net_alloc(); + if (null_net(new_net)) + goto out; + + mutex_lock(net_mutex); + err = setup_net(new_net); + if (err) + goto out_unlock; + + net_lock(); + net_list_append(new_net); + net_unlock(); + + tsk-nsproxy-net_ns = new_net; + +out_unlock: + mutex_unlock(net_mutex); +out: + put_net(old_net); + return err; +} + void pernet_modcopy(void *pnetdst, const void *src, unsigned long size) { net_t net; -- 1.4.4.1.g278f - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 27/31] net: Make the xfrm sysctls per network namespace.
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted In particalure I moved: /proc/sys/net/core/xfrm_aevent_etime /proc/sys/net/core/xfrm_aevent_rseqth Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- include/net/xfrm.h |4 ++-- net/core/sysctl_net_core.c | 37 ++--- net/xfrm/xfrm_state.c |8 net/xfrm/xfrm_user.c | 10 ++ 4 files changed, 30 insertions(+), 29 deletions(-) diff --git a/include/net/xfrm.h b/include/net/xfrm.h index e476541..9b2e727 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -24,8 +24,8 @@ MODULE_ALIAS(xfrm-mode- __stringify(family) - __stringify(encap)) extern struct sock *xfrm_nl; -extern u32 sysctl_xfrm_aevent_etime; -extern u32 sysctl_xfrm_aevent_rseqth; +DECLARE_PER_NET(u32, sysctl_xfrm_aevent_etime); +DECLARE_PER_NET(u32, sysctl_xfrm_aevent_rseqth); extern struct mutex xfrm_cfg_mutex; diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index 76f7a29..90f2a39 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -88,24 +88,6 @@ ctl_table core_table[] = { .mode = 0644, .proc_handler = proc_dointvec }, -#ifdef CONFIG_XFRM - { - .ctl_name = NET_CORE_AEVENT_ETIME, - .procname = xfrm_aevent_etime, - .data = sysctl_xfrm_aevent_etime, - .maxlen = sizeof(u32), - .mode = 0644, - .proc_handler = proc_dointvec - }, - { - .ctl_name = NET_CORE_AEVENT_RSEQTH, - .procname = xfrm_aevent_rseqth, - .data = sysctl_xfrm_aevent_rseqth, - .maxlen = sizeof(u32), - .mode = 0644, - .proc_handler = proc_dointvec - }, -#endif /* CONFIG_XFRM */ #endif /* CONFIG_NET */ { .ctl_name = NET_CORE_SOMAXCONN, @@ -127,6 +109,23 @@ ctl_table core_table[] = { }; DEFINE_PER_NET(struct ctl_table, multi_core_table[]) = { - /* Stub for holding per network namespace sysctls */ +#ifdef CONFIG_XFRM + { + .ctl_name = NET_CORE_AEVENT_ETIME, + .procname = xfrm_aevent_etime, + .data = __per_net_base(sysctl_xfrm_aevent_etime), + .maxlen = sizeof(u32), + .mode = 0644, + .proc_handler = proc_dointvec + }, + { + .ctl_name = NET_CORE_AEVENT_RSEQTH, + .procname = xfrm_aevent_rseqth, + .data = __per_net_base(sysctl_xfrm_aevent_rseqth), + .maxlen = sizeof(u32), + .mode = 0644, + .proc_handler = proc_dointvec + }, +#endif /* CONFIG_XFRM */ {} }; diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index fdb08d9..3304a2d 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -27,11 +27,11 @@ struct sock *xfrm_nl; EXPORT_SYMBOL(xfrm_nl); -u32 sysctl_xfrm_aevent_etime = XFRM_AE_ETIME; -EXPORT_SYMBOL(sysctl_xfrm_aevent_etime); +DEFINE_PER_NET(u32, sysctl_xfrm_aevent_etime) = XFRM_AE_ETIME; +EXPORT_PER_NET_SYMBOL(sysctl_xfrm_aevent_etime); -u32 sysctl_xfrm_aevent_rseqth = XFRM_AE_SEQT_SIZE; -EXPORT_SYMBOL(sysctl_xfrm_aevent_rseqth); +DEFINE_PER_NET(u32, sysctl_xfrm_aevent_rseqth) = XFRM_AE_SEQT_SIZE; +EXPORT_PER_NET_SYMBOL(sysctl_xfrm_aevent_rseqth); /* Each xfrm_state may be linked to two tables: diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index 55affa7..15e962b 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -375,7 +375,8 @@ error: return err; } -static struct xfrm_state *xfrm_state_construct(struct xfrm_usersa_info *p, +static struct xfrm_state *xfrm_state_construct(net_t net, + struct xfrm_usersa_info *p, struct rtattr **xfrma, int *errp) { @@ -411,9 +412,9 @@ static struct xfrm_state *xfrm_state_construct(struct xfrm_usersa_info *p, goto error; x-km.seq = p-seq; - x-replay_maxdiff = sysctl_xfrm_aevent_rseqth; + x-replay_maxdiff = per_net(sysctl_xfrm_aevent_rseqth, net); /* sysctl_xfrm_aevent_etime is in 100ms units */ - x-replay_maxage = (sysctl_xfrm_aevent_etime*HZ)/XFRM_AE_ETH_M; + x-replay_maxage = (per_net(sysctl_xfrm_aevent_etime, net)*HZ)/XFRM_AE_ETH_M; x-preplay.bitmap = 0; x-preplay.seq = x-replay.seq+x-replay_maxdiff; x-preplay.oseq = x-replay.oseq +x-replay_maxdiff; @@ -437,6 +438,7 @@ error_no_put: static int xfrm_add_sa(struct sk_buff *skb, struct nlmsghdr *nlh, struct rtattr **xfrma) { + net_t net = skb-sk-sk_net;
[PATCH RFC 28/31] net: Make the SOMAXCONN sysctl per network namespace
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- include/linux/socket.h |3 ++- net/core/sysctl_net_core.c | 16 net/socket.c |7 --- 3 files changed, 14 insertions(+), 12 deletions(-) diff --git a/include/linux/socket.h b/include/linux/socket.h index 92cd38e..aa159ea 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -23,8 +23,9 @@ struct __kernel_sockaddr_storage { #include linux/uio.h /* iovec support*/ #include linux/types.h /* pid_t*/ #include linux/compiler.h/* __user */ +#include linux/net_namespace_type.h -extern int sysctl_somaxconn; +DECLARE_PER_NET(int, sysctl_somaxconn); #ifdef CONFIG_PROC_FS struct seq_file; extern void socket_seq_show(struct seq_file *seq); diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index 90f2a39..14eca68 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -90,14 +90,6 @@ ctl_table core_table[] = { }, #endif /* CONFIG_NET */ { - .ctl_name = NET_CORE_SOMAXCONN, - .procname = somaxconn, - .data = sysctl_somaxconn, - .maxlen = sizeof(int), - .mode = 0644, - .proc_handler = proc_dointvec - }, - { .ctl_name = NET_CORE_BUDGET, .procname = netdev_budget, .data = netdev_budget, @@ -127,5 +119,13 @@ DEFINE_PER_NET(struct ctl_table, multi_core_table[]) = { .proc_handler = proc_dointvec }, #endif /* CONFIG_XFRM */ + { + .ctl_name = NET_CORE_SOMAXCONN, + .procname = somaxconn, + .data = __per_net_base(sysctl_somaxconn), + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, {} }; diff --git a/net/socket.c b/net/socket.c index 7371654..ab2aeea 100644 --- a/net/socket.c +++ b/net/socket.c @@ -1305,7 +1305,7 @@ asmlinkage long sys_bind(int fd, struct sockaddr __user *umyaddr, int addrlen) * ready for listening. */ -int sysctl_somaxconn __read_mostly = SOMAXCONN; +DEFINE_PER_NET(int, sysctl_somaxconn)= SOMAXCONN; asmlinkage long sys_listen(int fd, int backlog) { @@ -1314,8 +1314,9 @@ asmlinkage long sys_listen(int fd, int backlog) sock = sockfd_lookup_light(fd, err, fput_needed); if (sock) { - if ((unsigned)backlog sysctl_somaxconn) - backlog = sysctl_somaxconn; + net_t net = sock-sk-sk_net; + if ((unsigned)backlog per_net(sysctl_somaxconn, net)) + backlog = per_net(sysctl_somaxconn, net); err = security_socket_listen(sock, backlog); if (!err) -- 1.4.4.1.g278f - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 11/31] net: Initialize the network namespace of network devices.
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted Except for carefully selected pseudo devices all network interfaces should start out in the initial network namespace. Ultimately it will be register_netdev that examines what dev-nd_net is set to and places a device in a network namespace. This patch modifies alloc_netdev to initialize the network namespace a device is in with the initial network namespace. This gets it right for the vast majority of devices so their drivers need not be modified and for those few pseudo devices that need something different they can change this parameter before calling register_netdevice. The network namespace parameter on a network device is not reference counted as the devices are inside of a network namespace and cannot remain in that namespace past the lifetime of the network namespace. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- drivers/net/loopback.c |1 + net/core/dev.c |1 + 2 files changed, 2 insertions(+), 0 deletions(-) diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c index 2b739fd..22b672d 100644 --- a/drivers/net/loopback.c +++ b/drivers/net/loopback.c @@ -231,6 +231,7 @@ struct net_device loopback_dev = { /* Setup and register the loopback device. */ static int __init loopback_init(void) { + loopback_dev.nd_net = init_net(); return register_netdev(loopback_dev); }; diff --git a/net/core/dev.c b/net/core/dev.c index 90e4c0e..a3ee150 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3192,6 +3192,7 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name, dev = (struct net_device *) (((long)p + NETDEV_ALIGN_CONST) ~NETDEV_ALIGN_CONST); dev-padded = (char *)dev - (char *)p; + dev-nd_net = init_net(); if (sizeof_priv) dev-priv = netdev_priv(dev); -- 1.4.4.1.g278f - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 24/31] net: Make rtnetlink network namespace aware
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted After this patch none of the netlink callback support anything except the initial network namespace but the rtnetlink infrastructure now handles multiple network namespaces. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- include/linux/rtnetlink.h |8 ++-- net/bridge/br_netlink.c |4 +- net/core/fib_rules.c |4 +- net/core/neighbour.c |4 +- net/core/rtnetlink.c | 74 +++- net/core/wireless.c |5 ++- net/decnet/dn_dev.c |4 +- net/decnet/dn_route.c |2 +- net/decnet/dn_table.c |4 +- net/ipv4/devinet.c|4 +- net/ipv4/fib_semantics.c |4 +- net/ipv4/ipmr.c |4 +- net/ipv4/route.c |2 +- net/ipv6/addrconf.c | 14 net/ipv6/route.c |6 ++-- net/sched/cls_api.c |2 +- net/sched/sch_api.c |4 +- 17 files changed, 98 insertions(+), 51 deletions(-) diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index 4a629ea..6c8281d 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -581,11 +581,11 @@ struct rtnetlink_link }; extern struct rtnetlink_link * rtnetlink_links[NPROTO]; -extern int rtnetlink_send(struct sk_buff *skb, u32 pid, u32 group, int echo); -extern int rtnl_unicast(struct sk_buff *skb, u32 pid); -extern int rtnl_notify(struct sk_buff *skb, u32 pid, u32 group, +extern int rtnetlink_send(struct sk_buff *skb, net_t net, u32 pid, u32 group, int echo); +extern int rtnl_unicast(struct sk_buff *skb, net_t net, u32 pid); +extern int rtnl_notify(struct sk_buff *skb, net_t net, u32 pid, u32 group, struct nlmsghdr *nlh, gfp_t flags); -extern void rtnl_set_sk_err(u32 group, int error); +extern void rtnl_set_sk_err(net_t net, u32 group, int error); extern int rtnetlink_put_metrics(struct sk_buff *skb, u32 *metrics); extern int rtnl_put_cacheinfo(struct sk_buff *skb, struct dst_entry *dst, u32 id, u32 ts, u32 tsage, long expires, diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c index 85165a1..372fb18 100644 --- a/net/bridge/br_netlink.c +++ b/net/bridge/br_netlink.c @@ -94,10 +94,10 @@ void br_ifinfo_notify(int event, struct net_bridge_port *port) /* failure implies BUG in br_nlmsg_size() */ BUG_ON(err 0); - err = rtnl_notify(skb, 0, RTNLGRP_LINK, NULL, GFP_ATOMIC); + err = rtnl_notify(skb, init_net(), 0, RTNLGRP_LINK, NULL, GFP_ATOMIC); errout: if (err 0) - rtnl_set_sk_err(RTNLGRP_LINK, err); + rtnl_set_sk_err(init_net(), RTNLGRP_LINK, err); } /* diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c index 00b4148..5f65973 100644 --- a/net/core/fib_rules.c +++ b/net/core/fib_rules.c @@ -418,10 +418,10 @@ static void notify_rule_change(int event, struct fib_rule *rule, /* failure implies BUG in fib_rule_nlmsg_size() */ BUG_ON(err 0); - err = rtnl_notify(skb, pid, ops-nlgroup, nlh, GFP_KERNEL); + err = rtnl_notify(skb, init_net(), pid, ops-nlgroup, nlh, GFP_KERNEL); errout: if (err 0) - rtnl_set_sk_err(ops-nlgroup, err); + rtnl_set_sk_err(init_net(), ops-nlgroup, err); } static void attach_rules(struct list_head *rules, struct net_device *dev) diff --git a/net/core/neighbour.c b/net/core/neighbour.c index d89c6fe..6f61207 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -2453,10 +2453,10 @@ static void __neigh_notify(struct neighbour *n, int type, int flags) /* failure implies BUG in neigh_nlmsg_size() */ BUG_ON(err 0); - err = rtnl_notify(skb, 0, RTNLGRP_NEIGH, NULL, GFP_ATOMIC); + err = rtnl_notify(skb, init_net(), 0, RTNLGRP_NEIGH, NULL, GFP_ATOMIC); errout: if (err 0) - rtnl_set_sk_err(RTNLGRP_NEIGH, err); + rtnl_set_sk_err(init_net(), RTNLGRP_NEIGH, err); } void neigh_app_ns(struct neighbour *n) diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 9be586c..29a81bf 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -58,7 +58,7 @@ #endif /* CONFIG_NET_WIRELESS_RTNETLINK */ static DEFINE_MUTEX(rtnl_mutex); -static struct sock *rtnl; +static DEFINE_PER_NET(struct sock *, rtnl); void rtnl_lock(void) { @@ -72,9 +72,17 @@ void __rtnl_unlock(void) void rtnl_unlock(void) { + net_t net; mutex_unlock(rtnl_mutex); - if (rtnl rtnl-sk_receive_queue.qlen) - rtnl-sk_data_ready(rtnl, 0); + + net_lock(); + for_each_net(net) { + struct sock *rtnl = per_net(rtnl, net); + if (rtnl rtnl-sk_receive_queue.qlen) + rtnl-sk_data_ready(rtnl, 0); + } + net_unlock(); + netdev_run_todo(); } @@ -151,8 +159,9 @@ size_t rtattr_strlcpy(char *dest, const struct rtattr *rta, size_t
[PATCH RFC 17/31] net: Factor out __dev_alloc_name from dev_alloc_name
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted When forcibly changing the network namespace of a device I need something that can generate a name for the device in the new namespace without overwriting the old name. __dev_alloc_name provides me that functionality. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- net/core/dev.c | 44 +--- 1 files changed, 33 insertions(+), 11 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 32fe905..fc0d2af 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -655,9 +655,10 @@ int dev_valid_name(const char *name) } /** - * dev_alloc_name - allocate a name for a device - * @dev: device + * __dev_alloc_name - allocate a name for a device + * @net: network namespace to allocate the device name in * @name: name format string + * @buf: scratch buffer and result name string * * Passed a format string - eg lt%d it will try and find a suitable * id. It scans list of devices to build up a free map, then chooses @@ -668,18 +669,13 @@ int dev_valid_name(const char *name) * Returns the number of the unit assigned or a negative errno code. */ -int dev_alloc_name(struct net_device *dev, const char *name) +static int __dev_alloc_name(net_t net, const char *name, char buf[IFNAMSIZ]) { int i = 0; - char buf[IFNAMSIZ]; const char *p; const int max_netdevices = 8*PAGE_SIZE; long *inuse; struct net_device *d; - net_t net; - - BUG_ON(null_net(dev-nd_net)); - net = dev-nd_net; p = strnchr(name, IFNAMSIZ-1, '%'); if (p) { @@ -713,10 +709,8 @@ int dev_alloc_name(struct net_device *dev, const char *name) } snprintf(buf, sizeof(buf), name, i); - if (!__dev_get_by_name(net, buf)) { - strlcpy(dev-name, buf, IFNAMSIZ); + if (!__dev_get_by_name(net, buf)) return i; - } /* It is possible to run out of possible slots * when the name is long and there isn't enough space left @@ -725,6 +719,34 @@ int dev_alloc_name(struct net_device *dev, const char *name) return -ENFILE; } +/** + * dev_alloc_name - allocate a name for a device + * @dev: device + * @name: name format string + * + * Passed a format string - eg lt%d it will try and find a suitable + * id. It scans list of devices to build up a free map, then chooses + * the first empty slot. The caller must hold the dev_base or rtnl lock + * while allocating the name and adding the device in order to avoid + * duplicates. + * Limited to bits_per_byte * page size devices (ie 32K on most platforms). + * Returns the number of the unit assigned or a negative errno code. + */ + +int dev_alloc_name(struct net_device *dev, const char *name) +{ + char buf[IFNAMSIZ]; + net_t net; + int ret; + + BUG_ON(null_net(dev-nd_net)); + net = dev-nd_net; + ret = __dev_alloc_name(net, name, buf); + if (ret = 0) + strlcpy(dev-name, buf, IFNAMSIZ); + return ret; +} + /** * dev_change_name - change name of a device -- 1.4.4.1.g278f - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 20/31] net: Implement CONFIG_NET_NS
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted Add the config option to enable multiple network namespaces. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- net/Kconfig |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/net/Kconfig b/net/Kconfig index 7dfc949..4671398 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -27,6 +27,13 @@ if NET menu Networking options +config NET_NS + bool Network namespace support + depends on EXPERIMENTAL + help + Support what appear to user space as multiple instances of the + network stack. + config NETDEBUG bool Network packet debugging help -- 1.4.4.1.g278f - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 25/31] net: Make wireless netlink event generation handle multiple network namespaces
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- net/core/wireless.c | 15 ++- 1 files changed, 10 insertions(+), 5 deletions(-) diff --git a/net/core/wireless.c b/net/core/wireless.c index 9036359..d534617 100644 --- a/net/core/wireless.c +++ b/net/core/wireless.c @@ -1934,8 +1934,13 @@ static void wireless_nlevent_process(unsigned long data) { struct sk_buff *skb; - while ((skb = skb_dequeue(wireless_nlevent_queue))) - rtnl_notify(skb, init_net(), 0, RTNLGRP_LINK, NULL, GFP_ATOMIC); + while ((skb = skb_dequeue(wireless_nlevent_queue))) { + struct net_device *dev = skb-dev; + net_t net = dev-nd_net; + skb-dev = NULL; + rtnl_notify(skb, net, 0, RTNLGRP_LINK, NULL, GFP_ATOMIC); + dev_put(dev); + } } static DECLARE_TASKLET(wireless_nlevent_tasklet, wireless_nlevent_process, 0); @@ -1992,9 +1997,6 @@ static inline void rtmsg_iwinfo(struct net_device * dev, struct sk_buff *skb; int size = NLMSG_GOODSIZE; - if (!net_eq(dev-nd_net, init_net())) - return; - skb = alloc_skb(size, GFP_ATOMIC); if (!skb) return; @@ -2004,6 +2006,9 @@ static inline void rtmsg_iwinfo(struct net_device * dev, kfree_skb(skb); return; } + /* Remember the device until we are in process context */ + dev_hold(dev); + skb-dev = dev; NETLINK_CB(skb).dst_group = RTNLGRP_LINK; skb_queue_tail(wireless_nlevent_queue, skb); tasklet_schedule(wireless_nlevent_tasklet); -- 1.4.4.1.g278f - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 29/31] net: Make AF_PACKET handle multiple network namespaces
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted This is done by making all of the relevant global variables per network namespace. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- net/packet/af_packet.c | 125 +++- 1 files changed, 81 insertions(+), 44 deletions(-) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 4ac9f9f..c772491 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -152,8 +152,8 @@ dev-hard_header == NULL (ll header is added by device, we cannot control it) */ /* List of all packet sockets. */ -static HLIST_HEAD(packet_sklist); -static DEFINE_RWLOCK(packet_sklist_lock); +static DEFINE_PER_NET(rwlock_t, packet_sklist_lock); +static DEFINE_PER_NET(struct hlist_head, packet_sklist); static atomic_t packet_socks_nr; @@ -264,9 +264,6 @@ static int packet_rcv_spkt(struct sk_buff *skb, struct packet_type *pt, struct n struct sock *sk; struct sockaddr_pkt *spkt; - if (!net_eq(dev-nd_net, init_net())) - goto out; - /* * When we registered the protocol we saved the socket in the data * field for just this event. @@ -288,6 +285,9 @@ static int packet_rcv_spkt(struct sk_buff *skb, struct packet_type *pt, struct n if (skb-pkt_type == PACKET_LOOPBACK) goto out; + if (!net_eq(dev-nd_net, sk-sk_net)) + goto out; + if ((skb = skb_share_check(skb, GFP_ATOMIC)) == NULL) goto oom; @@ -359,7 +359,7 @@ static int packet_sendmsg_spkt(struct kiocb *iocb, struct socket *sock, */ saddr-spkt_device[13] = 0; - dev = dev_get_by_name(init_net(), saddr-spkt_device); + dev = dev_get_by_name(sk-sk_net, saddr-spkt_device); err = -ENODEV; if (dev == NULL) goto out_unlock; @@ -475,15 +475,15 @@ static int packet_rcv(struct sk_buff *skb, struct packet_type *pt, struct net_de int skb_len = skb-len; unsigned snaplen; - if (!net_eq(dev-nd_net, init_net())) - goto drop; - if (skb-pkt_type == PACKET_LOOPBACK) goto drop; sk = pt-af_packet_priv; po = pkt_sk(sk); + if (!net_eq(dev-nd_net, sk-sk_net)) + goto drop; + skb-dev = dev; if (dev-hard_header) { @@ -583,15 +583,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct packet_type *pt, struct net_d unsigned short macoff, netoff; struct sk_buff *copy_skb = NULL; - if (!net_eq(dev-nd_net, init_net())) - goto drop; - if (skb-pkt_type == PACKET_LOOPBACK) goto drop; sk = pt-af_packet_priv; po = pkt_sk(sk); + if (!net_eq(dev-nd_net, sk-sk_net)) + goto drop; + if (dev-hard_header) { if (sk-sk_type != SOCK_DGRAM) skb_push(skb, skb-data - skb-mac.raw); @@ -744,7 +744,7 @@ static int packet_sendmsg(struct kiocb *iocb, struct socket *sock, } - dev = dev_get_by_index(init_net(), ifindex); + dev = dev_get_by_index(sk-sk_net, ifindex); err = -ENXIO; if (dev == NULL) goto out_unlock; @@ -817,15 +817,17 @@ static int packet_release(struct socket *sock) { struct sock *sk = sock-sk; struct packet_sock *po; + net_t net; if (!sk) return 0; + net = sk-sk_net; po = pkt_sk(sk); - write_lock_bh(packet_sklist_lock); + write_lock_bh(per_net(packet_sklist_lock, net)); sk_del_node_init(sk); - write_unlock_bh(packet_sklist_lock); + write_unlock_bh(per_net(packet_sklist_lock, net)); /* * Unhook packet receive handler. @@ -943,7 +945,7 @@ static int packet_bind_spkt(struct socket *sock, struct sockaddr *uaddr, int add return -EINVAL; strlcpy(name,uaddr-sa_data,sizeof(name)); - dev = dev_get_by_name(init_net(), name); + dev = dev_get_by_name(sk-sk_net, name); if (dev) { err = packet_do_bind(sk, dev, pkt_sk(sk)-num); dev_put(dev); @@ -971,7 +973,7 @@ static int packet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len if (sll-sll_ifindex) { err = -ENODEV; - dev = dev_get_by_index(init_net(), sll-sll_ifindex); + dev = dev_get_by_index(sk-sk_net, sll-sll_ifindex); if (dev == NULL) goto out; } @@ -1000,9 +1002,6 @@ static int packet_create(net_t net, struct socket *sock, int protocol) __be16 proto = (__force __be16)protocol; /* weird, but documented */ int err; - if (!net_eq(net, init_net())) - return -EAFNOSUPPORT; - if (!capable(CAP_NET_RAW)) return -EPERM; if (sock-type != SOCK_DGRAM sock-type !=
[PATCH RFC 9/31] net: Implement the per network namespace sysctl infrastructure
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted The user interface is: register_net_sysctl_table and unregister_net_sysctl_table. Very much like the current interface except there is an network namespace parameter. This this any sysctl in the net_root_table and it's subdirectories are registered with register_net_sysctl shows up only to tasks in the same network namespace. All other sysctls continue to be globally visible. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- include/linux/sysctl.h |7 include/net/sock.h |1 + kernel/sysctl.c| 71 ++- net/core/sysctl_net_core.c |5 +++ net/sysctl_net.c | 20 5 files changed, 102 insertions(+), 2 deletions(-) diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index 8eba2d2..286e723 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -1044,6 +1044,13 @@ struct ctl_table_header * register_sysctl_table(ctl_table * table); void unregister_sysctl_table(struct ctl_table_header * table); +#ifdef CONFIG_NET +#include linux/net_namespace_type.h +extern struct ctl_table_header *register_net_sysctl_table(net_t net, struct ctl_table *table); +extern void unregister_net_sysctl_table(struct ctl_table_header *header); +DECLARE_PER_NET(struct ctl_table, net_root_table[]); +#endif + #else /* __KERNEL__ */ #endif /* __KERNEL__ */ diff --git a/include/net/sock.h b/include/net/sock.h index 5bf6bb5..01a2781 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1414,6 +1414,7 @@ extern void sk_init(void); #ifdef CONFIG_SYSCTL extern struct ctl_table core_table[]; +DECLARE_PER_NET(struct ctl_table, multi_core_table[]); #endif extern int sysctl_optmem_max; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 7da313e..ae6a424 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -45,6 +45,7 @@ #include linux/syscalls.h #include linux/nfs_fs.h #include linux/acpi.h +#include net/net_namespace.h #include asm/uaccess.h #include asm/processor.h @@ -135,6 +136,10 @@ static int proc_do_cad_pid(ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); #endif +#ifdef CONFIG_NET +static DEFINE_PER_NET(struct ctl_table_header, net_table_header); +#endif + static ctl_table root_table[]; static struct ctl_table_header root_table_header = { root_table, LIST_HEAD_INIT(root_table_header.ctl_entry) }; @@ -1059,6 +1064,7 @@ struct ctl_table_header *sysctl_head_next(struct ctl_table_header *prev) { struct ctl_table_header *head; struct list_head *tmp; + net_t net = current-nsproxy-net_ns; spin_lock(sysctl_lock); if (prev) { tmp = prev-ctl_entry; @@ -1076,6 +1082,10 @@ struct ctl_table_header *sysctl_head_next(struct ctl_table_header *prev) next: tmp = tmp-next; if (tmp == root_table_header.ctl_entry) +#ifdef CONFIG_NET + tmp = per_net(net_table_header, net).ctl_entry; + else if (tmp == per_net(net_table_header, net).ctl_entry) +#endif break; } spin_unlock(sysctl_lock); @@ -1290,7 +1300,8 @@ int do_sysctl_strategy (ctl_table *table, * This routine returns %NULL on a failure to register, and a pointer * to the table header on success. */ -struct ctl_table_header *register_sysctl_table(ctl_table * table) +static struct ctl_table_header *__register_sysctl_table( + struct ctl_table_header *root, ctl_table * table) { struct ctl_table_header *tmp; tmp = kmalloc(sizeof(struct ctl_table_header), GFP_KERNEL); @@ -1301,11 +1312,16 @@ struct ctl_table_header *register_sysctl_table(ctl_table * table) tmp-used = 0; tmp-unregistering = NULL; spin_lock(sysctl_lock); - list_add_tail(tmp-ctl_entry, root_table_header.ctl_entry); + list_add_tail(tmp-ctl_entry, root-ctl_entry); spin_unlock(sysctl_lock); return tmp; } +struct ctl_table_header *register_sysctl_table(ctl_table *table) +{ + return __register_sysctl_table(root_table_header, table); +} + /** * unregister_sysctl_table - unregister a sysctl table hierarchy * @header: the header returned from register_sysctl_table @@ -1322,6 +1338,57 @@ void unregister_sysctl_table(struct ctl_table_header * header) kfree(header); } +#ifdef CONFIG_NET + +static void *fixup_per_net_addr(net_t net, void *addr) +{ + char *ptr = addr; + if ((ptr = __per_net_start) (ptr __per_net_end)) + ptr += __per_net_offset(net); + return ptr; +} + +static void sysctl_net_table_fixup(net_t net, struct ctl_table *table) +{ + for (; table-ctl_name || table-procname; table++) { + table-child = fixup_per_net_addr(net, table-child); + table-data = fixup_per_net_addr(net, table-data); +
[PATCH RFC 18/31] net: Implment network device movement between namespaces
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted This patch introduces NETIF_F_NETNS_LOCAL a flag to indicate a network device is local to a single network namespace and should never be moved. Useful for pseudo devices that we need an instance in each network namespace (like the loopback device) and for any device we find that cannot handle multiple network namespaces so we may trap them in the initial network namespace. This patch introduces the function dev_change_net_namespace a function used to move a network device from one network namespace to another. To the network device nothing special appears to happen, to the components of the network stack it appears as if the network device was unregistered in the network namespace it is in, and a new device was registered in the network namespace the device was moved to. This patch sets up a namespace device destructor that upon the exit of a network namespace moves all of the movable network devices to the initial network namespace so they are not lost. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- drivers/net/loopback.c|3 +- include/linux/netdevice.h |3 + net/core/dev.c| 222 +++- 3 files changed, 201 insertions(+), 27 deletions(-) diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c index e9abf3f..7d15de0 100644 --- a/drivers/net/loopback.c +++ b/drivers/net/loopback.c @@ -225,7 +225,8 @@ DEFINE_PER_NET(struct net_device, loopback_dev) = { | NETIF_F_TSO #endif | NETIF_F_NO_CSUM | NETIF_F_HIGHDMA - | NETIF_F_LLTX, + | NETIF_F_LLTX + | NETIF_F_NETNS_LOCAL, .ethtool_ops= loopback_ethtool_ops, }; diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 0b4a4dc..3fcaf60 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -324,6 +324,7 @@ struct net_device #define NETIF_F_VLAN_CHALLENGED1024/* Device cannot handle VLAN packets */ #define NETIF_F_GSO2048/* Enable software GSO. */ #define NETIF_F_LLTX 4096/* LockLess TX */ +#define NETIF_F_NETNS_LOCAL8192/* Does not change network namespaces */ /* Segmentation offload features */ #define NETIF_F_GSO_SHIFT 16 @@ -710,6 +711,8 @@ extern int dev_ethtool(net_t net, struct ifreq *); extern unsigneddev_get_flags(const struct net_device *); extern int dev_change_flags(struct net_device *, unsigned); extern int dev_change_name(struct net_device *, char *); +extern int dev_change_net_namespace(struct net_device *, net_t, +const char *); extern int dev_set_mtu(struct net_device *, int); extern int dev_set_mac_address(struct net_device *, struct sockaddr *); diff --git a/net/core/dev.c b/net/core/dev.c index fc0d2af..52994e4 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -198,6 +198,52 @@ static inline struct hlist_head *dev_index_hash(net_t net, int ifindex) return per_net(dev_index_head, net)[ifindex ((1NETDEV_HASHBITS)-1)]; } +/* Device list insertion */ +static int list_netdevice(struct net_device *dev) +{ + net_t net = dev-nd_net; + + ASSERT_RTNL(); + + dev-next = NULL; + write_lock_bh(per_net(dev_base_lock, net)); + *per_net(dev_tail, net) = dev; + per_net(dev_tail, net) = dev-next; + hlist_add_head(dev-name_hlist, dev_name_hash(net, dev-name)); + hlist_add_head(dev-index_hlist, dev_index_hash(net, dev-ifindex)); + write_unlock_bh(per_net(dev_base_lock, net)); + return 0; +} + +/* Device list removal */ +static int unlist_netdevice(struct net_device *dev) +{ + struct net_device *d, **dp; + net_t net = dev-nd_net; + + ASSERT_RTNL(); + + /* Unlink dev from the device chain */ + for (dp = per_net(dev_base, net); (d = *dp) != NULL; dp = d-next) { + if (d == dev) { + write_lock_bh(per_net(dev_base_lock, net)); + hlist_del(dev-name_hlist); + hlist_del(dev-index_hlist); + if (per_net(dev_tail, net) == dev-next) + per_net(dev_tail, net) = dp; + *dp = d-next; + write_unlock_bh(per_net(dev_base_lock, net)); + break; + } + } + if (!d) { + printk(KERN_ERR unlist net_device: '%s' not found\n, + dev-name); + return -ENODEV; + } + return 0; +} + /* * Our notifier list */ @@ -3054,15 +3100,9 @@ int register_netdevice(struct net_device *dev)
[PATCH RFC 2/31] net: Implement a place holder network namespace
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted Many of the changes to the network stack will simply be adding a network namespace parameter to function calls or moving variables from globals to being per network namespace. When those variables have initializers that cannot statically compute the proper value, a function that runs at the creation and destruction of network namespaces will need to be registered, and the logic will need to be changed to accomidate that. Adding unconditional support for these functions ensures that even when everything else is compiled out the modified network stack logic will continue to run correctly. This patch adds struct pernet_operations that has an init (constructor) and an exit (destructor) method. When registered the init method is called for every existing namespace, and when unregistered the exit method is called for every existing namespace. When a new network namespace is created all of the init methods are called in the order in which they were registered, and when a network namespace is destroyed the exit methods are called in the reverse order in which they were registered. There are two distinct types of pernet_operations recognized: subsys and device. At creation all subsys init functions are called before device init functions, and at destruction all device exit functions are called before subsys exit function. For other ordering the preservation of the order of registration combined with the various kinds of kernel initcalls should be sufficient. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- include/net/net_namespace.h | 62 ++ net/core/Makefile |2 +- net/core/net_namespace.c| 149 +++ 3 files changed, 212 insertions(+), 1 deletions(-) diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h new file mode 100644 index 000..06a9ba1 --- /dev/null +++ b/include/net/net_namespace.h @@ -0,0 +1,62 @@ +/* + * Operations on the network namespace + */ +#ifndef __NET_NET_NAMESPACE_H +#define __NET_NET_NAMESPACE_H + +#include asm/atomic.h +#include linux/workqueue.h +#include linux/nsproxy.h +#include linux/net_namespace_type.h + +/* How many bytes in each network namespace should we allocate + * for use by modules when they are loaded. + */ +#ifdef CONFIG_MODULES +# define PER_NET_MODULE_RESERVE 2048 +#else +# define PER_NET_MODULE_RESERVE 0 +#endif + +struct net_namespace_head { + atomic_t count; /* To decided when the network namespace +* should go +*/ + atomic_t use_count; /* For references we destroy on demand */ + struct list_head list; + struct work_struct work; +}; + +static inline net_t get_net(net_t net) { return net; } +static inline void put_net(net_t net) {} +static inline net_t hold_net(net_t net) { return net; } +static inline void release_net(net_t net) {} + +#define __per_net_start((char *)0) +#define __per_net_end ((char *)0) + +static inline int copy_net(int flags, struct task_struct *tsk) { return 0; } + +/* Don't let the list of network namespaces change */ +static inline void net_lock(void) {} +static inline void net_unlock(void) {} + +#define for_each_net(VAR) if (1) + +extern net_t net_template; + +#define NET_CREATE 0x0001 /* A network namespace has been created */ +#define NET_DESTROY0x0002 /* A network namespace is being destroyed */ + +struct pernet_operations { + struct list_head list; + int (*init)(net_t net); + void (*exit)(net_t net); +}; + +extern int register_pernet_subsys(struct pernet_operations *); +extern void unregister_pernet_subsys(struct pernet_operations *); +extern int register_pernet_device(struct pernet_operations *); +extern void unregister_pernet_device(struct pernet_operations *); + +#endif /* __NET_NET_NAMESPACE_H */ diff --git a/net/core/Makefile b/net/core/Makefile index 73272d5..554dbdc 100644 --- a/net/core/Makefile +++ b/net/core/Makefile @@ -3,7 +3,7 @@ # obj-y := sock.o request_sock.o skbuff.o iovec.o datagram.o stream.o scm.o \ -gen_stats.o gen_estimator.o +gen_stats.o gen_estimator.o net_namespace.o obj-$(CONFIG_SYSCTL) += sysctl_net_core.o diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c new file mode 100644 index 000..4ae266d --- /dev/null +++ b/net/core/net_namespace.c @@ -0,0 +1,149 @@ +#include linux/rtnetlink.h +#include net/net_namespace.h + +/* + * Our network namespace constructor/destructor lists + */ + +static LIST_HEAD(pernet_list); +static struct list_head *first_device = pernet_list; +static DEFINE_MUTEX(net_mutex); +net_t net_template; + +static int register_pernet_operations(struct list_head *list, + struct pernet_operations *ops) +{ + net_t net, undo_net; + int error; + + error = 0; + list_add_tail(ops-list, list); +
[PATCH RFC 21/31] net: Implement the guts of the network namespace infrastructure
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted Support is added for the .data.pernet section where all of the variables who have a single instance in each network namespace will live. Every architectures linker script is modified so is should work. Summarizing the functions: net_ns_init creates a slab and allocates the template and the initial network namespace. pernet_modcopy keeps the network namespaces in sync with the loaded modules. Initializing new data variables as they are added. The network namespace destruction because the last reference can come from interrupt context queues itself for later with schedule_work. Then we alert everyone the network namespace is disappearing. If a buggy user is still holding a reference to the network namespace we print a nasty message and leak the network namespace. The wrest are just light-weight wrapper functions to make things more convinient. A little should probably be said about net_head the variable at the start of my network namespace structure. It is the only variable with a location decided by the C code instead of the linker and I string them together in a linked list so I can iterate. Probably more interesting is that it looks like it is saner not to directly use a pointer to my network namespace but instead to use an offset. All of the references to data in my network namespace are coming from per_net(...) which takes the address of the variable in the .data.pernet section and then adds my magic offset. If I used a pointer I would have to subract an additional value and export an extra symbol. Not good for performance or maintenance :) The expected usage of network namespace variables is to replace sequences like: loopback_dev with per_net(loopback_dev, net) where net is some network namespace reference. In my preliminary tests the only a single additional addition is inserted so it appears to be an efficient idiom. Hopefully it is also easy to comprehend and use. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- arch/alpha/kernel/vmlinux.lds.S|2 + arch/arm/kernel/vmlinux.lds.S |3 + arch/arm26/kernel/vmlinux-arm26-xip.lds.in |3 + arch/arm26/kernel/vmlinux-arm26.lds.in |3 + arch/avr32/kernel/vmlinux.lds.c|3 + arch/cris/arch-v10/vmlinux.lds.S |2 + arch/cris/arch-v32/vmlinux.lds.S |2 + arch/frv/kernel/vmlinux.lds.S |2 + arch/h8300/kernel/vmlinux.lds.S|3 + arch/i386/kernel/vmlinux.lds.S |3 + arch/ia64/kernel/vmlinux.lds.S |2 + arch/m32r/kernel/vmlinux.lds.S |3 + arch/m68k/kernel/vmlinux-std.lds |3 + arch/m68k/kernel/vmlinux-sun3.lds |3 + arch/m68knommu/kernel/vmlinux.lds.S|3 + arch/mips/kernel/vmlinux.lds.S |3 + arch/parisc/kernel/vmlinux.lds.S |3 + arch/powerpc/kernel/vmlinux.lds.S |2 + arch/ppc/kernel/vmlinux.lds.S |2 + arch/s390/kernel/vmlinux.lds.S |3 + arch/sh/kernel/vmlinux.lds.S |3 + arch/sh64/kernel/vmlinux.lds.S |3 + arch/sparc/kernel/vmlinux.lds.S|3 + arch/sparc64/kernel/vmlinux.lds.S |3 + arch/v850/kernel/vmlinux.lds.S |6 +- arch/x86_64/kernel/vmlinux.lds.S |3 + arch/xtensa/kernel/vmlinux.lds.S |2 + include/asm-generic/vmlinux.lds.h |8 + include/asm-um/common.lds.S|4 +- include/linux/module.h |3 + include/linux/net_namespace_type.h | 63 - include/net/net_namespace.h| 49 ++- kernel/module.c| 211 - net/core/net_namespace.c | 232 34 files changed, 631 insertions(+), 15 deletions(-) diff --git a/arch/alpha/kernel/vmlinux.lds.S b/arch/alpha/kernel/vmlinux.lds.S index 76bf071..ad20077 100644 --- a/arch/alpha/kernel/vmlinux.lds.S +++ b/arch/alpha/kernel/vmlinux.lds.S @@ -72,6 +72,8 @@ SECTIONS .data.percpu : { *(.data.percpu) } __per_cpu_end = .; + DATA_PER_NET + . = ALIGN(2*8192); __init_end = .; /* Freed after init ends here */ diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S index a8fa75e..5b003f9 100644 --- a/arch/arm/kernel/vmlinux.lds.S +++ b/arch/arm/kernel/vmlinux.lds.S @@ -61,6 +61,9 @@ SECTIONS __per_cpu_start = .; *(.data.percpu) __per_cpu_end = .; + + DATA_PER_NET + #ifndef CONFIG_XIP_KERNEL __init_begin = _stext; *(.init.data) diff --git a/arch/arm26/kernel/vmlinux-arm26-xip.lds.in b/arch/arm26/kernel/vmlinux-arm26-xip.lds.in index ca61ec8..69d5772 100644 --- a/arch/arm26/kernel/vmlinux-arm26-xip.lds.in +++
[PATCH RFC 14/31] net: Support multiple network namespaces with netlink
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted Each netlink socket will live in exactly one network namespace, this includes the controlling kernel sockets. This patch updates all of the existing netlink protocols to only support the initial network namespace. Request by clients in other namespaces will get -ECONREFUSED. As they would if the kernel did not have the support for that netlink protocol compiled in. As each netlink protocol is updated to be multiple network namespace safe it can register multiple kernel sockets to acquire a presence in the rest of the network namespaces. The implementation in af_netlink is a simple filter implemenation at hash table insertion and hash table look up time. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- drivers/scsi/scsi_netlink.c |2 +- drivers/scsi/scsi_transport_iscsi.c |2 +- include/linux/netlink.h |3 +- kernel/audit.c |4 +- lib/kobject_uevent.c|4 +- net/bridge/netfilter/ebt_ulog.c |5 +- net/core/rtnetlink.c|4 +- net/decnet/netfilter/dn_rtmsg.c |3 +- net/ipv4/fib_frontend.c |3 +- net/ipv4/inet_diag.c|4 +- net/ipv4/netfilter/ip_queue.c |6 +- net/ipv4/netfilter/ipt_ULOG.c |4 +- net/ipv6/netfilter/ip6_queue.c |4 +- net/netfilter/nfnetlink.c |2 +- net/netfilter/nfnetlink_log.c |3 +- net/netfilter/nfnetlink_queue.c |3 +- net/netlink/af_netlink.c| 104 ++- net/netlink/genetlink.c |4 +- net/xfrm/xfrm_user.c|2 +- 19 files changed, 112 insertions(+), 54 deletions(-) diff --git a/drivers/scsi/scsi_netlink.c b/drivers/scsi/scsi_netlink.c index 1b59b27..02c2c1e 100644 --- a/drivers/scsi/scsi_netlink.c +++ b/drivers/scsi/scsi_netlink.c @@ -167,7 +167,7 @@ scsi_netlink_init(void) return; } - scsi_nl_sock = netlink_kernel_create(NETLINK_SCSITRANSPORT, + scsi_nl_sock = netlink_kernel_create(init_net(), NETLINK_SCSITRANSPORT, SCSI_NL_GRP_CNT, scsi_nl_rcv, THIS_MODULE); if (!scsi_nl_sock) { printk(KERN_ERR %s: register of recieve handler failed\n, diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c index 9c22f13..1ad22c2 100644 --- a/drivers/scsi/scsi_transport_iscsi.c +++ b/drivers/scsi/scsi_transport_iscsi.c @@ -1435,7 +1435,7 @@ static __init int iscsi_transport_init(void) if (err) goto unregister_conn_class; - nls = netlink_kernel_create(NETLINK_ISCSI, 1, iscsi_if_rx, + nls = netlink_kernel_create(init_net(), NETLINK_ISCSI, 1, iscsi_if_rx, THIS_MODULE); if (!nls) { err = -ENOBUFS; diff --git a/include/linux/netlink.h b/include/linux/netlink.h index b3b9b60..9dacd00 100644 --- a/include/linux/netlink.h +++ b/include/linux/netlink.h @@ -151,7 +151,7 @@ struct netlink_skb_parms #define NETLINK_CREDS(skb) (NETLINK_CB((skb)).creds) -extern struct sock *netlink_kernel_create(int unit, unsigned int groups, void (*input)(struct sock *sk, int len), struct module *module); +extern struct sock *netlink_kernel_create(net_t net, int unit, unsigned int groups, void (*input)(struct sock *sk, int len), struct module *module); extern void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err); extern int netlink_has_listeners(struct sock *sk, unsigned int group); extern int netlink_unicast(struct sock *ssk, struct sk_buff *skb, __u32 pid, int nonblock); @@ -188,6 +188,7 @@ struct netlink_callback struct netlink_notify { + net_t net; int pid; int protocol; }; diff --git a/kernel/audit.c b/kernel/audit.c index d9b690a..b0c5c61 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -696,8 +696,8 @@ static int __init audit_init(void) printk(KERN_INFO audit: initializing netlink socket (%s)\n, audit_default ? enabled : disabled); - audit_sock = netlink_kernel_create(NETLINK_AUDIT, 0, audit_receive, - THIS_MODULE); + audit_sock = netlink_kernel_create(init_net(), NETLINK_AUDIT, 0, + audit_receive, THIS_MODULE); if (!audit_sock) audit_panic(cannot initialize netlink socket); else diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c index 84272ed..9a5d4ca 100644 --- a/lib/kobject_uevent.c +++ b/lib/kobject_uevent.c @@ -292,8 +292,8 @@ EXPORT_SYMBOL_GPL(add_uevent_var); #if defined(CONFIG_NET) static int __init kobject_uevent_init(void) { - uevent_sock = netlink_kernel_create(NETLINK_KOBJECT_UEVENT, 1, NULL, - THIS_MODULE); + uevent_sock = netlink_kernel_create(init_net(), NETLINK_KOBJECT_UEVENT, 1, +
[PATCH RFC 10/31] net: Make socket creation namespace safe.
From: Eric W. Biederman [EMAIL PROTECTED] - unquoted This patch passes in the namespace a new socket should be created in and has the socket code do the appropriate reference counting. By virtue of this all socket create methods are touched. In addition the socket create methods are modified so that they will fail if you attempt to create a socket in a non-default network namespace. Failing if we attempt to create a socket outside of the default socket namespace ensures that as we incrementally make the network stack network namespace aware we will not export functionality that someone has not audited and made certain is network namespace safe. Allowing us to partially enable network namespaces before all of the exotic protocols are supported. Any protocol layers I have missed will fail to compile because I now pass an extra parameter into the socket creation code. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- drivers/net/pppoe.c |4 ++-- drivers/net/pppox.c |7 +-- include/linux/if_pppox.h |2 +- include/linux/net.h |3 ++- include/net/llc_conn.h |2 +- include/net/sock.h |4 +++- net/appletalk/ddp.c |7 +-- net/atm/common.c |4 ++-- net/atm/common.h |2 +- net/atm/pvc.c|7 +-- net/atm/svc.c| 11 +++ net/ax25/af_ax25.c |9 ++--- net/bluetooth/af_bluetooth.c |7 +-- net/bluetooth/bnep/sock.c|4 ++-- net/bluetooth/cmtp/sock.c|4 ++-- net/bluetooth/hci_sock.c |4 ++-- net/bluetooth/hidp/sock.c|4 ++-- net/bluetooth/l2cap.c| 10 +- net/bluetooth/rfcomm/sock.c | 10 +- net/bluetooth/sco.c | 10 +- net/core/sock.c |6 -- net/decnet/af_decnet.c | 13 - net/econet/af_econet.c |7 +-- net/ipv4/af_inet.c |7 +-- net/ipv6/af_inet6.c |7 +-- net/ipx/af_ipx.c |7 +-- net/irda/af_irda.c | 11 +++ net/key/af_key.c |7 +-- net/llc/af_llc.c |7 +-- net/llc/llc_conn.c |6 +++--- net/netlink/af_netlink.c | 13 - net/netrom/af_netrom.c |9 ++--- net/packet/af_packet.c |7 +-- net/rose/af_rose.c |9 ++--- net/sctp/ipv6.c |2 +- net/sctp/protocol.c |2 +- net/socket.c |8 net/tipc/socket.c|9 ++--- net/unix/af_unix.c | 13 - net/wanrouter/af_wanpipe.c | 15 +-- net/x25/af_x25.c | 13 - 41 files changed, 182 insertions(+), 111 deletions(-) diff --git a/drivers/net/pppoe.c b/drivers/net/pppoe.c index d34fe16..d09334d 100644 --- a/drivers/net/pppoe.c +++ b/drivers/net/pppoe.c @@ -475,12 +475,12 @@ static struct proto pppoe_sk_proto = { * Initialize a new struct sock. * **/ -static int pppoe_create(struct socket *sock) +static int pppoe_create(net_t net, struct socket *sock) { int error = -ENOMEM; struct sock *sk; - sk = sk_alloc(PF_PPPOX, GFP_KERNEL, pppoe_sk_proto, 1); + sk = sk_alloc(net, PF_PPPOX, GFP_KERNEL, pppoe_sk_proto, 1); if (!sk) goto out; diff --git a/drivers/net/pppox.c b/drivers/net/pppox.c index 9315046..0d5c7bc 100644 --- a/drivers/net/pppox.c +++ b/drivers/net/pppox.c @@ -106,10 +106,13 @@ int pppox_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg) EXPORT_SYMBOL(pppox_ioctl); -static int pppox_create(struct socket *sock, int protocol) +static int pppox_create(net_t net, struct socket *sock, int protocol) { int rc = -EPROTOTYPE; + if (!net_eq(net, init_net())) + return -EAFNOSUPPORT; + if (protocol 0 || protocol PX_MAX_PROTO) goto out; @@ -118,7 +121,7 @@ static int pppox_create(struct socket *sock, int protocol) !try_module_get(pppox_protos[protocol]-owner)) goto out; - rc = pppox_protos[protocol]-create(sock); + rc = pppox_protos[protocol]-create(net, sock); module_put(pppox_protos[protocol]-owner); out: diff --git a/include/linux/if_pppox.h b/include/linux/if_pppox.h index 4fab3d0..f6ffd83 100644 --- a/include/linux/if_pppox.h +++ b/include/linux/if_pppox.h @@ -148,7 +148,7 @@ static inline struct sock *sk_pppox(struct pppox_sock *po) struct module; struct pppox_proto { - int (*create)(struct socket *sock); + int (*create)(net_t net, struct socket *sock); int (*ioctl)(struct socket *sock, unsigned int cmd, unsigned long arg); struct module *owner; diff --git
Re: [PATCH RFC 2/31] net: Implement a place holder network namespace
On Thu, 25 Jan 2007 12:00:04 -0700 Eric W. Biederman [EMAIL PROTECTED] wrote: From: Eric W. Biederman [EMAIL PROTECTED] - unquoted Many of the changes to the network stack will simply be adding a network namespace parameter to function calls or moving variables from globals to being per network namespace. When those variables have initializers that cannot statically compute the proper value, a function that runs at the creation and destruction of network namespaces will need to be registered, and the logic will need to be changed to accomidate that. Adding unconditional support for these functions ensures that even when everything else is compiled out the modified network stack logic will continue to run correctly. This patch adds struct pernet_operations that has an init (constructor) and an exit (destructor) method. When registered the init method is called for every existing namespace, and when unregistered the exit method is called for every existing namespace. When a new network namespace is created all of the init methods are called in the order in which they were registered, and when a network namespace is destroyed the exit methods are called in the reverse order in which they were registered. There are two distinct types of pernet_operations recognized: subsys and device. At creation all subsys init functions are called before device init functions, and at destruction all device exit functions are called before subsys exit function. For other ordering the preservation of the order of registration combined with the various kinds of kernel initcalls should be sufficient. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] + +static inline net_t get_net(net_t net) { return net; } +static inline void put_net(net_t net) {} +static inline net_t hold_net(net_t net) { return net; } +static inline void release_net(net_t net) {} + +#define __per_net_start ((char *)0) +#define __per_net_end((char *)0 Don't use these use NULL + +static inline int copy_net(int flags, struct task_struct *tsk) { return 0; } + +/* Don't let the list of network namespaces change */ +static inline void net_lock(void) {} +static inline void net_unlock(void) {} Don't make all one line, or use #define instead. + +#define for_each_net(VAR) if (1) + +extern net_t net_template; + +#define NET_CREATE 0x0001 /* A network namespace has been created */ +#define NET_DESTROY 0x0002 /* A network namespace is being destroyed */ + +struct pernet_operations { + struct list_head list; + int (*init)(net_t net); + void (*exit)(net_t net); +}; + +extern int register_pernet_subsys(struct pernet_operations *); +extern void unregister_pernet_subsys(struct pernet_operations *); +extern int register_pernet_device(struct pernet_operations *); +extern void unregister_pernet_device(struct pernet_operations *); + +#endif /* __NET_NET_NAMESPACE_H */ diff --git a/net/core/Makefile b/net/core/Makefile index 73272d5..554dbdc 100644 --- a/net/core/Makefile +++ b/net/core/Makefile @@ -3,7 +3,7 @@ # obj-y := sock.o request_sock.o skbuff.o iovec.o datagram.o stream.o scm.o \ - gen_stats.o gen_estimator.o + gen_stats.o gen_estimator.o net_namespace.o obj-$(CONFIG_SYSCTL) += sysctl_net_core.o diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c new file mode 100644 index 000..4ae266d --- /dev/null +++ b/net/core/net_namespace.c @@ -0,0 +1,149 @@ +#include linux/rtnetlink.h +#include net/net_namespace.h + +/* + * Our network namespace constructor/destructor lists + */ + +static LIST_HEAD(pernet_list); +static struct list_head *first_device = pernet_list; +static DEFINE_MUTEX(net_mutex); +net_t net_template; + +static int register_pernet_operations(struct list_head *list, + struct pernet_operations *ops) +{ + net_t net, undo_net; + int error; + + error = 0; + list_add_tail(ops-list, list); + for_each_net(net) { + if (ops-init) { + error = ops-init(net); + if (error) + goto out_undo; + } + } +out: + return error; + +out_undo: + /* If I have an error cleanup all namespaces I initialized */ + list_del(ops-list); + for_each_net(undo_net) { + if (net_eq(undo_net, net)) + goto undone; + if (ops-exit) + ops-exit(undo_net); + } +undone: + goto out; +} + +static void unregister_pernet_operations(struct pernet_operations *ops) +{ + net_t net; + + list_del(ops-list); + for_each_net(net) + if (ops-exit) + ops-exit(net); +} + You should use RCU for this because registering/unregistering network namespaces is obviously a much rarer occurrence than referencing them. -- Stephen Hemminger
Re: [PATCH RFC 1/31] net: Add net_namespace_type.h to allow for per network namespace variables.
Can all this be a nop if a CONFIG option is not selected? diff --git a/include/linux/net_namespace_type.h b/include/linux/net_namespace_type.h new file mode 100644 index 000..8173f59 --- /dev/null +++ b/include/linux/net_namespace_type.h @@ -0,0 +1,52 @@ +/* + * Definition of the network namespace reference type + * And operations upon it. + */ +#ifndef __LINUX_NET_NAMESPACE_TYPE_H +#define __LINUX_NET_NAMESPACE_TYPE_H + +#define __pernetname(name) per_net__##name Code obfuscation, please don't do that +typedef struct {} net_t; No typedef for this please. + +#define __data_pernet + +/* Look up a per network namespace variable */ +static inline unsigned long __per_net_offset(net_t net) { return 0; } + +/* Like per_net but returns a pseudo variable address that must be moved + * __per_net_offset() bytes before it will point to a real variable. + * Useful for static initializers. + */ +#define __per_net_base(name) __pernetname(name) + +/* Get the network namespace reference from a per_net variable address */ +#define net_of(ptr, name) ({ net_t net; ptr; net; }) + +/* Look up a per network namespace variable */ +#define per_net(name, net) \ + (*(__per_net_offset(net), __per_net_base(name))) + +/* Are the two network namespaces the same */ +static inline int net_eq(net_t a, net_t b) { return 1; } +/* Get an unsigned value appropriate for hashing the network namespace */ +static inline unsigned int net_hval(net_t net) { return 0; } + +/* Convert to and from to and from void pointers */ +static inline void *net_to_voidp(net_t net) { return NULL; } +static inline net_t net_from_voidp(void *ptr) { net_t net; return net; } + +static inline int null_net(net_t net) { return 0; } + +#define DEFINE_PER_NET(type, name) \ + __data_pernet __typeof__(type) __pernetname(name) + +#define DECLARE_PER_NET(type, name) \ + extern __typeof__(type) __pernetname(name) + +#define EXPORT_PER_NET_SYMBOL(var) \ + EXPORT_SYMBOL(__pernetname(var)) +#define EXPORT_PER_NET_SYMBOL_GPL(var) \ + EXPORT_SYMBOL_GPL(__pernetname(var)) + +#endif /* __LINUX_NET_NAMESPACE_TYPE_H */ -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/31] net: Add net_namespace_type.h to allow for per network namespace variables.
Stephen Hemminger [EMAIL PROTECTED] writes: Can all this be a nop if a CONFIG option is not selected? That is exactly what this infrastructure supports. What you see is the version that comes into effect when the CONFIG option is not selected. From using an empty structure to replace a pointer to make that a NOP to most of the rest below. diff --git a/include/linux/net_namespace_type.h b/include/linux/net_namespace_type.h new file mode 100644 index 000..8173f59 --- /dev/null +++ b/include/linux/net_namespace_type.h @@ -0,0 +1,52 @@ +/* + * Definition of the network namespace reference type + * And operations upon it. + */ +#ifndef __LINUX_NET_NAMESPACE_TYPE_H +#define __LINUX_NET_NAMESPACE_TYPE_H + +#define __pernetname(name) per_net__##name Code obfuscation, please don't do that Single point of making the naming rules, better maintenance. The basic point is that variables that come through this path you should not access directly. Tweaking the name enforces that even in the compiled out state. +typedef struct {} net_t; No typedef for this please. Why. That is conventially how we do opaque types in linux when someone is doing something sophisticated. You probably want to look down to patch 21 to see what the compiled in version of these look like. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
wireless-dev updated 2007-01-25
The following changes since commit a4893aa0bb61c7bbced8fcdea874cb8d0e1d3a8d: John W. Linville (1): Merge branch 'from-linus' are found in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-dev.git Gertjan van Wingerde (1): d80211: Select CRYPTO_ECB when enabler d80211. Ivo van Doorn (4): eeprom_93cx6 rt2x00 should use generic eeprom_93cx6 crc-itu-t rt2x00 should use generic crc-itu-t Jan Kiszka (1): d80211: Fix inconsistent sta_lock usage John W. Linville (2): Merge http://bu3sch.de/git/wireless-dev Merge git://git.kernel.org/.../jbenc/dscape Michael Buesch (47): bcm43xx-d80211: Add some PHY register definitions. bcm43xx-d80211: Move ILT stuff to OFDM table stuff bcm43xx-d80211: Remove PHY OFDM routing bit, if we are on A-PHY. bcm43xx-d80211: Merge new LO-control code. bcm43xx-d80211: Fix compilation: Missing files for LO and VSTACK. bcm43xx-d80211: Rename struct bcm43xx_phyinfo to struct bcm43xx_phy bcm43xx-d80211: merge struct bcm43xx_radioinfo into struct bcm43xx_phy bcm43xx-d80211: Merge all radio stuff into phy.c Merge branch 'master' of git://kernel.org/.../linville/wireless-dev bcm43xx-d80211: Drain TXstatus queue before enabling IRQs. bcm43xx-d80211: Fix antenna selection for TX and RX. bcm43xx-d80211: Fix bogus LO validation failure. bcm43xx-d80211: Remove netpoll and ethtool stuff. Merge branch 'master' of git://kernel.org/.../linville/wireless-dev Merge branch 'master' of git://kernel.org/.../linville/wireless-dev Merge branch 'master' of git://kernel.org/.../linville/wireless-dev Merge branch 'master' of git://zeus2.kernel.org/.../linville/wireless-dev Remove obsolete SSB driver library. Implement new SSB subsystem. bcm43xx-d80211: Port driver to the new SSB subsystem. bcm43xx-d80211: Fix LO feedthrough measurement. ssb: Fix dependencies. MIPS core must depend on MIPS platform. ssb: Fix typo. SSB_PCICORE_SBTOPCI1_CFG1 does not exist. ssb, bcm43xx-d80211: Move DMA translation logic to ssb. bcm43xx-d80211: Remove bogus call to refresh_templates in add_interface. ssb, bcm43xx-d80211: Add function to set DMA mask on SSB. ssb: Fix busnumber assignment. Must assign it before scanning the bus. ssb: Allow disabling of all PCI related stuff. ssb: PCMCIA-hostbus support. bcm43xx-d80211: Support for PCMCIA devices. bcm43xx-d80211: re-add chipid printk bcm43xx-d80211: Remove leds_exit() call in detach stage. bcm43xx-d80211: Get rid of PHY-connected semantics. bcm43xx-d80211: Fix wrong register write in lo_measure_feedthrough(). bcm43xx-d80211: Fix error return codes. bcm43xx-d80211: Various cleanups all over the code. bcm43xx-d80211: Fix semantical errors in LO measure setup. bcm43xx-d80211: gphy init: Some cleanups and some bugfixes. ssb: Add missing include to delay.h in ssb/pcmcia.c ssb, usb: Implement SSB based Broadcom USB OHCI driver. ssb: PCIcore hostmode fixes. ssb: export ssb_clockspeed() ssb: add PM config register definitions ssb: b44 related fixes. bcm43xx-d80211: Fix initial LO Calibration. bcm43xx-d80211: Fix loopback gain calculation. bcm43xx-d80211: Fix DMA TX skb doublefree Michael Wu (2): d80211: Only free WEP crypto ciphers when they have been allocated correctly. d80211: Fix __ieee80211_if_del on live interfaces Pavel Roskin (1): bcm43xx_d80211: Fix major memory corruption bug drivers/Kconfig|2 + drivers/Makefile |1 + drivers/misc/Kconfig |4 - drivers/misc/Makefile |1 - drivers/misc/ssb.c | 1074 - drivers/net/wireless/d80211/bcm43xx/Kconfig| 40 +- drivers/net/wireless/d80211/bcm43xx/Makefile | 11 +- drivers/net/wireless/d80211/bcm43xx/bcm43xx.h | 441 +- .../net/wireless/d80211/bcm43xx/bcm43xx_debugfs.c | 72 +- .../net/wireless/d80211/bcm43xx/bcm43xx_debugfs.h | 16 +- drivers/net/wireless/d80211/bcm43xx/bcm43xx_dma.c | 197 +- drivers/net/wireless/d80211/bcm43xx/bcm43xx_dma.h | 48 +- drivers/net/wireless/d80211/bcm43xx/bcm43xx_ilt.c | 337 -- drivers/net/wireless/d80211/bcm43xx/bcm43xx_ilt.h | 32 - drivers/net/wireless/d80211/bcm43xx/bcm43xx_leds.c | 89 +- drivers/net/wireless/d80211/bcm43xx/bcm43xx_leds.h | 12 +- drivers/net/wireless/d80211/bcm43xx/bcm43xx_lo.c | 1060 + drivers/net/wireless/d80211/bcm43xx/bcm43xx_lo.h | 91 + drivers/net/wireless/d80211/bcm43xx/bcm43xx_main.c | 3972 +++- drivers/net/wireless/d80211/bcm43xx/bcm43xx_main.h | 40 +- drivers/net/wireless/d80211/bcm43xx/bcm43xx_pci.c | 147 +
Re: [PATCH 2.6.20-rc5] IPV6: skb is unexpectedly freed.
David, Yoshifuji-san, Herbert, I appreciate your feedback. I made an another patch that simply replaced __kfree_skb() in exit path with kfree_skb(). I tested it overnight with a chat benchmark tool and my test program, which can reproduce the original problem. As a result, I didn't see any problem. (For example, neither oops nor memory leak happened.) I will post the patch a few moments later. Please take a look at it. Thanks, Masa David Miller wrote: From: YOSHIFUJI Hideaki [EMAIL PROTECTED] Date: Wed, 24 Jan 2007 13:37:25 +0900 (JST) In article [EMAIL PROTECTED] (at Wed, 24 Jan 2007 15:31:47 +1100), Herbert Xu [EMAIL PROTECTED] says: Masayuki Nakagawa [EMAIL PROTECTED] wrote: I suggest to use kfree_skb() instead of __kfree_skb(). I agree. In fact please do it for all paths in that function, i.e., just change __kfree_skb to kfree_skb rather than adding a special case for this path. I do think so, too. So do I, but initially I want to push his basic patch in so that I can push the same exact thing into -stable to fix this bug. So if you make the subsequent change, please make it relative to the original patch. Thank you. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] TCP: Replace __kfree_skb() with kfree_skb()
This patch simply replaces __kfree_skb() in exit path with kfree_skb(). In tcp_rcv_state_process(), generally skbs should be destroyed only when the ref count is zero. That is the way things are supposed to be done in the kernel. This change might reveals a memory leak of skb. If it happens, it would be because someone doesn't deal with the skb properly. Signed-off-by: Masayuki Nakagawa [EMAIL PROTECTED] --- linux-2.6/net/ipv4/tcp_input.c.orig 2007-01-25 07:04:35.0 -0800 +++ linux-2.6/net/ipv4/tcp_input.c 2007-01-25 07:05:05.0 -0800 @@ -4423,8 +4423,6 @@ int tcp_rcv_state_process(struct sock *s * in the interest of security over speed unless * it's still in use. */ - kfree_skb(skb); - return 0; } goto discard; @@ -4634,7 +4632,7 @@ int tcp_rcv_state_process(struct sock *s if (!queued) { discard: - __kfree_skb(skb); + kfree_skb(skb); } return 0; } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix sorting of SACK blocks
* David Miller [EMAIL PROTECTED] [070126 01:55]: From: Baruch Even [EMAIL PROTECTED] Date: Thu, 25 Jan 2007 20:29:03 +0200 The sorting of SACK blocks actually munges them rather than sort, causing the TCP stack to ignore some SACK information and breaking the assumption of ordered SACK blocks after sorting. The sort takes the data from a second buffer which isn't moved causing subsequent data moves to occur from the wrong location. The fix is to use a temporary buffer as a normal sort does. Signed-Off-By: Baruch Even [EMAIL PROTECTED] BTW, in reviewing this I note that there is now only one remaining use of tp-recv_sack_cache[] and that is the code earlier in this function which is trying to detect if all we are doing is extending the leading edge of a SACK block. It would be nice to be able to clear out that usage as well, and remove recv_sack_cache[] and thus make tcp_sock smaller. You actually need recv_sack_cache to detect if you can use the fast path. Another alternative is to somehow hash the values of the sack blocks but then you rely on probabilty that you will properly detect the ability to use the fast path. Hashing will save some space but you can't get rid of it completely unless you go back to the old and slow method of SACK processing. There were thoughts thrown a while back about using a different data structure, I think you said you started working on something like that. If that comes to fruition the cache might go. FWIW, my other mail about possible bugs actually says that you might need to add another value to check, the number of sack blocks in the cache. Baruch - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up
On Thu, 25 Jan 2007 01:50:54 -0500, Pavel Roskin wrote: It turns out d80211 uses the config method of the hardware drivers very sparingly. It's only used for scanning and in ioctl commands. It is not called after the interface has been brought up with the open method. I don't know whose responsibility it should be to apply the configuration when the interface is brought up. I'm not familiar with d80211 design principles. I think it should be done in the stack (actually, it's on todo list for quite some time). However, I don't consider this as a big problem - just do it in a driver (like your patch does) for now. Jiri -- Jiri Benc SUSE Labs - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up
Hi, I have discovered that while I can indeed associate without wpa_supplicant using bcm43xx_d80211 driver, I have to set the channel every time the interface is brought down and up. It turns out d80211 uses the config method of the hardware drivers very sparingly. It's only used for scanning and in ioctl commands. It is not called after the interface has been brought up with the open method. Correct, similar problems have been detected in rt2x00. The temporary solution in there is to demand a scanning operation after the interface has been brought up. I don't know whose responsibility it should be to apply the configuration when the interface is brought up. I'm not familiar with d80211 design principles. Well my personal preference would be if the dscape stack handles it, unless the stack guarentees the conf structure has been initialized and contains valid data when the interface is being brought up. Ivo - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up
On Thu, 25 Jan 2007 12:47:08 +0100, Ivo Van Doorn wrote: Correct, similar problems have been detected in rt2x00. The temporary solution in there is to demand a scanning operation after the interface has been brought up. Scanning? No no no, please! That would be a clear bug and misbehaviour. Jiri -- Jiri Benc SUSE Labs - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up
Hi, Correct, similar problems have been detected in rt2x00. The temporary solution in there is to demand a scanning operation after the interface has been brought up. Scanning? No no no, please! That would be a clear bug and misbehaviour. Hmm, I think I forgot to add one little thing in my comment. The scanning operation is demanded in the rt2x00 README, so the driver doesn't start the scanning automatically and just awaits the user commands. The user is also free to change the channel to make the configuration active. But a scanning command will also display if it has at least found some AP, so without scanning results attempting to scan will very likely fail. ;) Ivo - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up
On Thu, 25 Jan 2007 09:05:32 -0500, Gene Heskett wrote: Oh? I'm sitting here watching the tty0 screen of my lappy after x has been started, and I have established a connection, but SoftMAC is still logging its scan activity, starting with channel 1 and scanning 14 channels. Its doing this at approximately 2 minute intervals. So I think we have your definition of a clear bug and misbehaviour. Yes. As well as some other wireless drivers. But that's not worth fixing. Also, please note that softmac isn't going to do user space MLME so it is not relevant at all. Jiri -- Jiri Benc SUSE Labs - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up
On Thursday 25 January 2007 09:23, Jiri Benc wrote: On Thu, 25 Jan 2007 09:05:32 -0500, Gene Heskett wrote: Oh? I'm sitting here watching the tty0 screen of my lappy after x has been started, and I have established a connection, but SoftMAC is still logging its scan activity, starting with channel 1 and scanning 14 channels. Its doing this at approximately 2 minute intervals. So I think we have your definition of a clear bug and misbehaviour. Yes. As well as some other wireless drivers. But that's not worth fixing. Also, please note that softmac isn't going to do user space MLME so it is not relevant at all. MLME? More acronyms I've not put in my wet dictionary.. :) Jiri -- Cheers, Gene There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order. -Ed Howdershelt (Author) Yahoo.com and AOL/TW attorneys please note, additions to the above message by Gene Heskett are: Copyright 2007 by Maurice Eugene Heskett, all rights reserved. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up
On Thursday 25 January 2007 07:50, Jiri Benc wrote: On Thu, 25 Jan 2007 12:47:08 +0100, Ivo Van Doorn wrote: Correct, similar problems have been detected in rt2x00. The temporary solution in there is to demand a scanning operation after the interface has been brought up. Scanning? No no no, please! That would be a clear bug and misbehaviour. Jiri Oh? I'm sitting here watching the tty0 screen of my lappy after x has been started, and I have established a connection, but SoftMAC is still logging its scan activity, starting with channel 1 and scanning 14 channels. Its doing this at approximately 2 minute intervals. So I think we have your definition of a clear bug and misbehaviour. -- Cheers, Gene There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order. -Ed Howdershelt (Author) Yahoo.com and AOL/TW attorneys please note, additions to the above message by Gene Heskett are: Copyright 2007 by Maurice Eugene Heskett, all rights reserved. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up
Gene Heskett wrote: On Thursday 25 January 2007 07:50, Jiri Benc wrote: On Thu, 25 Jan 2007 12:47:08 +0100, Ivo Van Doorn wrote: Correct, similar problems have been detected in rt2x00. The temporary solution in there is to demand a scanning operation after the interface has been brought up. Scanning? No no no, please! That would be a clear bug and misbehaviour. Jiri Oh? I'm sitting here watching the tty0 screen of my lappy after x has been started, and I have established a connection, but SoftMAC is still logging its scan activity, starting with channel 1 and scanning 14 channels. Its doing this at approximately 2 minute intervals. So I think we have your definition of a clear bug and misbehaviour. Are you running NetworkManager? If so, that is the source of the scanning. Larry - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up
On Thu, 2007-01-25 at 09:36 -0600, Larry Finger wrote: Gene Heskett wrote: On Thursday 25 January 2007 07:50, Jiri Benc wrote: On Thu, 25 Jan 2007 12:47:08 +0100, Ivo Van Doorn wrote: Correct, similar problems have been detected in rt2x00. The temporary solution in there is to demand a scanning operation after the interface has been brought up. Scanning? No no no, please! That would be a clear bug and misbehaviour. Jiri Oh? I'm sitting here watching the tty0 screen of my lappy after x has been started, and I have established a connection, but SoftMAC is still logging its scan activity, starting with channel 1 and scanning 14 channels. Its doing this at approximately 2 minute intervals. So I think we have your definition of a clear bug and misbehaviour. Are you running NetworkManager? If so, that is the source of the scanning. Right; and NM scans at 2m intervals by default, unless you've clicked the menu (or a few other instances) where it will jump up to 20s and then back off to 2m again. dan Larry - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up
On Thursday 25 January 2007 01:50, Pavel Roskin wrote: If the hardware drivers are supposed to do it, here's my patch. It is working fine for me and ready to be applied. The changelog is in the subject. Let's fix this in the stack. This problem will be fixed for most users once auto channel selection is implemented, and fixing it for users manually setting the channel should be trivial. -Michael Wu pgp5LtQxHoU24.pgp Description: PGP signature
Re: [RFC PATCH] bcm43xx: set channel when the interface is brought up
On Thu, Jan 25, 2007 at 11:51:27AM -0500, Michael Wu wrote: On Thursday 25 January 2007 01:50, Pavel Roskin wrote: If the hardware drivers are supposed to do it, here's my patch. It is working fine for me and ready to be applied. The changelog is in the subject. Let's fix this in the stack. This problem will be fixed for most users once auto channel selection is implemented, and fixing it for users manually setting the channel should be trivial. ACK...fixing the stack makes the most sense. -- John W. Linville [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bcm43xx-d80211: Interrogate hardware-enable switch and update LEDs
On Sat, Dec 30, 2006 at 11:25:15PM -0600, Larry Finger wrote: The current bcm43xx driver ignores any wireless-enable switches on mini-PCI and mini-PCI-E cards. This patch implements a new routine to interrogate the radio hardware enabled bit in the interface, logs the initial state and any changes in the switch (if debugging enabled), activates the LED to show the state, and changes the periodic work handler to provide 1 second response to switch changes and to account for changes in the periodic work specs. It also incorporates changes in the LED state that were accepted into mainline some time ago. Larry, I wanted to merge this, but a slew of changes pulled from Michael makes this patch fail to apply. Would you mind refactoring the patch and resubmitting? Thanks, John -- John W. Linville [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[take34 6/10] kevent: Pipe notifications.
Pipe notifications. diff --git a/fs/pipe.c b/fs/pipe.c index 68090e8..0c75bf1 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -16,6 +16,7 @@ #include linux/uio.h #include linux/highmem.h #include linux/pagemap.h +#include linux/kevent.h #include asm/uaccess.h #include asm/ioctls.h @@ -313,6 +314,7 @@ redo: break; } if (do_wakeup) { + kevent_pipe_notify(inode, KEVENT_SOCKET_SEND); wake_up_interruptible_sync(pipe-wait); kill_fasync(pipe-fasync_writers, SIGIO, POLL_OUT); } @@ -322,6 +324,7 @@ redo: /* Signal writers asynchronously that there is more room. */ if (do_wakeup) { + kevent_pipe_notify(inode, KEVENT_SOCKET_SEND); wake_up_interruptible(pipe-wait); kill_fasync(pipe-fasync_writers, SIGIO, POLL_OUT); } @@ -484,6 +487,7 @@ redo2: break; } if (do_wakeup) { + kevent_pipe_notify(inode, KEVENT_SOCKET_RECV); wake_up_interruptible_sync(pipe-wait); kill_fasync(pipe-fasync_readers, SIGIO, POLL_IN); do_wakeup = 0; @@ -495,6 +499,7 @@ redo2: out: mutex_unlock(inode-i_mutex); if (do_wakeup) { + kevent_pipe_notify(inode, KEVENT_SOCKET_RECV); wake_up_interruptible(pipe-wait); kill_fasync(pipe-fasync_readers, SIGIO, POLL_IN); } @@ -590,6 +595,7 @@ pipe_release(struct inode *inode, int decr, int decw) free_pipe_info(inode); } else { wake_up_interruptible(pipe-wait); + kevent_pipe_notify(inode, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV); kill_fasync(pipe-fasync_readers, SIGIO, POLL_IN); kill_fasync(pipe-fasync_writers, SIGIO, POLL_OUT); } diff --git a/kernel/kevent/kevent_pipe.c b/kernel/kevent/kevent_pipe.c new file mode 100644 index 000..91dc1eb --- /dev/null +++ b/kernel/kevent/kevent_pipe.c @@ -0,0 +1,123 @@ +/* + * kevent_pipe.c + * + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include linux/kernel.h +#include linux/types.h +#include linux/slab.h +#include linux/spinlock.h +#include linux/file.h +#include linux/fs.h +#include linux/kevent.h +#include linux/pipe_fs_i.h + +static int kevent_pipe_callback(struct kevent *k) +{ + struct inode *inode = k-st-origin; + struct pipe_inode_info *pipe = inode-i_pipe; + int nrbufs = pipe-nrbufs; + + if (k-event.event KEVENT_SOCKET_RECV nrbufs 0) { + if (!pipe-writers) + return -1; + return 1; + } + + if (k-event.event KEVENT_SOCKET_SEND nrbufs PIPE_BUFFERS) { + if (!pipe-readers) + return -1; + return 1; + } + + return 0; +} + +int kevent_pipe_enqueue(struct kevent *k) +{ + struct file *pipe; + int err = -EBADF; + struct inode *inode; + + pipe = fget(k-event.id.raw[0]); + if (!pipe) + goto err_out_exit; + + inode = igrab(pipe-f_dentry-d_inode); + if (!inode) + goto err_out_fput; + + err = -EINVAL; + if (!S_ISFIFO(inode-i_mode)) + goto err_out_iput; + + err = kevent_storage_enqueue(inode-st, k); + if (err) + goto err_out_iput; + + if (k-event.req_flags KEVENT_REQ_ALWAYS_QUEUE) { + kevent_requeue(k); + err = 0; + } else { + err = k-callbacks.callback(k); + if (err) + goto err_out_dequeue; + } + + fput(pipe); + + return err; + +err_out_dequeue: + kevent_storage_dequeue(k-st, k); +err_out_iput: + iput(inode); +err_out_fput: + fput(pipe); +err_out_exit: + return err; +} + +int kevent_pipe_dequeue(struct kevent *k) +{ + struct inode *inode = k-st-origin; + + kevent_storage_dequeue(k-st, k); + iput(inode); + + return 0; +} + +void kevent_pipe_notify(struct inode
[take34 4/10] kevent: Socket notifications.
Socket notifications. This patch includes socket send/recv/accept notifications. Using trivial web server based on kevent and this features instead of epoll it's performance increased more than noticebly. More details about various benchmarks and server itself (evserver_kevent.c) can be found on project's homepage. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/fs/inode.c b/fs/inode.c index bf21dc6..82817b1 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -21,6 +21,7 @@ #include linux/cdev.h #include linux/bootmem.h #include linux/inotify.h +#include linux/kevent.h #include linux/mount.h /* @@ -164,12 +165,18 @@ static struct inode *alloc_inode(struct super_block *sb) } inode-i_private = NULL; inode-i_mapping = mapping; +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE + kevent_storage_init(inode, inode-st); +#endif } return inode; } void destroy_inode(struct inode *inode) { +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE + kevent_storage_fini(inode-st); +#endif BUG_ON(inode_has_buffers(inode)); security_inode_free(inode); if (inode-i_sb-s_op-destroy_inode) diff --git a/include/net/sock.h b/include/net/sock.h index 03684e7..d840399 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -49,6 +49,7 @@ #include linux/skbuff.h /* struct sk_buff */ #include linux/mm.h #include linux/security.h +#include linux/kevent.h #include linux/filter.h @@ -451,6 +452,21 @@ static inline int sk_stream_memory_free(struct sock *sk) extern void sk_stream_rfree(struct sk_buff *skb); +struct socket_alloc { + struct socket socket; + struct inode vfs_inode; +}; + +static inline struct socket *SOCKET_I(struct inode *inode) +{ + return container_of(inode, struct socket_alloc, vfs_inode)-socket; +} + +static inline struct inode *SOCK_INODE(struct socket *socket) +{ + return container_of(socket, struct socket_alloc, socket)-vfs_inode; +} + static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk) { skb-sk = sk; @@ -478,6 +494,7 @@ static inline void sk_add_backlog(struct sock *sk, struct sk_buff *skb) sk-sk_backlog.tail = skb; } skb-next = NULL; + kevent_socket_notify(sk, KEVENT_SOCKET_RECV); } #define sk_wait_event(__sk, __timeo, __condition) \ @@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kiocb(struct sock_iocb *si) return si-kiocb; } -struct socket_alloc { - struct socket socket; - struct inode vfs_inode; -}; - -static inline struct socket *SOCKET_I(struct inode *inode) -{ - return container_of(inode, struct socket_alloc, vfs_inode)-socket; -} - -static inline struct inode *SOCK_INODE(struct socket *socket) -{ - return container_of(socket, struct socket_alloc, socket)-vfs_inode; -} - extern void __sk_stream_mem_reclaim(struct sock *sk); extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind); diff --git a/include/net/tcp.h b/include/net/tcp.h index b7d8317..2763b30 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -864,6 +864,7 @@ static inline int tcp_prequeue(struct sock *sk, struct sk_buff *skb) tp-ucopy.memory = 0; } else if (skb_queue_len(tp-ucopy.prequeue) == 1) { wake_up_interruptible(sk-sk_sleep); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); if (!inet_csk_ack_scheduled(sk)) inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK, (3 * TCP_RTO_MIN) / 4, diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c new file mode 100644 index 000..d1a2701 --- /dev/null +++ b/kernel/kevent/kevent_socket.c @@ -0,0 +1,144 @@ +/* + * kevent_socket.c + * + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include linux/kernel.h +#include linux/types.h +#include linux/list.h +#include linux/slab.h +#include linux/spinlock.h +#include linux/timer.h +#include linux/file.h
[take34 5/10] kevent: Timer notifications.
Timer notifications. Timer notifications can be used for fine grained per-process time management, since interval timers are very inconvenient to use, and they are limited. This subsystem uses high-resolution timers. id.raw[0] is used as number of seconds id.raw[1] is used as number of nanoseconds Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c new file mode 100644 index 000..c21a155 --- /dev/null +++ b/kernel/kevent/kevent_timer.c @@ -0,0 +1,114 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include linux/kernel.h +#include linux/types.h +#include linux/list.h +#include linux/slab.h +#include linux/spinlock.h +#include linux/hrtimer.h +#include linux/jiffies.h +#include linux/kevent.h + +struct kevent_timer +{ + struct hrtimer ktimer; + struct kevent_storage ktimer_storage; + struct kevent *ktimer_event; +}; + +static int kevent_timer_func(struct hrtimer *timer) +{ + struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer); + struct kevent *k = t-ktimer_event; + + kevent_storage_ready(t-ktimer_storage, NULL, KEVENT_MASK_ALL); + hrtimer_forward(timer, timer-base-softirq_time, + ktime_set(k-event.id.raw[0], k-event.id.raw[1])); + return HRTIMER_RESTART; +} + +static struct lock_class_key kevent_timer_key; + +static int kevent_timer_enqueue(struct kevent *k) +{ + int err; + struct kevent_timer *t; + + t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL); + if (!t) + return -ENOMEM; + + hrtimer_init(t-ktimer, CLOCK_MONOTONIC, HRTIMER_REL); + t-ktimer.expires = ktime_set(k-event.id.raw[0], k-event.id.raw[1]); + t-ktimer.function = kevent_timer_func; + t-ktimer_event = k; + + err = kevent_storage_init(t-ktimer, t-ktimer_storage); + if (err) + goto err_out_free; + lockdep_set_class(t-ktimer_storage.lock, kevent_timer_key); + + err = kevent_storage_enqueue(t-ktimer_storage, k); + if (err) + goto err_out_st_fini; + + hrtimer_start(t-ktimer, t-ktimer.expires, HRTIMER_REL); + + return 0; + +err_out_st_fini: + kevent_storage_fini(t-ktimer_storage); +err_out_free: + kfree(t); + + return err; +} + +static int kevent_timer_dequeue(struct kevent *k) +{ + struct kevent_storage *st = k-st; + struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage); + + hrtimer_cancel(t-ktimer); + kevent_storage_dequeue(st, k); + kfree(t); + + return 0; +} + +static int kevent_timer_callback(struct kevent *k) +{ + k-event.ret_data[0] = jiffies_to_msecs(jiffies); + return 1; +} + +static int __init kevent_init_timer(void) +{ + struct kevent_callbacks tc = { + .callback = kevent_timer_callback, + .enqueue = kevent_timer_enqueue, + .dequeue = kevent_timer_dequeue, + .flags = 0, + }; + + return kevent_add_callbacks(tc, KEVENT_TIMER); +} +module_init(kevent_init_timer); + - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[take34 7/10] kevent: Signal notifications.
Signal notifications. This type of notifications allows to deliver signals through kevent queue. One can find example application signal.c on project homepage. If KEVENT_SIGNAL_NOMASK bit is set in raw_u64 id then signal will be delivered only through queue, otherwise both delivery types are used - old through update of mask of pending signals and through queue. If signal is delivered only through kevent queue mask of pending signals is not updated at all, which is equal to putting signal into blocked mask, but with delivery of that signal through kevent queue. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/include/linux/sched.h b/include/linux/sched.h index 4463735..e7372f2 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -82,6 +82,7 @@ struct sched_param { #include linux/resource.h #include linux/timer.h #include linux/hrtimer.h +#include linux/kevent_storage.h #include linux/task_io_accounting.h #include asm/processor.h @@ -1048,6 +1049,10 @@ struct task_struct { #ifdef CONFIG_TASK_DELAY_ACCT struct task_delay_info *delays; #endif +#ifdef CONFIG_KEVENT_SIGNAL + struct kevent_storage st; + u32 kevent_signals; +#endif #ifdef CONFIG_FAULT_INJECTION int make_it_fail; #endif diff --git a/kernel/fork.c b/kernel/fork.c index fc723e5..fd7c749 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -49,6 +49,7 @@ #include linux/delayacct.h #include linux/taskstats_kern.h #include linux/random.h +#include linux/kevent.h #include asm/pgtable.h #include asm/pgalloc.h @@ -118,6 +119,9 @@ void __put_task_struct(struct task_struct *tsk) WARN_ON(atomic_read(tsk-usage)); WARN_ON(tsk == current); +#ifdef CONFIG_KEVENT_SIGNAL + kevent_storage_fini(tsk-st); +#endif security_task_free(tsk); free_uid(tsk-user); put_group_info(tsk-group_info); @@ -1126,6 +1130,10 @@ static struct task_struct *copy_process(unsigned long clone_flags, if (retval) goto bad_fork_cleanup_namespaces; +#ifdef CONFIG_KEVENT_SIGNAL + kevent_storage_init(p, p-st); +#endif + p-set_child_tid = (clone_flags CLONE_CHILD_SETTID) ? child_tidptr : NULL; /* * Clear TID on mm_release()? diff --git a/kernel/kevent/kevent_signal.c b/kernel/kevent/kevent_signal.c new file mode 100644 index 000..abe3972 --- /dev/null +++ b/kernel/kevent/kevent_signal.c @@ -0,0 +1,94 @@ +/* + * kevent_signal.c + * + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include linux/kernel.h +#include linux/types.h +#include linux/slab.h +#include linux/spinlock.h +#include linux/file.h +#include linux/fs.h +#include linux/kevent.h + +static int kevent_signal_callback(struct kevent *k) +{ + struct task_struct *tsk = k-st-origin; + int sig = k-event.id.raw[0]; + int ret = 0; + + if (sig == tsk-kevent_signals) + ret = 1; + + if (ret (k-event.id.raw_u64 KEVENT_SIGNAL_NOMASK)) + tsk-kevent_signals |= 0x8000; + + return ret; +} + +int kevent_signal_enqueue(struct kevent *k) +{ + int err; + + err = kevent_storage_enqueue(current-st, k); + if (err) + goto err_out_exit; + + if (k-event.req_flags KEVENT_REQ_ALWAYS_QUEUE) { + kevent_requeue(k); + err = 0; + } else { + err = k-callbacks.callback(k); + if (err) + goto err_out_dequeue; + } + + return err; + +err_out_dequeue: + kevent_storage_dequeue(k-st, k); +err_out_exit: + return err; +} + +int kevent_signal_dequeue(struct kevent *k) +{ + kevent_storage_dequeue(k-st, k); + return 0; +} + +int kevent_signal_notify(struct task_struct *tsk, int sig) +{ + tsk-kevent_signals = sig; + kevent_storage_ready(tsk-st, NULL, KEVENT_SIGNAL_DELIVERY); + return (tsk-kevent_signals 0x8000); +} + +static int __init kevent_init_signal(void) +{ + struct kevent_callbacks sc = { + .callback = kevent_signal_callback, + .enqueue = kevent_signal_enqueue, + .dequeue = kevent_signal_dequeue, + .flags = 0, +
[take34 8/10] kevent: Kevent posix timer notifications.
Kevent posix timer notifications. Simple extensions to POSIX timers which allows to deliver notification of the timer expiration through kevent queue. Example application posix_timer.c can be found in archive on project homepage. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h index 8786e01..3768746 100644 --- a/include/asm-generic/siginfo.h +++ b/include/asm-generic/siginfo.h @@ -235,6 +235,7 @@ typedef struct siginfo { #define SIGEV_NONE 1 /* other notification: meaningless */ #define SIGEV_THREAD 2 /* deliver via thread creation */ #define SIGEV_THREAD_ID 4 /* deliver to thread */ +#define SIGEV_KEVENT 8 /* deliver through kevent queue */ /* * This works because the alignment is ok on all current architectures @@ -260,6 +261,8 @@ typedef struct sigevent { void (*_function)(sigval_t); void *_attribute; /* really pthread_attr_t */ } _sigev_thread; + + int kevent_fd; } _sigev_un; } sigevent_t; diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h index a7dd38f..4b9deb4 100644 --- a/include/linux/posix-timers.h +++ b/include/linux/posix-timers.h @@ -4,6 +4,7 @@ #include linux/spinlock.h #include linux/list.h #include linux/sched.h +#include linux/kevent_storage.h union cpu_time_count { cputime_t cpu; @@ -49,6 +50,9 @@ struct k_itimer { sigval_t it_sigev_value;/* value word of sigevent struct */ struct task_struct *it_process; /* process to send signal to */ struct sigqueue *sigq; /* signal queue entry. */ +#ifdef CONFIG_KEVENT_TIMER + struct kevent_storage st; +#endif union { struct { struct hrtimer timer; diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c index 5fe87de..5ec805e 100644 --- a/kernel/posix-timers.c +++ b/kernel/posix-timers.c @@ -48,6 +48,8 @@ #include linux/wait.h #include linux/workqueue.h #include linux/module.h +#include linux/kevent.h +#include linux/file.h /* * Management arrays for POSIX timers. Timers are kept in slab memory @@ -224,6 +226,100 @@ static int posix_ktime_get_ts(clockid_t which_clock, struct timespec *tp) return 0; } +#ifdef CONFIG_KEVENT_TIMER +static int posix_kevent_enqueue(struct kevent *k) +{ + /* +* It is not ugly - there is no pointer in the id field union, +* but its size is 64bits, which is ok for any known pointer size. +*/ + struct k_itimer *tmr = (struct k_itimer *)(unsigned long)k-event.id.raw_u64; + return kevent_storage_enqueue(tmr-st, k); +} +static int posix_kevent_dequeue(struct kevent *k) +{ + struct k_itimer *tmr = (struct k_itimer *)(unsigned long)k-event.id.raw_u64; + kevent_storage_dequeue(tmr-st, k); + return 0; +} +static int posix_kevent_callback(struct kevent *k) +{ + return 1; +} +static int posix_kevent_init(void) +{ + struct kevent_callbacks tc = { + .callback = posix_kevent_callback, + .enqueue = posix_kevent_enqueue, + .dequeue = posix_kevent_dequeue, + .flags = KEVENT_CALLBACKS_KERNELONLY}; + + return kevent_add_callbacks(tc, KEVENT_POSIX_TIMER); +} + +extern struct file_operations kevent_user_fops; + +static int posix_kevent_init_timer(struct k_itimer *tmr, int fd) +{ + struct ukevent uk; + struct file *file; + struct kevent_user *u; + int err; + + file = fget(fd); + if (!file) { + err = -EBADF; + goto err_out; + } + + if (file-f_op != kevent_user_fops) { + err = -EINVAL; + goto err_out_fput; + } + + u = file-private_data; + + memset(uk, 0, sizeof(struct ukevent)); + + uk.event = KEVENT_MASK_ALL; + uk.type = KEVENT_POSIX_TIMER; + uk.id.raw_u64 = (unsigned long)(tmr); /* Just cast to something unique */ + uk.req_flags = KEVENT_REQ_ONESHOT | KEVENT_REQ_ALWAYS_QUEUE; + uk.ptr = tmr-it_sigev_value.sival_ptr; + + err = kevent_user_add_ukevent(uk, u); + if (err) + goto err_out_fput; + + fput(file); + + return 0; + +err_out_fput: + fput(file); +err_out: + return err; +} + +static void posix_kevent_fini_timer(struct k_itimer *tmr) +{ + kevent_storage_fini(tmr-st); +} +#else +static int posix_kevent_init_timer(struct k_itimer *tmr, int fd) +{ + return -ENOSYS; +} +static int posix_kevent_init(void) +{ + return 0; +} +static void posix_kevent_fini_timer(struct k_itimer *tmr) +{ +} +#endif + + /* * Initialize everything, well, just everything in Posix clocks/timers ;) */ @@ -241,6 +337,11 @@ static __init int init_posix_timers(void) register_posix_clock(CLOCK_REALTIME, clock_realtime);
[take34 3/10] kevent: poll/select() notifications.
poll/select() notifications. This patch includes generic poll/select notifications. kevent_poll works simialr to epoll and has the same issues (callback is invoked not from internal state machine of the caller, but through process awake, a lot of allocations and so on). Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/fs/file_table.c b/fs/file_table.c index 4c17a18..46f458c 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -20,6 +20,7 @@ #include linux/cdev.h #include linux/fsnotify.h #include linux/sysctl.h +#include linux/kevent.h #include linux/percpu_counter.h #include asm/atomic.h @@ -119,6 +120,7 @@ struct file *get_empty_filp(void) f-f_uid = tsk-fsuid; f-f_gid = tsk-fsgid; eventpoll_init_file(f); + kevent_init_file(f); /* f-f_version: 0 */ return f; @@ -164,6 +166,7 @@ void fastcall __fput(struct file *file) * in the file cleanup chain. */ eventpoll_release(file); + kevent_cleanup_file(file); locks_remove_flock(file); if (file-f_op file-f_op-release) diff --git a/include/linux/fs.h b/include/linux/fs.h index 186da81..59e6069 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -280,6 +280,7 @@ extern int dir_notify_enable; #include linux/init.h #include linux/pid.h #include linux/mutex.h +#include linux/kevent_storage.h #include asm/atomic.h #include asm/semaphore.h @@ -408,6 +409,8 @@ struct address_space_operations { int (*readpages)(struct file *filp, struct address_space *mapping, struct list_head *pages, unsigned nr_pages); + int (*aio_readpages)(struct file *filp, struct address_space *mapping, + struct list_head *pages, unsigned nr_pages, void *priv); /* * ext3 requires that a successful prepare_write() call be followed @@ -578,6 +581,10 @@ struct inode { struct mutexinotify_mutex; /* protects the watches list */ #endif +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE + struct kevent_storage st; +#endif + unsigned long i_state; unsigned long dirtied_when; /* jiffies of first dirtying */ @@ -737,6 +744,9 @@ struct file { struct list_headf_ep_links; spinlock_t f_ep_lock; #endif /* #ifdef CONFIG_EPOLL */ +#ifdef CONFIG_KEVENT_POLL + struct kevent_storage st; +#endif struct address_space*f_mapping; }; extern spinlock_t files_lock; diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c new file mode 100644 index 000..58129fa --- /dev/null +++ b/kernel/kevent/kevent_poll.c @@ -0,0 +1,234 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/kernel.h +#include linux/types.h +#include linux/list.h +#include linux/slab.h +#include linux/spinlock.h +#include linux/timer.h +#include linux/file.h +#include linux/kevent.h +#include linux/poll.h +#include linux/fs.h + +static struct kmem_cache *kevent_poll_container_cache; +static struct kmem_cache *kevent_poll_priv_cache; + +struct kevent_poll_ctl +{ + struct poll_table_structpt; + struct kevent *k; +}; + +struct kevent_poll_wait_container +{ + struct list_headcontainer_entry; + wait_queue_head_t *whead; + wait_queue_twait; + struct kevent *k; +}; + +struct kevent_poll_private +{ + struct list_headcontainer_list; + spinlock_t container_lock; +}; + +static int kevent_poll_enqueue(struct kevent *k); +static int kevent_poll_dequeue(struct kevent *k); +static int kevent_poll_callback(struct kevent *k); + +static int kevent_poll_wait_callback(wait_queue_t *wait, + unsigned mode, int sync, void *key) +{ + struct kevent_poll_wait_container *cont = + container_of(wait, struct kevent_poll_wait_container, wait); + struct kevent *k = cont-k; + + kevent_storage_ready(k-st, NULL, KEVENT_MASK_ALL); + return 0; +} + +static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead, + struct poll_table_struct *poll_table) +{ + struct kevent *k = + container_of(poll_table, struct kevent_poll_ctl, pt)-k; + struct kevent_poll_private *priv = k-priv; +
[take34 9/10] kevent: Private userspace notifications.
Private userspace notifications. Allows to register notifications of any private userspace events over kevent. Events can be marked as readt using kevent_ctl(KEVENT_READY) command. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/kernel/kevent/kevent_unotify.c b/kernel/kevent/kevent_unotify.c new file mode 100644 index 000..618c09c --- /dev/null +++ b/kernel/kevent/kevent_unotify.c @@ -0,0 +1,62 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include linux/kernel.h +#include linux/kevent.h + +static int kevent_unotify_callback(struct kevent *k) +{ + return 1; +} + +int kevent_unotify_enqueue(struct kevent *k) +{ + int err; + + err = kevent_storage_enqueue(k-user-st, k); + if (err) + goto err_out_exit; + + if (k-event.req_flags KEVENT_REQ_ALWAYS_QUEUE) + kevent_requeue(k); + + return 0; + +err_out_exit: + return err; +} + +int kevent_unotify_dequeue(struct kevent *k) +{ + kevent_storage_dequeue(k-st, k); + return 0; +} + +static int __init kevent_init_unotify(void) +{ + struct kevent_callbacks sc = { + .callback = kevent_unotify_callback, + .enqueue = kevent_unotify_enqueue, + .dequeue = kevent_unotify_dequeue, + .flags = 0, + }; + + return kevent_add_callbacks(sc, KEVENT_UNOTIFY); +} +module_init(kevent_init_unotify); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [tipc-discussion] [RFC: 2.6 patch] net/tipc/: possible cleanups
Adrian Bunk wrote: This patch contains the following possible cleanups: - make needlessly global functions static - #if 0 unused functions Thanks. I think most of those were due for our next release, anyway. But we'll get it in, one way or another. - remove all EXPORT_SYMBOL's My impression is that most of this might have users that are not yet submitted for inclusion in the kernel - one year after TIPC was merged. Not quite. The exported symbols belong to a public API for driver programmers. We know about several users of this API, and there will be more, but I don't think any of them are aspiring to have their code be included in the kernel. If this is true, please submit the users for inclusion in the kernel. Signed-off-by: Adrian Bunk [EMAIL PROTECTED] --- include/net/tipc/tipc.h | 60 include/net/tipc/tipc_port.h |9 net/tipc/addr.c |2 net/tipc/cluster.c |2 net/tipc/cluster.h |1 net/tipc/config.c|9 +++- net/tipc/config.h|7 --- net/tipc/core.c | 74 ++- net/tipc/core.h | 14 -- net/tipc/dbg.c | 15 +-- net/tipc/dbg.h |3 - net/tipc/discover.c |4 + net/tipc/discover.h |1 net/tipc/link.c | 14 +++--- net/tipc/link.h |4 - net/tipc/name_table.c|3 - net/tipc/node.c |6 +- net/tipc/node.h |1 net/tipc/port.c | 59 +-- net/tipc/port.h |2 net/tipc/subscr.c|2 net/tipc/zone.c |2 net/tipc/zone.h |1 23 files changed, 97 insertions(+), 198 deletions(-) --- linux-2.6.20-rc4-mm1/include/net/tipc/tipc.h.old2007-01-24 19:12:15.0 +0100 +++ linux-2.6.20-rc4-mm1/include/net/tipc/tipc.h2007-01-24 20:40:58.0 +0100 @@ -50,8 +50,6 @@ * TIPC operating mode routines */ -u32 tipc_get_addr(void); - #define TIPC_NOT_RUNNING 0 #define TIPC_NODE_MODE1 #define TIPC_NET_MODE 2 @@ -62,8 +60,6 @@ void tipc_detach(unsigned int userref); -int tipc_get_mode(void); - /* * TIPC port manipulation routines */ @@ -153,12 +149,8 @@ int tipc_shutdown(u32 ref); /* Sends SHUTDOWN msg */ -int tipc_isconnected(u32 portref, int *isconnected); - int tipc_peer(u32 portref, struct tipc_portid *peer); -int tipc_ref_valid(u32 portref); - /* * TIPC messaging routines */ @@ -170,38 +162,12 @@ unsigned int num_sect, struct iovec const *msg_sect); -int tipc_send_buf(u32 portref, - struct sk_buff *buf, - unsigned int dsz); - int tipc_send2name(u32 portref, struct tipc_name const *name, u32 domain, /* 0:own zone */ unsigned int num_sect, struct iovec const *msg_sect); -int tipc_send_buf2name(u32 portref, - struct tipc_name const *name, - u32 domain, - struct sk_buff *buf, - unsigned int dsz); - -int tipc_forward2name(u32 portref, - struct tipc_name const *name, - u32 domain, /*0: own zone */ - unsigned int section_count, - struct iovec const *msg_sect, - struct tipc_portid const *origin, - unsigned int importance); - -int tipc_forward_buf2name(u32 portref, - struct tipc_name const *name, - u32 domain, - struct sk_buff *buf, - unsigned int dsz, - struct tipc_portid const *orig, - unsigned int importance); - int tipc_send2port(u32 portref, struct tipc_portid const *dest, unsigned int num_sect, @@ -212,20 +178,6 @@ struct sk_buff *buf, unsigned int dsz); -int tipc_forward2port(u32 portref, - struct tipc_portid const *dest, - unsigned int num_sect, - struct iovec const *msg_sect, - struct tipc_portid const *origin, - unsigned int importance); - -int tipc_forward_buf2port(u32 portref, - struct tipc_portid const *dest, - struct sk_buff *buf, - unsigned int dsz, - struct tipc_portid const *orig, - unsigned int importance); - int tipc_multicast(u32 portref, struct tipc_name_seq const *seq, u32 domain, /* 0:own zone */ @@ -240,18 +192,6 @@ unsigned int size); #endif -/* - * TIPC subscription routines - */ - -int tipc_ispublished(struct tipc_name const *name); - -/* - * Get number of available nodes
[ANNOUNCE] PRO/1000 PCI-e Software Developer Manual is now available
The Software Developer Manual for the PRO/1000 PCI-e controllers is now available via the http://e1000.sf.net/ web site. The file is OpenSDM_8257x-10.pdf. I know it's been a long time coming but sometimes that's just how it goes. Enjoy. -- Cheers, John - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] PRO/1000 PCI-e Software Developer Manual is now available
John Ronciak wrote: The Software Developer Manual for the PRO/1000 PCI-e controllers is now available via the http://e1000.sf.net/ web site. The file is OpenSDM_8257x-10.pdf. I know it's been a long time coming but sometimes that's just how it goes. Enjoy. Nice, thanks for posting it! Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] PRO/1000 PCI-e Software Developer Manual is now available
On Thu, Jan 25, 2007 at 01:40:23PM -0800, John Ronciak wrote: The Software Developer Manual for the PRO/1000 PCI-e controllers is now available via the http://e1000.sf.net/ web site. The file is OpenSDM_8257x-10.pdf. I know it's been a long time coming but sometimes that's just how it goes. Enjoy. I congratulate (and thank) you, sir! Now, if we could only get such an announcement from the wireless side of Intel's house... :-) John -- John W. Linville [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: + oops-in-drivers-net-shaperc.patch added to -mm tree
From: [EMAIL PROTECTED] Date: Wed, 24 Jan 2007 19:54:51 -0800 Hi, The following code: [...] Causes the following oops: ... [ 66.355188] [c0396c74] error_code+0x7c/0x84 [ 66.355192] [f8adaf03] packet_sendmsg+0x147/0x201 [af_packet] [ 66.355199] [c030e1c5] sock_sendmsg+0xf9/0x116 [ 66.355204] [c030eb54] sys_sendto+0xbf/0xe0 [ 66.355208] [c030f494] sys_socketcall+0x1aa/0x277 [ 66.355212] [c01041ea] sysenter_past_esp+0x5f/0x99 [ 66.355216] === [ 66.355218] Code: Bad EIP value. [ 66.355223] EIP: [] 0x0 SS:ESP 0068:f6261d70 shaper_header() should check for shaper-dev not being NULL (ie. the shaper was actually attached) as in the following patch. This happens in mainline too (tested 2.6.19.2). Signed-off-by: Frederik Deweerdt [EMAIL PROTECTED] Cc: David S. Miller [EMAIL PROTECTED] Cc: Stephen Hemminger [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] Shaper is actually OK. None of these hardware header callbacks should be invoked if the device is down. Yet, this is what is accidently being allowed in the AF_PACKET socket layer. Shaper makes sure to fail -open() if shaper-dev is NULL, in order to prevent this. But AF_PACKET does it's check of device state too late, after the dev-header() call. That's the bug. I'll fix it like this: diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 594c078..6dc01bd 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -359,6 +359,10 @@ static int packet_sendmsg_spkt(struct kiocb *iocb, struct socket *sock, if (dev == NULL) goto out_unlock; + err = -ENETDOWN; + if (!(dev-flags IFF_UP)) + goto out_unlock; + /* * You may not queue a frame bigger than the mtu. This is the lowest level * raw protocol and you must do your own fragmentation at this level. @@ -407,10 +411,6 @@ static int packet_sendmsg_spkt(struct kiocb *iocb, struct socket *sock, if (err) goto out_free; - err = -ENETDOWN; - if (!(dev-flags IFF_UP)) - goto out_free; - /* * Now send it */ @@ -738,6 +738,10 @@ static int packet_sendmsg(struct kiocb *iocb, struct socket *sock, if (sock-type == SOCK_RAW) reserve = dev-hard_header_len; + err = -ENETDOWN; + if (!(dev-flags IFF_UP)) + goto out_unlock; + err = -EMSGSIZE; if (len dev-mtu+reserve) goto out_unlock; @@ -770,10 +774,6 @@ static int packet_sendmsg(struct kiocb *iocb, struct socket *sock, skb-dev = dev; skb-priority = sk-sk_priority; - err = -ENETDOWN; - if (!(dev-flags IFF_UP)) - goto out_free; - /* * Now send it */ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[take34 1/10] kevent: Description.
Description. diff --git a/Documentation/kevent.txt b/Documentation/kevent.txt new file mode 100644 index 000..d6e126f --- /dev/null +++ b/Documentation/kevent.txt @@ -0,0 +1,271 @@ +Description. + +int kevent_init(struct kevent_ring *ring, unsigned int ring_size, + unsigned int flags); + +num - size of the ring buffer in events +ring - pointer to allocated ring buffer +flags - various flags, see KEVENT_FLAGS_* definitions. + +Return value: kevent control file descriptor or negative error value. + + struct kevent_ring + { + unsigned int ring_kidx, ring_over; + struct ukevent event[0]; + } + +ring_kidx - index in the ring buffer where kernel will put new events + when kevent_wait() or kevent_get_events() is called +ring_over - number of overflows of ring_uidx happend from the start. + Overflow counter is used to prevent situation when two threads + are going to free the same events, but one of them was scheduled + away for too long, so ring indexes were wrapped, so when that + thread will be awakened, it will free not those events, which + it suppose to free. + +Example userspace code (ring_buffer.c) can be found on project's homepage. + +Each kevent syscall can be so called cancellation point in glibc, i.e. when +thread has been cancelled in kevent syscall, thread can be safely removed +and no events will be lost, since each syscall (kevent_wait() or +kevent_get_events()) will copy event into special ring buffer, accessible +from other threads or even processes (if shared memory is used). + +When kevent is removed (not dequeued when it is ready, but just removed), +even if it was ready, it is not copied into ring buffer, since if it is +removed, no one cares about it (otherwise user would wait until it becomes +ready and got it through usual way using kevent_get_events() or kevent_wait()) +and thus no need to copy it to the ring buffer. + +--- + + +int kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent *arg); + +fd - is the file descriptor referring to the kevent queue to manipulate. +It is created by opening /dev/kevent char device, which is created with +dynamic minor number and major number assigned for misc devices. + +cmd - is the requested operation. It can be one of the following: +KEVENT_CTL_ADD - add event notification +KEVENT_CTL_REMOVE - remove event notification +KEVENT_CTL_MODIFY - modify existing notification +KEVENT_CTL_READY - mark existing events as ready, if number of events is zero, + it just wakes up parked in syscall thread + +num - number of struct ukevent in the array pointed to by arg +arg - array of struct ukevent + +Return value: + number of events processed or negative error value. + +When called, kevent_ctl will carry out the operation specified in the +cmd parameter. +--- + + int kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, + struct timespec timeout, struct ukevent *buf, unsigned flags); + +ctl_fd - file descriptor referring to the kevent queue +min_nr - minimum number of completed events that kevent_get_events will block +waiting for +max_nr - number of struct ukevent in buf +timeout - time to wait before returning less than min_nr + events. If this is -1, then wait forever. +buf - pointer to an array of struct ukevent. +flags - various flags, see KEVENT_FLAGS_* definitions. + +Return value: + number of events copied or negative error value. + +kevent_get_events will wait timeout milliseconds for at least min_nr completed +events, copying completed struct ukevents to buf and deleting any +KEVENT_REQ_ONESHOT event requests. In nonblocking mode it returns as many +events as possible, but not more than max_nr. In blocking mode it waits until +timeout or if at least min_nr events are ready. + +This function copies event into ring buffer if it was initialized, if ring buffer +is full, KEVENT_RET_COPY_FAILED flag is set in ret_flags field. +--- + + int kevent_wait(int ctl_fd, unsigned int num, unsigned int old_uidx, + struct timespec timeout, unsigned int flags); + +ctl_fd - file descriptor referring to the kevent queue +num - number of processed kevents +old_uidx - the last index user is aware of +timeout - time to wait until there is free space in kevent queue +flags - various flags, see KEVENT_FLAGS_* definitions. + +Return value: + number of events copied into ring buffer or negative error value. + +This syscall waits until either timeout expires or at least one event becomes +ready. It also copies events into special ring buffer. If ring buffer is full, +it waits until there are ready events and then return. +If kevent is one-shot kevent it is
[take34 0/10] kevent: Generic event handling mechanism.
Generic event handling mechanism. Kevent is a generic subsytem which allows to handle event notifications. It supports both level and edge triggered events. It is similar to poll/epoll in some cases, but it is more scalable, it is faster and allows to work with essentially eny kind of events. Events are provided into kernel through control syscall and can be read back through ring buffer or using usual syscalls. Kevent update (i.e. readiness switching) happens directly from internals of the appropriate state machine of the underlying subsytem (like network, filesystem, timer or any other). Homepage: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=kevent Documentation page: http://linux-net.osdl.org/index.php/Kevent Consider for inclusion. P.S. If you want to be removed from Cc: list just drop me a mail. Changes from 'take33' patchset: * Added optional header pointer and its size into aio_sendfile_path(), which allows to send header and file in one syscall instead of send(header), open file, sendfile(file). Changes from 'take32' patchset: * Updated documentation (aio_sendfile_path()). * Fixed typo in forward declaration. Changes from 'take31' patchset: * Added aio_sendfile_path() - this syscall allows to asynchronosly transfer file specified by provided pathname to destination socket. Opened file descriptor is returned. * Added trivial scheduler which selects execution thread. It allows to specify given thread 'by-hands', but since kaio provides '-1' it uses round-robin to get processing thread. In theory it can be bound to scheduler statistics or gamma-ray receiver data. * Number of bug fixes in kevent based AIO mpage_readpages(). Benchmark of the 100 1Mb files transfer (files are in VFS already) using sync sendfile or this new version shows about 10Mb/sec performance win for aio_sendfile_path(). Changes from 'take30' patchset: * AIO state machine. * aio_sendfile() implementation. * moved kevent_user_get/kevent_user_put into header. * use *zalloc where needed. Changes from 'take29' patchset: * new private userspace notifications - allows to queue any userspace private event and then mark it as ready using kevent_ctl(KEVENT_READY) command * KEVENT_REQ_READY flag - if set kevent will be marked as ready at enqueue time * port to 2.6.20-rc2 tree (54abb5fcdae74a811ed440ec6556cabc6b24f404 commit) * use struct kmem_cache instead of kmem_cache_t * added notificaion type into search key, this allows to have the same id for different types of notifications Changes from 'take28' patchset: * optimized af_unix to use socket notifications * changed ALWAYS_QUEUE behaviour with poll/select notifications - previously kevent was not queued into poll wait queue when ALWAYS_QUEUE flag is set * added KEVENT_POLL_POLLRDHUP definition into ukevent.h header * libevent-1.2 patch (Jamal, your request is completed, so I'm waiting two weeks before starting final countdown :) All regression tests passed successfully except test_evbuffer(), which is crashed on my amd64 linux 2.6 test machine for all types of notifications, probably it was fixed in libevent-1.2a version, I did not check. Patch and README can be found at project homepage. Changes from 'take27' patchset: * made kevent default yes in non embedded case. * added falgs to callback structures - currently used to check if kevent can be requested from kernelspace only (posix timers) or userspace (all others) Changes from 'take26' patchset: * made kevent visible in config only in case of embedded setup. * added comment about KEVENT_MAX number. * spell fix. Changes from 'take25' patchset: * use timespec as timeout parameter. * added high-resolution timer to handle absolute timeouts. * added flags to waiting and initialization syscalls. * kevent_commit() has new_uidx parameter. * kevent_wait() has old_uidx parameter, which, if not equal to u-uidx, results in immediate wakeup (usefull for the case when entries are added asynchronously from kernel (not supported for now)). * added interface to mark any event as ready. * event POSIX timers support. * return -ENOSYS if there is no registered event type. * provided file descriptor must be checked for fifo type (spotted by Eric Dumazet). * signal notifications. * documentation update. * lighttpd patch updated (the latest benchmarks with lighttpd patch can be found in blog). Changes from 'take24' patchset: * new (old (new)) ring buffer implementation with kernel and user indexes. * added initialization syscall instead of opening /dev/kevent * kevent_commit() syscall to commit ring buffer entries * changed KEVENT_REQ_WAKEUP_ONE flag to KEVENT_REQ_WAKEUP_ALL, kevent wakes only first thread always if that flag is not set * KEVENT_REQ_ALWAYS_QUEUE flag. If set, kevent will be queued into ready queue instead of copying back
[take34 10/10] kevent: Kevent based AIO (aio_sendfile()/aio_sendfile_path()).
Kevent based AIO (aio_sendfile()/aio_sendfile_path()). aio_sendfile()/aio_sendfile_path() contains of two major parts: AIO state machine and page processing code. The former is just a small subsystem, which allows to queue callback for theirs invocation in process' context on behalf of pool of kernel threads. It allows to queue caches of callbacks to the local thread or to any other specified. Each cache of callbacks is processed until there are callbacks in it, callbacks can requeue themselfs into the same cache. Real work is being done in page processing code - code which populates pages into VFS cache and then sends pages to the destination socket via -sendpage(). Unlike previous aio_sendfile() implementation, new one does not require low-level filesystem specific callbacks (-get_block()) at all, instead I extended struct address_space_operations to contain new member called -aio_readpages(), which is exactly the same as -readpage() (read: mpage_readpages()) except different BIO allocation and sumbission routines. I changed mpage_readpages() to provide mpage_alloc() and mpage_bio_submit() to the new function called __mpage_readpages(), which is exactly old mpage_readpages() with provided callback invocation instead of usage for old functions. mpage_readpages_aio() provides kevent specific callbacks, which calls old functions, but with different destructor callbacks, which are essentially the same, except that they reschedule AIO processing. aio_sendfile_path() is essentially aio_sendfile(), except that it takes source filename as parameter and returns opened file descriptor. Benchmark of the 100 1MB files transfer (files are in VFS already) using sync sendfile() against aio_sendfile_path() shows about 10MB/sec performance win (78 MB/s vs 66-72 MB/s over 1 Gb network, sendfile sending server is one-way AMD Athlong 64 3500+) for aio_sendfile_path(). AIO state machine is a base for network AIO (which becomes quite trivial), but I will not start implementation until roadback of kevent as a whole and AIO implementation become more clear. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/fs/bio.c b/fs/bio.c index 7618bcb..291e7e8 100644 --- a/fs/bio.c +++ b/fs/bio.c @@ -120,7 +120,7 @@ void bio_free(struct bio *bio, struct bio_set *bio_set) /* * default destructor for a bio allocated with bio_alloc_bioset() */ -static void bio_fs_destructor(struct bio *bio) +void bio_fs_destructor(struct bio *bio) { bio_free(bio, fs_bio_set); } diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index beaf25f..f08c957 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -1650,6 +1650,13 @@ ext3_readpages(struct file *file, struct address_space *mapping, return mpage_readpages(mapping, pages, nr_pages, ext3_get_block); } +static int +ext3_readpages_aio(struct file *file, struct address_space *mapping, + struct list_head *pages, unsigned nr_pages, void *priv) +{ + return mpage_readpages_aio(mapping, pages, nr_pages, ext3_get_block, priv); +} + static void ext3_invalidatepage(struct page *page, unsigned long offset) { journal_t *journal = EXT3_JOURNAL(page-mapping-host); @@ -1768,6 +1775,7 @@ static int ext3_journalled_set_page_dirty(struct page *page) } static const struct address_space_operations ext3_ordered_aops = { + .aio_readpages = ext3_readpages_aio, .readpage = ext3_readpage, .readpages = ext3_readpages, .writepage = ext3_ordered_writepage, diff --git a/fs/mpage.c b/fs/mpage.c index 692a3e5..e5ba44b 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -102,7 +102,7 @@ static struct bio *mpage_bio_submit(int rw, struct bio *bio) static struct bio * mpage_alloc(struct block_device *bdev, sector_t first_sector, int nr_vecs, - gfp_t gfp_flags) + gfp_t gfp_flags, void *priv) { struct bio *bio; @@ -116,6 +116,7 @@ mpage_alloc(struct block_device *bdev, if (bio) { bio-bi_bdev = bdev; bio-bi_sector = first_sector; + bio-bi_private = priv; } return bio; } @@ -175,7 +176,10 @@ map_buffer_to_page(struct page *page, struct buffer_head *bh, int page_block) static struct bio * do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages, sector_t *last_block_in_bio, struct buffer_head *map_bh, - unsigned long *first_logical_block, get_block_t get_block) + unsigned long *first_logical_block, get_block_t get_block, + struct bio *(*alloc)(struct block_device *bdev, sector_t first_sector, + int nr_vecs, gfp_t gfp_flags, void *priv), + struct bio *(*submit)(int rw, struct bio *bio), void *priv) { struct inode *inode = page-mapping-host; const unsigned blkbits = inode-i_blkbits; @@ -302,25 +306,25 @@ do_mpage_readpage(struct bio *bio, struct page *page,
Re: [take34 0/10] kevent: Generic event handling mechanism.
On Thu, Jan 25, 2007 at 04:48:30PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: Changes from 'take33' patchset: * Added optional header pointer and its size into aio_sendfile_path(), which allows to send header and file in one syscall instead of send(header), open file, sendfile(file). Btw, aio_sendfile and aio_sendfile_path use naive and actually the simplest approach of async IO - it just stupidly blocks on sending or resends (like repeated sending approach) - I'm a bit lazy to use kevent there, since there is _no_ gain after a bit more deep analysis (hint: there are multiple IO threads, some of them might block), and network AIO does not exist (yet, kevent status is in hinged state, and I was asked to postpone additional feature addons, which otherwise could happen a bit more frequently then current kevent/kernel releases), due to kevent future is indeterminate... -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html