Re: [patch 2/4] net: use mutex_is_locked() for ASSERT_RTNL()
On Fri, 14 Dec 2007 16:10:44 +0800 Herbert Xu [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: diff -puN drivers/net/cxgb3/cxgb3_main.c~net-use-mutex_is_locked-for-assert_rtnl drivers/net/cxgb3/cxgb3_main.c --- a/drivers/net/cxgb3/cxgb3_main.c~net-use-mutex_is_locked-for-assert_rtnl +++ a/drivers/net/cxgb3/cxgb3_main.c @@ -2191,7 +2191,7 @@ static void check_t3b2_mac(struct adapte { int i; - if (!rtnl_trylock())/* synchronize with ifdown */ + if (rtnl_is_locked()) /* synchronize with ifdown */ return; for_each_port(adapter, i) { @@ -2219,7 +2219,6 @@ static void check_t3b2_mac(struct adapte p-mac.stats.num_resets++; } } - rtnl_unlock(); This doesn't look right. It seems that they really want trylock here so we should just fix it by removing the bang. doh. Also, does ASSERT_RTNL still warn when someone calls it from an atomic context? We definitely don't want to lose that check. I don't see how it could warn about that. Nor should it - one might want to check that rtnl_lock is held inside preempt_disable() or spin_lock or whatever. It might make sense to warn if ASSERT_RTNL is called in in_interrupt() contexts though. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/4] net: use mutex_is_locked() for ASSERT_RTNL()
[EMAIL PROTECTED] wrote: diff -puN drivers/net/cxgb3/cxgb3_main.c~net-use-mutex_is_locked-for-assert_rtnl drivers/net/cxgb3/cxgb3_main.c --- a/drivers/net/cxgb3/cxgb3_main.c~net-use-mutex_is_locked-for-assert_rtnl +++ a/drivers/net/cxgb3/cxgb3_main.c @@ -2191,7 +2191,7 @@ static void check_t3b2_mac(struct adapte { int i; - if (!rtnl_trylock())/* synchronize with ifdown */ + if (rtnl_is_locked()) /* synchronize with ifdown */ return; for_each_port(adapter, i) { @@ -2219,7 +2219,6 @@ static void check_t3b2_mac(struct adapte p-mac.stats.num_resets++; } } - rtnl_unlock(); This doesn't look right. It seems that they really want trylock here so we should just fix it by removing the bang. Also, does ASSERT_RTNL still warn when someone calls it from an atomic context? We definitely don't want to lose that check. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/4] net: use mutex_is_locked() for ASSERT_RTNL()
On Fri, Dec 14, 2007 at 12:22:09AM -0800, Andrew Morton wrote: I don't see how it could warn about that. Nor should it - one might want to check that rtnl_lock is held inside preempt_disable() or spin_lock or whatever. It might make sense to warn if ASSERT_RTNL is called in in_interrupt() contexts though. Well the paths where ASSERT_RTNL is used should never be in an atomic context. In the past it has been quite useful in pointing out bogus locking practices. There is currently one path where it's known to warn because of this and it (promiscuous mode) is on my todo list. Oh and it only warns when you have mutex debugging enabled. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] add tcp congestion control relevant parts
Hello Linux networking folk, I received the patch below for the tcp.7 man page. Would anybody here be prepared to review the new material / double check the details? Cheers, Michael Original Message Subject: [patch] add tcp congestion control relevant parts Date: Wed, 12 Dec 2007 16:40:23 +0100 From: Thomas Egerer [EMAIL PROTECTED] To: [EMAIL PROTECTED] CC: [EMAIL PROTECTED] Hello *, man-pages version : 2.70 from http://www.kernel.org/pub/linux/docs/man-pages/ All required information were obtained by reading the kernel code/documentation. I'm not sure, whether it is completely bullet proof on when the sysctl variables/socket option first appeared in the kernel, so you might as well drop this information, but I'm pretty sure about how it works. Here we go with my patch: diff -ru man-pages-2.70/man7/tcp.7 man-pages-2.70.new/man7/tcp.7 --- man-pages-2.70/man7/tcp.7 2007-11-24 14:33:34.0 +0100 +++ man-pages-2.70.new/man7/tcp.7 2007-12-12 16:34:52.0 +0100 @@ -177,8 +177,6 @@ .\ FIXME As at Sept 2006, kernel 2.6.18-rc5, the following are .\not yet documented (shown with default values): .\ -.\ /proc/sys/net/ipv4/tcp_congestion_control (since 2.6.13) -.\ bic .\ /proc/sys/net/ipv4/tcp_moderate_rcvbuf .\ 1 .\ /proc/sys/net/ipv4/tcp_no_metrics_save @@ -224,6 +222,20 @@ are reserved for the application buffer. A value of 0 implies that no amount is reserved. +.TP +.BR tcp_allowed_congestion_control \ + (String; default: cubic reno) (since 2.6.13) +Show/set the congestion control choices available to non-privileged +processes. The list is a subset of those listed in +.IR tcp_available_congestion_control . +Default is cubic reno and the default setting +.RI ( tcp_congestion_control ). +.TP +.BR tcp_available_congestion_control \ + (String; default: cubic reno) (since 2.6.13) +Lists the TCP congestion control algorithms available on the system. This value +can only be changed by loading/unloading modules responsible for congestion +control. .\ .\ The following is from 2.6.12: Documentation/networking/ip-sysctl.txt .TP @@ -257,6 +269,17 @@ Allows two flows sharing the same connection to converge more rapidly. .TP +.BR tcp_congestion_control (String; default: cubic reno) (since 2.6.13) +Determines the congestion control algorithm used for newly created TCP +sockets. By default Linux uses cubic with reno as fallback. If you want +to have more control over the algorithm used, you must enable the symbol +CONFIG_TCP_CONG_ADVANCED in your kernel config. +You can use +.BR setsockopt (2) +to individually change the algorithm on a single socket. +Requires CAP_NET_ADMIN or congestion algorithm to be listed in +.IR tcp_allowed_congestion_control . +.TP .BR tcp_dsack (Boolean; default: enabled) Enable RFC\ 2883 TCP Duplicate SACK support. .TP @@ -649,7 +672,21 @@ socket options are valid on TCP sockets. For more information see .BR ip (7). -.\ FIXME Document TCP_CONGESTION (new in 2.6.13) +.TP +.BR TCP_CONGESTION (new since kernel version 2.6.13) +If set to the name of an available congestion control algorithm, +it will henceforth be used for the socket. To get a list of +available congestion control algorithms, consult the sysctl variable +.IR net.ipv4.tcp_available_congestion_control . +The algorithm that is used by default for all newly created +TCP sockets can be viewed/changed via the sysctl variable +.IR net.ipv4.tcp_congestion_control . +If you feel, you are missing an algorithm in the list, +you may try to load the corresponding module using +.BR modprobe (8), +or if your kernel is built with module autoloading support +.RI ( CONFIG_KMOD ) +and the algorithm has been compiled as a module, it will be autoloaded. .TP .B TCP_CORK If set, don't send out partial frames. -- Michael Kerrisk Maintainer of the Linux man-pages project http://www.kernel.org/doc/man-pages/ Want to report a man-pages bug? Look here: http://www.kernel.org/doc/man-pages/reporting_bugs.html -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] add driver for enc28j60 ethernet chip
Hi Stephen, thank you for your suggestions. I already applied trivial fixes, but I have questions on some points, see inline. Stephen Hemminger wrote: General comments: * device driver does no carrier detection. This makes it useless for bridging, bonding, or any form of failover. * use msglevel method (via ethtool) to control debug messages rather than kernel configuration. This allows enabling debugging without recompilation which is important in distributions. * Please add ethtool support * Consider using NAPI Can you point me to a possibly simple driver that uses ethtool and NAPI? Or other example that I can use for reference. May be the skeleton should be updated. * use netdev_priv(netdev) rather than netdev-priv I can't find where I used netdev-priv, may be do you mean priv-netdev? My comments: diff --git a/drivers/net/enc28j60.c b/drivers/net/enc28j60.c new file mode 100644 index 000..6182473 --- /dev/null +++ b/drivers/net/enc28j60.c @@ -0,0 +1,1400 @@ +/* + * Microchip ENC28J60 ethernet driver (MAC + PHY) + * + * Copyright (C) 2007 Eurek srl + * Author: Claudio Lanconelli [EMAIL PROTECTED] + * based on enc28j60.c written by David Anders for 2.4 kernel version + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * $Id: enc28j60.c,v 1.10 2007/12/10 16:59:37 claudio Exp $ + */ + +#include linux/autoconf.h Use msglvl instead see netdevice.h Ok + +#if CONFIG_ENC28J60_DBGLEVEL 1 +# define VERBOSE_DEBUG +#endif +#if CONFIG_ENC28J60_DBGLEVEL 0 +# define DEBUG +#endif + ... + +#define MY_TX_TIMEOUT ((500*HZ)/1000) That is a really short TX timeout, should be 2 seconds at least not 1/2 sec. Having it less than a second causes increased wakeups. Ok + +/* Max TX retries in case of collision as suggested by errata datasheet */ +#define MAX_TX_RETRYCOUNT 16 + +/* Driver local data */ +struct enc28j60_net_local { Rename something shorter like enc28j60_net or just enc28j60? Ok, renamed enc28j60_net + struct net_device_stats stats; net_device_stats are now in net_device. + struct net_device *netdev; + struct spi_device *spi; + struct semaphore semlock; /* protect spi_transfer_buf */ Use mutex (or spin_lock) rather than semaphore Ok + uint8_t *spi_transfer_buf; + struct sk_buff *tx_skb; + struct work_struct tx_work; + struct work_struct irq_work; Not sure why you need to have workqueue's for tx_work and irq_work, rather than using a spin_lock and doing directly. I need irq_work for sure because it needs to go sleep. Any access to enc28j60 registers are through SPI blocking transaction, spi_sync(). I'm not sure if the hard_start_xmit() can go sleep, so I used a work queue to tx too. + int bank; /* current register bank selected */ bank is really unsigned. + uint16_t next_pk_ptr; /* next packet pointer within FIFO */ + int max_pk_counter; /* statistics: max packet counter */ + int tx_retry_count; these are used as unsigned. + int hw_enable; +}; + +/* Selects Full duplex vs. Half duplex mode */ +static int full_duplex = 0; Use ethtool for this. Ok + +static int enc28j60_send_packet(struct sk_buff *skb, struct net_device *dev); +static int enc28j60_net_close(struct net_device *dev); +static struct net_device_stats *enc28j60_net_get_stats(struct net_device *dev); +static void enc28j60_set_multicast_list(struct net_device *dev); +static void enc28j60_net_tx_timeout(struct net_device *ndev); + +static int enc28j60_chipset_init(struct net_device *dev); +static void enc28j60_hw_disable(struct enc28j60_net_local *priv); +static void enc28j60_hw_enable(struct enc28j60_net_local *priv); +static void enc28j60_hw_rx(struct enc28j60_net_local *priv); +static void enc28j60_hw_tx(struct enc28j60_net_local *priv); If you order functions correctly in code, you don't have to waste lots of space with all these forward declarations. ... Ok + const char *msg); + +/* + * SPI read buffer + * wait for the SPI transfer and copy received data to destination + */ +static int +spi_read_buf(struct enc28j60_net_local *priv, int len, uint8_t *data) +{ + uint8_t *rx_buf; + uint8_t *tx_buf; + struct spi_transfer t; + struct spi_message msg; + int ret, slen; + + slen = 1; + memset(t, 0, sizeof(t)); + t.tx_buf = tx_buf = priv-spi_transfer_buf; + t.rx_buf = rx_buf = priv-spi_transfer_buf + 4; + t.len = slen + len; If you use structure initializer you can avoid having to do the memset Ok + + down(priv-semlock); + tx_buf[0] = ENC28J60_READ_BUF_MEM; + tx_buf[1] = tx_buf[2] = tx_buf[3] = 0; /* don't care
Re: [NETFILTER] xt_hashlimit : speedups hash_dst()
Eric Dumazet wrote: 1) Using jhash2() instead of jhash() is a litle bit faster if applicable. 2) Thanks to jhash, hash value uses full 32 bits. Instead of returning hash % size (implying a divide) we return the high 32 bits of the (hash * size) that will give results between [0 and size-1] and same hash distribution. On most cpus, a multiply is less expensive than a divide, by an order of magnitude. Clever :) Applied, thanks Eric. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch] authorize some users to bind on specifics priv ports
Really simpler and usable than POSIX/capabilities and I think it covers basics needs for sysadmins... at least, it covers mine :-) www-data$ nc -l -p 80 -v Can't grab 0.0.0.0:80 with bind : Permission denied root# id -u www-data 33 root# port_acl_set +80 www-data root# cat /proc/net/port_acl 80: 33 www-data$ nc -l -p 80 -v listening on [any] 80 ... diff -r --unidirectional-new-file -u linux-2.6.23/arch/i386/kernel/syscall_table.S linux-2.6.23-patched/arch/i386/kernel/syscall_table.S --- linux-2.6.23/arch/i386/kernel/syscall_table.S 2007-10-09 22:31:38.0 +0200 +++ linux-2.6.23-patched/arch/i386/kernel/syscall_table.S 2007-12-13 14:29:40.0 +0100 @@ -324,3 +324,4 @@ .long sys_timerfd .long sys_eventfd .long sys_fallocate + .long sys_port_acl_set /* 325 */ diff -r --unidirectional-new-file -u linux-2.6.23/include/asm-i386/unistd.h linux-2.6.23-patched/include/asm-i386/unistd.h --- linux-2.6.23/include/asm-i386/unistd.h 2007-10-09 22:31:38.0 +0200 +++ linux-2.6.23-patched/include/asm-i386/unistd.h 2007-12-13 14:29:40.0 +0100 @@ -330,10 +330,11 @@ #define __NR_timerfd 322 #define __NR_eventfd 323 #define __NR_fallocate 324 +#define __NR_port_acl_set 325 #ifdef __KERNEL__ -#define NR_syscalls 325 +#define NR_syscalls 326 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR diff -r --unidirectional-new-file -u linux-2.6.23/include/net/port_acl.h linux-2.6.23-patched/include/net/port_acl.h --- linux-2.6.23/include/net/port_acl.h 1970-01-01 01:00:00.0 +0100 +++ linux-2.6.23-patched/include/net/port_acl.h 2007-12-13 15:17:40.0 +0100 @@ -0,0 +1,15 @@ +#include linux/types.h + +struct port_acl { + uid_t uid; + struct port_acl *next; +}; + +#ifdef __PORT_ACL__ + struct port_acl *port_acl_list[1024]; +#else + extern struct port_acl *port_acl_list[1024]; +#endif + +extern int port_acl(short int); +extern int port_acl_get_info(char *, char **, off_t, int); diff -r --unidirectional-new-file -u linux-2.6.23/kernel/sys.c linux-2.6.23-patched/kernel/sys.c --- linux-2.6.23/kernel/sys.c 2007-10-09 22:31:38.0 +0200 +++ linux-2.6.23-patched/kernel/sys.c 2007-12-13 16:01:24.0 +0100 @@ -43,6 +43,9 @@ #include asm/io.h #include asm/unistd.h +#define __PORT_ACL__ +#include net/port_acl.h + #ifndef SET_UNALIGN_CTL # define SET_UNALIGN_CTL(a,b) (-EINVAL) #endif @@ -2356,4 +2359,89 @@ return ret; } + +/* + * The following lines were added to implement port_acl secured + * mecanism : + * port_acl_add - add to a user the authorisation to acces a particular port + * port_acl_remove - remove to a user that authorisation + * sys_port_acl_set - front end for port_acl_add and port_acl_remove + */ +long port_acl_add(short int snum, uid_t uid) +{ + struct port_acl *ptr, *new; + + /* we verify if the permition is already set for that user */ + ptr = port_acl_list[snum]; + + while (ptr != NULL) { + if (ptr-uid == uid) + return -EBUSY; + if (ptr-next == NULL) + break; + ptr = ptr-next; + } + + /* ok, we haven't found the user and ptr is a pointer on the + last structure */ + new = kmalloc( sizeof(struct port_acl), GFP_KERNEL); + new-next = NULL; + new-uid = uid; + + if(ptr == NULL) + port_acl_list[snum] = new; + else + ptr-next = new; + + return 0; +} + +long port_acl_remove(short int snum, uid_t uid) +{ + struct port_acl *ptr, *prev = 0; + + /* we verify if the permition is already set for that user */ + ptr = port_acl_list[snum]; + + while (ptr != NULL) { + /* we found the user */ + if (ptr-uid == uid) { + if (ptr == port_acl_list[snum]) { + port_acl_list[snum] = ptr-next; + } + else { + prev-next = ptr-next; + } + kfree(ptr); + return 0; + } + prev = ptr; + ptr = ptr-next; + } + + return -ENODATA; +} + +asmlinkage long sys_port_acl_set(short int snum, uid_t uid, int act) +{ + /* the owner of the process must be root */ + if (current-uid != 0) + return -EACCES; + + /* we verify that the port is valid */ + if (snum0 || snum1023) + return -EINVAL; + + if (uid1 || uid65534) + return -EINVAL; + + if (act == 0) + return port_acl_remove(snum, uid); + else if (act == 1) + return port_acl_add(snum, uid); + else + return -EPERM; +} + +
[NETFILTER] xt_hashlimit : speedups hash_dst()
1) Using jhash2() instead of jhash() is a litle bit faster if applicable. 2) Thanks to jhash, hash value uses full 32 bits. Instead of returning hash % size (implying a divide) we return the high 32 bits of the (hash * size) that will give results between [0 and size-1] and same hash distribution. On most cpus, a multiply is less expensive than a divide, by an order of magnitude. Signed-off-by: Eric Dumazet [EMAIL PROTECTED] diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c index 033d448..7cc04e8 100644 --- a/net/netfilter/xt_hashlimit.c +++ b/net/netfilter/xt_hashlimit.c @@ -105,7 +105,16 @@ static inline bool dst_cmp(const struct dsthash_ent *ent, static u_int32_t hash_dst(const struct xt_hashlimit_htable *ht, const struct dsthash_dst *dst) { - return jhash(dst, sizeof(*dst), ht-rnd) % ht-cfg.size; + u_int32_t hash = jhash2((const u32 *)dst, + sizeof(*dst)/sizeof(u32), + ht-rnd); + /* +* Instead of returning hash % ht-cfg.size (implying a divide) +* we return the high 32 bits of the (hash * ht-cfg.size) that will +* give results between [0 and cfg.size-1] and same hash distribution, +* but using a multiply, less expensive than a divide +*/ + return ((u64)hash * ht-cfg.size) 32; } static struct dsthash_ent *
Re: [PATCH] PS3: gelic: Add wireless support for PS3
On Fri, 2007-12-14 at 14:03 +0900, Masakazu Mokuno wrote: On Thu, 13 Dec 2007 16:13:38 -0500 Dan Williams [EMAIL PROTECTED] wrote: One more question; does the driver work with wpa_supplicant for WPA, or does the firmware capture the EAPOL frames and handle the 4 way handshake internally? Ideally the firmware would have the ability to pass those frames up unmodified so the driver would at least have a _hope_ of 802.1x capability. Does the firmware handle Dynamic WEP at all? Basically, what happens when the AP you've just associated with starts sending you EAPOL traffic to start the 802.1x process? The PS3 wireless device does the association and 4way handshake in its firmware/hypervisor. No interventions between them are allowed to the guest OSes. All frames which are sent/received from/to before the connection process completed seems to be dropped by the hardware. Only the static WEP is supported. That sort of sucks; but I guess there's not too much you can do about it. That probably means that using wpa_supplicant + WPA is completely out of the picture, which unfortunately makes the PS3 wireless unlike any other card, which would require special-casing the PS3 in userspace tools. Dan -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/4] net: use mutex_is_locked() for ASSERT_RTNL()
I agree with this. IIRC I removed some ASSERT_RTNL()s in the wireless code (or maybe it was only during testing patches) where we had a function that required only the rtnl to be held but in certain contexts was called from within an RCU section. Please point me to the actual code so I can see if this is legit or not. I don't think I have that case any more since now my interface list is either protected by RCU or the rtnl. johannes signature.asc Description: This is a digitally signed message part
Re: [patch 2/4] net: use mutex_is_locked() for ASSERT_RTNL()
On Fri, Dec 14, 2007 at 01:37:40PM +0100, Johannes Berg wrote: I agree with this. IIRC I removed some ASSERT_RTNL()s in the wireless code (or maybe it was only during testing patches) where we had a function that required only the rtnl to be held but in certain contexts was called from within an RCU section. Please point me to the actual code so I can see if this is legit or not. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc5-mm1
Hi Andrew, I hit this just now. Not sure if I can reproduce it though. WARNING: at net/ipv4/tcp_input.c:2533 tcp_fastretrans_alert() Pid: 4624, comm: yield Not tainted 2.6.24-rc5-mm1 #5 [c010582a] show_trace_log_lvl+0x12/0x22 [c0105847] show_trace+0xd/0xf [c0105959] dump_stack+0x57/0x5e [c03db95b] tcp_fastretrans_alert+0xde/0x5bd [c03dcab2] tcp_ack+0x236/0x2e4 [c03dea01] tcp_rcv_established+0x51e/0x5c0 [c03e56f1] tcp_v4_do_rcv+0x22/0xc4 [c03e5c49] tcp_v4_rcv+0x4b6/0x7f5 [c03cd5ad] ip_local_deliver_finish+0xb9/0x169 [c03cd68a] ip_local_deliver+0x2d/0x34 [c03cd91d] ip_rcv_finish+0x28c/0x2ab [c03cdb16] ip_rcv+0x1da/0x204 [c03b800a] netif_receive_skb+0x23c/0x26f [c02db326] tg3_rx+0x246/0x353 [c02db4ac] tg3_poll_work+0x79/0x86 [c02db4e8] tg3_poll+0x2f/0x16f [c03b822b] net_rx_action+0xbb/0x1a8 [c0129596] __do_softirq+0x73/0xe6 [c0129642] do_softirq+0x39/0x51 [c01296c0] irq_exit+0x47/0x49 [c01064f4] do_IRQ+0x55/0x69 [c0105492] common_interrupt+0x2e/0x34 === -- regards, Dhaval -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] add tcp congestion control relevant parts
On Fri, 14 Dec 2007 09:48:32 +0100 Michael Kerrisk [EMAIL PROTECTED] wrote: Hello Linux networking folk, I received the patch below for the tcp.7 man page. Would anybody here be prepared to review the new material / double check the details? Cheers, Michael Original Message Subject: [patch] add tcp congestion control relevant parts Date: Wed, 12 Dec 2007 16:40:23 +0100 From: Thomas Egerer [EMAIL PROTECTED] To: [EMAIL PROTECTED] CC: [EMAIL PROTECTED] Hello *, man-pages version : 2.70 from http://www.kernel.org/pub/linux/docs/man-pages/ All required information were obtained by reading the kernel code/documentation. I'm not sure, whether it is completely bullet proof on when the sysctl variables/socket option first appeared in the kernel, so you might as well drop this information, but I'm pretty sure about how it works. Here we go with my patch: diff -ru man-pages-2.70/man7/tcp.7 man-pages-2.70.new/man7/tcp.7 --- man-pages-2.70/man7/tcp.7 2007-11-24 14:33:34.0 +0100 +++ man-pages-2.70.new/man7/tcp.7 2007-12-12 16:34:52.0 +0100 @@ -177,8 +177,6 @@ .\ FIXME As at Sept 2006, kernel 2.6.18-rc5, the following are .\not yet documented (shown with default values): .\ -.\ /proc/sys/net/ipv4/tcp_congestion_control (since 2.6.13) -.\ bic .\ /proc/sys/net/ipv4/tcp_moderate_rcvbuf .\ 1 .\ /proc/sys/net/ipv4/tcp_no_metrics_save @@ -224,6 +222,20 @@ are reserved for the application buffer. A value of 0 implies that no amount is reserved. +.TP +.BR tcp_allowed_congestion_control \ + (String; default: cubic reno) (since 2.6.13) +Show/set the congestion control choices available to non-privileged +processes. The list is a subset of those listed in +.IR tcp_available_congestion_control . +Default is cubic reno and the default setting +.RI ( tcp_congestion_control ). +.TP +.BR tcp_available_congestion_control \ + (String; default: cubic reno) (since 2.6.13) +Lists the TCP congestion control algorithms available on the system. This value +can only be changed by loading/unloading modules responsible for congestion +control. .\ .\ The following is from 2.6.12: Documentation/networking/ip-sysctl.txt .TP @@ -257,6 +269,17 @@ Allows two flows sharing the same connection to converge more rapidly. .TP +.BR tcp_congestion_control (String; default: cubic reno) (since 2.6.13) +Determines the congestion control algorithm used for newly created TCP +sockets. By default Linux uses cubic with reno as fallback. If you want +to have more control over the algorithm used, you must enable the symbol +CONFIG_TCP_CONG_ADVANCED in your kernel config. You can choose the default congestion control as well as part of the kernel configuration. -- Stephen Hemminger [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Packet per Second
Hi all, It's my first time using usenet... Well, I work on an ISP and we have a linux box acting as a bridge+firewall. With this bridge+firewall we control the packet rate per second from each client and from our repeaters. But I can`t measure the packet rate per IP. Is there any tool for this? Actually, what I want is to measure the packet rate per IP and generate graphics with mrtg or rrdtool, but for this I must have the number of packets per second of each client :) Thank you all -- Flávio -- I'm trying a new usenet client for Mac, Nemo OS X. You can download it at http://www.malcom-mac.com/nemo -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] mac80211: clean up frame receive handling
Is there any way for an user space application to figure out whether a received EAPOL frame was encrypted? In theory, WPA/WPA2 Authenticators (e.g., hostapd) should verify that the frame was encrypted if pairwise keys are set (whereas IEEE 802.1X Authenticator accepts unencrypted EAPOL frames). Unfortunately not. Does that really matter? It seems that the verification whether the frame was encrypted would either be always require encryption when pairwise keys in use (which this patch doesn't do right now but could trivially be done) or simply don't care since it doesn't really matter. Did you/someone already verify that the Linux bridge code does not bridge EAPOL frames? The use of a separate interface for this removed the need for doing such filtering based on ethertype, but with EAPOL frames using the same netdev with other data frames, the bridge code should filter these out (mainly the PAE group addressed ones, but if I remember correctly, IEEE 802.1X specified all frames using EAPOL ethertype not to be bridged). Actually, 802.1X doesn't specify that, as I said previously it *recommends* it in C.3.3 (not C.1.1 as the 802.11 specs lead you to believe). Also, a patch to do this was rejected by Stephen Hemminger, so I decided to only pass up EAPOL frames that are either for our own unicast address or the link-local eapol address, both of which won't be bridged. I haven't looked into the current implementations and/or proposed patches on for TX part, but I would assume that it is possible to select whether an EAPOL frame will be encrypted when injecting it(?). Yes, by setting the F_WEP flag on any frame you decide whether it will be encrypted (if possible) or not. Right now, the corresponding hostapd patch always sets that flag. johannes signature.asc Description: This is a digitally signed message part
Re: [RFC] mac80211: clean up frame receive handling
+static bool ieee80211_frame_allowed(struct ieee80211_txrx_data *rx) +{ + static const u8 pae_group_addr[ETH_ALEN] + = { 0x01, 0x80, 0xC2, 0x00, 0x00, 0x03 }; + struct ethhdr *ehdr = (struct ethhdr *)rx-skb-data; + + if (rx-skb-protocol == htons(ETH_P_PAE) + (compare_ether_addr(ehdr-h_dest, pae_group_addr) == 0 || +compare_ether_addr(ehdr-h_dest, rx-dev-dev_addr) == 0)) + return true; Should you reverse these two compare_ether_addr calls? rx-dev-dev_addr seems more likely for any given packet. It probably makes little difference but it seems like checking for that first would still be better. I think in theory all eapol frames are sent to the PAE group address, but I have no idea which of the checks would be more efficient. It seems that the first could be optimised a lot because it's constant too... johannes signature.asc Description: This is a digitally signed message part
Re: [patch 2/4] net: use mutex_is_locked() for ASSERT_RTNL()
I don't see how it could warn about that. Nor should it - one might want to check that rtnl_lock is held inside preempt_disable() or spin_lock or whatever. I agree with this. IIRC I removed some ASSERT_RTNL()s in the wireless code (or maybe it was only during testing patches) where we had a function that required only the rtnl to be held but in certain contexts was called from within an RCU section. johannes signature.asc Description: This is a digitally signed message part
Re: [Bugme-new] [Bug 9543] New: RTNL: assertion failed at net/ipv6/addrconf.c (2164)/RTNL: assertion failed at net/ipv4/devinet.c (1055)
On Wed, 12 Dec 2007, Jay Vosburgh wrote: Herbert Xu [EMAIL PROTECTED] wrote: diff -puN drivers/net/bonding/bond_sysfs.c~bonding-locking-fix drivers/net/bonding/bond_sysfs.c --- a/drivers/net/bonding/bond_sysfs.c~bonding-locking-fix +++ a/drivers/net/bonding/bond_sysfs.c @@ -,8 +,6 @@ static ssize_t bonding_store_primary(str out: write_unlock_bh(bond-lock); - rtnl_unlock(); - Looking at the changeset that added this perhaps the intention is to hold the lock? If so we should add an rtnl_lock to the start of the function. Yes, this function needs to hold locks, and more than just what's there now. I believe the following should be correct; I haven't tested it, though (I'm supposedly on vacation right now). The following change should be correct for the bonding_store_primary case discussed in this thread, and also corrects the bonding_store_active case which performs similar functions. The bond_change_active_slave and bond_select_active_slave functions both require rtnl, bond-lock for read and curr_slave_lock for write_bh, and no other locks. This is so that the lower level mode-specific functions can release locks down to just rtnl in order to call, e.g., dev_set_mac_address with the locks it expects (rtnl only). Signed-off-by: Jay Vosburgh [EMAIL PROTECTED] diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 11b76b3..28a2d80 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device *d, struct slave *slave; struct bonding *bond = to_bond(d); - write_lock_bh(bond-lock); + rtnl_lock(); + read_lock(bond-lock); + write_lock_bh(bond-curr_slave_lock); + if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME : %s: Unable to set primary slave; %s is in mode %d\n, @@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device *d, } } out: - write_unlock_bh(bond-lock); - + write_unlock_bh(bond-curr_slave_lock); + read_unlock(bond-lock); rtnl_unlock(); return count; @@ -1190,7 +1193,8 @@ static ssize_t bonding_store_active_slave(struct device *d, struct bonding *bond = to_bond(d); rtnl_lock(); - write_lock_bh(bond-lock); + read_lock(bond-lock); + write_lock_bh(bond-curr_slave_lock); if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME @@ -1247,7 +1251,8 @@ static ssize_t bonding_store_active_slave(struct device *d, } } out: - write_unlock_bh(bond-lock); + write_unlock_bh(bond-curr_slave_lock); + read_unlock(bond-lock); rtnl_unlock(); return count; Vanilla 2.6.24-rc5 plus this patch: = [ INFO: possible irq lock inversion dependency detected ] 2.6.24-rc5 #1 - events/0/9 just changed the state of lock: (mc-mca_lock){-+..}, at: [c0411c7a] mld_ifc_timer_expire+0x130/0x1fb but this lock took another, soft-read-irq-unsafe lock in the past: (bond-lock){-.--} and interrupts could create inverse lock ordering between them. other info that might help us debug this: 4 locks held by events/0/9: #0: (events){--..}, at: [c0133c57] run_workqueue+0x87/0x1b6 #1: ((linkwatch_work).work){--..}, at: [c0133c57] run_workqueue+0x87/0x1b6 #2: (rtnl_mutex){--..}, at: [c03abd50] linkwatch_event+0x5/0x22 #3: (ndev-lock){-.-+}, at: [c0411b61] mld_ifc_timer_expire+0x17/0x1fb the first lock's dependencies: - (mc-mca_lock){-+..} ops: 10 { initial-use at: [c0104ee2] dump_trace+0x83/0x8d [c014289c] __lock_acquire+0x4ba/0xc07 [c0109ef2] save_stack_trace+0x20/0x3a [c0142fa1] __lock_acquire+0xbbf/0xc07 [c0412452] ipv6_dev_mc_inc+0x24d/0x31c [c0143062] lock_acquire+0x79/0x93 [c04120d6] igmp6_group_added+0x18/0x11d [c0439d62] _spin_lock_bh+0x3b/0x64 [c04120d6] igmp6_group_added+0x18/0x11d [c04120d6] igmp6_group_added+0x18/0x11d [c0141f9f] trace_hardirqs_on+0x122/0x14c [c04124a8] ipv6_dev_mc_inc+0x2a3/0x31c [c0412452] ipv6_dev_mc_inc+0x24d/0x31c [c04124dd] ipv6_dev_mc_inc+0x2d8/0x31c [c0412205] ipv6_dev_mc_inc+0x0/0x31c [c0401834] ipv6_add_dev+0x21c/0x24b [c040b07d] ndisc_ifinfo_sysctl_change+0x0/0x1ef [c05c5b40] addrconf_init+0x13/0x193 [c0199f63] proc_net_fops_create+0x10/0x21
Re: [Bridge] Packet per Second
On Fri, 14 Dec 2007 15:34:10 + (UTC) Flávio Pires [EMAIL PROTECTED] wrote: Hi all, It's my first time using usenet... Well, I work on an ISP and we have a linux box acting as a bridge+firewall. With this bridge+firewall we control the packet rate per second from each client and from our repeaters. But I can`t measure the packet rate per IP. Is there any tool for this? Actually, what I want is to measure the packet rate per IP and generate graphics with mrtg or rrdtool, but for this I must have the number of packets per second of each client :) Thank you all -- Flávio Not that I know of, but you might look at: http://www.bandwidtharbitrator.com/ -- Stephen Hemminger [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bridge] Packet per Second
In article [EMAIL PROTECTED] Stephen Hemminger[EMAIL PROTECTED] wrote: On Fri, 14 Dec 2007 15:34:10 + (UTC) Flvio Pires [EMAIL PROTECTED] wrote: Hi all, It's my first time using usenet... Well, I work on an ISP and we have a linux box acting as a bridge+firewall. With this bridge+firewall we control the packet rate per second from each client and from our repeaters. But I can`t measure the packet rate per IP. Is there any tool for this? Actually, what I want is to measure the packet rate per IP and generate graphics with mrtg or rrdtool, but for this I must have the number of packets per second of each client :) Thank you all -- Flvio Not that I know of, but you might look at: http://www.bandwidtharbitrator.com/ Yeah, we have a proprietary solution from etinc it does bandwidth control and firewall... but using his firewall let this machine too slow, so we created a box just for firewall... Then we need a way to measure pps per host so we can determine which limits fits better to our clients and our own needs. -- I'm trying a new usenet client for Mac, Nemo OS X. You can download it at http://www.malcom-mac.com/nemo -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: What was the reason for 2.6.22 SMP kernels to change how sendmsg is called?
do not express your frustration ... wasn't frustration but rather playful sarcasm. This was not a bug report at all... Wasn't really meant to be a true blue bug report (my bad I guess). Anywho, I know you guys have big fish to fry so I tried to keep it short and to the point. I knew something had changed and I was truly stumped in trying to figure out what it was so I decided to ask for some general guidance. Without having your code it is virtually impossible to say. I know this is partly my own fault for not stating so explicitly in my first email. However, as I stated in my second email, I would have been happy to send it to anyone that expressed an interest (even though the issue wasn't interesting in itself) in my post. I just thought, being the experts you are, that having only the small subset of code that is actually involved in the offending call, you'd be able to say go take a look at commit such-n-such which you have now done. Thanks a million! I'm not trying to start a nag war here, I know you guys are busy. Having both (mostly me) made pitiful assumptions I think we have reached an understanding on this (now dead) topic. I'm just starting out in the drivers arena with hopes of being a decent contributor to the kernel ecosystem some day so there will be some growing pains and this thread was one of them. Thanks for the help and suggestions, I appreciate them immensely. Kevin -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Evgeniy Polyakov Sent: Friday, December 14, 2007 00:33 To: Kevin Wilson Cc: David Miller; netdev@vger.kernel.org Subject: Re: What was the reason for 2.6.22 SMP kernels to change how sendmsg is called? Hi Kevin. On Thu, Dec 13, 2007 at 04:00:02PM -0600, Kevin Wilson ([EMAIL PROTECTED]) wrote: I see your point but it just so happens it is a GPL'd driver, as is all of our Linux code we produce for our hardware. Granted it is out of tree, and after you saw it you would want it to stay that way. However, I would have sent you the whole thing if that is a pre-req to cordial exchanges on this list. Nonetheless, a somewhat recent change in your tree, that I could not pinpoint on my own, caused the driver to stop functioning properly. So after much searching in git/google/sources with no luck, I decided to ask for a little assistance, maybe just a hint as to where the culprit may be in the tree so I could investigate for myself. For SNGs I tried the method that now works but I am still at a loss as to (can't find) what changes in the tree caused it to fail. Without having your code it is virtually impossible to say, why you have a bug. And do not express your frustration telling 'zero people responded to my bug report'. This was not a bug report at all, but empty message about 'my code stopped working after some network changes, which broke the stuff. Now in 2.6.22 and later kernels you must use the higher level SOCKET to make a call to PROTO_OPS then to sendmsg(). e.g., socket-ops-sendmsg(). It was done because of bug found in inet_sendmsg(), which tried to autobind socket it should not try. -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 9543] New: RTNL: assertion failed at net/ipv6/addrconf.c (2164)/RTNL: assertion failed at net/ipv4/devinet.c (1055)
On Fri, Dec 14, 2007 at 05:14:57PM +0100, Krzysztof Oledzki wrote: On Wed, 12 Dec 2007, Jay Vosburgh wrote: Herbert Xu [EMAIL PROTECTED] wrote: diff -puN drivers/net/bonding/bond_sysfs.c~bonding-locking-fix drivers/net/bonding/bond_sysfs.c --- a/drivers/net/bonding/bond_sysfs.c~bonding-locking-fix +++ a/drivers/net/bonding/bond_sysfs.c @@ -,8 +,6 @@ static ssize_t bonding_store_primary(str out: write_unlock_bh(bond-lock); - rtnl_unlock(); - Looking at the changeset that added this perhaps the intention is to hold the lock? If so we should add an rtnl_lock to the start of the function. Yes, this function needs to hold locks, and more than just what's there now. I believe the following should be correct; I haven't tested it, though (I'm supposedly on vacation right now). The following change should be correct for the bonding_store_primary case discussed in this thread, and also corrects the bonding_store_active case which performs similar functions. The bond_change_active_slave and bond_select_active_slave functions both require rtnl, bond-lock for read and curr_slave_lock for write_bh, and no other locks. This is so that the lower level mode-specific functions can release locks down to just rtnl in order to call, e.g., dev_set_mac_address with the locks it expects (rtnl only). Signed-off-by: Jay Vosburgh [EMAIL PROTECTED] diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 11b76b3..28a2d80 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device *d, struct slave *slave; struct bonding *bond = to_bond(d); -write_lock_bh(bond-lock); +rtnl_lock(); +read_lock(bond-lock); +write_lock_bh(bond-curr_slave_lock); + if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME : %s: Unable to set primary slave; %s is in mode %d\n, @@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device *d, } } out: -write_unlock_bh(bond-lock); - +write_unlock_bh(bond-curr_slave_lock); +read_unlock(bond-lock); rtnl_unlock(); return count; @@ -1190,7 +1193,8 @@ static ssize_t bonding_store_active_slave(struct device *d, struct bonding *bond = to_bond(d); rtnl_lock(); -write_lock_bh(bond-lock); +read_lock(bond-lock); +write_lock_bh(bond-curr_slave_lock); if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME @@ -1247,7 +1251,8 @@ static ssize_t bonding_store_active_slave(struct device *d, } } out: -write_unlock_bh(bond-lock); +write_unlock_bh(bond-curr_slave_lock); +read_unlock(bond-lock); rtnl_unlock(); return count; Vanilla 2.6.24-rc5 plus this patch: = [ INFO: possible irq lock inversion dependency detected ] 2.6.24-rc5 #1 - events/0/9 just changed the state of lock: (mc-mca_lock){-+..}, at: [c0411c7a] mld_ifc_timer_expire+0x130/0x1fb but this lock took another, soft-read-irq-unsafe lock in the past: (bond-lock){-.--} and interrupts could create inverse lock ordering between them. Grrr, I should have seen that -- sorry. Try your luck with this instead: diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 11b76b3..0694254 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device *d, struct slave *slave; struct bonding *bond = to_bond(d); - write_lock_bh(bond-lock); + rtnl_lock(); + read_lock_bh(bond-lock); + write_lock_bh(bond-curr_slave_lock); + if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME : %s: Unable to set primary slave; %s is in mode %d\n, @@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device *d, } } out: - write_unlock_bh(bond-lock); - + write_unlock_bh(bond-curr_slave_lock); + read_unlock_bh(bond-lock); rtnl_unlock(); return count; @@ -1190,7 +1193,8 @@ static ssize_t bonding_store_active_slave(struct device *d, struct bonding *bond = to_bond(d); rtnl_lock(); - write_lock_bh(bond-lock); + read_lock_bh(bond-lock); + write_lock_bh(bond-curr_slave_lock); if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME @@ -1247,7 +1251,8 @@ static ssize_t bonding_store_active_slave(struct device *d, } } out: - write_unlock_bh(bond-lock); +
Re: [NETFILTER] xt_hashlimit : speedups hash_dst()
From: Eric Dumazet [EMAIL PROTECTED] Date: Fri, 14 Dec 2007 12:09:31 +0100 1) Using jhash2() instead of jhash() is a litle bit faster if applicable. 2) Thanks to jhash, hash value uses full 32 bits. Instead of returning hash % size (implying a divide) we return the high 32 bits of the (hash * size) that will give results between [0 and size-1] and same hash distribution. On most cpus, a multiply is less expensive than a divide, by an order of magnitude. Signed-off-by: Eric Dumazet [EMAIL PROTECTED] As a side note, Jenkins performs nearly optimally (unlike most traditional hash functions) with power of two hash table sizes. Using a pow2 hash table size would completely obviate the issues solved by #2. I don't know if that is feasible here in xt_hashlimit, but if it is that is how we should solve this expensive modulo. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] add driver for enc28j60 ethernet chip
On Fri, 14 Dec 2007 10:21:51 +0100 Claudio Lanconelli [EMAIL PROTECTED] wrote: Hi Stephen, thank you for your suggestions. I already applied trivial fixes, but I have questions on some points, see inline. Stephen Hemminger wrote: General comments: * device driver does no carrier detection. This makes it useless for bridging, bonding, or any form of failover. * use msglevel method (via ethtool) to control debug messages rather than kernel configuration. This allows enabling debugging without recompilation which is important in distributions. * Please add ethtool support * Consider using NAPI Can you point me to a possibly simple driver that uses ethtool and NAPI? No driver stays simple! but look at tg3, sky2, r8169 for examples. Or other example that I can use for reference. May be the skeleton should be updated. * use netdev_priv(netdev) rather than netdev-priv I can't find where I used netdev-priv, may be do you mean priv-netdev? yes (skipping other comments) +static int __devinit enc28j60_probe(struct spi_device *spi) +{ + struct net_device *dev; + struct enc28j60_net_local *priv; + int ret = 0; + + dev_dbg(spi-dev, %s() start\n, __FUNCTION__); + + dev = alloc_etherdev(sizeof(struct enc28j60_net_local)); + if (!dev) { + ret = -ENOMEM; + goto error_alloc; + } + priv = netdev_priv(dev); + + priv-netdev = dev; /* priv to netdev reference */ + priv-spi = spi;/* priv to spi reference */ + priv-spi_transfer_buf = kmalloc(SPI_TRANSFER_BUF_LEN, GFP_KERNEL); Why not declare the transfer buffer as an array in spi? I don't understand exactly what do you mean here. spi field point to struct spi_device from SPI subsystem. Other SPI client driver uses an allocated buffer too. I just noticed that you alloc an ether device then do an additional allocation for the buffer. It makes sense if there is other uses. You do need to be careful for cases where transfer_buf might be used after free: module unload (your probably safe), and client driver using during shutdown. -- Stephen Hemminger [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6.25 6/8] sctp: Use ipv4_is_type
Joe Perches wrote: Signed-off-by: Joe Perches [EMAIL PROTECTED] Thanks Joe. I've put this into my tree. -vlad -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6.25 6/8] sctp: Use ipv4_is_type
From: Vlad Yasevich [EMAIL PROTECTED] Date: Fri, 14 Dec 2007 14:11:44 -0500 Joe Perches wrote: Signed-off-by: Joe Perches [EMAIL PROTECTED] Thanks Joe. I've put this into my tree. You can't, because your tree won't build without his first patch and I have to approve that and stick it into my tree first. Just let me review and potentially pick up all of Joe's stuff since we have this dependency. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/4] net: use mutex_is_locked() for ASSERT_RTNL()
From: Herbert Xu [EMAIL PROTECTED] Date: Fri, 14 Dec 2007 16:30:37 +0800 On Fri, Dec 14, 2007 at 12:22:09AM -0800, Andrew Morton wrote: I don't see how it could warn about that. Nor should it - one might want to check that rtnl_lock is held inside preempt_disable() or spin_lock or whatever. It might make sense to warn if ASSERT_RTNL is called in in_interrupt() contexts though. Well the paths where ASSERT_RTNL is used should never be in an atomic context. In the past it has been quite useful in pointing out bogus locking practices. There is currently one path where it's known to warn because of this and it (promiscuous mode) is on my todo list. Oh and it only warns when you have mutex debugging enabled. Right, this change is just totally bogus. I'm all for using existing facilities to replace hand-crafted copies, but this case is removing useful debugging functionality so it's wrong. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6.25 6/8] sctp: Use ipv4_is_type
David Miller wrote: From: Vlad Yasevich [EMAIL PROTECTED] Date: Fri, 14 Dec 2007 14:11:44 -0500 Joe Perches wrote: Signed-off-by: Joe Perches [EMAIL PROTECTED] Thanks Joe. I've put this into my tree. You can't, because your tree won't build without his first patch and I have to approve that and stick it into my tree first. Just let me review and potentially pick up all of Joe's stuff since we have this dependency. Ok. In that case, you can have my ACK for the SCTP stuff. -vlad -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] [NET]: Use {hton{s,l},cpu_to_be{16,32}}() where appropriate.
From: YOSHIFUJI Hideaki / 吉藤英明 [EMAIL PROTECTED] Date: Fri, 14 Dec 2007 16:28:35 +0900 (JST) Please consider pulling the following changes from the branch net-2.6-dev-20071214 available at git://git.linux-ipv6.org/gitroot/yoshfuji/linux-2.6-dev.git which is on top of your net-2.6-devel tree. Pulled, thank you. Could you please provide the full pull URL all in one line, instead of splitting the base URL and the HEAD name onto seperate lines? I have to cut and paste from multiple places in your mail in order to compose the pull command line and I wish I didn't have to do that every time. I should be able to just cut and paste one line, which should be the fully specified URL, for maximum efficiency. Thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 20/29] netfilter: NF_QUEUE vs emergency skbs
Avoid memory getting stuck waiting for userspace, drop all emergency packets. This of course requires the regular storage route to not include an NF_QUEUE target ;-) Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- net/netfilter/core.c |3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6/net/netfilter/core.c === --- linux-2.6.orig/net/netfilter/core.c +++ linux-2.6/net/netfilter/core.c @@ -176,9 +176,12 @@ next_hook: ret = 1; goto unlock; } else if (verdict == NF_DROP) { +drop: kfree_skb(skb); ret = -EPERM; } else if ((verdict NF_VERDICT_MASK) == NF_QUEUE) { + if (skb_emergency(*pskb)) + goto drop; if (!nf_queue(skb, elem, pf, hook, indev, outdev, okfn, verdict NF_VERDICT_BITS)) goto next_hook; -- -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/29] Swap over NFS -v15
Hi, Another posting of the full swap over NFS series. Andrew/Linus, could we start thinking of sticking this in -mm? [ patches against 2.6.24-rc5-mm1, also to be found online at: http://programming.kicks-ass.net/kernel-patches/vm_deadlock/v2.6.24-rc5-mm1/ ] The patch-set can be split in roughtly 5 parts, for each of which I shall give a description. Part 1, patches 1-11 The problem with swap over network is the generic swap problem: needing memory to free memory. Normally this is solved using mempools, as can be seen in the BIO layer. Swap over network has the problem that the network subsystem does not use fixed sized allocations, but heavily relies on kmalloc(). This makes mempools unusable. This first part provides a generic reserve framework. Care is taken to only affect the slow paths - when we're low on memory. Caveats: it currently doesn't do SLOB. 1 - mm: gfp_to_alloc_flags() 2 - mm: tag reseve pages 3 - mm: sl[au]b: add knowledge of reserve pages 4 - mm: kmem_estimate_pages() 5 - mm: allow PF_MEMALLOC from softirq context 6 - mm: serialize access to min_free_kbytes 7 - mm: emergency pool 8 - mm: system wide ALLOC_NO_WATERMARK 9 - mm: __GFP_MEMALLOC 10 - mm: memory reserve management 11 - selinux: tag avc cache alloc as non-critical Part 2, patches 12-14 Provide some generic network infrastructure needed later on. 12 - net: wrap sk-sk_backlog_rcv() 13 - net: packet split receive api 14 - net: sk_allocation() - concentrate socket related allocations Part 3, patches 15-21 Now that we have a generic memory reserve system, use it on the network stack. The thing that makes this interesting is that, contrary to BIO, both the transmit and receive path require memory allocations. That is, in the BIO layer write back completion is usually just an ISR flipping a bit and waking stuff up. A network write back completion involved receiving packets, which when there is no memory, is rather hard. And even when there is memory there is no guarantee that the required packet comes in in the window that that memory buys us. The solution to this problem is found in the fact that network is to be assumed lossy. Even now, when there is no memory to receive packets the network card will have to discard packets. What we do is move this into the network stack. So we reserve a little pool to act as a receive buffer, this allows us to inspect packets before tossing them. This way, we can filter out those packets that ensure progress (writeback completion) and disregard the others (as would have happened anyway). [ NOTE: this is a stable mode of operation with limited memory usage, exactly the kind of thing we need ] Again, care is taken to keep much of the overhead of this to only affect the slow path. Only packets allocated from the reserves will suffer the extra atomic overhead needed for accounting. 15 - netvm: network reserve infrastructure 16 - netvm: INET reserves. 17 - netvm: hook skb allocation to reserves 18 - netvm: filter emergency skbs. 19 - netvm: prevent a TCP specific deadlock 20 - netfilter: NF_QUEUE vs emergency skbs 21 - netvm: skb processing Part 4, patches 22-24 Generic vm infrastructure to handle swapping to a filesystem instead of a block device. This provides new a_ops to handle swapcache pages and could be used to obsolete the bmap usage for swapfiles. 22 - mm: prepare swap entry methods for use in page methods 23 - mm: add support for non block device backed swap files 24 - mm: methods for teaching filesystems about PG_swapcache pages Part 5, patches 25-29 Finally, convert NFS to make use of the new network and vm infrastructure to provide swap over NFS. 25 - nfs: remove mempools 26 - nfs: teach the NFS client how to treat PG_swapcache pages 27 - nfs: disable data cache revalidation for swapfiles 28 - nfs: enable swap on NFS 29 - nfs: fix various memory recursions possible with swap over NFS. Changes since -v14: - SLAB support - a_ops rework - various bug fixes and cleanups -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/29] mm: emergency pool
Provide means to reserve a specific amount of pages. The emergency pool is separated from the min watermark because ALLOC_HARDER and ALLOC_HIGH modify the watermark in a relative way and thus do not ensure a strict minimum. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/mmzone.h |3 + mm/page_alloc.c| 82 +++-- mm/vmstat.c|6 +-- 3 files changed, 78 insertions(+), 13 deletions(-) Index: linux-2.6/include/linux/mmzone.h === --- linux-2.6.orig/include/linux/mmzone.h +++ linux-2.6/include/linux/mmzone.h @@ -213,7 +213,7 @@ enum zone_type { struct zone { /* Fields commonly accessed by the page allocator */ - unsigned long pages_min, pages_low, pages_high; + unsigned long pages_emerg, pages_min, pages_low, pages_high; /* * We don't know if the memory that we're going to allocate will be freeable * or/and it will be released eventually, so to avoid totally wasting several @@ -682,6 +682,7 @@ int sysctl_min_unmapped_ratio_sysctl_han struct file *, void __user *, size_t *, loff_t *); int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); +int adjust_memalloc_reserve(int pages); extern int numa_zonelist_order_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); Index: linux-2.6/mm/page_alloc.c === --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -118,6 +118,8 @@ static char * const zone_names[MAX_NR_ZO static DEFINE_SPINLOCK(min_free_lock); int min_free_kbytes = 1024; +static DEFINE_MUTEX(var_free_mutex); +int var_free_kbytes; unsigned long __meminitdata nr_kernel_pages; unsigned long __meminitdata nr_all_pages; @@ -1252,7 +1254,7 @@ int zone_watermark_ok(struct zone *z, in if (alloc_flags ALLOC_HARDER) min -= min / 4; - if (free_pages = min + z-lowmem_reserve[classzone_idx]) + if (free_pages = min + z-lowmem_reserve[classzone_idx] + z-pages_emerg) return 0; for (o = 0; o order; o++) { /* At the next order, this order's pages become unavailable */ @@ -1733,8 +1735,8 @@ nofail_alloc: nopage: if (!(gfp_mask __GFP_NOWARN) printk_ratelimit()) { printk(KERN_WARNING %s: page allocation failure. -order:%d, mode:0x%x\n, - p-comm, order, gfp_mask); +order:%d, mode:0x%x, alloc_flags:0x%x, pflags:0x%x\n, + p-comm, order, gfp_mask, alloc_flags, p-flags); dump_stack(); show_mem(); } @@ -1952,9 +1954,9 @@ void show_free_areas(void) \n, zone-name, K(zone_page_state(zone, NR_FREE_PAGES)), - K(zone-pages_min), - K(zone-pages_low), - K(zone-pages_high), + K(zone-pages_emerg + zone-pages_min), + K(zone-pages_emerg + zone-pages_low), + K(zone-pages_emerg + zone-pages_high), K(zone_page_state(zone, NR_ACTIVE)), K(zone_page_state(zone, NR_INACTIVE)), K(zone-present_pages), @@ -4113,7 +4115,7 @@ static void calculate_totalreserve_pages } /* we treat pages_high as reserved pages. */ - max += zone-pages_high; + max += zone-pages_high + zone-pages_emerg; if (max zone-present_pages) max = zone-present_pages; @@ -4170,7 +4172,8 @@ static void setup_per_zone_lowmem_reserv */ static void __setup_per_zone_pages_min(void) { - unsigned long pages_min = min_free_kbytes (PAGE_SHIFT - 10); + unsigned pages_min = min_free_kbytes (PAGE_SHIFT - 10); + unsigned pages_emerg = var_free_kbytes (PAGE_SHIFT - 10); unsigned long lowmem_pages = 0; struct zone *zone; unsigned long flags; @@ -4182,11 +4185,13 @@ static void __setup_per_zone_pages_min(v } for_each_zone(zone) { - u64 tmp; + u64 tmp, tmp_emerg; spin_lock_irqsave(zone-lru_lock, flags); tmp = (u64)pages_min * zone-present_pages; do_div(tmp, lowmem_pages); + tmp_emerg = (u64)pages_emerg * zone-present_pages; + do_div(tmp_emerg, lowmem_pages); if (is_highmem(zone)) { /* * __GFP_HIGH and PF_MEMALLOC allocations usually don't @@
[PATCH 28/29] nfs: enable swap on NFS
Implement all the new swapfile a_ops for NFS. This will set the NFS socket to SOCK_MEMALLOC and run socket reconnect under PF_MEMALLOC as well as reset SOCK_MEMALLOC before engaging the protocol -connect() method. PF_MEMALLOC should allow the allocation of struct socket and related objects and the early (re)setting of SOCK_MEMALLOC should allow us to receive the packets required for the TCP connection buildup. (swapping continues over a server reset during heavy network traffic) Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/Kconfig | 18 fs/nfs/file.c | 12 fs/nfs/write.c | 19 + include/linux/nfs_fs.h |2 + include/linux/sunrpc/xprt.h |5 ++- net/sunrpc/sched.c |9 -- net/sunrpc/xprtsock.c | 63 7 files changed, 125 insertions(+), 3 deletions(-) Index: linux-2.6/fs/nfs/file.c === --- linux-2.6.orig/fs/nfs/file.c +++ linux-2.6/fs/nfs/file.c @@ -371,6 +371,13 @@ static int nfs_launder_page(struct page return nfs_wb_page(page_file_mapping(page)-host, page); } +#ifdef CONFIG_NFS_SWAP +static int nfs_swapfile(struct address_space *mapping, int enable) +{ + return xs_swapper(NFS_CLIENT(mapping-host)-cl_xprt, enable); +} +#endif + const struct address_space_operations nfs_file_aops = { .readpage = nfs_readpage, .readpages = nfs_readpages, @@ -385,6 +392,11 @@ const struct address_space_operations nf .direct_IO = nfs_direct_IO, #endif .launder_page = nfs_launder_page, +#ifdef CONFIG_NFS_SWAP + .swapfile = nfs_swapfile, + .swap_out = nfs_swap_out, + .swap_in = nfs_readpage, +#endif }; static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct page *page) Index: linux-2.6/fs/nfs/write.c === --- linux-2.6.orig/fs/nfs/write.c +++ linux-2.6/fs/nfs/write.c @@ -365,6 +365,25 @@ int nfs_writepage(struct page *page, str return ret; } +int nfs_swap_out(struct file *file, struct page *page, +struct writeback_control *wbc) +{ + struct nfs_open_context *ctx = nfs_file_open_context(file); + int status; + + status = nfs_writepage_setup(ctx, page, 0, nfs_page_length(page)); + if (status 0) { + nfs_set_pageerror(page); + goto out; + } + + status = nfs_writepage_locked(page, wbc); + +out: + unlock_page(page); + return status; +} + static int nfs_writepages_callback(struct page *page, struct writeback_control *wbc, void *data) { int ret; Index: linux-2.6/include/linux/nfs_fs.h === --- linux-2.6.orig/include/linux/nfs_fs.h +++ linux-2.6/include/linux/nfs_fs.h @@ -413,6 +413,8 @@ extern int nfs_flush_incompatible(struc extern int nfs_updatepage(struct file *, struct page *, unsigned int, unsigned int); extern int nfs_writeback_done(struct rpc_task *, struct nfs_write_data *); extern void nfs_writedata_release(void *); +extern int nfs_swap_out(struct file *file, struct page *page, +struct writeback_control *wbc); /* * Try to write back everything synchronously (but check the Index: linux-2.6/fs/Kconfig === --- linux-2.6.orig/fs/Kconfig +++ linux-2.6/fs/Kconfig @@ -1692,6 +1692,18 @@ config NFS_DIRECTIO causes open() to return EINVAL if a file residing in NFS is opened with the O_DIRECT flag. +config NFS_SWAP + bool Provide swap over NFS support + default n + depends on NFS_FS + select SUNRPC_SWAP + help + This option enables swapon to work on files located on NFS mounts. + + For more details, see Documentation/vm_deadlock.txt + + If unsure, say N. + config NFSD tristate NFS server support depends on INET @@ -1835,6 +1847,12 @@ config SUNRPC_BIND34 If unsure, say N to get traditional behavior (version 2 rpcbind requests only). +config SUNRPC_SWAP + def_bool n + depends on SUNRPC + select NETVM + select SWAP_FILE + config RPCSEC_GSS_KRB5 tristate Secure RPC: Kerberos V mechanism (EXPERIMENTAL) depends on SUNRPC EXPERIMENTAL Index: linux-2.6/include/linux/sunrpc/xprt.h === --- linux-2.6.orig/include/linux/sunrpc/xprt.h +++ linux-2.6/include/linux/sunrpc/xprt.h @@ -143,7 +143,9 @@ struct rpc_xprt { unsigned intmax_reqs; /* total slots */ unsigned long state; /* transport state */ unsigned char shutdown : 1, /* being shut down */ - resvport : 1; /* use a reserved
[PATCH 01/29] mm: gfp_to_alloc_flags()
Factor out the gfp to alloc_flags mapping so it can be used in other places. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- mm/internal.h | 11 ++ mm/page_alloc.c | 98 2 files changed, 67 insertions(+), 42 deletions(-) Index: linux-2.6/mm/internal.h === --- linux-2.6.orig/mm/internal.h +++ linux-2.6/mm/internal.h @@ -47,4 +47,15 @@ static inline unsigned long page_order(s VM_BUG_ON(!PageBuddy(page)); return page_private(page); } + +#define ALLOC_HARDER 0x01 /* try to alloc harder */ +#define ALLOC_HIGH 0x02 /* __GFP_HIGH set */ +#define ALLOC_WMARK_MIN0x04 /* use pages_min watermark */ +#define ALLOC_WMARK_LOW0x08 /* use pages_low watermark */ +#define ALLOC_WMARK_HIGH 0x10 /* use pages_high watermark */ +#define ALLOC_NO_WATERMARKS0x20 /* don't check watermarks at all */ +#define ALLOC_CPUSET 0x40 /* check for correct cpuset */ + +int gfp_to_alloc_flags(gfp_t gfp_mask); + #endif Index: linux-2.6/mm/page_alloc.c === --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -1139,14 +1139,6 @@ failed: return NULL; } -#define ALLOC_NO_WATERMARKS0x01 /* don't check watermarks at all */ -#define ALLOC_WMARK_MIN0x02 /* use pages_min watermark */ -#define ALLOC_WMARK_LOW0x04 /* use pages_low watermark */ -#define ALLOC_WMARK_HIGH 0x08 /* use pages_high watermark */ -#define ALLOC_HARDER 0x10 /* try to alloc harder */ -#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */ -#define ALLOC_CPUSET 0x40 /* check for correct cpuset */ - #ifdef CONFIG_FAIL_PAGE_ALLOC static struct fail_page_alloc_attr { @@ -1535,6 +1527,44 @@ static void set_page_owner(struct page * #endif /* CONFIG_PAGE_OWNER */ /* + * get the deepest reaching allocation flags for the given gfp_mask + */ +int gfp_to_alloc_flags(gfp_t gfp_mask) +{ + struct task_struct *p = current; + int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET; + const gfp_t wait = gfp_mask __GFP_WAIT; + + /* +* The caller may dip into page reserves a bit more if the caller +* cannot run direct reclaim, or if the caller has realtime scheduling +* policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will +* set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH). +*/ + if (gfp_mask __GFP_HIGH) + alloc_flags |= ALLOC_HIGH; + + if (!wait) { + alloc_flags |= ALLOC_HARDER; + /* +* Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc. +* See also cpuset_zone_allowed() comment in kernel/cpuset.c. +*/ + alloc_flags = ~ALLOC_CPUSET; + } else if (unlikely(rt_task(p)) !in_interrupt()) + alloc_flags |= ALLOC_HARDER; + + if (likely(!(gfp_mask __GFP_NOMEMALLOC))) { + if (!in_interrupt() + ((p-flags PF_MEMALLOC) || +unlikely(test_thread_flag(TIF_MEMDIE + alloc_flags |= ALLOC_NO_WATERMARKS; + } + + return alloc_flags; +} + +/* * This is the 'heart' of the zoned buddy allocator. */ struct page * fastcall @@ -1589,48 +1619,28 @@ restart: * OK, we're below the kswapd watermark and have kicked background * reclaim. Now things get more complex, so set up alloc_flags according * to how we want to proceed. -* -* The caller may dip into page reserves a bit more if the caller -* cannot run direct reclaim, or if the caller has realtime scheduling -* policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will -* set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH). */ - alloc_flags = ALLOC_WMARK_MIN; - if ((unlikely(rt_task(p)) !in_interrupt()) || !wait) - alloc_flags |= ALLOC_HARDER; - if (gfp_mask __GFP_HIGH) - alloc_flags |= ALLOC_HIGH; - if (wait) - alloc_flags |= ALLOC_CPUSET; + alloc_flags = gfp_to_alloc_flags(gfp_mask); - /* -* Go through the zonelist again. Let __GFP_HIGH and allocations -* coming from realtime tasks go deeper into reserves. -* -* This is the last chance, in general, before the goto nopage. -* Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc. -* See also cpuset_zone_allowed() comment in kernel/cpuset.c. -*/ - page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags); + /* This is the last chance, in general, before the goto nopage. */ + page = get_page_from_freelist(gfp_mask, order, zonelist, +
[PATCH 23/29] mm: add support for non block device backed swap files
New addres_space_operations methods are added: int swapfile(struct address_space *, int); int swap_out(struct file *, struct page *, struct writeback_control *); int swap_in(struct file *, struct page *); When during sys_swapon() the swapfile() method is found and returns no error the swapper_space.a_ops will proxy to sis-swap_file-f_mapping-a_ops, and make use of swap_{out,in}() to write/read swapcache pages. The swapfile method will be used to communicate to the address_space that the VM relies on it, and the address_space should take adequate measures (like reserving memory for mempools or the like). This new interface can be used to obviate the need for -bmap in the swapfile code. A filesystem would need to load (and maybe even allocate) the full block map for a file into memory and pin it there on -swapfile(,1) so that -swap_{out,in}() have instant access to it. It can be released on -swapfile(,0). The reason to provide -swap_{out,in}() over using {write,read}page() is to 1) make a distinction between swapcache and pagecache pages, and 2) to provide a struct file * for credential context (normally not needed in the context of writepage, as the page content is normally dirtied using either of the following interfaces: write_{begin,end}() {prepare,commit}_write() page_mkwrite() which do have the file context. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- Documentation/filesystems/Locking | 19 Documentation/filesystems/vfs.txt | 17 +++ include/linux/buffer_head.h |2 - include/linux/fs.h|8 + include/linux/swap.h |3 + mm/Kconfig|3 + mm/page_io.c | 58 ++ mm/swap_state.c |5 +++ mm/swapfile.c | 22 +- 9 files changed, 135 insertions(+), 2 deletions(-) Index: linux-2.6/include/linux/swap.h === --- linux-2.6.orig/include/linux/swap.h +++ linux-2.6/include/linux/swap.h @@ -164,6 +164,7 @@ enum { SWP_USED= (1 0), /* is slot in swap_info[] used? */ SWP_WRITEOK = (1 1), /* ok to write to this swap?*/ SWP_ACTIVE = (SWP_USED | SWP_WRITEOK), + SWP_FILE= (1 2), /* file swap area */ /* add others here before... */ SWP_SCANNING= (1 8), /* refcount in scan_swap_map */ }; @@ -261,6 +262,8 @@ extern void swap_unplug_io_fn(struct bac /* linux/mm/page_io.c */ extern int swap_readpage(struct file *, struct page *); extern int swap_writepage(struct page *page, struct writeback_control *wbc); +extern void swap_sync_page(struct page *page); +extern int swap_set_page_dirty(struct page *page); extern void end_swap_bio_read(struct bio *bio, int err); /* linux/mm/swap_state.c */ Index: linux-2.6/mm/page_io.c === --- linux-2.6.orig/mm/page_io.c +++ linux-2.6/mm/page_io.c @@ -17,6 +17,7 @@ #include linux/bio.h #include linux/swapops.h #include linux/writeback.h +#include linux/buffer_head.h #include asm/pgtable.h static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index, @@ -102,6 +103,18 @@ int swap_writepage(struct page *page, st unlock_page(page); goto out; } +#ifdef CONFIG_SWAP_FILE + { + struct swap_info_struct *sis = page_swap_info(page); + if (sis-flags SWP_FILE) { + ret = sis-swap_file-f_mapping- + a_ops-swap_out(sis-swap_file, page, wbc); + if (!ret) + count_vm_event(PSWPOUT); + return ret; + } + } +#endif bio = get_swap_bio(GFP_NOIO, page_private(page), page, end_swap_bio_write); if (bio == NULL) { @@ -120,6 +133,39 @@ out: return ret; } +#ifdef CONFIG_SWAP_FILE +void swap_sync_page(struct page *page) +{ + struct swap_info_struct *sis = page_swap_info(page); + + if (sis-flags SWP_FILE) { + const struct address_space_operations * a_ops = + sis-swap_file-f_mapping-a_ops; + if (a_ops-sync_page) + a_ops-sync_page(page); + } else + block_sync_page(page); +} + +int swap_set_page_dirty(struct page *page) +{ + struct swap_info_struct *sis = page_swap_info(page); + + if (sis-flags SWP_FILE) { + const struct address_space_operations * a_ops = + sis-swap_file-f_mapping-a_ops; + int (*spd)(struct page *) = a_ops-set_page_dirty; +#ifdef CONFIG_BLOCK + if (!spd) + spd = __set_page_dirty_buffers; +#endif +
[PATCH 10/29] mm: memory reserve management
Generic reserve management code. It provides methods to reserve and charge. Upon this, generic alloc/free style reserve pools could be build, which could fully replace mempool_t functionality. It should also allow for a Banker's algorithm replacement of __GFP_NOFAIL. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/reserve.h | 54 + mm/Makefile |2 mm/reserve.c| 438 3 files changed, 493 insertions(+), 1 deletion(-) Index: linux-2.6/include/linux/reserve.h === --- /dev/null +++ linux-2.6/include/linux/reserve.h @@ -0,0 +1,54 @@ +/* + * Memory reserve management. + * + * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra [EMAIL PROTECTED] + * + * This file contains the public data structure and API definitions. + */ + +#ifndef _LINUX_RESERVE_H +#define _LINUX_RESERVE_H + +#include linux/list.h +#include linux/spinlock.h + +struct mem_reserve { + struct mem_reserve *parent; + struct list_head children; + struct list_head siblings; + + const char *name; + + long pages; + long limit; + long usage; + spinlock_t lock;/* protects limit and usage */ +}; + +extern struct mem_reserve mem_reserve_root; + +void mem_reserve_init(struct mem_reserve *res, const char *name, + struct mem_reserve *parent); +int mem_reserve_connect(struct mem_reserve *new_child, + struct mem_reserve *node); +int mem_reserve_disconnect(struct mem_reserve *node); + +int mem_reserve_pages_set(struct mem_reserve *res, long pages); +int mem_reserve_pages_add(struct mem_reserve *res, long pages); +int mem_reserve_pages_charge(struct mem_reserve *res, long pages, +int overcommit); + +int mem_reserve_kmalloc_set(struct mem_reserve *res, long bytes); +int mem_reserve_kmalloc_charge(struct mem_reserve *res, long bytes, + int overcommit); + +struct kmem_cache; + +int mem_reserve_kmem_cache_set(struct mem_reserve *res, + struct kmem_cache *s, + int objects); +int mem_reserve_kmem_cache_charge(struct mem_reserve *res, + long objs, + int overcommit); + +#endif /* _LINUX_RESERVE_H */ Index: linux-2.6/mm/Makefile === --- linux-2.6.orig/mm/Makefile +++ linux-2.6/mm/Makefile @@ -11,7 +11,7 @@ obj-y := bootmem.o filemap.o mempool.o page_alloc.o page-writeback.o pdflush.o \ readahead.o swap.o truncate.o vmscan.o \ prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \ - page_isolation.o $(mmu-y) + page_isolation.o reserve.o $(mmu-y) obj-$(CONFIG_PROC_PAGE_MONITOR) += pagewalk.o obj-$(CONFIG_BOUNCE) += bounce.o Index: linux-2.6/mm/reserve.c === --- /dev/null +++ linux-2.6/mm/reserve.c @@ -0,0 +1,438 @@ +/* + * Memory reserve management. + * + * Copyright (C) 2007, Red Hat, Inc., Peter Zijlstra [EMAIL PROTECTED] + * + * Description: + * + * Manage a set of memory reserves. + * + * A memory reserve is a reserve for a specified number of object of specified + * size. Since memory is managed in pages, this reserve demand is then + * translated into a page unit. + * + * So each reserve has a specified object limit, an object usage count and a + * number of pages required to back these objects. + * + * Usage is charged against a reserve, if the charge fails, the resource must + * not be allocated/used. + * + * The reserves are managed in a tree, and the resource demands (pages and + * limit) are propagated up the tree. Obviously the object limit will be + * meaningless as soon as the unit starts mixing, but the required page reserve + * (being of one unit) is still valid at the root. + * + * It is the page demand of the root node that is used to set the global + * reserve (adjust_memalloc_reserve() which sets zone-pages_emerg). + * + * As long as a subtree has the same usage unit, an aggregate node can be used + * to charge against, instead of the leaf nodes. However, do be consistent with + * who is charged, resource usage is not propagated up the tree (for + * performance reasons). + */ + +#include linux/reserve.h +#include linux/mutex.h +#include linux/mmzone.h +#include linux/log2.h +#include linux/proc_fs.h +#include linux/seq_file.h +#include linux/module.h +#include linux/slab.h + +static DEFINE_MUTEX(mem_reserve_mutex); + +/** + * @mem_reserve_root - the global reserve root + * + * The global reserve is empty, and has no limit unit, it merely + * acts as an aggregation point for reserves and an interface to + * adjust_memalloc_reserve(). + */
[PATCH 17/29] netvm: hook skb allocation to reserves
Change the skb allocation api to indicate RX usage and use this to fall back to the reserve when needed. SKBs allocated from the reserve are tagged in skb-emergency. Teach all other skb ops about emergency skbs and the reserve accounting. Use the (new) packet split API to allocate and track fragment pages from the emergency reserve. Do this using an atomic counter in page-index. This is needed because the fragments have a different sharing semantic than that indicated by skb_shinfo()-dataref. Note that the decision to distinguish between regular and emergency SKBs allows the accounting overhead to be limited to the later kind. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/mm_types.h |1 include/linux/skbuff.h | 26 +-- net/core/skbuff.c| 173 +-- 3 files changed, 174 insertions(+), 26 deletions(-) Index: linux-2.6/include/linux/skbuff.h === --- linux-2.6.orig/include/linux/skbuff.h +++ linux-2.6/include/linux/skbuff.h @@ -314,7 +314,9 @@ struct sk_buff { __u16 tc_verd;/* traffic control verdict */ #endif #endif - /* 2 byte hole */ + __u8emergency:1; + /* 7 bit hole */ + /* 1 byte hole */ #ifdef CONFIG_NET_DMA dma_cookie_tdma_cookie; @@ -345,10 +347,22 @@ struct sk_buff { #include asm/system.h +#define SKB_ALLOC_FCLONE 0x01 +#define SKB_ALLOC_RX 0x02 + +static inline bool skb_emergency(const struct sk_buff *skb) +{ +#ifdef CONFIG_NETVM + return unlikely(skb-emergency); +#else + return false; +#endif +} + extern void kfree_skb(struct sk_buff *skb); extern void __kfree_skb(struct sk_buff *skb); extern struct sk_buff *__alloc_skb(unsigned int size, - gfp_t priority, int fclone, int node); + gfp_t priority, int flags, int node); static inline struct sk_buff *alloc_skb(unsigned int size, gfp_t priority) { @@ -358,7 +372,7 @@ static inline struct sk_buff *alloc_skb( static inline struct sk_buff *alloc_skb_fclone(unsigned int size, gfp_t priority) { - return __alloc_skb(size, priority, 1, -1); + return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, -1); } extern struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src); @@ -1303,7 +1317,8 @@ static inline void __skb_queue_purge(str static inline struct sk_buff *__dev_alloc_skb(unsigned int length, gfp_t gfp_mask) { - struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask); + struct sk_buff *skb = + __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, -1); if (likely(skb)) skb_reserve(skb, NET_SKB_PAD); return skb; @@ -1349,6 +1364,7 @@ static inline struct sk_buff *netdev_all } extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask); +extern void __netdev_free_page(struct net_device *dev, struct page *page); /** * netdev_alloc_page - allocate a page for ps-rx on a specific device @@ -1365,7 +1381,7 @@ static inline struct page *netdev_alloc_ static inline void netdev_free_page(struct net_device *dev, struct page *page) { - __free_page(page); + __netdev_free_page(dev, page); } /** Index: linux-2.6/net/core/skbuff.c === --- linux-2.6.orig/net/core/skbuff.c +++ linux-2.6/net/core/skbuff.c @@ -179,21 +179,28 @@ EXPORT_SYMBOL(skb_truesize_bug); * %GFP_ATOMIC. */ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, - int fclone, int node) + int flags, int node) { struct kmem_cache *cache; struct skb_shared_info *shinfo; struct sk_buff *skb; u8 *data; + int emergency = 0, memalloc = sk_memalloc_socks(); - cache = fclone ? skbuff_fclone_cache : skbuff_head_cache; + size = SKB_DATA_ALIGN(size); + cache = (flags SKB_ALLOC_FCLONE) + ? skbuff_fclone_cache : skbuff_head_cache; +#ifdef CONFIG_NETVM + if (memalloc (flags SKB_ALLOC_RX)) + gfp_mask |= __GFP_NOMEMALLOC|__GFP_NOWARN; +retry_alloc: +#endif /* Get the HEAD */ skb = kmem_cache_alloc_node(cache, gfp_mask ~__GFP_DMA, node); if (!skb) - goto out; + goto noskb; - size = SKB_DATA_ALIGN(size); data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info), gfp_mask, node); if (!data) @@ -203,6 +210,7 @@ struct sk_buff *__alloc_skb(unsigned int * See comment in sk_buff definition, just before the 'tail'
[PATCH 11/29] selinux: tag avc cache alloc as non-critical
Failing to allocate a cache entry will only harm performance not correctness. Do not consume valuable reserve pages for something like that. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] Acked-by: James Morris [EMAIL PROTECTED] --- security/selinux/avc.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6-2/security/selinux/avc.c === --- linux-2.6-2.orig/security/selinux/avc.c +++ linux-2.6-2/security/selinux/avc.c @@ -334,7 +334,7 @@ static struct avc_node *avc_alloc_node(v { struct avc_node *node; - node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC); + node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC|__GFP_NOMEMALLOC); if (!node) goto out; -- -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/29] mm: allow PF_MEMALLOC from softirq context
Allow PF_MEMALLOC to be set in softirq context. When running softirqs from a borrowed context save current-flags, ksoftirqd will have its own task_struct. This is needed to allow network softirq packet processing to make use of PF_MEMALLOC. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/sched.h |4 kernel/softirq.c |3 +++ mm/page_alloc.c |7 --- 3 files changed, 11 insertions(+), 3 deletions(-) Index: linux-2.6/mm/page_alloc.c === --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -1557,9 +1557,10 @@ int gfp_to_alloc_flags(gfp_t gfp_mask) alloc_flags |= ALLOC_HARDER; if (likely(!(gfp_mask __GFP_NOMEMALLOC))) { - if (!in_interrupt() - ((p-flags PF_MEMALLOC) || -unlikely(test_thread_flag(TIF_MEMDIE + if (!in_irq() (p-flags PF_MEMALLOC)) + alloc_flags |= ALLOC_NO_WATERMARKS; + else if (!in_interrupt() + unlikely(test_thread_flag(TIF_MEMDIE))) alloc_flags |= ALLOC_NO_WATERMARKS; } Index: linux-2.6/kernel/softirq.c === --- linux-2.6.orig/kernel/softirq.c +++ linux-2.6/kernel/softirq.c @@ -211,6 +211,8 @@ asmlinkage void __do_softirq(void) __u32 pending; int max_restart = MAX_SOFTIRQ_RESTART; int cpu; + unsigned long pflags = current-flags; + current-flags = ~PF_MEMALLOC; pending = local_softirq_pending(); account_system_vtime(current); @@ -249,6 +251,7 @@ restart: account_system_vtime(current); _local_bh_enable(); + tsk_restore_flags(current, pflags, PF_MEMALLOC); } #ifndef __ARCH_HAS_DO_SOFTIRQ Index: linux-2.6/include/linux/sched.h === --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -1389,6 +1389,10 @@ static inline void put_task_struct(struc #define tsk_used_math(p) ((p)-flags PF_USED_MATH) #define used_math() tsk_used_math(current) +#define tsk_restore_flags(p, pflags, mask) \ + do {(p)-flags = ~(mask); \ + (p)-flags |= ((pflags) (mask)); } while (0) + #ifdef CONFIG_SMP extern int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask); #else -- -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/29] mm: kmem_estimate_pages()
Provide a method to get the upper bound on the pages needed to allocate a given number of objects from a given kmem_cache. This lays the foundation for a generic reserve framework as presented in a later patch in this series. This framework needs to convert object demand (kmalloc() bytes, kmem_cache_alloc() objects) to pages. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/slab.h |3 + mm/slab.c| 74 ++ mm/slub.c| 82 +++ 3 files changed, 159 insertions(+) Index: linux-2.6/include/linux/slab.h === --- linux-2.6.orig/include/linux/slab.h +++ linux-2.6/include/linux/slab.h @@ -60,6 +60,7 @@ void kmem_cache_free(struct kmem_cache * unsigned int kmem_cache_size(struct kmem_cache *); const char *kmem_cache_name(struct kmem_cache *); int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr); +unsigned kmem_estimate_pages(struct kmem_cache *cachep, gfp_t flags, int objects); /* * Please use this macro to create slab caches. Simply specify the @@ -94,6 +95,8 @@ int kmem_ptr_validate(struct kmem_cache void * __must_check krealloc(const void *, size_t, gfp_t); void kfree(const void *); size_t ksize(const void *); +unsigned kestimate_single(size_t, gfp_t, int); +unsigned kestimate(gfp_t, size_t); /* * Allocator specific definitions. These are mainly used to establish optimized Index: linux-2.6/mm/slub.c === --- linux-2.6.orig/mm/slub.c +++ linux-2.6/mm/slub.c @@ -2446,6 +2446,37 @@ const char *kmem_cache_name(struct kmem_ EXPORT_SYMBOL(kmem_cache_name); /* + * return the max number of pages required to allocated count + * objects from the given cache + */ +unsigned kmem_estimate_pages(struct kmem_cache *s, gfp_t flags, int objects) +{ + unsigned long slabs; + + if (WARN_ON(!s) || WARN_ON(!s-objects)) + return 0; + + slabs = DIV_ROUND_UP(objects, s-objects); + + /* +* Account the possible additional overhead if the slab holds more that +* one object. +*/ + if (s-objects 1) { + /* +* Account the possible additional overhead if per cpu slabs +* are currently empty and have to be allocated. This is very +* unlikely but a possible scenario immediately after +* kmem_cache_shrink. +*/ + slabs += num_online_cpus(); + } + + return slabs s-order; +} +EXPORT_SYMBOL_GPL(kmem_estimate_pages); + +/* * Attempt to free all slabs on a node. Return the number of slabs we * were unable to free. */ @@ -2800,6 +2831,57 @@ static unsigned long count_partial(struc } /* + * return the max number of pages required to allocate @count objects + * of @size bytes from kmalloc given @flags. + */ +unsigned kestimate_single(size_t size, gfp_t flags, int count) +{ + struct kmem_cache *s = get_slab(size, flags); + if (!s) + return 0; + + return kmem_estimate_pages(s, flags, count); + +} +EXPORT_SYMBOL_GPL(kestimate_single); + +/* + * return the max number of pages required to allocate @bytes from kmalloc + * in an unspecified number of allocation of heterogeneous size. + */ +unsigned kestimate(gfp_t flags, size_t bytes) +{ + int i; + unsigned long pages; + + /* +* multiply by two, in order to account the worst case slack space +* due to the power-of-two allocation sizes. +*/ + pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE); + + /* +* add the kmem_cache overhead of each possible kmalloc cache +*/ + for (i = 1; i PAGE_SHIFT; i++) { + struct kmem_cache *s; + +#ifdef CONFIG_ZONE_DMA + if (unlikely(flags SLUB_DMA)) + s = dma_kmalloc_cache(i, flags); + else +#endif + s = kmalloc_caches[i]; + + if (s) + pages += kmem_estimate_pages(s, flags, 0); + } + + return pages; +} +EXPORT_SYMBOL_GPL(kestimate); + +/* * kmem_cache_shrink removes empty slabs from the partial lists and sorts * the remaining slabs by the number of items in use. The slabs with the * most items in use come first. New allocations will then fill those up Index: linux-2.6/mm/slab.c === --- linux-2.6.orig/mm/slab.c +++ linux-2.6/mm/slab.c @@ -3844,6 +3844,80 @@ const char *kmem_cache_name(struct kmem_ EXPORT_SYMBOL_GPL(kmem_cache_name); /* + * return the max number of pages required to allocated count + * objects from the given cache + */ +unsigned kmem_estimate_pages(struct kmem_cache *cachep, gfp_t flags, int objects) +{ + /* +* (1) memory for objects, +*/ +
[PATCH 14/29] net: sk_allocation() - concentrate socket related allocations
Introduce sk_allocation(), this function allows to inject sock specific flags to each sock related allocation. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/net/sock.h|5 + net/ipv4/tcp.c|2 +- net/ipv4/tcp_output.c | 11 ++- net/ipv6/tcp_ipv6.c | 14 +- 4 files changed, 21 insertions(+), 11 deletions(-) Index: linux-2.6/net/ipv4/tcp_output.c === --- linux-2.6.orig/net/ipv4/tcp_output.c +++ linux-2.6/net/ipv4/tcp_output.c @@ -2063,7 +2063,7 @@ void tcp_send_fin(struct sock *sk) } else { /* Socket is locked, keep trying until memory is available. */ for (;;) { - skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_KERNEL); + skb = alloc_skb_fclone(MAX_TCP_HEADER, sk-sk_allocation); if (skb) break; yield(); @@ -2096,7 +2096,7 @@ void tcp_send_active_reset(struct sock * struct sk_buff *skb; /* NOTE: No TCP options attached and we never retransmit this. */ - skb = alloc_skb(MAX_TCP_HEADER, priority); + skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, priority)); if (!skb) { NET_INC_STATS(LINUX_MIB_TCPABORTFAILED); return; @@ -2169,7 +2169,8 @@ struct sk_buff * tcp_make_synack(struct __u8 *md5_hash_location; #endif - skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1, GFP_ATOMIC); + skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1, + sk_allocation(sk, GFP_ATOMIC)); if (skb == NULL) return NULL; @@ -2428,7 +2429,7 @@ void tcp_send_ack(struct sock *sk) * tcp_transmit_skb() will set the ownership to this * sock. */ - buff = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC); + buff = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC)); if (buff == NULL) { inet_csk_schedule_ack(sk); inet_csk(sk)-icsk_ack.ato = TCP_ATO_MIN; @@ -2470,7 +2471,7 @@ static int tcp_xmit_probe_skb(struct soc struct sk_buff *skb; /* We don't queue it, tcp_transmit_skb() sets ownership. */ - skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC); + skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC)); if (skb == NULL) return -1; Index: linux-2.6/include/net/sock.h === --- linux-2.6.orig/include/net/sock.h +++ linux-2.6/include/net/sock.h @@ -425,6 +425,11 @@ static inline int sock_flag(struct sock return test_bit(flag, sk-sk_flags); } +static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask) +{ + return gfp_mask; +} + static inline void sk_acceptq_removed(struct sock *sk) { sk-sk_ack_backlog--; Index: linux-2.6/net/ipv6/tcp_ipv6.c === --- linux-2.6.orig/net/ipv6/tcp_ipv6.c +++ linux-2.6/net/ipv6/tcp_ipv6.c @@ -574,7 +574,8 @@ static int tcp_v6_md5_do_add(struct sock } else { /* reallocate new list if current one is full. */ if (!tp-md5sig_info) { - tp-md5sig_info = kzalloc(sizeof(*tp-md5sig_info), GFP_ATOMIC); + tp-md5sig_info = kzalloc(sizeof(*tp-md5sig_info), + sk_allocation(sk, GFP_ATOMIC)); if (!tp-md5sig_info) { kfree(newkey); return -ENOMEM; @@ -587,7 +588,8 @@ static int tcp_v6_md5_do_add(struct sock } if (tp-md5sig_info-alloced6 == tp-md5sig_info-entries6) { keys = kmalloc((sizeof (tp-md5sig_info-keys6[0]) * - (tp-md5sig_info-entries6 + 1)), GFP_ATOMIC); + (tp-md5sig_info-entries6 + 1)), + sk_allocation(sk, GFP_ATOMIC)); if (!keys) { tcp_free_md5sig_pool(); @@ -711,7 +713,7 @@ static int tcp_v6_parse_md5_keys (struct struct tcp_sock *tp = tcp_sk(sk); struct tcp_md5sig_info *p; - p = kzalloc(sizeof(struct tcp_md5sig_info), GFP_KERNEL); + p = kzalloc(sizeof(struct tcp_md5sig_info), sk-sk_allocation); if (!p) return -ENOMEM; @@ -1012,7 +1014,7 @@ static void tcp_v6_send_reset(struct soc */ buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len, -GFP_ATOMIC); +sk_allocation(sk, GFP_ATOMIC)); if (buff == NULL) return; @@ -1091,10 +1093,12 @@
[PATCH 09/29] mm: __GFP_MEMALLOC
__GFP_MEMALLOC will allow the allocation to disregard the watermarks, much like PF_MEMALLOC. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/gfp.h |3 ++- mm/page_alloc.c |4 +++- 2 files changed, 5 insertions(+), 2 deletions(-) Index: linux-2.6/include/linux/gfp.h === --- linux-2.6.orig/include/linux/gfp.h +++ linux-2.6/include/linux/gfp.h @@ -43,6 +43,7 @@ struct vm_area_struct; #define __GFP_REPEAT ((__force gfp_t)0x400u) /* Retry the allocation. Might fail */ #define __GFP_NOFAIL ((__force gfp_t)0x800u) /* Retry for ever. Cannot fail */ #define __GFP_NORETRY ((__force gfp_t)0x1000u)/* Do not retry. Might fail */ +#define __GFP_MEMALLOC ((__force gfp_t)0x2000u)/* Use emergency reserves */ #define __GFP_COMP ((__force gfp_t)0x4000u)/* Add compound page metadata */ #define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */ #define __GFP_NOMEMALLOC ((__force gfp_t)0x1u) /* Don't use emergency reserves */ @@ -88,7 +89,7 @@ struct vm_area_struct; /* Control page allocator reclaim behavior */ #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\ __GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\ - __GFP_NORETRY|__GFP_NOMEMALLOC) + __GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC) /* Control allocation constraints */ #define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE) Index: linux-2.6/mm/page_alloc.c === --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -1560,7 +1560,9 @@ int gfp_to_alloc_flags(gfp_t gfp_mask) alloc_flags |= ALLOC_HARDER; if (likely(!(gfp_mask __GFP_NOMEMALLOC))) { - if (!in_irq() (p-flags PF_MEMALLOC)) + if (gfp_mask __GFP_MEMALLOC) + alloc_flags |= ALLOC_NO_WATERMARKS; + else if (!in_irq() (p-flags PF_MEMALLOC)) alloc_flags |= ALLOC_NO_WATERMARKS; else if (!in_interrupt() unlikely(test_thread_flag(TIF_MEMDIE))) -- -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/29] mm: tag reseve pages
Tag pages allocated from the reserves with a non-zero page-reserve. This allows us to distinguish and account reserve pages. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/mm_types.h |1 + mm/page_alloc.c |4 +++- 2 files changed, 4 insertions(+), 1 deletion(-) Index: linux-2.6/include/linux/mm_types.h === --- linux-2.6.orig/include/linux/mm_types.h +++ linux-2.6/include/linux/mm_types.h @@ -70,6 +70,7 @@ struct page { union { pgoff_t index; /* Our offset within mapping. */ void *freelist; /* SLUB: freelist req. slab lock */ + int reserve;/* page_alloc: page is a reserve page */ }; struct list_head lru; /* Pageout list, eg. active_list * protected by zone-lru_lock ! Index: linux-2.6/mm/page_alloc.c === --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -1448,8 +1448,10 @@ zonelist_scan: } page = buffered_rmqueue(zonelist, zone, order, gfp_mask); - if (page) + if (page) { + page-reserve = !!(alloc_flags ALLOC_NO_WATERMARKS); break; + } this_zone_full: if (NUMA_BUILD) zlc_mark_zone_full(zonelist, z); -- -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/29] mm: serialize access to min_free_kbytes
There is a small race between the procfs caller and the memory hotplug caller of setup_per_zone_pages_min(). Not a big deal, but the next patch will add yet another caller. Time to close the gap. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- mm/page_alloc.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) Index: linux-2.6/mm/page_alloc.c === --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -116,6 +116,7 @@ static char * const zone_names[MAX_NR_ZO Movable, }; +static DEFINE_SPINLOCK(min_free_lock); int min_free_kbytes = 1024; unsigned long __meminitdata nr_kernel_pages; @@ -4162,12 +4163,12 @@ static void setup_per_zone_lowmem_reserv } /** - * setup_per_zone_pages_min - called when min_free_kbytes changes. + * __setup_per_zone_pages_min - called when min_free_kbytes changes. * * Ensures that the pages_{min,low,high} values for each zone are set correctly * with respect to min_free_kbytes. */ -void setup_per_zone_pages_min(void) +static void __setup_per_zone_pages_min(void) { unsigned long pages_min = min_free_kbytes (PAGE_SHIFT - 10); unsigned long lowmem_pages = 0; @@ -4222,6 +4223,15 @@ void setup_per_zone_pages_min(void) calculate_totalreserve_pages(); } +void setup_per_zone_pages_min(void) +{ + unsigned long flags; + + spin_lock_irqsave(min_free_lock, flags); + __setup_per_zone_pages_min(); + spin_unlock_irqrestore(min_free_lock, flags); +} + /* * Initialise min_free_kbytes. * @@ -4257,7 +4267,7 @@ static int __init init_per_zone_pages_mi min_free_kbytes = 128; if (min_free_kbytes 65536) min_free_kbytes = 65536; - setup_per_zone_pages_min(); + __setup_per_zone_pages_min(); setup_per_zone_lowmem_reserve(); return 0; } -- -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/29] net: packet split receive api
Add some packet-split receive hooks. For one this allows to do NUMA node affine page allocs. Later on these hooks will be extended to do emergency reserve allocations for fragments. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- drivers/net/e1000/e1000_main.c |8 ++-- drivers/net/sky2.c | 16 ++-- include/linux/skbuff.h | 23 +++ net/core/skbuff.c | 20 4 files changed, 51 insertions(+), 16 deletions(-) Index: linux-2.6/drivers/net/e1000/e1000_main.c === --- linux-2.6.orig/drivers/net/e1000/e1000_main.c +++ linux-2.6/drivers/net/e1000/e1000_main.c @@ -4392,12 +4392,8 @@ e1000_clean_rx_irq_ps(struct e1000_adapt pci_unmap_page(pdev, ps_page_dma-ps_page_dma[j], PAGE_SIZE, PCI_DMA_FROMDEVICE); ps_page_dma-ps_page_dma[j] = 0; - skb_fill_page_desc(skb, j, ps_page-ps_page[j], 0, - length); + skb_add_rx_frag(skb, j, ps_page-ps_page[j], 0, length); ps_page-ps_page[j] = NULL; - skb-len += length; - skb-data_len += length; - skb-truesize += length; } /* strip the ethernet crc, problem is we're using pages now so @@ -4605,7 +4601,7 @@ e1000_alloc_rx_buffers_ps(struct e1000_a if (j adapter-rx_ps_pages) { if (likely(!ps_page-ps_page[j])) { ps_page-ps_page[j] = - alloc_page(GFP_ATOMIC); + netdev_alloc_page(netdev); if (unlikely(!ps_page-ps_page[j])) { adapter-alloc_rx_buff_failed++; goto no_buffers; Index: linux-2.6/include/linux/skbuff.h === --- linux-2.6.orig/include/linux/skbuff.h +++ linux-2.6/include/linux/skbuff.h @@ -851,6 +851,9 @@ static inline void skb_fill_page_desc(st skb_shinfo(skb)-nr_frags = i + 1; } +extern void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, + int off, int size); + #define SKB_PAGE_ASSERT(skb) BUG_ON(skb_shinfo(skb)-nr_frags) #define SKB_FRAG_ASSERT(skb) BUG_ON(skb_shinfo(skb)-frag_list) #define SKB_LINEAR_ASSERT(skb) BUG_ON(skb_is_nonlinear(skb)) @@ -1344,6 +1347,26 @@ static inline struct sk_buff *netdev_all return __netdev_alloc_skb(dev, length, GFP_ATOMIC); } +extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask); + +/** + * netdev_alloc_page - allocate a page for ps-rx on a specific device + * @dev: network device to receive on + * + * Allocate a new page node local to the specified device. + * + * %NULL is returned if there is no free memory. + */ +static inline struct page *netdev_alloc_page(struct net_device *dev) +{ + return __netdev_alloc_page(dev, GFP_ATOMIC); +} + +static inline void netdev_free_page(struct net_device *dev, struct page *page) +{ + __free_page(page); +} + /** * skb_clone_writable - is the header of a clone writable * @skb: buffer to check Index: linux-2.6/net/core/skbuff.c === --- linux-2.6.orig/net/core/skbuff.c +++ linux-2.6/net/core/skbuff.c @@ -263,6 +263,24 @@ struct sk_buff *__netdev_alloc_skb(struc return skb; } +struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask) +{ + int node = dev-dev.parent ? dev_to_node(dev-dev.parent) : -1; + struct page *page; + + page = alloc_pages_node(node, gfp_mask, 0); + return page; +} + +void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off, + int size) +{ + skb_fill_page_desc(skb, i, page, off, size); + skb-len += size; + skb-data_len += size; + skb-truesize += size; +} + static void skb_drop_list(struct sk_buff **listp) { struct sk_buff *list = *listp; @@ -2466,6 +2484,8 @@ EXPORT_SYMBOL(kfree_skb); EXPORT_SYMBOL(__pskb_pull_tail); EXPORT_SYMBOL(__alloc_skb); EXPORT_SYMBOL(__netdev_alloc_skb); +EXPORT_SYMBOL(__netdev_alloc_page); +EXPORT_SYMBOL(skb_add_rx_frag); EXPORT_SYMBOL(pskb_copy); EXPORT_SYMBOL(pskb_expand_head); EXPORT_SYMBOL(skb_checksum); Index: linux-2.6/drivers/net/sky2.c === --- linux-2.6.orig/drivers/net/sky2.c +++ linux-2.6/drivers/net/sky2.c @@ -1198,7 +1198,7 @@ static struct sk_buff *sky2_rx_alloc(str } for (i = 0; i sky2-rx_nfrags; i++) { -
[PATCH 12/29] net: wrap sk-sk_backlog_rcv()
Wrap calling sk-sk_backlog_rcv() in a function. This will allow extending the generic sk_backlog_rcv behaviour. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/net/sock.h |5 + net/core/sock.c |4 ++-- net/ipv4/tcp.c |2 +- net/ipv4/tcp_timer.c |2 +- 4 files changed, 9 insertions(+), 4 deletions(-) Index: linux-2.6/include/net/sock.h === --- linux-2.6.orig/include/net/sock.h +++ linux-2.6/include/net/sock.h @@ -485,6 +485,11 @@ static inline void sk_add_backlog(struct skb-next = NULL; } +static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb) +{ + return sk-sk_backlog_rcv(sk, skb); +} + #define sk_wait_event(__sk, __timeo, __condition) \ ({ int __rc; \ release_sock(__sk); \ Index: linux-2.6/net/core/sock.c === --- linux-2.6.orig/net/core/sock.c +++ linux-2.6/net/core/sock.c @@ -320,7 +320,7 @@ int sk_receive_skb(struct sock *sk, stru */ mutex_acquire(sk-sk_lock.dep_map, 0, 1, _RET_IP_); - rc = sk-sk_backlog_rcv(sk, skb); + rc = sk_backlog_rcv(sk, skb); mutex_release(sk-sk_lock.dep_map, 1, _RET_IP_); } else @@ -1312,7 +1312,7 @@ static void __release_sock(struct sock * struct sk_buff *next = skb-next; skb-next = NULL; - sk-sk_backlog_rcv(sk, skb); + sk_backlog_rcv(sk, skb); /* * We are in process context here with softirqs Index: linux-2.6/net/ipv4/tcp.c === --- linux-2.6.orig/net/ipv4/tcp.c +++ linux-2.6/net/ipv4/tcp.c @@ -1134,7 +1134,7 @@ static void tcp_prequeue_process(struct * necessary */ local_bh_disable(); while ((skb = __skb_dequeue(tp-ucopy.prequeue)) != NULL) - sk-sk_backlog_rcv(sk, skb); + sk_backlog_rcv(sk, skb); local_bh_enable(); /* Clear memory counter. */ Index: linux-2.6/net/ipv4/tcp_timer.c === --- linux-2.6.orig/net/ipv4/tcp_timer.c +++ linux-2.6/net/ipv4/tcp_timer.c @@ -196,7 +196,7 @@ static void tcp_delack_timer(unsigned lo NET_INC_STATS_BH(LINUX_MIB_TCPSCHEDULERFAILED); while ((skb = __skb_dequeue(tp-ucopy.prequeue)) != NULL) - sk-sk_backlog_rcv(sk, skb); + sk_backlog_rcv(sk, skb); tp-ucopy.memory = 0; } -- -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 16/29] netvm: INET reserves.
Add reserves for INET. The two big users seem to be the route cache and ip-fragment cache. Reserve the route cache under generic RX reserve, its usage is bounded by the high reclaim watermark, and thus does not need further accounting. Reserve the ip-fragement caches under SKB data reserve, these add to the SKB RX limit. By ensuring we can at least receive as much data as fits in the reassmbly line we avoid fragment attack deadlocks. Use proc conv() routines to update these limits and return -ENOMEM to user space. Adds to the reserve tree: total network reserve network TX reserve protocol TX pages network RX reserve + IPv6 route cache + IPv4 route cache SKB data reserve + IPv6 fragment cache + IPv4 fragment cache Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- net/ipv4/ip_fragment.c |7 net/ipv4/route.c | 64 +++-- net/ipv4/sysctl_net_ipv4.c | 57 ++-- net/ipv6/reassembly.c |7 net/ipv6/route.c | 64 +++-- net/ipv6/sysctl_net_ipv6.c | 57 ++-- 6 files changed, 248 insertions(+), 8 deletions(-) Index: linux-2.6/net/ipv4/sysctl_net_ipv4.c === --- linux-2.6.orig/net/ipv4/sysctl_net_ipv4.c +++ linux-2.6/net/ipv4/sysctl_net_ipv4.c @@ -21,6 +21,7 @@ #include net/tcp.h #include net/cipso_ipv4.h #include net/inet_frag.h +#include linux/reserve.h static int zero; static int tcp_retr1_max = 255; @@ -192,6 +193,57 @@ static int strategy_allowed_congestion_c } +static int ipv4_frag_bytes; +extern struct mem_reserve ipv4_frag_reserve; + +static int proc_dointvec_fragment(struct ctl_table *table, int write, + struct file *filp, void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + int old_bytes, ret; + + if (!write) + ipv4_frag_bytes = ip4_frags_ctl.high_thresh; + old_bytes = ipv4_frag_bytes; + + ret = proc_dointvec(table, write, filp, buffer, lenp, ppos); + + if (!ret write) { + ret = mem_reserve_kmalloc_set(ipv4_frag_reserve, ipv4_frag_bytes); + if (!ret) + ip4_frags_ctl.high_thresh = ipv4_frag_bytes; + else + ipv4_frag_bytes = old_bytes; + } + + return ret; +} + +static int sysctl_intvec_fragment(struct ctl_table *table, + int __user *name, int nlen, + void __user *oldval, size_t __user *oldlenp, + void __user *newval, size_t newlen) +{ + int old_bytes, ret; + int write = (newval newlen); + + if (!write) + ipv4_frag_bytes = ip4_frags_ctl.high_thresh; + old_bytes = ipv4_frag_bytes; + + ret = sysctl_intvec(table, name, nlen, oldval, oldlenp, newval, newlen); + + if (!ret write) { + ret = mem_reserve_kmalloc_set(ipv4_frag_reserve, ipv4_frag_bytes); + if (!ret) + ip4_frags_ctl.high_thresh = ipv4_frag_bytes; + else + ipv4_frag_bytes = old_bytes; + } + + return ret; +} + static struct ctl_table ipv4_table[] = { { .ctl_name = NET_IPV4_TCP_TIMESTAMPS, @@ -285,10 +337,11 @@ static struct ctl_table ipv4_table[] = { { .ctl_name = NET_IPV4_IPFRAG_HIGH_THRESH, .procname = ipfrag_high_thresh, - .data = ip4_frags_ctl.high_thresh, + .data = ipv4_frag_bytes, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_dointvec + .proc_handler = proc_dointvec_fragment, + .strategy = sysctl_intvec_fragment, }, { .ctl_name = NET_IPV4_IPFRAG_LOW_THRESH, Index: linux-2.6/net/ipv6/sysctl_net_ipv6.c === --- linux-2.6.orig/net/ipv6/sysctl_net_ipv6.c +++ linux-2.6/net/ipv6/sysctl_net_ipv6.c @@ -13,6 +13,58 @@ #include net/ipv6.h #include net/addrconf.h #include net/inet_frag.h +#include linux/reserve.h + +static int ipv6_frag_bytes; +extern struct mem_reserve ipv6_frag_reserve; + +static int proc_dointvec_fragment(struct ctl_table *table, int write, + struct file *filp, void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + int old_bytes, ret; + + if (!write) + ipv6_frag_bytes = ip6_frags_ctl.high_thresh; + old_bytes = ipv6_frag_bytes; + + ret = proc_dointvec(table, write, filp, buffer, lenp, ppos); + + if (!ret write) { + ret =
[PATCH 15/29] netvm: network reserve infrastructure
Provide the basic infrastructure to reserve and charge/account network memory. We provide the following reserve tree: 1) total network reserve 2)network TX reserve 3) protocol TX pages 4)network RX reserve 5) SKB data reserve [1] is used to make all the network reserves a single subtree, for easy manipulation. [2] and [4] are merely for eastetic reasons. The TX pages reserve [3] is assumed bounded by it being the upper bound of memory that can be used for sending pages (not quite true, but good enough) The SKB reserve [5] is an aggregate reserve, which is used to charge SKB data against in the fallback path. The consumers for these reserves are sockets marked with: SOCK_MEMALLOC Such sockets are to be used to service the VM (iow. to swap over). They must be handled kernel side, exposing such a socket to user-space is a BUG. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/net/sock.h | 35 +++- net/Kconfig|3 + net/core/sock.c| 113 + 3 files changed, 150 insertions(+), 1 deletion(-) Index: linux-2.6/include/net/sock.h === --- linux-2.6.orig/include/net/sock.h +++ linux-2.6/include/net/sock.h @@ -51,6 +51,7 @@ #include linux/skbuff.h /* struct sk_buff */ #include linux/mm.h #include linux/security.h +#include linux/reserve.h #include linux/filter.h @@ -403,6 +404,7 @@ enum sock_flags { SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */ SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */ SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */ + SOCK_MEMALLOC, /* the VM depends on us - make sure we're serviced */ }; static inline void sock_copy_flags(struct sock *nsk, struct sock *osk) @@ -425,9 +427,40 @@ static inline int sock_flag(struct sock return test_bit(flag, sk-sk_flags); } +static inline int sk_has_memalloc(struct sock *sk) +{ + return sock_flag(sk, SOCK_MEMALLOC); +} + +/* + * Guestimate the per request queue TX upper bound. + * + * Max packet size is 64k, and we need to reserve that much since the data + * might need to bounce it. Double it to be on the safe side. + */ +#define TX_RESERVE_PAGES DIV_ROUND_UP(2*65536, PAGE_SIZE) + +extern atomic_t memalloc_socks; + +extern struct mem_reserve net_rx_reserve; +extern struct mem_reserve net_skb_reserve; + +static inline int sk_memalloc_socks(void) +{ + return atomic_read(memalloc_socks); +} + +extern int rx_emergency_get(int bytes); +extern int rx_emergency_get_overcommit(int bytes); +extern void rx_emergency_put(int bytes); + +extern int sk_adjust_memalloc(int socks, long tx_reserve_pages); +extern int sk_set_memalloc(struct sock *sk); +extern int sk_clear_memalloc(struct sock *sk); + static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask) { - return gfp_mask; + return gfp_mask | (sk-sk_allocation __GFP_MEMALLOC); } static inline void sk_acceptq_removed(struct sock *sk) Index: linux-2.6/net/core/sock.c === --- linux-2.6.orig/net/core/sock.c +++ linux-2.6/net/core/sock.c @@ -112,6 +112,7 @@ #include linux/tcp.h #include linux/init.h #include linux/highmem.h +#include linux/reserve.h #include asm/uaccess.h #include asm/system.h @@ -213,6 +214,111 @@ __u32 sysctl_rmem_default __read_mostly /* Maximal space eaten by iovec or ancilliary data plus some space */ int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512); +atomic_t memalloc_socks; + +static struct mem_reserve net_reserve; +struct mem_reserve net_rx_reserve; +struct mem_reserve net_skb_reserve; +static struct mem_reserve net_tx_reserve; +static struct mem_reserve net_tx_pages; + +EXPORT_SYMBOL_GPL(net_rx_reserve); /* modular ipv6 only */ +EXPORT_SYMBOL_GPL(net_skb_reserve); /* modular ipv6 only */ + +/* + * is there room for another emergency packet? + */ +static int __rx_emergency_get(int bytes, bool overcommit) +{ + return mem_reserve_kmalloc_charge(net_skb_reserve, bytes, overcommit); +} + +int rx_emergency_get(int bytes) +{ + return __rx_emergency_get(bytes, false); +} + +int rx_emergency_get_overcommit(int bytes) +{ + return __rx_emergency_get(bytes, true); +} + +void rx_emergency_put(int bytes) +{ + mem_reserve_kmalloc_charge(net_skb_reserve, -bytes, 0); +} + +/** + * sk_adjust_memalloc - adjust the global memalloc reserve for critical RX + * @socks: number of new %SOCK_MEMALLOC sockets + * @tx_resserve_pages: number of pages to (un)reserve for TX + * + * This function adjusts the memalloc reserve based on system demand. + * The RX reserve is a limit, and only added once, not for each socket. + * + * NOTE: + *@tx_reserve_pages is an upper-bound of memory used for TX hence + *we need not account the pages like we do for RX pages.
[PATCH 29/29] nfs: fix various memory recursions possible with swap over NFS.
GFP_NOFS is not enough, since swap traffic is IO, hence fall back to GFP_NOIO. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/nfs/pagelist.c |2 +- fs/nfs/write.c|6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) Index: linux-2.6/fs/nfs/write.c === --- linux-2.6.orig/fs/nfs/write.c +++ linux-2.6/fs/nfs/write.c @@ -44,7 +44,7 @@ static struct kmem_cache *nfs_wdata_cach struct nfs_write_data *nfs_commit_alloc(void) { - struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS); + struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOIO); if (p) { memset(p, 0, sizeof(*p)); @@ -68,7 +68,7 @@ void nfs_commit_free(struct nfs_write_da struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount) { - struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS); + struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOIO); if (p) { memset(p, 0, sizeof(*p)); @@ -77,7 +77,7 @@ struct nfs_write_data *nfs_writedata_all if (pagecount = ARRAY_SIZE(p-page_array)) p-pagevec = p-page_array; else { - p-pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS); + p-pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOIO); if (!p-pagevec) { kmem_cache_free(nfs_wdata_cachep, p); p = NULL; Index: linux-2.6/fs/nfs/pagelist.c === --- linux-2.6.orig/fs/nfs/pagelist.c +++ linux-2.6/fs/nfs/pagelist.c @@ -27,7 +27,7 @@ static inline struct nfs_page * nfs_page_alloc(void) { struct nfs_page *p; - p = kmem_cache_alloc(nfs_page_cachep, GFP_KERNEL); + p = kmem_cache_alloc(nfs_page_cachep, GFP_NOIO); if (p) { memset(p, 0, sizeof(*p)); INIT_LIST_HEAD(p-wb_list); -- -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 25/29] nfs: remove mempools
With the introduction of the shared dirty page accounting in .19, NFS should not be able to surpise the VM with all dirty pages. Thus it should always be able to free some memory. Hence no more need for mempools. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/nfs/read.c | 15 +++ fs/nfs/write.c | 27 +-- 2 files changed, 8 insertions(+), 34 deletions(-) Index: linux-2.6/fs/nfs/read.c === --- linux-2.6.orig/fs/nfs/read.c +++ linux-2.6/fs/nfs/read.c @@ -33,13 +33,10 @@ static const struct rpc_call_ops nfs_rea static const struct rpc_call_ops nfs_read_full_ops; static struct kmem_cache *nfs_rdata_cachep; -static mempool_t *nfs_rdata_mempool; - -#define MIN_POOL_READ (32) struct nfs_read_data *nfs_readdata_alloc(unsigned int pagecount) { - struct nfs_read_data *p = mempool_alloc(nfs_rdata_mempool, GFP_NOFS); + struct nfs_read_data *p = kmem_cache_alloc(nfs_rdata_cachep, GFP_NOFS); if (p) { memset(p, 0, sizeof(*p)); @@ -50,7 +47,7 @@ struct nfs_read_data *nfs_readdata_alloc else { p-pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS); if (!p-pagevec) { - mempool_free(p, nfs_rdata_mempool); + kmem_cache_free(nfs_rdata_cachep, p); p = NULL; } } @@ -63,7 +60,7 @@ static void nfs_readdata_rcu_free(struct struct nfs_read_data *p = container_of(head, struct nfs_read_data, task.u.tk_rcu); if (p (p-pagevec != p-page_array[0])) kfree(p-pagevec); - mempool_free(p, nfs_rdata_mempool); + kmem_cache_free(nfs_rdata_cachep, p); } static void nfs_readdata_free(struct nfs_read_data *rdata) @@ -597,16 +594,10 @@ int __init nfs_init_readpagecache(void) if (nfs_rdata_cachep == NULL) return -ENOMEM; - nfs_rdata_mempool = mempool_create_slab_pool(MIN_POOL_READ, -nfs_rdata_cachep); - if (nfs_rdata_mempool == NULL) - return -ENOMEM; - return 0; } void nfs_destroy_readpagecache(void) { - mempool_destroy(nfs_rdata_mempool); kmem_cache_destroy(nfs_rdata_cachep); } Index: linux-2.6/fs/nfs/write.c === --- linux-2.6.orig/fs/nfs/write.c +++ linux-2.6/fs/nfs/write.c @@ -28,9 +28,6 @@ #define NFSDBG_FACILITYNFSDBG_PAGECACHE -#define MIN_POOL_WRITE (32) -#define MIN_POOL_COMMIT(4) - /* * Local function declarations */ @@ -44,12 +41,10 @@ static const struct rpc_call_ops nfs_wri static const struct rpc_call_ops nfs_commit_ops; static struct kmem_cache *nfs_wdata_cachep; -static mempool_t *nfs_wdata_mempool; -static mempool_t *nfs_commit_mempool; struct nfs_write_data *nfs_commit_alloc(void) { - struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS); + struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS); if (p) { memset(p, 0, sizeof(*p)); @@ -63,7 +58,7 @@ static void nfs_commit_rcu_free(struct r struct nfs_write_data *p = container_of(head, struct nfs_write_data, task.u.tk_rcu); if (p (p-pagevec != p-page_array[0])) kfree(p-pagevec); - mempool_free(p, nfs_commit_mempool); + kmem_cache_free(nfs_wdata_cachep, p); } void nfs_commit_free(struct nfs_write_data *wdata) @@ -73,7 +68,7 @@ void nfs_commit_free(struct nfs_write_da struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount) { - struct nfs_write_data *p = mempool_alloc(nfs_wdata_mempool, GFP_NOFS); + struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS); if (p) { memset(p, 0, sizeof(*p)); @@ -84,7 +79,7 @@ struct nfs_write_data *nfs_writedata_all else { p-pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS); if (!p-pagevec) { - mempool_free(p, nfs_wdata_mempool); + kmem_cache_free(nfs_wdata_cachep, p); p = NULL; } } @@ -97,7 +92,7 @@ static void nfs_writedata_rcu_free(struc struct nfs_write_data *p = container_of(head, struct nfs_write_data, task.u.tk_rcu); if (p (p-pagevec != p-page_array[0])) kfree(p-pagevec); - mempool_free(p, nfs_wdata_mempool); + kmem_cache_free(nfs_wdata_cachep, p); } static void nfs_writedata_free(struct nfs_write_data *wdata) @@ -1474,16 +1469,6 @@ int __init nfs_init_writepagecache(void) if (nfs_wdata_cachep == NULL) return -ENOMEM; -
[PATCH 18/29] netvm: filter emergency skbs.
Toss all emergency packets not for a SOCK_MEMALLOC socket. This ensures our precious memory reserve doesn't get stuck waiting for user-space. The correctness of this approach relies on the fact that networks must be assumed lossy. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/net/sock.h |3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6/include/net/sock.h === --- linux-2.6.orig/include/net/sock.h +++ linux-2.6/include/net/sock.h @@ -930,6 +930,9 @@ static inline int sk_filter(struct sock { int err; struct sk_filter *filter; + + if (skb_emergency(skb) !sk_has_memalloc(sk)) + return -ENOMEM; err = security_sock_rcv_skb(sk, skb); if (err) -- -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 26/29] nfs: teach the NFS client how to treat PG_swapcache pages
Replace all relevant occurences of page-index and page-mapping in the NFS client with the new page_file_index() and page_file_mapping() functions. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/nfs/file.c |8 fs/nfs/internal.h |7 --- fs/nfs/pagelist.c |6 +++--- fs/nfs/read.c |6 +++--- fs/nfs/write.c| 49 + 5 files changed, 39 insertions(+), 37 deletions(-) Index: linux-2.6/fs/nfs/file.c === --- linux-2.6.orig/fs/nfs/file.c +++ linux-2.6/fs/nfs/file.c @@ -357,7 +357,7 @@ static void nfs_invalidate_page(struct p if (offset != 0) return; /* Cancel any unstarted writes on this page */ - nfs_wb_page_cancel(page-mapping-host, page); + nfs_wb_page_cancel(page_file_mapping(page)-host, page); } static int nfs_release_page(struct page *page, gfp_t gfp) @@ -368,7 +368,7 @@ static int nfs_release_page(struct page static int nfs_launder_page(struct page *page) { - return nfs_wb_page(page-mapping-host, page); + return nfs_wb_page(page_file_mapping(page)-host, page); } const struct address_space_operations nfs_file_aops = { @@ -397,13 +397,13 @@ static int nfs_vm_page_mkwrite(struct vm loff_t offset; lock_page(page); - mapping = page-mapping; + mapping = page_file_mapping(page); if (mapping != vma-vm_file-f_path.dentry-d_inode-i_mapping) { unlock_page(page); return -EINVAL; } pagelen = nfs_page_length(page); - offset = (loff_t)page-index PAGE_CACHE_SHIFT; + offset = (loff_t)page_file_index(page) PAGE_CACHE_SHIFT; unlock_page(page); /* Index: linux-2.6/fs/nfs/pagelist.c === --- linux-2.6.orig/fs/nfs/pagelist.c +++ linux-2.6/fs/nfs/pagelist.c @@ -77,11 +77,11 @@ nfs_create_request(struct nfs_open_conte * update_nfs_request below if the region is not locked. */ req-wb_page= page; atomic_set(req-wb_complete, 0); - req-wb_index = page-index; + req-wb_index = page_file_index(page); page_cache_get(page); BUG_ON(PagePrivate(page)); BUG_ON(!PageLocked(page)); - BUG_ON(page-mapping-host != inode); + BUG_ON(page_file_mapping(page)-host != inode); req-wb_offset = offset; req-wb_pgbase = offset; req-wb_bytes = count; @@ -383,7 +383,7 @@ void nfs_pageio_cond_complete(struct nfs * nfs_scan_list - Scan a list for matching requests * @nfsi: NFS inode * @dst: Destination list - * @idx_start: lower bound of page-index to scan + * @idx_start: lower bound of page_file_index(page) to scan * @npages: idx_start + npages sets the upper bound to scan. * @tag: tag to scan for * Index: linux-2.6/fs/nfs/read.c === --- linux-2.6.orig/fs/nfs/read.c +++ linux-2.6/fs/nfs/read.c @@ -460,11 +460,11 @@ static const struct rpc_call_ops nfs_rea int nfs_readpage(struct file *file, struct page *page) { struct nfs_open_context *ctx; - struct inode *inode = page-mapping-host; + struct inode *inode = page_file_mapping(page)-host; int error; dprintk(NFS: nfs_readpage (%p [EMAIL PROTECTED])\n, - page, PAGE_CACHE_SIZE, page-index); + page, PAGE_CACHE_SIZE, page_file_index(page)); nfs_inc_stats(inode, NFSIOS_VFSREADPAGE); nfs_add_stats(inode, NFSIOS_READPAGES, 1); @@ -511,7 +511,7 @@ static int readpage_async_filler(void *data, struct page *page) { struct nfs_readdesc *desc = (struct nfs_readdesc *)data; - struct inode *inode = page-mapping-host; + struct inode *inode = page_file_mapping(page)-host; struct nfs_page *new; unsigned int len; int error; Index: linux-2.6/fs/nfs/write.c === --- linux-2.6.orig/fs/nfs/write.c +++ linux-2.6/fs/nfs/write.c @@ -126,7 +126,7 @@ static struct nfs_page *nfs_page_find_re static struct nfs_page *nfs_page_find_request(struct page *page) { - struct inode *inode = page-mapping-host; + struct inode *inode = page_file_mapping(page)-host; struct nfs_page *req = NULL; spin_lock(inode-i_lock); @@ -138,13 +138,13 @@ static struct nfs_page *nfs_page_find_re /* Adjust the file length if we're writing beyond the end */ static void nfs_grow_file(struct page *page, unsigned int offset, unsigned int count) { - struct inode *inode = page-mapping-host; + struct inode *inode = page_file_mapping(page)-host; loff_t end, i_size = i_size_read(inode); pgoff_t end_index = (i_size - 1) PAGE_CACHE_SHIFT; - if (i_size 0 page-index end_index) + if (i_size 0
[PATCH 24/29] mm: methods for teaching filesystems about PG_swapcache pages
In order to teach filesystems to handle swap cache pages, two new page functions are introduced: pgoff_t page_file_index(struct page *); struct address_space *page_file_mapping(struct page *); page_file_index - gives the offset of this page in the file in PAGE_CACHE_SIZE blocks. Like page-index is for mapped pages, this function also gives the correct index for PG_swapcache pages. page_file_mapping - gives the mapping backing the actual page; that is for swap cache pages it will give swap_file-f_mapping. page_offset() is modified to use page_file_index(), so that it will give the expected result, even for PG_swapcache pages. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/mm.h | 26 ++ include/linux/pagemap.h |2 +- 2 files changed, 27 insertions(+), 1 deletion(-) Index: linux-2.6/include/linux/mm.h === --- linux-2.6.orig/include/linux/mm.h +++ linux-2.6/include/linux/mm.h @@ -14,6 +14,7 @@ #include linux/mm_types.h #include linux/security.h #include linux/swap.h +#include linux/fs.h struct mempolicy; struct anon_vma; @@ -608,6 +609,16 @@ static inline struct swap_info_struct *p return get_swap_info_struct(swp_type(swap)); } +static inline +struct address_space *page_file_mapping(struct page *page) +{ +#ifdef CONFIG_SWAP_FILE + if (unlikely(PageSwapCache(page))) + return page_swap_info(page)-swap_file-f_mapping; +#endif + return page-mapping; +} + static inline int PageAnon(struct page *page) { return ((unsigned long)page-mapping PAGE_MAPPING_ANON) != 0; @@ -625,6 +636,21 @@ static inline pgoff_t page_index(struct } /* + * Return the file index of the page. Regular pagecache pages use -index + * whereas swapcache pages use swp_offset(-private) + */ +static inline pgoff_t page_file_index(struct page *page) +{ +#ifdef CONFIG_SWAP_FILE + if (unlikely(PageSwapCache(page))) { + swp_entry_t swap = { .val = page_private(page) }; + return swp_offset(swap); + } +#endif + return page-index; +} + +/* * The atomic page-_mapcount, like _count, starts from -1: * so that transitions both from it and to it can be tracked, * using atomic_inc_and_test and atomic_add_negative(-1). Index: linux-2.6/include/linux/pagemap.h === --- linux-2.6.orig/include/linux/pagemap.h +++ linux-2.6/include/linux/pagemap.h @@ -145,7 +145,7 @@ extern void __remove_from_page_cache(str */ static inline loff_t page_offset(struct page *page) { - return ((loff_t)page-index) PAGE_CACHE_SHIFT; + return ((loff_t)page_file_index(page)) PAGE_CACHE_SHIFT; } static inline pgoff_t linear_page_index(struct vm_area_struct *vma, -- -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 22/29] mm: prepare swap entry methods for use in page methods
Move around the swap entry methods in preparation for use from page methods. Also provide a function to obtain the swap_info_struct backing a swap cache page. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/mm.h |8 +++ include/linux/swap.h| 49 include/linux/swapops.h | 44 --- mm/swapfile.c |1 4 files changed, 58 insertions(+), 44 deletions(-) Index: linux-2.6/include/linux/mm.h === --- linux-2.6.orig/include/linux/mm.h +++ linux-2.6/include/linux/mm.h @@ -13,6 +13,7 @@ #include linux/debug_locks.h #include linux/mm_types.h #include linux/security.h +#include linux/swap.h struct mempolicy; struct anon_vma; @@ -600,6 +601,13 @@ static inline struct address_space *page return mapping; } +static inline struct swap_info_struct *page_swap_info(struct page *page) +{ + swp_entry_t swap = { .val = page_private(page) }; + BUG_ON(!PageSwapCache(page)); + return get_swap_info_struct(swp_type(swap)); +} + static inline int PageAnon(struct page *page) { return ((unsigned long)page-mapping PAGE_MAPPING_ANON) != 0; Index: linux-2.6/include/linux/swap.h === --- linux-2.6.orig/include/linux/swap.h +++ linux-2.6/include/linux/swap.h @@ -80,6 +80,50 @@ typedef struct { } swp_entry_t; /* + * swapcache pages are stored in the swapper_space radix tree. We want to + * get good packing density in that tree, so the index should be dense in + * the low-order bits. + * + * We arrange the `type' and `offset' fields so that `type' is at the five + * high-order bits of the swp_entry_t and `offset' is right-aligned in the + * remaining bits. + * + * swp_entry_t's are *never* stored anywhere in their arch-dependent format. + */ +#define SWP_TYPE_SHIFT(e) (sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT) +#define SWP_OFFSET_MASK(e) ((1UL SWP_TYPE_SHIFT(e)) - 1) + +/* + * Store a type+offset into a swp_entry_t in an arch-independent format + */ +static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset) +{ + swp_entry_t ret; + + ret.val = (type SWP_TYPE_SHIFT(ret)) | + (offset SWP_OFFSET_MASK(ret)); + return ret; +} + +/* + * Extract the `type' field from a swp_entry_t. The swp_entry_t is in + * arch-independent format + */ +static inline unsigned swp_type(swp_entry_t entry) +{ + return (entry.val SWP_TYPE_SHIFT(entry)); +} + +/* + * Extract the `offset' field from a swp_entry_t. The swp_entry_t is in + * arch-independent format + */ +static inline pgoff_t swp_offset(swp_entry_t entry) +{ + return entry.val SWP_OFFSET_MASK(entry); +} + +/* * current-reclaim_state points to one of these when a task is running * memory reclaim */ @@ -321,6 +365,11 @@ static inline struct page *lookup_swap_c return NULL; } +static inline struct swap_info_struct *get_swap_info_struct(unsigned type) +{ + return NULL; +} + #define can_share_swap_page(p) (page_mapcount(p) == 1) static inline int move_to_swap_cache(struct page *page, swp_entry_t entry) Index: linux-2.6/include/linux/swapops.h === --- linux-2.6.orig/include/linux/swapops.h +++ linux-2.6/include/linux/swapops.h @@ -1,47 +1,3 @@ -/* - * swapcache pages are stored in the swapper_space radix tree. We want to - * get good packing density in that tree, so the index should be dense in - * the low-order bits. - * - * We arrange the `type' and `offset' fields so that `type' is at the five - * high-order bits of the swp_entry_t and `offset' is right-aligned in the - * remaining bits. - * - * swp_entry_t's are *never* stored anywhere in their arch-dependent format. - */ -#define SWP_TYPE_SHIFT(e) (sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT) -#define SWP_OFFSET_MASK(e) ((1UL SWP_TYPE_SHIFT(e)) - 1) - -/* - * Store a type+offset into a swp_entry_t in an arch-independent format - */ -static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset) -{ - swp_entry_t ret; - - ret.val = (type SWP_TYPE_SHIFT(ret)) | - (offset SWP_OFFSET_MASK(ret)); - return ret; -} - -/* - * Extract the `type' field from a swp_entry_t. The swp_entry_t is in - * arch-independent format - */ -static inline unsigned swp_type(swp_entry_t entry) -{ - return (entry.val SWP_TYPE_SHIFT(entry)); -} - -/* - * Extract the `offset' field from a swp_entry_t. The swp_entry_t is in - * arch-independent format - */ -static inline pgoff_t swp_offset(swp_entry_t entry) -{ - return entry.val SWP_OFFSET_MASK(entry); -} - /* check whether a pte points to a swap entry */ static inline int is_swap_pte(pte_t pte) { Index: linux-2.6/mm/swapfile.c
[PATCH 21/29] netvm: skb processing
In order to make sure emergency packets receive all memory needed to proceed ensure processing of emergency SKBs happens under PF_MEMALLOC. Use the (new) sk_backlog_rcv() wrapper to ensure this for backlog processing. Skip taps, since those are user-space again. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/net/sock.h |5 net/core/dev.c | 59 +++-- net/core/sock.c| 18 3 files changed, 76 insertions(+), 6 deletions(-) Index: linux-2.6/net/core/dev.c === --- linux-2.6.orig/net/core/dev.c +++ linux-2.6/net/core/dev.c @@ -2008,6 +2008,30 @@ out: } #endif +/* + * Filter the protocols for which the reserves are adequate. + * + * Before adding a protocol make sure that it is either covered by the existing + * reserves, or add reserves covering the memory need of the new protocol's + * packet processing. + */ +static int skb_emergency_protocol(struct sk_buff *skb) +{ + if (skb_emergency(skb)) + switch(skb-protocol) { + case __constant_htons(ETH_P_ARP): + case __constant_htons(ETH_P_IP): + case __constant_htons(ETH_P_IPV6): + case __constant_htons(ETH_P_8021Q): + break; + + default: + return 0; + } + + return 1; +} + /** * netif_receive_skb - process receive buffer from network * @skb: buffer to process @@ -2029,10 +2053,23 @@ int netif_receive_skb(struct sk_buff *sk struct net_device *orig_dev; int ret = NET_RX_DROP; __be16 type; + unsigned long pflags = current-flags; + + /* Emergency skb are special, they should +* - be delivered to SOCK_MEMALLOC sockets only +* - stay away from userspace +* - have bounded memory usage +* +* Use PF_MEMALLOC as a poor mans memory pool - the grouping kind. +* This saves us from propagating the allocation context down to all +* allocation sites. +*/ + if (skb_emergency(skb)) + current-flags |= PF_MEMALLOC; /* if we've gotten here through NAPI, check netpoll */ if (netpoll_receive_skb(skb)) - return NET_RX_DROP; + goto out; if (!skb-tstamp.tv64) net_timestamp(skb); @@ -2043,7 +2080,7 @@ int netif_receive_skb(struct sk_buff *sk orig_dev = skb_bond(skb); if (!orig_dev) - return NET_RX_DROP; + goto out; __get_cpu_var(netdev_rx_stat).total++; @@ -2062,6 +2099,9 @@ int netif_receive_skb(struct sk_buff *sk } #endif + if (skb_emergency(skb)) + goto skip_taps; + list_for_each_entry_rcu(ptype, ptype_all, list) { if (!ptype-dev || ptype-dev == skb-dev) { if (pt_prev) @@ -2070,19 +2110,23 @@ int netif_receive_skb(struct sk_buff *sk } } +skip_taps: #ifdef CONFIG_NET_CLS_ACT skb = handle_ing(skb, pt_prev, ret, orig_dev); if (!skb) - goto out; + goto unlock; ncls: #endif + if (!skb_emergency_protocol(skb)) + goto drop; + skb = handle_bridge(skb, pt_prev, ret, orig_dev); if (!skb) - goto out; + goto unlock; skb = handle_macvlan(skb, pt_prev, ret, orig_dev); if (!skb) - goto out; + goto unlock; type = skb-protocol; list_for_each_entry_rcu(ptype, @@ -2098,6 +2142,7 @@ ncls: if (pt_prev) { ret = pt_prev-func(skb, skb-dev, pt_prev, orig_dev); } else { +drop: kfree_skb(skb); /* Jamal, now you will not able to escape explaining * me how you were going to use this. :-) @@ -2105,8 +2150,10 @@ ncls: ret = NET_RX_DROP; } -out: +unlock: rcu_read_unlock(); +out: + tsk_restore_flags(current, pflags, PF_MEMALLOC); return ret; } Index: linux-2.6/include/net/sock.h === --- linux-2.6.orig/include/net/sock.h +++ linux-2.6/include/net/sock.h @@ -529,8 +529,13 @@ static inline void sk_add_backlog(struct skb-next = NULL; } +extern int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb); + static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb) { + if (skb_emergency(skb)) + return __sk_backlog_rcv(sk, skb); + return sk-sk_backlog_rcv(sk, skb); } Index: linux-2.6/net/core/sock.c === --- linux-2.6.orig/net/core/sock.c +++ linux-2.6/net/core/sock.c @@ -319,6 +319,24 @@ int sk_clear_memalloc(struct sock *sk) } EXPORT_SYMBOL_GPL(sk_clear_memalloc);
[PATCH 08/29] mm: system wide ALLOC_NO_WATERMARK
Change ALLOC_NO_WATERMARK page allocation such that the reserves are system wide - which they are per setup_per_zone_pages_min(), when we scrape the barrel, do it properly. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- mm/page_alloc.c |6 ++ 1 file changed, 6 insertions(+) Index: linux-2.6/mm/page_alloc.c === --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -1638,6 +1638,12 @@ restart: rebalance: if (alloc_flags ALLOC_NO_WATERMARKS) { nofail_alloc: + /* +* break out of mempolicy boundaries +*/ + zonelist = NODE_DATA(numa_node_id())-node_zonelists + + gfp_zone(gfp_mask); + /* go through the zonelist yet again, ignoring mins */ page = get_page_from_freelist(gfp_mask, order, zonelist, ALLOC_NO_WATERMARKS); -- -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/29] mm: slb: add knowledge of reserve pages
Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation contexts that are entitled to it. This is done to ensure reserve pages don't leak out and get consumed. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/slub_def.h |1 mm/slab.c| 59 +++ mm/slub.c| 27 - 3 files changed, 72 insertions(+), 15 deletions(-) Index: linux-2.6/mm/slub.c === --- linux-2.6.orig/mm/slub.c +++ linux-2.6/mm/slub.c @@ -21,11 +21,12 @@ #include linux/ctype.h #include linux/kallsyms.h #include linux/memory.h +#include internal.h /* * Lock order: * 1. slab_lock(page) - * 2. slab-list_lock + * 2. node-list_lock * * The slab_lock protects operations on the object of a particular * slab and its metadata in the page struct. If the slab lock @@ -1071,7 +1072,7 @@ static void setup_object(struct kmem_cac } static noinline struct page *new_slab(struct kmem_cache *s, - gfp_t flags, int node) + gfp_t flags, int node, int *reserve) { struct page *page; struct kmem_cache_node *n; @@ -1087,6 +1088,7 @@ static noinline struct page *new_slab(st if (!page) goto out; + *reserve = page-reserve; n = get_node(s, page_to_nid(page)); if (n) atomic_long_inc(n-nr_slabs); @@ -1517,11 +1519,12 @@ static noinline unsigned long get_new_sl { struct kmem_cache_cpu *c = *pc; struct page *page; + int reserve; if (gfpflags __GFP_WAIT) local_irq_enable(); - page = new_slab(s, gfpflags, node); + page = new_slab(s, gfpflags, node, reserve); if (gfpflags __GFP_WAIT) local_irq_disable(); @@ -1530,6 +1533,7 @@ static noinline unsigned long get_new_sl return 0; *pc = c = get_cpu_slab(s, smp_processor_id()); + c-reserve = reserve; if (c-page) flush_slab(s, c); c-page = page; @@ -1564,6 +1568,16 @@ static void *__slab_alloc(struct kmem_ca local_irq_save(flags); preempt_enable_no_resched(); #endif + if (unlikely(c-reserve)) { + /* +* If the current slab is a reserve slab and the current +* allocation context does not allow access to the reserves we +* must force an allocation to test the current levels. +*/ + if (!(gfp_to_alloc_flags(gfpflags) ALLOC_NO_WATERMARKS)) + goto grow_slab; + } + if (likely(c-page)) { state = slab_lock(c-page); @@ -1586,7 +1600,7 @@ load_freelist: */ VM_BUG_ON(c-page-freelist == c-page-end); - if (unlikely(state SLABDEBUG)) + if (unlikely((state SLABDEBUG) || c-reserve)) goto debug; object = c-page-freelist; @@ -1615,7 +1629,7 @@ grow_slab: /* Perform debugging */ debug: object = c-page-freelist; - if (!alloc_debug_processing(s, c-page, object, addr)) + if ((state SLABDEBUG) !alloc_debug_processing(s, c-page, object, addr)) goto another_slab; c-page-inuse++; @@ -2156,10 +2170,11 @@ static struct kmem_cache_node *early_kme struct page *page; struct kmem_cache_node *n; unsigned long flags; + int reserve; BUG_ON(kmalloc_caches-size sizeof(struct kmem_cache_node)); - page = new_slab(kmalloc_caches, gfpflags, node); + page = new_slab(kmalloc_caches, gfpflags, node, reserve); BUG_ON(!page); if (page_to_nid(page) != node) { Index: linux-2.6/include/linux/slub_def.h === --- linux-2.6.orig/include/linux/slub_def.h +++ linux-2.6/include/linux/slub_def.h @@ -18,6 +18,7 @@ struct kmem_cache_cpu { unsigned int offset;/* Freepointer offset (in word units) */ unsigned int objsize; /* Size of an object (from kmem_cache) */ unsigned int objects; /* Objects per slab (from kmem_cache) */ + int reserve;/* Did the current page come from the reserve */ }; struct kmem_cache_node { Index: linux-2.6/mm/slab.c === --- linux-2.6.orig/mm/slab.c +++ linux-2.6/mm/slab.c @@ -115,6 +115,8 @@ #include asm/tlbflush.h #include asm/page.h +#include internal.h + /* * DEBUG - 1 for kmem_cache_create() to honour; SLAB_RED_ZONE SLAB_POISON. * 0 for faster, smaller code (especially in the critical paths). @@ -265,7 +267,8 @@ struct array_cache { unsigned int avail; unsigned int limit; unsigned int batchcount; - unsigned int touched; +
[PATCH 27/29] nfs: disable data cache revalidation for swapfiles
Do as Trond suggested: http://lkml.org/lkml/2006/8/25/348 Disable NFS data cache revalidation on swap files since it doesn't really make sense to have other clients change the file while you are using it. Thereby we can stop setting PG_private on swap pages, since there ought to be no further races with invalidate_inode_pages2() to deal with. And since we cannot set PG_private we cannot use page-private (which is already used by PG_swapcache pages anyway) to store the nfs_page. Thus augment the new nfs_page_find_request logic. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/nfs/inode.c |6 fs/nfs/write.c | 73 ++--- 2 files changed, 65 insertions(+), 14 deletions(-) Index: linux-2.6/fs/nfs/inode.c === --- linux-2.6.orig/fs/nfs/inode.c +++ linux-2.6/fs/nfs/inode.c @@ -758,6 +758,12 @@ int nfs_revalidate_mapping_nolock(struct struct nfs_inode *nfsi = NFS_I(inode); int ret = 0; + /* +* swapfiles are not supposed to be shared. +*/ + if (IS_SWAPFILE(inode)) + goto out; + if ((nfsi-cache_validity NFS_INO_REVAL_PAGECACHE) || nfs_attribute_timeout(inode) || NFS_STALE(inode)) { ret = __nfs_revalidate_inode(NFS_SERVER(inode), inode); Index: linux-2.6/fs/nfs/write.c === --- linux-2.6.orig/fs/nfs/write.c +++ linux-2.6/fs/nfs/write.c @@ -112,25 +112,62 @@ static void nfs_context_set_write_error( set_bit(NFS_CONTEXT_ERROR_WRITE, ctx-flags); } -static struct nfs_page *nfs_page_find_request_locked(struct page *page) +static struct nfs_page * +__nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page, int get) { struct nfs_page *req = NULL; - if (PagePrivate(page)) { + if (PagePrivate(page)) req = (struct nfs_page *)page_private(page); - if (req != NULL) - kref_get(req-wb_kref); - } + else if (unlikely(PageSwapCache(page))) + req = radix_tree_lookup(nfsi-nfs_page_tree, page_file_index(page)); + + if (get req) + kref_get(req-wb_kref); + return req; } +static inline struct nfs_page * +nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page) +{ + return __nfs_page_find_request_locked(nfsi, page, 1); +} + +static int __nfs_page_has_request(struct page *page) +{ + struct inode *inode = page_file_mapping(page)-host; + struct nfs_page *req = NULL; + + spin_lock(inode-i_lock); + req = __nfs_page_find_request_locked(NFS_I(inode), page, 0); + spin_unlock(inode-i_lock); + + /* +* hole here plugged by the caller holding onto PG_locked +*/ + + return req != NULL; +} + +static inline int nfs_page_has_request(struct page *page) +{ + if (PagePrivate(page)) + return 1; + + if (unlikely(PageSwapCache(page))) + return __nfs_page_has_request(page); + + return 0; +} + static struct nfs_page *nfs_page_find_request(struct page *page) { struct inode *inode = page_file_mapping(page)-host; struct nfs_page *req = NULL; spin_lock(inode-i_lock); - req = nfs_page_find_request_locked(page); + req = nfs_page_find_request_locked(NFS_I(inode), page); spin_unlock(inode-i_lock); return req; } @@ -253,7 +290,7 @@ static int nfs_page_async_flush(struct n spin_lock(inode-i_lock); for(;;) { - req = nfs_page_find_request_locked(page); + req = nfs_page_find_request_locked(nfsi, page); if (req == NULL) { spin_unlock(inode-i_lock); return 0; @@ -370,8 +407,14 @@ static void nfs_inode_add_request(struct if (nfs_have_delegation(inode, FMODE_WRITE)) nfsi-change_attr++; } - SetPagePrivate(req-wb_page); - set_page_private(req-wb_page, (unsigned long)req); + /* +* Swap-space should not get truncated. Hence no need to plug the race +* with invalidate/truncate. +*/ + if (likely(!PageSwapCache(req-wb_page))) { + SetPagePrivate(req-wb_page); + set_page_private(req-wb_page, (unsigned long)req); + } nfsi-npages++; kref_get(req-wb_kref); } @@ -387,8 +430,10 @@ static void nfs_inode_remove_request(str BUG_ON (!NFS_WBACK_BUSY(req)); spin_lock(inode-i_lock); - set_page_private(req-wb_page, 0); - ClearPagePrivate(req-wb_page); + if (likely(!PageSwapCache(req-wb_page))) { + set_page_private(req-wb_page, 0); + ClearPagePrivate(req-wb_page); + } radix_tree_delete(nfsi-nfs_page_tree, req-wb_index);
[PATCH 19/29] netvm: prevent a TCP specific deadlock
It could happen that all !SOCK_MEMALLOC sockets have buffered so much data that we're over the global rmem limit. This will prevent SOCK_MEMALLOC buffers from receiving data, which will prevent userspace from running, which is needed to reduce the buffered data. Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/net/sock.h |7 --- net/core/stream.c |5 +++-- 2 files changed, 7 insertions(+), 5 deletions(-) Index: linux-2.6/include/net/sock.h === --- linux-2.6.orig/include/net/sock.h +++ linux-2.6/include/net/sock.h @@ -756,7 +756,8 @@ static inline struct inode *SOCK_INODE(s } extern void __sk_stream_mem_reclaim(struct sock *sk); -extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind); +extern int sk_stream_mem_schedule(struct sock *sk, struct sk_buff *skb, + int size, int kind); #define SK_STREAM_MEM_QUANTUM ((int)PAGE_SIZE) @@ -774,13 +775,13 @@ static inline void sk_stream_mem_reclaim static inline int sk_stream_rmem_schedule(struct sock *sk, struct sk_buff *skb) { return (int)skb-truesize = sk-sk_forward_alloc || - sk_stream_mem_schedule(sk, skb-truesize, 1); + sk_stream_mem_schedule(sk, skb, skb-truesize, 1); } static inline int sk_stream_wmem_schedule(struct sock *sk, int size) { return size = sk-sk_forward_alloc || - sk_stream_mem_schedule(sk, size, 0); + sk_stream_mem_schedule(sk, NULL, size, 0); } /* Used by processes to lock a socket state, so that Index: linux-2.6/net/core/stream.c === --- linux-2.6.orig/net/core/stream.c +++ linux-2.6/net/core/stream.c @@ -207,7 +207,7 @@ void __sk_stream_mem_reclaim(struct sock EXPORT_SYMBOL(__sk_stream_mem_reclaim); -int sk_stream_mem_schedule(struct sock *sk, int size, int kind) +int sk_stream_mem_schedule(struct sock *sk, struct sk_buff *skb, int size, int kind) { int amt = sk_stream_pages(size); struct proto *prot = sk-sk_prot; @@ -225,7 +225,8 @@ int sk_stream_mem_schedule(struct sock * /* Over hard limit. */ if (atomic_read(prot-memory_allocated) prot-sysctl_mem[2]) { prot-enter_memory_pressure(); - goto suppress_allocation; + if (!skb || (skb !skb_emergency(skb))) + goto suppress_allocation; } /* Under pressure. */ -- -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc5-mm1
From: Herbert Xu [EMAIL PROTECTED] Date: Fri, 14 Dec 2007 10:08:07 +0800 [UDP]: Move udp_stats_in6 into net/ipv4/udp.c Now that external users may increment the counters directly, we need to ensure that udp_stats_in6 is always available. Otherwise we'd either have to requrie the external users to be built as modules or ipv6 to be built-in. This isn't too bad because udp_stats_in6 is just a pair of pointers plus an EXPORT, e.g., just 40 (16 + 24) bytes on x86-64. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Applied. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHES 0/3]: DCCP patches for 2.6.25
From: Arnaldo Carvalho de Melo [EMAIL PROTECTED] Date: Thu, 13 Dec 2007 23:41:59 -0200 Please consider pulling from: master.kernel.org:/pub/scm/linux/kernel/git/acme/net-2.6.25 Pulled, but could you please reformat Gerrit's changelog entries in the future? They have these 80+ long lines which are painful to read in ascii email clients and in terminal output. I'll do this by hand during my next rebase for this case, but I will push back when I see it again in future pull requests. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] Updates to nfsroot documentation
From: [EMAIL PROTECTED] Date: Thu, 13 Dec 2007 16:02:33 -0800 From: Amos Waterland [EMAIL PROTECTED] The difference between ip=off and ip=::off has been a cause of much confusion. Document how each behaves, and do not contradict ourselves by saying that off is the default when in fact any is the default and is descibed as being so lower in the file. Signed-off-by: Amos Waterland [EMAIL PROTECTED] Cc: Simon Horman [EMAIL PROTECTED] Cc: Andi Kleen [EMAIL PROTECTED] Cc: David S. Miller [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] Applied. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/4] net: use mutex_is_locked() for ASSERT_RTNL()
From: [EMAIL PROTECTED] Date: Thu, 13 Dec 2007 16:02:36 -0800 From: Andrew Morton [EMAIL PROTECTED] ASSERT_RTNL() uses mutex_trylock(), but it's better to use mutex_is_locked(). Make that change, and remove rtnl_trylock() altogether. (not tested yet!) Cc: David S. Miller [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] NACK, as explained please remove this until the replacement doesn't remove valid checks which are done currently. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] tipc: fix semaphore handling
From: [EMAIL PROTECTED] Date: Thu, 13 Dec 2007 16:02:36 -0800 From: Andrew Morton [EMAIL PROTECTED] As noted by Kevin, tipc's release() does down_interruptible() and ignores the return value. So if signal_pending() we'll end up doing up() on a non-downed semaphore. Fix. Cc: Kevin Winchester [EMAIL PROTECTED] Cc: Per Liden [EMAIL PROTECTED] Cc: Jon Maloy [EMAIL PROTECTED] Cc: Allan Stephens [EMAIL PROTECTED] Cc: David S. Miller [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] This is already in my net-2.6 tree, but thanks for resubmitting anyways :) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 4/4] PPP synchronous tty: convert dead_sem to completion
From: [EMAIL PROTECTED] Date: Thu, 13 Dec 2007 16:02:37 -0800 From: Matthias Kaehlcke [EMAIL PROTECTED] PPP synchronous tty channel driver: convert the semaphore dead_sem to a completion Signed-off-by: Matthias Kaehlcke [EMAIL PROTECTED] Cc: Paul Mackerras [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] Applied to net-2.6.25, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][XFRM] Fix potential race vs xfrm_state(only)_find and xfrm_hash_resize.
From: Pavel Emelyanov [EMAIL PROTECTED] Date: Thu, 13 Dec 2007 13:56:14 +0300 The _find calls calculate the hash value using the xfrm_state_hmask, without the xfrm_state_lock. But the value of this mask can change in the _resize call under the state_lock, so we risk to fail in finding the desired entry in hash. I think, that the hash value is better to calculate under the state lock. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] Thanks for the bug fix. I know why I coded it this way, I wanted to give GCC more room to schedule the loads away from the uses in the hash calculation. Once you cram it after the spin lock acquire, it can't load unrelated values earlier to soften the load/use cost on cache misses. Of course it's invalid because the hash mask can change as you noticed. I wish there was a way to conditionally clobber memory, then we could tell GCC exactly what memory objects are protected by the lock and thus help in situations like this so much. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] ixgb: enable sun hardware support for broadcom phy
From: Matheos Worku [EMAIL PROTECTED] Implement support for a SUN-specific PHY. SUN provides a modified 82597-based board with their own PHY that works with very little modification to the code. This patch implements this new PHY which is identified by the subvendor device ID. The device ID of the adapter remains the same. Signed-off-by: Matheos Worku [EMAIL PROTECTED] Signed-off-by: Jesse Brandeburg [EMAIL PROTECTED] Signed-off-by: Auke Kok [EMAIL PROTECTED] --- drivers/net/ixgb/ixgb_hw.c | 82 +- drivers/net/ixgb/ixgb_hw.h |3 +- drivers/net/ixgb/ixgb_ids.h |4 ++ drivers/net/ixgb/ixgb_main.c | 10 +++-- 4 files changed, 91 insertions(+), 8 deletions(-) diff --git a/drivers/net/ixgb/ixgb_hw.c b/drivers/net/ixgb/ixgb_hw.c index 2c6367a..80a8b98 100644 --- a/drivers/net/ixgb/ixgb_hw.c +++ b/drivers/net/ixgb/ixgb_hw.c @@ -45,6 +45,8 @@ static boolean_t ixgb_link_reset(struct ixgb_hw *hw); static void ixgb_optics_reset(struct ixgb_hw *hw); +static void ixgb_optics_reset_bcm(struct ixgb_hw *hw); + static ixgb_phy_type ixgb_identify_phy(struct ixgb_hw *hw); static void ixgb_clear_hw_cntrs(struct ixgb_hw *hw); @@ -90,10 +92,20 @@ static uint32_t ixgb_mac_reset(struct ixgb_hw *hw) ASSERT(!(ctrl_reg IXGB_CTRL0_RST)); #endif - if (hw-phy_type == ixgb_phy_type_txn17401) { - ixgb_optics_reset(hw); + if (hw-subsystem_vendor_id == SUN_SUBVENDOR_ID) { + ctrl_reg = /* Enable interrupt from XFP and SerDes */ + IXGB_CTRL1_GPI0_EN | + IXGB_CTRL1_SDP6_DIR | + IXGB_CTRL1_SDP7_DIR | + IXGB_CTRL1_SDP6 | + IXGB_CTRL1_SDP7; + IXGB_WRITE_REG(hw, CTRL1, ctrl_reg); + ixgb_optics_reset_bcm(hw); } + if (hw-phy_type == ixgb_phy_type_txn17401) + ixgb_optics_reset(hw); + return ctrl_reg; } @@ -253,6 +265,10 @@ ixgb_identify_phy(struct ixgb_hw *hw) break; } + /* update phy type for sun specific board */ + if (hw-subsystem_vendor_id == SUN_SUBVENDOR_ID) + phy_type = ixgb_phy_type_bcm; + return (phy_type); } @@ -1225,3 +1241,65 @@ ixgb_optics_reset(struct ixgb_hw *hw) return; } + +/** + * Resets the 10GbE optics module for Sun variant NIC. + * + * hw - Struct containing variables accessed by shared code + */ + +#define IXGB_BCM8704_USER_PMD_TX_CTRL_REG 0xC803 +#define IXGB_BCM8704_USER_PMD_TX_CTRL_REG_VAL 0x0164 +#define IXGB_BCM8704_USER_CTRL_REG0xC800 +#define IXGB_BCM8704_USER_CTRL_REG_VAL0x7FBF +#define IXGB_BCM8704_USER_DEV3_ADDR 0x0003 +#define IXGB_SUN_PHY_ADDRESS 0x +#define IXGB_SUN_PHY_RESET_DELAY 305 + +static void +ixgb_optics_reset_bcm(struct ixgb_hw *hw) +{ + u32 ctrl = IXGB_READ_REG(hw, CTRL0); + ctrl = ~IXGB_CTRL0_SDP2; + ctrl |= IXGB_CTRL0_SDP3; + IXGB_WRITE_REG(hw, CTRL0, ctrl); + + /* SerDes needs extra delay */ + msleep(IXGB_SUN_PHY_RESET_DELAY); + + /* Broadcom 7408L configuration */ + /* Reference clock config */ + ixgb_write_phy_reg(hw, + IXGB_BCM8704_USER_PMD_TX_CTRL_REG, + IXGB_SUN_PHY_ADDRESS, + IXGB_BCM8704_USER_DEV3_ADDR, + IXGB_BCM8704_USER_PMD_TX_CTRL_REG_VAL); + /* we must read the registers twice */ + ixgb_read_phy_reg(hw, + IXGB_BCM8704_USER_PMD_TX_CTRL_REG, + IXGB_SUN_PHY_ADDRESS, + IXGB_BCM8704_USER_DEV3_ADDR); + ixgb_read_phy_reg(hw, + IXGB_BCM8704_USER_PMD_TX_CTRL_REG, + IXGB_SUN_PHY_ADDRESS, + IXGB_BCM8704_USER_DEV3_ADDR); + + ixgb_write_phy_reg(hw, + IXGB_BCM8704_USER_CTRL_REG, + IXGB_SUN_PHY_ADDRESS, + IXGB_BCM8704_USER_DEV3_ADDR, + IXGB_BCM8704_USER_CTRL_REG_VAL); + ixgb_read_phy_reg(hw, + IXGB_BCM8704_USER_CTRL_REG, + IXGB_SUN_PHY_ADDRESS, + IXGB_BCM8704_USER_DEV3_ADDR); + ixgb_read_phy_reg(hw, + IXGB_BCM8704_USER_CTRL_REG, + IXGB_SUN_PHY_ADDRESS, + IXGB_BCM8704_USER_DEV3_ADDR); + + /* SerDes needs extra delay */ + msleep(IXGB_SUN_PHY_RESET_DELAY); + + return; +} diff --git a/drivers/net/ixgb/ixgb_hw.h b/drivers/net/ixgb/ixgb_hw.h index af56433..f4e0044 100644 ---
[PATCH 1/2] ixgb: make sure jumbos stay enabled after reset
From: Matheos Worku [EMAIL PROTECTED] Currently a device reset (ethtool -r ethX) would cause the adapter to fall back to regular MTU sizes. Signed-off-by: Matheos Worku [EMAIL PROTECTED] Signed-off-by: Jesse Brandeburg [EMAIL PROTECTED] Signed-off-by: Auke Kok [EMAIL PROTECTED] --- drivers/net/ixgb/ixgb_main.c | 16 ++-- 1 files changed, 14 insertions(+), 2 deletions(-) diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c index 3021234..bf9085f 100644 --- a/drivers/net/ixgb/ixgb_main.c +++ b/drivers/net/ixgb/ixgb_main.c @@ -320,10 +320,22 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t kill_watchdog) void ixgb_reset(struct ixgb_adapter *adapter) { + struct ixgb_hw *hw = adapter-hw; - ixgb_adapter_stop(adapter-hw); - if(!ixgb_init_hw(adapter-hw)) + ixgb_adapter_stop(hw); + if (!ixgb_init_hw(hw)) DPRINTK(PROBE, ERR, ixgb_init_hw failed.\n); + + /* restore frame size information */ + IXGB_WRITE_REG(hw, MFS, hw-max_frame_size IXGB_MFS_SHIFT); + if (hw-max_frame_size + IXGB_MAX_ENET_FRAME_SIZE_WITHOUT_FCS + ENET_FCS_LENGTH) { + u32 ctrl0 = IXGB_READ_REG(hw, CTRL0); + if (!(ctrl0 IXGB_CTRL0_JFE)) { + ctrl0 |= IXGB_CTRL0_JFE; + IXGB_WRITE_REG(hw, CTRL0, ctrl0); + } + } } /** -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-2.6.25] Revert recent TCP work
On Fri, 14 Dec 2007, Ilpo Järvinen wrote: So, I might soon prepare a revert patch for most of the questionable TCP parts and ask Dave to apply it (and drop them fully during next rebase) unless I suddently figure something out soon which explains all/most of the problems, then return to drawing board. ...As it seems that the cumulative ACK processing problem discovered later on (having rather cumbersome solution with skbs only) will make part of the work that's currently in net-2.6.25 quite useless/duplicate effort. But thanks anyway for reporting these. Hi Dave, Could you either drop my recent patches (+one fix to them from Herbert Xu == [TCP]: Fix crash in tcp_advance_send_head), all mine after [TCP]: Abstract tp-highest_sack accessing point to next skb from net-2.6.25 or just apply the revert from below and do the removal during next rebase. I think it could even be automated by something like this (untested): for i in $(cat commits | cut -d ' ' -f 1); do git-rebase --onto $i^ $i; done (I've attached the commits list). I'll resend small bits that are still useful but get removed in this kind of straightforward operation (I guess it's easier for you to track this way and makes conflicts a non-problem). ...It was buggy as well, I've tried to Cc all bug reporters that I've noticed so far... Related bugs include at least these cases: These are completely removed by this revert: __tcp_rb_insert (__|)tcp_reset_fack_counts May still trigger later due to other, genuine bugs: tcp_sacktag_one (I'll rework resend this soon) tcp_fastretrans_alert (fackets_out trap) BUG_TRAP(packets = tp-packets_out); in tcp_mark_head_lost -- i. [PATCH net-2.6.25] Revert recent TCP work It was recently discovered that there's yet another processing aspect to consider related to cumulative ACK processing. This solution wasn't enough to handle that but (arguably) complex and intrusive changes were still necessary in addition to the complexity this already introduced. Another approach is on the drawing board. This was somehow buggy as well, a lot of reports against it were filed already :-), but hunting the cause doesn't seem so beneficial anymore. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- include/linux/skbuff.h |3 - include/linux/tcp.h |4 - include/net/tcp.h| 362 -- net/ipv4/tcp_input.c | 341 --- net/ipv4/tcp_ipv4.c |1 - net/ipv4/tcp_minisocks.c |1 - net/ipv4/tcp_output.c| 13 +- net/ipv6/tcp_ipv6.c |1 - 8 files changed, 196 insertions(+), 530 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index f21fee6..c618fbf 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -18,7 +18,6 @@ #include linux/compiler.h #include linux/time.h #include linux/cache.h -#include linux/rbtree.h #include asm/atomic.h #include asm/types.h @@ -254,8 +253,6 @@ struct sk_buff { struct sk_buff *next; struct sk_buff *prev; - struct rb_node rb; - struct sock *sk; ktime_t tstamp; struct net_device *dev; diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 56342c3..08027f1 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -174,7 +174,6 @@ struct tcp_md5sig { #include linux/skbuff.h #include linux/dmaengine.h -#include linux/rbtree.h #include net/sock.h #include net/inet_connection_sock.h #include net/inet_timewait_sock.h @@ -321,9 +320,6 @@ struct tcp_sock { u32 snd_cwnd_used; u32 snd_cwnd_stamp; - struct rb_root write_queue_rb; - struct rb_root sacked_queue_rb; - struct sk_buff_head sacked_queue; struct sk_buff_head out_of_order_queue; /* Out of order segments go here */ u32 rcv_wnd;/* Current receiver window */ diff --git a/include/net/tcp.h b/include/net/tcp.h index 5e6c433..5ec1cac 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -555,7 +555,6 @@ struct tcp_skb_cb { __u32 seq;/* Starting sequence number */ __u32 end_seq;/* SEQ + FIN + SYN + datalen*/ __u32 when; /* used to compute rtt's*/ - unsigned intfack_count; /* speed up SACK processing */ __u8flags; /* TCP header flags.*/ /* NOTE: These must match up to the flags byte in a @@ -1191,112 +1190,29 @@ static inline void tcp_put_md5sig_pool(void) } /* write queue abstraction */ -#define TCP_WQ_SACKED 1 - -static inline struct sk_buff_head *__tcp_list_select(struct sock *sk, const int queue) -{ - if (queue == TCP_WQ_SACKED) - return tcp_sk(sk)-sacked_queue; - else -
Re: [PATCH 8/8] gianfar: Magic Packet and suspend/resume support.
Scott Wood wrote: Signed-off-by: Scott Wood [EMAIL PROTECTED] --- Jeff, can you ack this to go through Paul's tree (assuming nothing wrong with it)? drivers/net/gianfar.c | 137 - drivers/net/gianfar.h | 13 +++- drivers/net/gianfar_ethtool.c | 41 - 3 files changed, 185 insertions(+), 6 deletions(-) ACK -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: RX lockup fix
Stephen Hemminger wrote: I'm using a Marvell 88E8062 on a custom PPC64 blade and ran into RX lockups while validating the sky2 driver. The receive MAC FIFO would become stuck during testing with high traffic. One port of the 88E8062 would lockup, while the other port remained functional. Re-inserting the sky2 module would not fix the problem - only a power cycle would. I looked over Marvell's most recent sk98lin driver and it looks like they had a workaround for the Yukon XL that the sky2 doesn't have yet. The sk98lin driver disables the RX MAC FIFO flush feature for all revisions of the Yukon XL. According to skgeinit.c of the sk98lin driver, Flushing must be enabled (needed for ASF see dev. #4.29), but the flushing mask should be disabled (see dev. #4.115). Nice. I implemented this same change in the sky2 driver and verified that the RX lockup I was seeing was resolved. Signed-off-by: Peter Tyser [EMAIL PROTECTED] Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- Original patch reformatted to remove line wrap. applied #upstream-fixes -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] e100: free IRQ to remove warningwhenrebooting
Auke Kok wrote: Adapted from Ian Wienand [EMAIL PROTECTED] Explicitly free the IRQ before removing the device to remove a warning Destroying IRQ without calling free_irq Signed-off-by: Auke Kok [EMAIL PROTECTED] Cc: Ian Wienand [EMAIL PROTECTED] --- drivers/net/e100.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) applied #upstream-fixes -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2.6 patch] drivers/net/sis190.c section fix
Adrian Bunk wrote: This patch fixes the following section mismatch with CONFIG_HOTPLUG=n: -- snip -- ... WARNING: vmlinux.o(.init.text.20+0x4cb25): Section mismatch: reference to .exit.text:sis190_mii_remove (between 'sis190_init_one' and 'read_eeprom') ... -- snip -- Signed-off-by: Adrian Bunk [EMAIL PROTECTED] --- 29fae057ba15a552a7cad1e731d3238d567032ba diff --git a/drivers/net/sis190.c b/drivers/net/sis190.c index 7200883..49f767b 100644 --- a/drivers/net/sis190.c +++ b/drivers/net/sis190.c @@ -1381,7 +1381,7 @@ out: return rc; } -static void __devexit sis190_mii_remove(struct net_device *dev) +static void sis190_mii_remove(struct net_device *dev) { struct sis190_private *tp = netdev_priv(dev); applied #upstream-fixes -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2.6 patch] drivers/net/s2io.c section fixes
Adrian Bunk wrote: Code used by the non-__devinit s2io_open() mustn't be __devinit. This patch fixes the following section mismatch with CONFIG_HOTPLUG=n: -- snip -- ... WARNING: vmlinux.o(.text+0x6f6e3e): Section mismatch: reference to .init.text.20:s2io_test_intr (between 's2io_open' and 's2io_ethtool_sset') ... -- snip -- Signed-off-by: Adrian Bunk [EMAIL PROTECTED] --- drivers/net/s2io.c |4 ++-- applied #upstream-fixes -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NETFILTER] xt_hashlimit : speedups hash_dst()
Eric Dumazet wrote, On 12/14/2007 12:09 PM: ... + /* + * Instead of returning hash % ht-cfg.size (implying a divide) + * we return the high 32 bits of the (hash * ht-cfg.size) that will + * give results between [0 and cfg.size-1] and same hash distribution, + * but using a multiply, less expensive than a divide + */ + return ((u64)hash * ht-cfg.size) 32; Are we sure of the same hash distribution? Probably I miss something, but: if this 'hash' is well distributed on 32 bits, and ht-cfg.size is smaller than 32 bits, e.g. 256 (8 bits), then this multiplication moves to the higher 32 of u64 only max. 8 bits of the most significant byte, and the other three bytes are never used, while division is always affected by all four bytes... Regards, Jarek P. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel 2.6.23.8: KERNEL: assertion in net/ipv4/tcp_input.c
On Thu, 13 Dec 2007, Wolfgang Walter wrote: it happened again with your patch applied: WARNING: at net/ipv4/tcp_input.c:1018 tcp_sacktag_write_queue() Call Trace: IRQ [80549290] tcp_sacktag_write_queue+0x7d0/0xa60 [80283869] add_partial+0x19/0x60 [80549ac4] tcp_ack+0x5a4/0x1d70 [8054e625] tcp_rcv_established+0x485/0x7b0 [80554c3d] tcp_v4_do_rcv+0xed/0x3e0 [80556fe7] tcp_v4_rcv+0x947/0x970 [80538c6c] ip_local_deliver+0xac/0x290 [80538862] ip_rcv+0x362/0x6c0 [804fc5d3] netif_receive_skb+0x323/0x420 [8042ab40] tg3_poll+0x630/0xa50 [804fecba] net_rx_action+0x8a/0x140 [8023a269] __do_softirq+0x69/0xe0 [8020d47c] call_softirq+0x1c/0x30 [8020f315] do_softirq+0x35/0x90 [8023a105] irq_exit+0x55/0x60 [8020f3f0] do_IRQ+0x80/0x100 [8020c7d1] ret_from_intr+0x0/0xa EOI ...Yeah, as I suspected, left_out != 0 when sacked_out and lost_out are zero. I'll try to read the code again to see how that could happen (in any case this is just annoying at the best, no other harm but the message is being done). ...If nothing comes up I might ask you to run with another test patch but it might take week or so until I've enough time to dig into this fully because I must also come familiar with something as pre-historic as the 2.6.23 (there are already large number of related changes since then, both in upcoming 2.6.24 and some in net-2.6.25)... :-) Any tweaking done to TCP related sysctls? net/core/somaxconn=2048 net/ipv4/tcp_syncookies=1 net/ipv4/tcp_max_syn_backlog=8192 net/ipv4/tcp_max_tw_buckets=180 net/ipv4/tcp_window_scaling=0 net/ipv4/tcp_timestamps=0 Thanks, these won't be that significant, though timestamps will exclude some possibilities :-). -- i. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 01/10] e1000e: make E1000E default to the same kconfig setting as E1000
[EMAIL PROTECTED] wrote: From: Randy Dunlap [EMAIL PROTECTED] Make E1000E default to the same kconfig setting as E1000. So people's machiens don't stop working when they use oldconfig. Signed-off-by: Randy Dunlap [EMAIL PROTECTED] Cc: Jeff Garzik [EMAIL PROTECTED] Cc: Auke Kok [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] --- drivers/net/Kconfig |1 + 1 file changed, 1 insertion(+) diff -puN drivers/net/Kconfig~e1000e-make-e1000e-default-to-the-same-kconfig-setting-as-e1000 drivers/net/Kconfig --- a/drivers/net/Kconfig~e1000e-make-e1000e-default-to-the-same-kconfig-setting-as-e1000 +++ a/drivers/net/Kconfig @@ -1986,6 +1986,7 @@ config E1000_DISABLE_PACKET_SPLIT config E1000E tristate Intel(R) PRO/1000 PCI-Express Gigabit Ethernet support depends on PCI + default E1000 I am not inclined to apply this one. This practice, applied over time, will tend to accumulate weird 'default' and 'select' statements. So I think the breakage that occurs is mitigated by two factors: 1) kernel hackers that do their own configs are expected to be able to figure this stuff. 2) kernel builders (read: distros, mainly) are expected to have put thought into the Kconfig selection and driver migration strategies. PCI IDs move across drivers from time, and we don't want to apply these sorts changes: Viewed in the long term, the suggested patch is merely a temporary change to allow kernel experts to more easily deal with the PCI ID migration across drivers. I would prefer simply to communicate to kernel experts and builders about a Kconfig issue that could potentially their booting/networking... because this patch is only needed if the kernel experts do not already know about a necessary config update. Jeff -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [patch 02/10] forcedeth: power down phy when interface is down
Ed, You mention that the phy will become 100Mbit half duplex, but during nv_close, the phy setting is not modified. This might be a separate issue. Ayaz -Original Message- From: Andrew Morton [mailto:[EMAIL PROTECTED] Sent: Thursday, December 13, 2007 5:07 PM To: Ed Swierk Cc: Ayaz Abdulla; [EMAIL PROTECTED]; netdev@vger.kernel.org Subject: Re: [patch 02/10] forcedeth: power down phy when interface is down On Thu, 13 Dec 2007 16:53:58 -0800 Ed Swierk [EMAIL PROTECTED] wrote: On 12/13/07, Andrew Morton [EMAIL PROTECTED] wrote: Does this patch actually fix any observeable problem? Without the patch, ifconfig down leaves the physical link up, which confuses datacenter users who expect the link lights both on the NIC and the switch to go out when they bring an interface down. Furthermore, even though the phy is powered on, autonegotiation stops working, so a normally gigabit link might suddenly become 100 Mbit half-duplex when the interface goes down, and become gigabit when it comes up again. OK, thanks, I added that text to the changelog along with Ayaz's objection and shall continue to bug people with it until we have a fix merged. --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. --- -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 04/10] ucc_geth-fix-build-break-introduced-by-commit-09f75cd7bf13720738e6a196cc0107ce9a5bd5a0-checkpatch-fixes
[EMAIL PROTECTED] wrote: From: Andrew Morton [EMAIL PROTECTED] Cc: David S. Miller [EMAIL PROTECTED] Cc: Emil Medve [EMAIL PROTECTED] Cc: Jeff Garzik [EMAIL PROTECTED] Cc: Kumar Gala [EMAIL PROTECTED] Cc: Li Yang [EMAIL PROTECTED] Cc: Paul Mackerras [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] --- drivers/net/ucc_geth.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff -puN drivers/net/ucc_geth.c~ucc_geth-fix-build-break-introduced-by-commit-09f75cd7bf13720738e6a196cc0107ce9a5bd5a0-checkpatch-fixes drivers/net/ucc_geth.c --- a/drivers/net/ucc_geth.c~ucc_geth-fix-build-break-introduced-by-commit-09f75cd7bf13720738e6a196cc0107ce9a5bd5a0-checkpatch-fixes +++ a/drivers/net/ucc_geth.c @@ -3447,7 +3447,7 @@ static int ucc_geth_rx(struct ucc_geth_p u16 length, howmany = 0; u32 bd_status; u8 *bdBuffer; - struct net_device * dev; + struct net_device *dev; ugeth_vdbg(%s: IN, __FUNCTION__); applied this crucial fix to #upstream-fixes with a suitable changelog -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 06/10] Net: ibm_newemac, remove SPIN_LOCK_UNLOCKED
[EMAIL PROTECTED] wrote: From: Jiri Slaby [EMAIL PROTECTED] SPIN_LOCK_UNLOCKED is deprecated, use DEFINE_SPINLOCK instead Signed-off-by: Jiri Slaby [EMAIL PROTECTED] Cc: Jeff Garzik [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] --- drivers/net/ibm_newemac/debug.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff -puN drivers/net/ibm_newemac/debug.c~net-ibm_newemac-remove-spin_lock_unlocked drivers/net/ibm_newemac/debug.c --- a/drivers/net/ibm_newemac/debug.c~net-ibm_newemac-remove-spin_lock_unlocked +++ a/drivers/net/ibm_newemac/debug.c @@ -21,7 +21,7 @@ #include core.h -static spinlock_t emac_dbg_lock = SPIN_LOCK_UNLOCKED; +static DEFINE_SPINLOCK(emac_dbg_lock); applied #upstream-fixes -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 08/10] net: smc911x: shut up compiler warnings
[EMAIL PROTECTED] wrote: From: Paul Mundt [EMAIL PROTECTED] Trivial fix to shut up gcc. Signed-off-by: Paul Mundt [EMAIL PROTECTED] Cc: Jeff Garzik [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] --- drivers/net/smc911x.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff -puN drivers/net/smc911x.h~net-smc911x-shut-up-compiler-warnings drivers/net/smc911x.h --- a/drivers/net/smc911x.h~net-smc911x-shut-up-compiler-warnings +++ a/drivers/net/smc911x.h @@ -76,7 +76,7 @@ -#if SMC_USE_PXA_DMA +#ifdef SMC_USE_PXA_DMA #define SMC_USE_DMA /* _ applied #upstream-fixes -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHES 0/3]: DCCP patches for 2.6.25
Em Fri, Dec 14, 2007 at 11:29:14AM -0800, David Miller escreveu: From: Arnaldo Carvalho de Melo [EMAIL PROTECTED] Date: Thu, 13 Dec 2007 23:41:59 -0200 Please consider pulling from: master.kernel.org:/pub/scm/linux/kernel/git/acme/net-2.6.25 Pulled, but could you please reformat Gerrit's changelog entries in the future? They have these 80+ long lines which are painful to read in ascii email clients and in terminal output. I'll do this by hand during my next rebase for this case, but I will push back when I see it again in future pull requests. OK, will take that into account in future requests, Thanks a lot, - Arnaldo -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] HDLC driver: use unregister_netdev instead of unregister_netdevice
Wang Chen [EMAIL PROTECTED] writes: [PATCH] HDLC driver: use unregister_netdev instead of unregister_netdevice Since the caller and the upper caller doesn't hod the rtnl semaphore. We should use unregister_netdev instead of unregister_netdevice. NAK, not-a-bug. The caller actually holds rtnl, it goes through the netdev core ioctl dispatcher: (unregister_netdevice+0x0/0x24) from (fr_ioctl+0x688/0x75c) /* fr_del_pvc() and fr_add_pvc() optimized out by gcc */ (fr_ioctl+0x0/0x75c) from (hdlc_ioctl+0x4c/0x8c) (hdlc_ioctl+0x0/0x8c) from (hss_ioctl+0x3c/0x324) (hss_ioctl+0x0/0x324) from (dev_ifsioc+0x428/0x4e8) (dev_ifsioc+0x0/0x4e8) from (dev_ioctl+0x5d8/0x664) (dev_ioctl+0x0/0x664) from (sock_ioctl+0x90/0x254) (sock_ioctl+0x0/0x254) from (do_ioctl+0x34/0x78) (do_ioctl+0x0/0x78) from (vfs_ioctl+0x78/0x2a8) (vfs_ioctl+0x0/0x2a8) from (sys_ioctl+0x40/0x64) (sys_ioctl+0x0/0x64) from (ret_fast_syscall+0x0/0x2c) The patch would make it deadlock. Please note that sister fr_add_pvc() uses register_netdevice(). The same applies to fr_destroy(). -- Krzysztof Halasa -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] [NETDEV] sky2: rtnl_lock out of loop will be faster
Wang Chen wrote: [PATCH 4/4] [NETDEV] sky2: rtnl_lock out of loop will be faster Before this patch, it gets and releases the lock at each iteration of the loop. Changing unregister_netdev to unregister_netdevice and locking outside of the loop will be faster for this approach. Signed-off-by: Wang Chen [EMAIL PROTECTED] --- sky2.c |4 +++- 1 files changed, 3 insertions(+), 1 deletion(-) --- linux-2.6.24.rc5.org/drivers/net/sky2.c 2007-12-12 10:19:43.0 +0800 +++ linux-2.6.24.rc5/drivers/net/sky2.c 2007-12-12 15:23:37.0 +0800 @@ -4270,8 +4270,10 @@ static void __devexit sky2_remove(struct del_timer_sync(hw-watchdog_timer); cancel_work_sync(hw-restart_work); + rtnl_lock(); for (i = hw-ports-1; i = 0; --i) - unregister_netdev(hw-dev[i]); + unregister_netdevice(hw-dev[i]); + rtnl_unlock(); while true and correct, I don't see the remove path as needing this type of micro-optimization. Removing and shutting down hardware is an operation that can take many seconds (an eternity, to a computer)... a very slow operation. Thus, given that speed is not a priority here, I place more value on smaller, more compact, easily reviewable code -- the existing unpatched code in this case. Jeff -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] ixgb: make sure jumbos stay enabled after reset
Auke Kok wrote: From: Matheos Worku [EMAIL PROTECTED] Currently a device reset (ethtool -r ethX) would cause the adapter to fall back to regular MTU sizes. Signed-off-by: Matheos Worku [EMAIL PROTECTED] Signed-off-by: Jesse Brandeburg [EMAIL PROTECTED] Signed-off-by: Auke Kok [EMAIL PROTECTED] --- drivers/net/ixgb/ixgb_main.c | 16 ++-- 1 files changed, 14 insertions(+), 2 deletions(-) applied #upstream-fixes -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 16/29] netvm: INET reserves.
Hi Peter, sysctl_intvec_fragment, proc_dointvec_fragment, sysctl_intvec_fragment seem to suffer from cut-n-pastitis. Regards, Daniel -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Re: [patch 03/10] forcedeth: fix MAC address detection on network card (regression in 2.6.23)
[EMAIL PROTECTED] wrote: From: Michael Pyne [EMAIL PROTECTED] Partially revert a change to mac address detection introduced to the forcedeth driver. The change was intended to correct mac address detection for newer nVidia chipsets where the mac address was stored in reverse order. One of those chipsets appears to still have the mac address in reverse order (or at least, it does on my system). The change that broke mac address detection for my card was commit ef756b3e56c68a4d76d9d7b9a73fa8f4f739180f forcedeth: mac address correct My network card is an nVidia built-in Ethernet card, output from lspci as follows (with text and numeric ids): $ lspci | grep Ethernet 00:07.0 Bridge: nVidia Corporation MCP61 Ethernet (rev a2) $ lspci -n | grep 07.0 00:07.0 0680: 10de:03ef (rev a2) The vendor id is, of course, nVidia. The device id corresponds to the NVIDIA_NVENET_19 entry. The included patch fixes the MAC address detection on my system. Interestingly, the MAC address appears to be in the range reserved for my motherboard manufacturer (Gigabyte) and not nVidia. Signed-off-by: Michael J. Pyne [EMAIL PROTECTED] Cc: Jeff Garzik [EMAIL PROTECTED] Cc: Ayaz Abdulla [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] On Wed, 21 Nov 2007 15:34:52 -0800 Ayaz Abdulla [EMAIL PROTECTED] wrote: The solution is to get the OEM to update their BIOS (instead of integrating this patch) since the MCP61 specs indicate that the MAC Address should be in correct order from BIOS. By changing the feature DEV_HAS_CORRECT_MACADDR to all MCP61 boards, it could cause it to break on other OEM systems who have implemented it correctly. Signed-off-by: Andrew Morton [EMAIL PROTECTED] --- drivers/net/forcedeth.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff -puN drivers/net/forcedeth.c~forcedeth-fix-mac-address-detection-on-network-card-regression-in-2623 drivers/net/forcedeth.c --- a/drivers/net/forcedeth.c~forcedeth-fix-mac-address-detection-on-network-card-regression-in-2623 +++ a/drivers/net/forcedeth.c @@ -5551,7 +5551,7 @@ static struct pci_device_id pci_tbl[] = }, { /* MCP61 Ethernet Controller */ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_19), - .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_HIGH_DMA|DEV_HAS_POWER_CNTRL|DEV_HAS_MSI|DEV_HAS_PAUSEFRAME_TX|DEV_HAS_STATISTICS_V2|DEV_HAS_TEST_EXTENDED|DEV_HAS_MGMT_UNIT|DEV_HAS_CORRECT_MACADDR, + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_HIGH_DMA|DEV_HAS_POWER_CNTRL|DEV_HAS_MSI|DEV_HAS_PAUSEFRAME_TX|DEV_HAS_STATISTICS_V2|DEV_HAS_TEST_EXTENDED|DEV_HAS_MGMT_UNIT, As discussed in the thread (and Michael did provide dmidecode output IIRC), one make everybody happy solution is to use a technique similar to that found in drivers/ata/ata_piix.c to match a list of BIOS that have incorrect mac addresses, and clear the feature bit DEV_HAS_CORRECT_MACADDR. I have attached an example patch of this approach -- someone merely needs to take the patch, fill in the blanks, and test it! :) Jeff diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c index a96583c..f7aab9b 100644 --- a/drivers/net/forcedeth.c +++ b/drivers/net/forcedeth.c @@ -147,6 +147,7 @@ #include linux/init.h #include linux/if_vlan.h #include linux/dma-mapping.h +#include linux/dmi.h #include asm/irq.h #include asm/io.h @@ -4987,6 +4988,26 @@ static int nv_close(struct net_device *dev) return 0; } +static int have_broken_macaddr(void) +{ + static const struct dmi_system_id brokenmac_sysids[] = { + { + .ident = blahblah, + .matches = { + DMI_MATCH(DMI_SYS_VENDOR, MY_VENDOR), + DMI_MATCH(DMI_PRODUCT_NAME, blahblah), + }, + }, + + { } /* terminate list */ + }; + + if (dmi_check_system(brokenmac_sysids)) + return 1; + + return 0; +} + static int __devinit nv_probe(struct pci_dev *pci_dev, const struct pci_device_id *id) { struct net_device *dev; @@ -4997,6 +5018,7 @@ static int __devinit nv_probe(struct pci_dev *pci_dev, const struct pci_device_i u32 powerstate, txreg; u32 phystate_orig = 0, phystate; int phyinitialized = 0; + int broken_macaddr = 0; DECLARE_MAC_BUF(mac); static int printed_version; @@ -5180,10 +5202,14 @@ static int __devinit nv_probe(struct pci_dev *pci_dev, const struct pci_device_i np-orig_mac[0] = readl(base + NvRegMacAddrA); np-orig_mac[1] = readl(base + NvRegMacAddrB); + if (!(id-driver_data DEV_HAS_CORRECT_MACADDR)) + broken_macaddr = 1; + else if (have_broken_macaddr()) + broken_macaddr = 1; + /* check the workaround bit for correct mac address order */ txreg = readl(base +
Re: [PATCH 00/29] Swap over NFS -v15
Hi Peter, A major feature of this patch set is the network receive deadlock avoidance, but there is quite a bit of stuff bundled with it, the NFS user accounting for a big part of the patch by itself. Is it possible to provide a before and after demonstration case for just the network receive deadlock part, given a subset of the patch set and a user space recipe that anybody can try? Regards, Daniel -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] HDLC driver: use unregister_netdev instead of unregister_netdevice
From: Krzysztof Halasa [EMAIL PROTECTED] Date: Fri, 14 Dec 2007 22:28:07 +0100 Wang Chen [EMAIL PROTECTED] writes: [PATCH] HDLC driver: use unregister_netdev instead of unregister_netdevice Since the caller and the upper caller doesn't hod the rtnl semaphore. We should use unregister_netdev instead of unregister_netdevice. NAK, not-a-bug. The caller actually holds rtnl, it goes through the netdev core ioctl dispatcher: (unregister_netdevice+0x0/0x24) from (fr_ioctl+0x688/0x75c) /* fr_del_pvc() and fr_add_pvc() optimized out by gcc */ (fr_ioctl+0x0/0x75c) from (hdlc_ioctl+0x4c/0x8c) (hdlc_ioctl+0x0/0x8c) from (hss_ioctl+0x3c/0x324) (hss_ioctl+0x0/0x324) from (dev_ifsioc+0x428/0x4e8) (dev_ifsioc+0x0/0x4e8) from (dev_ioctl+0x5d8/0x664) (dev_ioctl+0x0/0x664) from (sock_ioctl+0x90/0x254) (sock_ioctl+0x0/0x254) from (do_ioctl+0x34/0x78) (do_ioctl+0x0/0x78) from (vfs_ioctl+0x78/0x2a8) (vfs_ioctl+0x0/0x2a8) from (sys_ioctl+0x40/0x64) (sys_ioctl+0x0/0x64) from (ret_fast_syscall+0x0/0x2c) The patch would make it deadlock. Ok, I'll drop this patch, thanks for checking. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NETFILTER] xt_hashlimit : speedups hash_dst()
Jarek Poplawski a écrit : Eric Dumazet wrote, On 12/14/2007 12:09 PM: ... + /* +* Instead of returning hash % ht-cfg.size (implying a divide) +* we return the high 32 bits of the (hash * ht-cfg.size) that will +* give results between [0 and cfg.size-1] and same hash distribution, +* but using a multiply, less expensive than a divide +*/ + return ((u64)hash * ht-cfg.size) 32; Are we sure of the same hash distribution? Probably I miss something, but: if this 'hash' is well distributed on 32 bits, and ht-cfg.size is smaller than 32 bits, e.g. 256 (8 bits), then this multiplication moves to the higher 32 of u64 only max. 8 bits of the most significant byte, and the other three bytes are never used, while division is always affected by all four bytes... Not sure what you are saying... but if size=256, then, yes, we want a final result between 0 and 255, so three bytes are nul. 'size' is the size of hashtable, its not a random 32bits value :) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NETFILTER] xt_hashlimit : speedups hash_dst()
Jarek Poplawski wrote, On 12/14/2007 09:59 PM: Eric Dumazet wrote, On 12/14/2007 12:09 PM: ... +/* + * Instead of returning hash % ht-cfg.size (implying a divide) + * we return the high 32 bits of the (hash * ht-cfg.size) that will + * give results between [0 and cfg.size-1] and same hash distribution, + * but using a multiply, less expensive than a divide + */ +return ((u64)hash * ht-cfg.size) 32; Are we sure of the same hash distribution? Probably I miss something, but: if this 'hash' is well distributed on 32 bits, and ht-cfg.size is smaller than 32 bits, e.g. 256 (8 bits), then this multiplication moves to the higher 32 of u64 only max. 8 bits of the most significant byte, and the other three bytes are never used, while division is always affected by all four bytes... OOPS! So, I've missed this division here is also affected by only one byte, but from the other side - so, almost the same... It seems this could have been replaced with masking from the beginning... Sorry, Jarek P. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/29] mm: kmem_estimate_pages()
On Friday 14 December 2007 07:39, Peter Zijlstra wrote: Provide a method to get the upper bound on the pages needed to allocate a given number of objects from a given kmem_cache. This lays the foundation for a generic reserve framework as presented in a later patch in this series. This framework needs to convert object demand (kmalloc() bytes, kmem_cache_alloc() objects) to pages. And hence the big idea that all reserve accounting can be done in units of pages, allowing the use of a single global reserve that already exists. The other big idea here is that reserve accounting can be independent of the actual resource allocations. This is a powerful idea which we may not have explained clearly yet. Daniel -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 9543] New: RTNL: assertion failed at net/ipv6/addrconf.c (2164)/RTNL: assertion failed at net/ipv4/devinet.c (1055)
On Fri, 14 Dec 2007, Andy Gospodarek wrote: On Fri, Dec 14, 2007 at 07:57:42PM +0100, Krzysztof Oledzki wrote: On Fri, 14 Dec 2007, Andy Gospodarek wrote: On Fri, Dec 14, 2007 at 05:14:57PM +0100, Krzysztof Oledzki wrote: On Wed, 12 Dec 2007, Jay Vosburgh wrote: Herbert Xu [EMAIL PROTECTED] wrote: diff -puN drivers/net/bonding/bond_sysfs.c~bonding-locking-fix drivers/net/bonding/bond_sysfs.c --- a/drivers/net/bonding/bond_sysfs.c~bonding-locking-fix +++ a/drivers/net/bonding/bond_sysfs.c @@ -,8 +,6 @@ static ssize_t bonding_store_primary(str out: write_unlock_bh(bond-lock); - rtnl_unlock(); - Looking at the changeset that added this perhaps the intention is to hold the lock? If so we should add an rtnl_lock to the start of the function. Yes, this function needs to hold locks, and more than just what's there now. I believe the following should be correct; I haven't tested it, though (I'm supposedly on vacation right now). The following change should be correct for the bonding_store_primary case discussed in this thread, and also corrects the bonding_store_active case which performs similar functions. The bond_change_active_slave and bond_select_active_slave functions both require rtnl, bond-lock for read and curr_slave_lock for write_bh, and no other locks. This is so that the lower level mode-specific functions can release locks down to just rtnl in order to call, e.g., dev_set_mac_address with the locks it expects (rtnl only). Signed-off-by: Jay Vosburgh [EMAIL PROTECTED] diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 11b76b3..28a2d80 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device *d, struct slave *slave; struct bonding *bond = to_bond(d); - write_lock_bh(bond-lock); + rtnl_lock(); + read_lock(bond-lock); + write_lock_bh(bond-curr_slave_lock); F + if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME : %s: Unable to set primary slave; %s is in mode %d\n, @@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device *d, } } out: - write_unlock_bh(bond-lock); - + write_unlock_bh(bond-curr_slave_lock); + read_unlock(bond-lock); rtnl_unlock(); return count; @@ -1190,7 +1193,8 @@ static ssize_t bonding_store_active_slave(struct device *d, struct bonding *bond = to_bond(d); rtnl_lock(); - write_lock_bh(bond-lock); + read_lock(bond-lock); + write_lock_bh(bond-curr_slave_lock); if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME @@ -1247,7 +1251,8 @@ static ssize_t bonding_store_active_slave(struct device *d, } } out: - write_unlock_bh(bond-lock); + write_unlock_bh(bond-curr_slave_lock); + read_unlock(bond-lock); rtnl_unlock(); return count; Vanilla 2.6.24-rc5 plus this patch: = [ INFO: possible irq lock inversion dependency detected ] 2.6.24-rc5 #1 - events/0/9 just changed the state of lock: (mc-mca_lock){-+..}, at: [c0411c7a] mld_ifc_timer_expire+0x130/0x1fb but this lock took another, soft-read-irq-unsafe lock in the past: (bond-lock){-.--} and interrupts could create inverse lock ordering between them. Grrr, I should have seen that -- sorry. Try your luck with this instead: CUT No luck. bonding: bond0: setting mode to active-backup (1). bonding: bond0: Setting MII monitoring interval to 100. ADDRCONF(NETDEV_UP): bond0: link is not ready bonding: bond0: Adding slave eth0. e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX bonding: bond0: making interface eth0 the new active one. bonding: bond0: first active interface up! bonding: bond0: enslaving eth0 as an active interface with an up link. bonding: bond0: Adding slave eth1. ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready SNIP bonding: bond0: enslaving eth1 as a backup interface with a down link. bonding: bond0: Setting eth0 as primary slave. bond0: no IPv6 routers present Based on the console log, I'm guessing your initialization scripts use sysfs to set eth0 as the primary interface for bond0? Can you confirm? Yep, that's correct: postup() { if [[ ${IFACE} == bond0 ]] ; then echo -n +eth0 /sys/class/net/${IFACE}/bonding/slaves echo -n +eth1 /sys/class/net/${IFACE}/bonding/slaves echo -n eth0 /sys/class/net/${IFACE}/bonding/primary fi } If you did somehow use sysfs to set the primary device as eth0, I'm guessing you never see this issue without that line or without this patch.
Re: [Bugme-new] [Bug 9543] New: RTNL: assertion failed at net/ipv6/addrconf.c (2164)/RTNL: assertion failed at net/ipv4/devinet.c (1055)
On Fri, Dec 14, 2007 at 07:57:42PM +0100, Krzysztof Oledzki wrote: On Fri, 14 Dec 2007, Andy Gospodarek wrote: On Fri, Dec 14, 2007 at 05:14:57PM +0100, Krzysztof Oledzki wrote: On Wed, 12 Dec 2007, Jay Vosburgh wrote: Herbert Xu [EMAIL PROTECTED] wrote: diff -puN drivers/net/bonding/bond_sysfs.c~bonding-locking-fix drivers/net/bonding/bond_sysfs.c --- a/drivers/net/bonding/bond_sysfs.c~bonding-locking-fix +++ a/drivers/net/bonding/bond_sysfs.c @@ -,8 +,6 @@ static ssize_t bonding_store_primary(str out: write_unlock_bh(bond-lock); - rtnl_unlock(); - Looking at the changeset that added this perhaps the intention is to hold the lock? If so we should add an rtnl_lock to the start of the function. Yes, this function needs to hold locks, and more than just what's there now. I believe the following should be correct; I haven't tested it, though (I'm supposedly on vacation right now). The following change should be correct for the bonding_store_primary case discussed in this thread, and also corrects the bonding_store_active case which performs similar functions. The bond_change_active_slave and bond_select_active_slave functions both require rtnl, bond-lock for read and curr_slave_lock for write_bh, and no other locks. This is so that the lower level mode-specific functions can release locks down to just rtnl in order to call, e.g., dev_set_mac_address with the locks it expects (rtnl only). Signed-off-by: Jay Vosburgh [EMAIL PROTECTED] diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 11b76b3..28a2d80 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device *d, struct slave *slave; struct bonding *bond = to_bond(d); - write_lock_bh(bond-lock); + rtnl_lock(); + read_lock(bond-lock); + write_lock_bh(bond-curr_slave_lock); F + if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME : %s: Unable to set primary slave; %s is in mode %d\n, @@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device *d, } } out: - write_unlock_bh(bond-lock); - + write_unlock_bh(bond-curr_slave_lock); + read_unlock(bond-lock); rtnl_unlock(); return count; @@ -1190,7 +1193,8 @@ static ssize_t bonding_store_active_slave(struct device *d, struct bonding *bond = to_bond(d); rtnl_lock(); - write_lock_bh(bond-lock); + read_lock(bond-lock); + write_lock_bh(bond-curr_slave_lock); if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME @@ -1247,7 +1251,8 @@ static ssize_t bonding_store_active_slave(struct device *d, } } out: - write_unlock_bh(bond-lock); + write_unlock_bh(bond-curr_slave_lock); + read_unlock(bond-lock); rtnl_unlock(); return count; Vanilla 2.6.24-rc5 plus this patch: = [ INFO: possible irq lock inversion dependency detected ] 2.6.24-rc5 #1 - events/0/9 just changed the state of lock: (mc-mca_lock){-+..}, at: [c0411c7a] mld_ifc_timer_expire+0x130/0x1fb but this lock took another, soft-read-irq-unsafe lock in the past: (bond-lock){-.--} and interrupts could create inverse lock ordering between them. Grrr, I should have seen that -- sorry. Try your luck with this instead: CUT No luck. bonding: bond0: setting mode to active-backup (1). bonding: bond0: Setting MII monitoring interval to 100. ADDRCONF(NETDEV_UP): bond0: link is not ready bonding: bond0: Adding slave eth0. e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX bonding: bond0: making interface eth0 the new active one. bonding: bond0: first active interface up! bonding: bond0: enslaving eth0 as an active interface with an up link. bonding: bond0: Adding slave eth1. ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready SNIP bonding: bond0: enslaving eth1 as a backup interface with a down link. bonding: bond0: Setting eth0 as primary slave. bond0: no IPv6 routers present Based on the console log, I'm guessing your initialization scripts use sysfs to set eth0 as the primary interface for bond0? Can you confirm? If you did somehow use sysfs to set the primary device as eth0, I'm guessing you never see this issue without that line or without this patch. Please confirm this as well. Thanks, -andy -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 9543] New: RTNL: assertion failed at net/ipv6/addrconf.c (2164)/RTNL: assertion failed at net/ipv4/devinet.c (1055)
On Fri, Dec 14, 2007 at 11:11:15PM +0100, Krzysztof Oledzki wrote: On Fri, 14 Dec 2007, Andy Gospodarek wrote: On Fri, Dec 14, 2007 at 07:57:42PM +0100, Krzysztof Oledzki wrote: On Fri, 14 Dec 2007, Andy Gospodarek wrote: On Fri, Dec 14, 2007 at 05:14:57PM +0100, Krzysztof Oledzki wrote: On Wed, 12 Dec 2007, Jay Vosburgh wrote: Herbert Xu [EMAIL PROTECTED] wrote: diff -puN drivers/net/bonding/bond_sysfs.c~bonding-locking-fix drivers/net/bonding/bond_sysfs.c --- a/drivers/net/bonding/bond_sysfs.c~bonding-locking-fix +++ a/drivers/net/bonding/bond_sysfs.c @@ -,8 +,6 @@ static ssize_t bonding_store_primary(str out: write_unlock_bh(bond-lock); - rtnl_unlock(); - Looking at the changeset that added this perhaps the intention is to hold the lock? If so we should add an rtnl_lock to the start of the function. Yes, this function needs to hold locks, and more than just what's there now. I believe the following should be correct; I haven't tested it, though (I'm supposedly on vacation right now). The following change should be correct for the bonding_store_primary case discussed in this thread, and also corrects the bonding_store_active case which performs similar functions. The bond_change_active_slave and bond_select_active_slave functions both require rtnl, bond-lock for read and curr_slave_lock for write_bh, and no other locks. This is so that the lower level mode-specific functions can release locks down to just rtnl in order to call, e.g., dev_set_mac_address with the locks it expects (rtnl only). Signed-off-by: Jay Vosburgh [EMAIL PROTECTED] diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 11b76b3..28a2d80 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device *d, struct slave *slave; struct bonding *bond = to_bond(d); -write_lock_bh(bond-lock); +rtnl_lock(); +read_lock(bond-lock); +write_lock_bh(bond-curr_slave_lock); F + if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME : %s: Unable to set primary slave; %s is in mode %d\n, @@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device *d, } } out: -write_unlock_bh(bond-lock); - +write_unlock_bh(bond-curr_slave_lock); +read_unlock(bond-lock); rtnl_unlock(); return count; @@ -1190,7 +1193,8 @@ static ssize_t bonding_store_active_slave(struct device *d, struct bonding *bond = to_bond(d); rtnl_lock(); -write_lock_bh(bond-lock); +read_lock(bond-lock); +write_lock_bh(bond-curr_slave_lock); if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME @@ -1247,7 +1251,8 @@ static ssize_t bonding_store_active_slave(struct device *d, } } out: -write_unlock_bh(bond-lock); +write_unlock_bh(bond-curr_slave_lock); +read_unlock(bond-lock); rtnl_unlock(); return count; Vanilla 2.6.24-rc5 plus this patch: = [ INFO: possible irq lock inversion dependency detected ] 2.6.24-rc5 #1 - events/0/9 just changed the state of lock: (mc-mca_lock){-+..}, at: [c0411c7a] mld_ifc_timer_expire+0x130/0x1fb but this lock took another, soft-read-irq-unsafe lock in the past: (bond-lock){-.--} and interrupts could create inverse lock ordering between them. Grrr, I should have seen that -- sorry. Try your luck with this instead: CUT No luck. bonding: bond0: setting mode to active-backup (1). bonding: bond0: Setting MII monitoring interval to 100. ADDRCONF(NETDEV_UP): bond0: link is not ready bonding: bond0: Adding slave eth0. e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX bonding: bond0: making interface eth0 the new active one. bonding: bond0: first active interface up! bonding: bond0: enslaving eth0 as an active interface with an up link. bonding: bond0: Adding slave eth1. ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready SNIP bonding: bond0: enslaving eth1 as a backup interface with a down link. bonding: bond0: Setting eth0 as primary slave. bond0: no IPv6 routers present Based on the console log, I'm guessing your initialization scripts use sysfs to set eth0 as the primary interface for bond0? Can you confirm? Yep, that's correct: postup() { if [[ ${IFACE} == bond0 ]] ; then echo -n +eth0 /sys/class/net/${IFACE}/bonding/slaves echo -n +eth1 /sys/class/net/${IFACE}/bonding/slaves echo -n eth0 /sys/class/net/${IFACE}/bonding/primary fi } Good. Thanks for the confirmation. If you did somehow use sysfs to
Re: [patch 01/10] e1000e: make E1000E default to the same kconfig setting as E1000
On Fri, Dec 14, 2007 at 03:39:26PM -0500, Jeff Garzik wrote: [EMAIL PROTECTED] wrote: From: Randy Dunlap [EMAIL PROTECTED] ... So I think the breakage that occurs is mitigated by two factors: 1) kernel hackers that do their own configs are expected to be able to figure this stuff. 2) kernel builders (read: distros, mainly) are expected to have put thought into the Kconfig selection and driver migration strategies. ... I would prefer simply to communicate to kernel experts and builders about a Kconfig issue that could potentially their booting/networking... because this patch is only needed if the kernel experts do not already know about a necessary config update. You miss the vast majority of kconfig users: 3) system administrators etc. who for different reasons compile their own kernels but neither are nor want to be kernel developers There's a reason why e.g. LPI requires you to be able to compile your own kernel even for getting a Junior Level Linux Professional certificate. Or that one of the authors of Linux Device drivers has written a book covering only how to build and run your own kernel. Jeff cu Adrian -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 9543] New: RTNL: assertion failed at net/ipv6/addrconf.c (2164)/RTNL: assertion failed at net/ipv4/devinet.c (1055)
On Fri, Dec 14, 2007 at 07:57:42PM +0100, Krzysztof Oledzki wrote: On Fri, 14 Dec 2007, Andy Gospodarek wrote: On Fri, Dec 14, 2007 at 05:14:57PM +0100, Krzysztof Oledzki wrote: On Wed, 12 Dec 2007, Jay Vosburgh wrote: Herbert Xu [EMAIL PROTECTED] wrote: diff -puN drivers/net/bonding/bond_sysfs.c~bonding-locking-fix drivers/net/bonding/bond_sysfs.c --- a/drivers/net/bonding/bond_sysfs.c~bonding-locking-fix +++ a/drivers/net/bonding/bond_sysfs.c @@ -,8 +,6 @@ static ssize_t bonding_store_primary(str out: write_unlock_bh(bond-lock); - rtnl_unlock(); - Looking at the changeset that added this perhaps the intention is to hold the lock? If so we should add an rtnl_lock to the start of the function. Yes, this function needs to hold locks, and more than just what's there now. I believe the following should be correct; I haven't tested it, though (I'm supposedly on vacation right now). The following change should be correct for the bonding_store_primary case discussed in this thread, and also corrects the bonding_store_active case which performs similar functions. The bond_change_active_slave and bond_select_active_slave functions both require rtnl, bond-lock for read and curr_slave_lock for write_bh, and no other locks. This is so that the lower level mode-specific functions can release locks down to just rtnl in order to call, e.g., dev_set_mac_address with the locks it expects (rtnl only). Signed-off-by: Jay Vosburgh [EMAIL PROTECTED] diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 11b76b3..28a2d80 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device *d, struct slave *slave; struct bonding *bond = to_bond(d); - write_lock_bh(bond-lock); + rtnl_lock(); + read_lock(bond-lock); + write_lock_bh(bond-curr_slave_lock); + if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME : %s: Unable to set primary slave; %s is in mode %d\n, @@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device *d, } } out: - write_unlock_bh(bond-lock); - + write_unlock_bh(bond-curr_slave_lock); + read_unlock(bond-lock); rtnl_unlock(); return count; @@ -1190,7 +1193,8 @@ static ssize_t bonding_store_active_slave(struct device *d, struct bonding *bond = to_bond(d); rtnl_lock(); - write_lock_bh(bond-lock); + read_lock(bond-lock); + write_lock_bh(bond-curr_slave_lock); if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME @@ -1247,7 +1251,8 @@ static ssize_t bonding_store_active_slave(struct device *d, } } out: - write_unlock_bh(bond-lock); + write_unlock_bh(bond-curr_slave_lock); + read_unlock(bond-lock); rtnl_unlock(); return count; Vanilla 2.6.24-rc5 plus this patch: = [ INFO: possible irq lock inversion dependency detected ] 2.6.24-rc5 #1 - events/0/9 just changed the state of lock: (mc-mca_lock){-+..}, at: [c0411c7a] mld_ifc_timer_expire+0x130/0x1fb but this lock took another, soft-read-irq-unsafe lock in the past: (bond-lock){-.--} and interrupts could create inverse lock ordering between them. Grrr, I should have seen that -- sorry. Try your luck with this instead: CUT No luck. I'm guessing if we go back to using a write-lock for bond-lock this will go back to working again, but I'm not totally convinced since there are plenty of places where we used a read-lock with it. diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index 11b76b3..635b857 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device *d, struct slave *slave; struct bonding *bond = to_bond(d); + rtnl_lock(); write_lock_bh(bond-lock); + write_lock_bh(bond-curr_slave_lock); + if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME : %s: Unable to set primary slave; %s is in mode %d\n, @@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device *d, } } out: + write_unlock_bh(bond-curr_slave_lock); write_unlock_bh(bond-lock); - rtnl_unlock(); return count; @@ -1191,6 +1194,7 @@ static ssize_t bonding_store_active_slave(struct device *d, rtnl_lock(); write_lock_bh(bond-lock); + write_lock_bh(bond-curr_slave_lock); if (!USES_PRIMARY(bond-params.mode)) { printk(KERN_INFO DRV_NAME @@ -1247,6 +1251,7 @@ static ssize_t
Re: Packet per Second
On Fri, 2007-12-14 at 15:34 +, Flávio Pires wrote: Well, I work on an ISP and we have a linux box acting as a bridge+firewall. With this bridge+firewall we control the packet rate per second from each client and from our repeaters. But I can`t measure the packet rate per IP. Is there any tool for this? The usual approach is to generate NetFlow records -- there are a number of Linux tools for this. Collect them with a collector (flow-tools being a common choice). Then have a Perl script which reads the flow records, processes them whichever way you desire, and drops the result into a rrdtool file (there are modules for both reading the flow-tools data and outputing in the rrdtool format). The rrdtool utilities have a limited range of graphs, but there is a huge selection of graphing packages from other authors for rrdtool-stored data (Drraw, etc). Flow-tools also has some third-party analysis tools, some of those have good top talker statistics. This is a lot of work, since you are really putting a complete measurement infrastructure in place to get the one statistic you desire. But I'd encourage you to do that, since knowing one statistic usually leads to further questions of the data. -- Glen Turner, Senior Network Engineer Australia's Academic Research Networkwww.aarnet.edu.au -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/29] mm: slb: add knowledge of reserve pages
On Friday 14 December 2007 07:39, Peter Zijlstra wrote: Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation contexts that are entitled to it. This is done to ensure reserve pages don't leak out and get consumed. Tighter definitions of leak out and get consumed would be helpful here. As I see it, the chain of reasoning is: * Any MEMALLOC mode allocation must have come from a properly throttled path and has a finite lifetime that must eventually produce writeout progress. * Since the transaction that made the allocation was throttled and must have a finite lifetime, we know that it must eventually return the resources it consumed to the appropriate resource pool. Now, I think what you mean by get consumed and leak out is: become pinned by false sharing with other allocations that do not guarantee that they will be returned to the resource pool. We can say pinned for short. So you are attempting to prevent slab pages from becoming pinned by users that do not obey the reserve management rules, which I think your approach achieves. However... Note that false sharing of slab pages is still possible between two unrelated writeout processes, both of which obey rules for their own writeout path, but the pinned combination does not. This still leaves a hole through which a deadlock may slip. My original solution was simply to allocate a full page when drawing from the memaloc reserve, which may use a tad more reserve, but makes it possible to prove the algorithm correct. Regards, Daniel -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [ROSE] ax25_send_frame() called with a constant paclen = 260
Hi, In rose_link.c ax25_send_frame() was called with a constant paclen parameter of 260 bytes. This value looked odd to me for it did not correspond to any defined or possible computed length.Replacing this value by 0 (zero) allowed ax25_send_frame() to substitute it by the default AX25 frame size, which in turn induced significant results on the AX25 frame fragmentation and removed some garbage trailing characters in AX25 frames sent. signed off by Bernard Pidoux, [EMAIL PROTECTED] --- linux-2.6.24-rc5/net/rose/rose_link.c 2007-12-11 04:48:43.0 +0100 +++ b/net/rose/rose_link.c 2007-12-14 14:39:23.0 +0100 @@ -107,7 +107,7 @@ else rose_call = rose_callsign; - neigh-ax25 = ax25_send_frame(skb, 260, rose_call, neigh-callsign, neigh-digipeat, neigh-dev); + neigh-ax25 = ax25_send_frame(skb, 0, rose_call, neigh-callsign, neigh-digipeat, neigh-dev); return (neigh-ax25 != NULL); }