date:20071214

Re: [patch 2/4] net: use mutex_is_locked() for ASSERT_RTNL()

2007-12-14 Thread Andrew Morton

On Fri, 14 Dec 2007 16:10:44 +0800 Herbert Xu [EMAIL PROTECTED] wrote:

 [EMAIL PROTECTED] wrote:
 
  diff -puN 
  drivers/net/cxgb3/cxgb3_main.c~net-use-mutex_is_locked-for-assert_rtnl 
  drivers/net/cxgb3/cxgb3_main.c
  --- a/drivers/net/cxgb3/cxgb3_main.c~net-use-mutex_is_locked-for-assert_rtnl
  +++ a/drivers/net/cxgb3/cxgb3_main.c
  @@ -2191,7 +2191,7 @@ static void check_t3b2_mac(struct adapte
  {
 int i;
  
  -   if (!rtnl_trylock())/* synchronize with ifdown */
  +   if (rtnl_is_locked())   /* synchronize with ifdown */
 return;
  
 for_each_port(adapter, i) {
  @@ -2219,7 +2219,6 @@ static void check_t3b2_mac(struct adapte
 p-mac.stats.num_resets++;
 }
 }
  -   rtnl_unlock();
 
 This doesn't look right.  It seems that they really want trylock
 here so we should just fix it by removing the bang.

doh.

 Also, does ASSERT_RTNL still warn when someone calls it from an
 atomic context? We definitely don't want to lose that check.

I don't see how it could warn about that.  Nor should it - one might want
to check that rtnl_lock is held inside preempt_disable() or spin_lock or
whatever.

It might make sense to warn if ASSERT_RTNL is called in in_interrupt()
contexts though.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 2/4] net: use mutex_is_locked() for ASSERT_RTNL()

2007-12-14 Thread Herbert Xu

[EMAIL PROTECTED] wrote:

 diff -puN 
 drivers/net/cxgb3/cxgb3_main.c~net-use-mutex_is_locked-for-assert_rtnl 
 drivers/net/cxgb3/cxgb3_main.c
 --- a/drivers/net/cxgb3/cxgb3_main.c~net-use-mutex_is_locked-for-assert_rtnl
 +++ a/drivers/net/cxgb3/cxgb3_main.c
 @@ -2191,7 +2191,7 @@ static void check_t3b2_mac(struct adapte
 {
int i;
 
 -   if (!rtnl_trylock())/* synchronize with ifdown */
 +   if (rtnl_is_locked())   /* synchronize with ifdown */
return;
 
for_each_port(adapter, i) {
 @@ -2219,7 +2219,6 @@ static void check_t3b2_mac(struct adapte
p-mac.stats.num_resets++;
}
}
 -   rtnl_unlock();

This doesn't look right.  It seems that they really want trylock
here so we should just fix it by removing the bang.

Also, does ASSERT_RTNL still warn when someone calls it from an
atomic context? We definitely don't want to lose that check.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 2/4] net: use mutex_is_locked() for ASSERT_RTNL()

2007-12-14 Thread Herbert Xu

On Fri, Dec 14, 2007 at 12:22:09AM -0800, Andrew Morton wrote:

 I don't see how it could warn about that.  Nor should it - one might want
 to check that rtnl_lock is held inside preempt_disable() or spin_lock or
 whatever.
 
 It might make sense to warn if ASSERT_RTNL is called in in_interrupt()
 contexts though.

Well the paths where ASSERT_RTNL is used should never be in an
atomic context.  In the past it has been quite useful in pointing
out bogus locking practices.

There is currently one path where it's known to warn because of
this and it (promiscuous mode) is on my todo list.

Oh and it only warns when you have mutex debugging enabled.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] add tcp congestion control relevant parts

2007-12-14 Thread Michael Kerrisk

Hello Linux networking folk,

I received the patch below for the tcp.7 man page.  Would anybody here be
prepared to review the new material / double check the details?

Cheers,

Michael

 Original Message 
Subject: [patch] add tcp congestion control relevant parts
Date: Wed, 12 Dec 2007 16:40:23 +0100
From: Thomas Egerer [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
CC: [EMAIL PROTECTED]

Hello *,

man-pages version : 2.70 from http://www.kernel.org/pub/linux/docs/man-pages/
All required information were obtained by reading the kernel
code/documentation.
I'm not sure, whether it is completely bullet proof on when the sysctl
variables/socket option first appeared in the kernel, so you might as well
drop this information, but I'm pretty sure about how it works.
Here we go with my patch:

diff -ru man-pages-2.70/man7/tcp.7 man-pages-2.70.new/man7/tcp.7
--- man-pages-2.70/man7/tcp.7   2007-11-24 14:33:34.0 +0100
+++ man-pages-2.70.new/man7/tcp.7   2007-12-12 16:34:52.0 +0100
@@ -177,8 +177,6 @@
 .\ FIXME As at Sept 2006, kernel 2.6.18-rc5, the following are
 .\not yet documented (shown with default values):
 .\
-.\ /proc/sys/net/ipv4/tcp_congestion_control (since 2.6.13)
-.\ bic
 .\ /proc/sys/net/ipv4/tcp_moderate_rcvbuf
 .\ 1
 .\ /proc/sys/net/ipv4/tcp_no_metrics_save
@@ -224,6 +222,20 @@
 are reserved for the application buffer.
 A value of 0
 implies that no amount is reserved.
+.TP
+.BR tcp_allowed_congestion_control \
+ (String; default: cubic reno) (since 2.6.13) 
+Show/set the congestion control choices available to non-privileged
+processes. The list is a subset of those listed in
+.IR tcp_available_congestion_control .
+Default is cubic reno and the default setting
+.RI ( tcp_congestion_control ).
+.TP
+.BR tcp_available_congestion_control \
+ (String; default: cubic reno) (since 2.6.13) 
+Lists the TCP congestion control algorithms available on the system. This
value
+can only be changed by loading/unloading modules responsible for congestion
+control.
 .\
 .\ The following is from 2.6.12: Documentation/networking/ip-sysctl.txt
 .TP
@@ -257,6 +269,17 @@
 Allows two flows sharing the same connection to converge
 more rapidly.
 .TP
+.BR tcp_congestion_control  (String; default: cubic reno) (since 2.6.13) 
+Determines the congestion control algorithm used for newly created TCP
+sockets. By default Linux uses cubic with reno as fallback. If you want
+to have more control over the algorithm used, you must enable the symbol
+CONFIG_TCP_CONG_ADVANCED in your kernel config.
+You can use
+.BR setsockopt (2)
+to individually change the algorithm on a single socket.
+Requires CAP_NET_ADMIN or congestion algorithm to be listed in
+.IR tcp_allowed_congestion_control .
+.TP
 .BR tcp_dsack  (Boolean; default: enabled)
 Enable RFC\ 2883 TCP Duplicate SACK support.
 .TP
@@ -649,7 +672,21 @@
 socket options are valid on TCP sockets.
 For more information see
 .BR ip (7).
-.\ FIXME Document TCP_CONGESTION (new in 2.6.13)
+.TP
+.BR TCP_CONGESTION  (new since kernel version 2.6.13)
+If set to the name of an available congestion control algorithm,
+it will henceforth be used for the socket. To get a list of
+available congestion control algorithms, consult the sysctl variable
+.IR net.ipv4.tcp_available_congestion_control .
+The algorithm that is used by default for all newly created
+TCP sockets can be viewed/changed via the sysctl variable
+.IR net.ipv4.tcp_congestion_control .
+If you feel, you are missing an algorithm in the list,
+you may try to load the corresponding module using
+.BR modprobe (8),
+or if your kernel is built with module autoloading support
+.RI ( CONFIG_KMOD )
+and the algorithm has been compiled as a module, it will be autoloaded.
 .TP
 .B TCP_CORK
 If set, don't send out partial frames.


-- 
Michael Kerrisk
Maintainer of the Linux man-pages project
http://www.kernel.org/doc/man-pages/
Want to report a man-pages bug?  Look here:
http://www.kernel.org/doc/man-pages/reporting_bugs.html

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] add driver for enc28j60 ethernet chip

2007-12-14 Thread Claudio Lanconelli


Hi Stephen,
thank you for your suggestions.
I already applied trivial fixes, but I have questions on some points, 
see inline.


Stephen Hemminger wrote:

General comments:
  * device driver does no carrier detection. This makes it useless
for bridging, bonding, or any form of failover.

  * use msglevel method (via ethtool) to control debug messages
rather than kernel configuration. This allows enabling debugging
without recompilation which is important in distributions.

  * Please add ethtool support

  * Consider using NAPI

  
Can you point me to a possibly simple driver that uses ethtool and NAPI? 
Or other example that I can use for reference.

May be the skeleton should be updated.


 * use netdev_priv(netdev) rather than netdev-priv


I can't find where I used netdev-priv, may be do you mean priv-netdev?



My comments:

diff --git a/drivers/net/enc28j60.c b/drivers/net/enc28j60.c
new file mode 100644
index 000..6182473
--- /dev/null
+++ b/drivers/net/enc28j60.c
@@ -0,0 +1,1400 @@
+/*
+ * Microchip ENC28J60 ethernet driver (MAC + PHY)
+ *
+ * Copyright (C) 2007 Eurek srl
+ * Author: Claudio Lanconelli [EMAIL PROTECTED]
+ * based on enc28j60.c written by David Anders for 2.4 kernel version
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * $Id: enc28j60.c,v 1.10 2007/12/10 16:59:37 claudio Exp $
+ */
+
+#include linux/autoconf.h

Use msglvl instead see netdevice.h
  

Ok

+
+#if CONFIG_ENC28J60_DBGLEVEL  1
+# define VERBOSE_DEBUG
+#endif
+#if CONFIG_ENC28J60_DBGLEVEL  0
+# define DEBUG
+#endif
+

...
+
+#define MY_TX_TIMEOUT  ((500*HZ)/1000)

That is a really short TX timeout, should be 2 seconds at least not 1/2 sec.
Having it less than a second causes increased wakeups.
  

Ok

+
+/* Max TX retries in case of collision as suggested by errata datasheet */
+#define MAX_TX_RETRYCOUNT  16
+
+/* Driver local data */
+struct enc28j60_net_local {

Rename something shorter like enc28j60_net or just enc28j60?
  

Ok, renamed enc28j60_net

+   struct net_device_stats stats;

net_device_stats are now in net_device.

+   struct net_device *netdev;
+   struct spi_device *spi;
+   struct semaphore semlock;   /* protect spi_transfer_buf */
Use mutex (or spin_lock) rather than semaphore
  

Ok

+   uint8_t *spi_transfer_buf;
+   struct sk_buff *tx_skb;
+   struct work_struct tx_work;
+   struct work_struct irq_work;

Not sure why you need to have workqueue's for
tx_work and irq_work, rather than using a spin_lock
and doing directly.
  

I need irq_work for sure because it needs to go sleep. Any
access to enc28j60 registers are through SPI blocking transaction, 
spi_sync().
I'm not sure if the hard_start_xmit() can go sleep, so I used a work 
queue to tx too.

+   int bank;   /* current register bank selected */
bank is really unsigned.

+   uint16_t next_pk_ptr;   /* next packet pointer within FIFO */
+   int max_pk_counter; /* statistics: max packet counter */
+   int tx_retry_count;
these are used as unsigned.

+   int hw_enable;
+};
+
+/* Selects Full duplex vs. Half duplex mode */
+static int full_duplex = 0;

Use ethtool for this.
  

Ok

+
+static int enc28j60_send_packet(struct sk_buff *skb, struct net_device *dev);
+static int enc28j60_net_close(struct net_device *dev);
+static struct net_device_stats *enc28j60_net_get_stats(struct net_device *dev);
+static void enc28j60_set_multicast_list(struct net_device *dev);
+static void enc28j60_net_tx_timeout(struct net_device *ndev);
+
+static int enc28j60_chipset_init(struct net_device *dev);
+static void enc28j60_hw_disable(struct enc28j60_net_local *priv);
+static void enc28j60_hw_enable(struct enc28j60_net_local *priv);
+static void enc28j60_hw_rx(struct enc28j60_net_local *priv);
+static void enc28j60_hw_tx(struct enc28j60_net_local *priv);

If you order functions correctly in code, you don't have to waste lots
of space with all these forward declarations.

...
  

Ok

+   const char *msg);
+
+/*
+ * SPI read buffer
+ * wait for the SPI transfer and copy received data to destination
+ */
+static int
+spi_read_buf(struct enc28j60_net_local *priv, int len, uint8_t *data)
+{
+   uint8_t *rx_buf;
+   uint8_t *tx_buf;
+   struct spi_transfer t;
+   struct spi_message msg;
+   int ret, slen;
+
+   slen = 1;
+   memset(t, 0, sizeof(t));
+   t.tx_buf = tx_buf = priv-spi_transfer_buf;
+   t.rx_buf = rx_buf = priv-spi_transfer_buf + 4;
+   t.len = slen + len;

If you use structure initializer you can avoid having to do
the memset
  

Ok


+
+   down(priv-semlock);
+   tx_buf[0] = ENC28J60_READ_BUF_MEM;
+   tx_buf[1] = tx_buf[2] = tx_buf[3] = 0;  /* don't care

Re: [NETFILTER] xt_hashlimit : speedups hash_dst()

2007-12-14 Thread Patrick McHardy


Eric Dumazet wrote:

1) Using jhash2() instead of jhash() is a litle bit faster if applicable.

2) Thanks to jhash, hash value uses full 32 bits.
   Instead of returning hash % size (implying a divide)
   we return the high 32 bits of the (hash * size) that will
   give results between [0 and size-1] and same hash distribution.

  On most cpus, a multiply is less expensive than a divide, by an order
  of magnitude.



Clever :) Applied, thanks Eric.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch] authorize some users to bind on specifics priv ports

2007-12-14 Thread Arnauld Michelizza



Really simpler and usable than POSIX/capabilities and I think it 
covers basics needs for sysadmins... at least, it covers mine :-)


  www-data$ nc -l -p 80 -v
  Can't grab 0.0.0.0:80 with bind : Permission denied

  root# id -u www-data
  33
  root# port_acl_set +80 www-data
  root# cat /proc/net/port_acl
  80: 33

  www-data$ nc -l -p 80 -v
  listening on [any] 80 ...



diff -r --unidirectional-new-file -u 
linux-2.6.23/arch/i386/kernel/syscall_table.S 
linux-2.6.23-patched/arch/i386/kernel/syscall_table.S
--- linux-2.6.23/arch/i386/kernel/syscall_table.S   2007-10-09 
22:31:38.0 +0200
+++ linux-2.6.23-patched/arch/i386/kernel/syscall_table.S   2007-12-13 
14:29:40.0 +0100
@@ -324,3 +324,4 @@
.long sys_timerfd
.long sys_eventfd
.long sys_fallocate
+   .long sys_port_acl_set  /* 325 */
diff -r --unidirectional-new-file -u linux-2.6.23/include/asm-i386/unistd.h 
linux-2.6.23-patched/include/asm-i386/unistd.h
--- linux-2.6.23/include/asm-i386/unistd.h  2007-10-09 22:31:38.0 
+0200
+++ linux-2.6.23-patched/include/asm-i386/unistd.h  2007-12-13 
14:29:40.0 +0100
@@ -330,10 +330,11 @@
 #define __NR_timerfd   322
 #define __NR_eventfd   323
 #define __NR_fallocate 324
+#define __NR_port_acl_set  325

 #ifdef __KERNEL__

-#define NR_syscalls 325
+#define NR_syscalls 326

 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff -r --unidirectional-new-file -u linux-2.6.23/include/net/port_acl.h 
linux-2.6.23-patched/include/net/port_acl.h
--- linux-2.6.23/include/net/port_acl.h 1970-01-01 01:00:00.0 +0100
+++ linux-2.6.23-patched/include/net/port_acl.h 2007-12-13 15:17:40.0 
+0100
@@ -0,0 +1,15 @@
+#include linux/types.h
+
+struct port_acl {
+   uid_t   uid;
+   struct port_acl *next;
+};
+
+#ifdef __PORT_ACL__
+   struct port_acl *port_acl_list[1024];
+#else
+   extern struct port_acl *port_acl_list[1024];
+#endif
+
+extern int port_acl(short int);
+extern int port_acl_get_info(char *, char **, off_t, int); 
diff -r --unidirectional-new-file -u linux-2.6.23/kernel/sys.c linux-2.6.23-patched/kernel/sys.c

--- linux-2.6.23/kernel/sys.c   2007-10-09 22:31:38.0 +0200
+++ linux-2.6.23-patched/kernel/sys.c   2007-12-13 16:01:24.0 +0100
@@ -43,6 +43,9 @@
 #include asm/io.h
 #include asm/unistd.h

+#define __PORT_ACL__
+#include net/port_acl.h
+
 #ifndef SET_UNALIGN_CTL
 # define SET_UNALIGN_CTL(a,b)  (-EINVAL)
 #endif
@@ -2356,4 +2359,89 @@

return ret;
 }
+
+/*
+ * The following lines were added to implement port_acl secured
+ * mecanism :
+ * port_acl_add - add to a user the authorisation to acces a particular port
+ * port_acl_remove - remove to a user that authorisation
+ * sys_port_acl_set - front end for port_acl_add and port_acl_remove
+ */
+long port_acl_add(short int snum, uid_t uid)
+{
+   struct port_acl *ptr, *new;
+
+   /* we verify if the permition is already set for that user */
+   ptr = port_acl_list[snum];
+
+   while (ptr != NULL) {
+   if (ptr-uid == uid)
+   return -EBUSY;
+   if (ptr-next == NULL)
+   break;
+   ptr = ptr-next;
+   }
+
+   /* ok, we haven't found the user and ptr is a pointer on the
+   last structure */
+   new = kmalloc( sizeof(struct port_acl), GFP_KERNEL); 
+   new-next = NULL;

+   new-uid = uid;
+
+   if(ptr == NULL)
+   port_acl_list[snum] = new;
+   else
+   ptr-next = new;
+
+   return 0;
+}
+
+long port_acl_remove(short int snum, uid_t uid)
+{
+   struct port_acl *ptr, *prev = 0;
+
+   /* we verify if the permition is already set for that user */
+   ptr = port_acl_list[snum];
+
+   while (ptr != NULL) {
+   /* we found the user */
+   if (ptr-uid == uid) {
+   if (ptr == port_acl_list[snum]) {
+   port_acl_list[snum] = ptr-next;
+   }
+   else {
+   prev-next = ptr-next;
+   } 
+   kfree(ptr);

+   return 0;
+   }
+   prev = ptr;
+   ptr = ptr-next;
+   }
+
+   return -ENODATA;
+}
+
+asmlinkage long sys_port_acl_set(short int snum, uid_t uid, int act)
+{
+   /* the owner of the process must be root */
+   if (current-uid != 0)
+   return -EACCES;
+
+   /* we verify that the port is valid */
+   if (snum0 || snum1023)
+   return -EINVAL;
+
+   if (uid1 || uid65534)
+   return -EINVAL;
+
+   if (act == 0)
+   return port_acl_remove(snum, uid);
+   else if (act == 1)
+   return port_acl_add(snum, uid);
+   else
+   return -EPERM;
+}
+
+

[NETFILTER] xt_hashlimit : speedups hash_dst()

2007-12-14 Thread Eric Dumazet


1) Using jhash2() instead of jhash() is a litle bit faster if applicable.

2) Thanks to jhash, hash value uses full 32 bits.
   Instead of returning hash % size (implying a divide)
   we return the high 32 bits of the (hash * size) that will
   give results between [0 and size-1] and same hash distribution.

  On most cpus, a multiply is less expensive than a divide, by an order
  of magnitude.
 
Signed-off-by: Eric Dumazet [EMAIL PROTECTED]



diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
index 033d448..7cc04e8 100644
--- a/net/netfilter/xt_hashlimit.c
+++ b/net/netfilter/xt_hashlimit.c
@@ -105,7 +105,16 @@ static inline bool dst_cmp(const struct dsthash_ent *ent,
 static u_int32_t
 hash_dst(const struct xt_hashlimit_htable *ht, const struct dsthash_dst *dst)
 {
-   return jhash(dst, sizeof(*dst), ht-rnd) % ht-cfg.size;
+   u_int32_t hash = jhash2((const u32 *)dst,
+   sizeof(*dst)/sizeof(u32),
+   ht-rnd);
+   /*
+* Instead of returning hash % ht-cfg.size (implying a divide)
+* we return the high 32 bits of the (hash * ht-cfg.size) that will
+* give results between [0 and cfg.size-1] and same hash distribution,
+* but using a multiply, less expensive than a divide
+*/
+   return ((u64)hash * ht-cfg.size)  32;
 }
 
 static struct dsthash_ent *

Re: [PATCH] PS3: gelic: Add wireless support for PS3

2007-12-14 Thread Dan Williams

On Fri, 2007-12-14 at 14:03 +0900, Masakazu Mokuno wrote:
 On Thu, 13 Dec 2007 16:13:38 -0500
 Dan Williams [EMAIL PROTECTED] wrote:
 
  One more question; does the driver work with wpa_supplicant for WPA, or
  does the firmware capture the EAPOL frames and handle the 4 way
  handshake internally?  Ideally the firmware would have the ability to
  pass those frames up unmodified so the driver would at least have a
  _hope_ of 802.1x capability.  Does the firmware handle Dynamic WEP at
  all?
  
  Basically, what happens when the AP you've just associated with starts
  sending you EAPOL traffic to start the 802.1x process?
 
 The PS3 wireless device does the association and 4way handshake in its
 firmware/hypervisor.  No interventions between them are allowed to the guest
 OSes.  
 All frames which are sent/received from/to before the connection process
 completed seems to be dropped by the hardware.  Only the static WEP is
 supported.

That sort of sucks; but I guess there's not too much you can do about
it.  That probably means that using wpa_supplicant + WPA is completely
out of the picture, which unfortunately makes the PS3 wireless unlike
any other card, which would require special-casing the PS3 in userspace
tools.

Dan


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 2/4] net: use mutex_is_locked() for ASSERT_RTNL()

2007-12-14 Thread Johannes Berg


  I agree with this. IIRC I removed some ASSERT_RTNL()s in the wireless
  code (or maybe it was only during testing patches) where we had a
  function that required only the rtnl to be held but in certain contexts
  was called from within an RCU section.
 
 Please point me to the actual code so I can see if this is legit
 or not.

I don't think I have that case any more since now my interface list is
either protected by RCU or the rtnl.

johannes


signature.asc
Description: This is a digitally signed message part

Re: [patch 2/4] net: use mutex_is_locked() for ASSERT_RTNL()

2007-12-14 Thread Herbert Xu

On Fri, Dec 14, 2007 at 01:37:40PM +0100, Johannes Berg wrote:
 
 I agree with this. IIRC I removed some ASSERT_RTNL()s in the wireless
 code (or maybe it was only during testing patches) where we had a
 function that required only the rtnl to be held but in certain contexts
 was called from within an RCU section.

Please point me to the actual code so I can see if this is legit
or not.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc5-mm1

2007-12-14 Thread Dhaval Giani

Hi Andrew,

I hit this just now. Not sure if I can reproduce it though.

WARNING: at net/ipv4/tcp_input.c:2533 tcp_fastretrans_alert()
Pid: 4624, comm: yield Not tainted 2.6.24-rc5-mm1 #5
 [c010582a] show_trace_log_lvl+0x12/0x22
 [c0105847] show_trace+0xd/0xf
 [c0105959] dump_stack+0x57/0x5e
 [c03db95b] tcp_fastretrans_alert+0xde/0x5bd
 [c03dcab2] tcp_ack+0x236/0x2e4
 [c03dea01] tcp_rcv_established+0x51e/0x5c0
 [c03e56f1] tcp_v4_do_rcv+0x22/0xc4
 [c03e5c49] tcp_v4_rcv+0x4b6/0x7f5
 [c03cd5ad] ip_local_deliver_finish+0xb9/0x169
 [c03cd68a] ip_local_deliver+0x2d/0x34
 [c03cd91d] ip_rcv_finish+0x28c/0x2ab
 [c03cdb16] ip_rcv+0x1da/0x204
 [c03b800a] netif_receive_skb+0x23c/0x26f
 [c02db326] tg3_rx+0x246/0x353
 [c02db4ac] tg3_poll_work+0x79/0x86
 [c02db4e8] tg3_poll+0x2f/0x16f
 [c03b822b] net_rx_action+0xbb/0x1a8
 [c0129596] __do_softirq+0x73/0xe6
 [c0129642] do_softirq+0x39/0x51
 [c01296c0] irq_exit+0x47/0x49
 [c01064f4] do_IRQ+0x55/0x69
 [c0105492] common_interrupt+0x2e/0x34
 ===

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] add tcp congestion control relevant parts

2007-12-14 Thread Stephen Hemminger

On Fri, 14 Dec 2007 09:48:32 +0100
Michael Kerrisk [EMAIL PROTECTED] wrote:

 Hello Linux networking folk,
 
 I received the patch below for the tcp.7 man page.  Would anybody here be
 prepared to review the new material / double check the details?
 
 Cheers,
 
 Michael
 
  Original Message 
 Subject: [patch] add tcp congestion control relevant parts
 Date: Wed, 12 Dec 2007 16:40:23 +0100
 From: Thomas Egerer [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 CC: [EMAIL PROTECTED]
 
 Hello *,
 
 man-pages version : 2.70 from http://www.kernel.org/pub/linux/docs/man-pages/
 All required information were obtained by reading the kernel
 code/documentation.
 I'm not sure, whether it is completely bullet proof on when the sysctl
 variables/socket option first appeared in the kernel, so you might as well
 drop this information, but I'm pretty sure about how it works.
 Here we go with my patch:
 
 diff -ru man-pages-2.70/man7/tcp.7 man-pages-2.70.new/man7/tcp.7
 --- man-pages-2.70/man7/tcp.7   2007-11-24 14:33:34.0 +0100
 +++ man-pages-2.70.new/man7/tcp.7   2007-12-12 16:34:52.0 +0100
 @@ -177,8 +177,6 @@
  .\ FIXME As at Sept 2006, kernel 2.6.18-rc5, the following are
  .\not yet documented (shown with default values):
  .\
 -.\ /proc/sys/net/ipv4/tcp_congestion_control (since 2.6.13)
 -.\ bic
  .\ /proc/sys/net/ipv4/tcp_moderate_rcvbuf
  .\ 1
  .\ /proc/sys/net/ipv4/tcp_no_metrics_save
 @@ -224,6 +222,20 @@
  are reserved for the application buffer.
  A value of 0
  implies that no amount is reserved.
 +.TP
 +.BR tcp_allowed_congestion_control \
 + (String; default: cubic reno) (since 2.6.13) 
 +Show/set the congestion control choices available to non-privileged
 +processes. The list is a subset of those listed in
 +.IR tcp_available_congestion_control .
 +Default is cubic reno and the default setting
 +.RI ( tcp_congestion_control ).
 +.TP
 +.BR tcp_available_congestion_control \
 + (String; default: cubic reno) (since 2.6.13) 
 +Lists the TCP congestion control algorithms available on the system. This
 value
 +can only be changed by loading/unloading modules responsible for congestion
 +control.
  .\
  .\ The following is from 2.6.12: Documentation/networking/ip-sysctl.txt
  .TP
 @@ -257,6 +269,17 @@
  Allows two flows sharing the same connection to converge
  more rapidly.
  .TP
 +.BR tcp_congestion_control  (String; default: cubic reno) (since 2.6.13) 
 +Determines the congestion control algorithm used for newly created TCP
 +sockets. By default Linux uses cubic with reno as fallback. If you want
 +to have more control over the algorithm used, you must enable the symbol
 +CONFIG_TCP_CONG_ADVANCED in your kernel config.

You can choose the default congestion control as well as part of the kernel
configuration.
 


-- 
Stephen Hemminger [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Packet per Second

2007-12-14 Thread Flávio Pires

Hi all, 

It's my first time using usenet...

Well, I work on an ISP and we have a linux box acting as a
bridge+firewall. With this bridge+firewall we control the packet rate
per second from each client and from our repeaters. But I can`t
measure the packet rate per IP. Is there any tool for this?

Actually, what I want is to measure the packet rate per IP and
generate graphics with mrtg or rrdtool, but for this I must have the
number of packets per second of each client :)

Thank you all
--
Flávio


-- 

I'm trying a new usenet client for Mac, Nemo OS X.
You can download it at http://www.malcom-mac.com/nemo


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] mac80211: clean up frame receive handling

2007-12-14 Thread Johannes Berg


 Is there any way for an user space application to figure out whether a
 received EAPOL frame was encrypted? In theory, WPA/WPA2 Authenticators
 (e.g., hostapd) should verify that the frame was encrypted if pairwise
 keys are set (whereas IEEE 802.1X Authenticator accepts unencrypted
 EAPOL frames).

Unfortunately not. Does that really matter? It seems that the
verification whether the frame was encrypted would either be always
require encryption when pairwise keys in use (which this patch doesn't
do right now but could trivially be done) or simply don't care since it
doesn't really matter.

 Did you/someone already verify that the Linux bridge code does not
 bridge EAPOL frames? The use of a separate interface for this removed
 the need for doing such filtering based on ethertype, but with EAPOL
 frames using the same netdev with other data frames, the bridge code
 should filter these out (mainly the PAE group addressed ones, but if I
 remember correctly, IEEE 802.1X specified all frames using EAPOL
 ethertype not to be bridged).

Actually, 802.1X doesn't specify that, as I said previously it
*recommends* it in C.3.3 (not C.1.1 as the 802.11 specs lead you to
believe). Also, a patch to do this was rejected by Stephen Hemminger, so
I decided to only pass up EAPOL frames that are either for our own
unicast address or the link-local eapol address, both of which won't be
bridged.

 I haven't looked into the current implementations and/or proposed
 patches on for TX part, but I would assume that it is possible to select
 whether an EAPOL frame will be encrypted when injecting it(?).

Yes, by setting the F_WEP flag on any frame you decide whether it will
be encrypted (if possible) or not. Right now, the corresponding hostapd
patch always sets that flag.

johannes


signature.asc
Description: This is a digitally signed message part

Re: [RFC] mac80211: clean up frame receive handling

2007-12-14 Thread Johannes Berg


  +static bool ieee80211_frame_allowed(struct ieee80211_txrx_data *rx)
  +{
  +   static const u8 pae_group_addr[ETH_ALEN]
  +   = { 0x01, 0x80, 0xC2, 0x00, 0x00, 0x03 };
  +   struct ethhdr *ehdr = (struct ethhdr *)rx-skb-data;
  +
  +   if (rx-skb-protocol == htons(ETH_P_PAE) 
  +   (compare_ether_addr(ehdr-h_dest, pae_group_addr) == 0 ||
  +compare_ether_addr(ehdr-h_dest, rx-dev-dev_addr) == 0))
  +   return true;
 
 Should you reverse these two compare_ether_addr calls?
 rx-dev-dev_addr seems more likely for any given packet.  It probably
 makes little difference but it seems like checking for that first
 would still be better.

I think in theory all eapol frames are sent to the PAE group address,
but I have no idea which of the checks would be more efficient. It seems
that the first could be optimised a lot because it's constant too...

johannes


signature.asc
Description: This is a digitally signed message part

Re: [patch 2/4] net: use mutex_is_locked() for ASSERT_RTNL()

2007-12-14 Thread Johannes Berg


 I don't see how it could warn about that.  Nor should it - one might want
 to check that rtnl_lock is held inside preempt_disable() or spin_lock or
 whatever.

I agree with this. IIRC I removed some ASSERT_RTNL()s in the wireless
code (or maybe it was only during testing patches) where we had a
function that required only the rtnl to be held but in certain contexts
was called from within an RCU section.

johannes


signature.asc
Description: This is a digitally signed message part

Re: [Bugme-new] [Bug 9543] New: RTNL: assertion failed at net/ipv6/addrconf.c (2164)/RTNL: assertion failed at net/ipv4/devinet.c (1055)

2007-12-14 Thread Krzysztof Oledzki




On Wed, 12 Dec 2007, Jay Vosburgh wrote:


Herbert Xu [EMAIL PROTECTED] wrote:


diff -puN drivers/net/bonding/bond_sysfs.c~bonding-locking-fix 
drivers/net/bonding/bond_sysfs.c
--- a/drivers/net/bonding/bond_sysfs.c~bonding-locking-fix
+++ a/drivers/net/bonding/bond_sysfs.c
@@ -,8 +,6 @@ static ssize_t bonding_store_primary(str
out:
   write_unlock_bh(bond-lock);

-   rtnl_unlock();
-


Looking at the changeset that added this perhaps the intention
is to hold the lock? If so we should add an rtnl_lock to the start
of the function.


Yes, this function needs to hold locks, and more than just
what's there now.  I believe the following should be correct; I haven't
tested it, though (I'm supposedly on vacation right now).

The following change should be correct for the
bonding_store_primary case discussed in this thread, and also corrects
the bonding_store_active case which performs similar functions.

The bond_change_active_slave and bond_select_active_slave
functions both require rtnl, bond-lock for read and curr_slave_lock for
write_bh, and no other locks.  This is so that the lower level
mode-specific functions can release locks down to just rtnl in order to
call, e.g., dev_set_mac_address with the locks it expects (rtnl only).

Signed-off-by: Jay Vosburgh [EMAIL PROTECTED]

diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 11b76b3..28a2d80 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device *d,
struct slave *slave;
struct bonding *bond = to_bond(d);

-   write_lock_bh(bond-lock);
+   rtnl_lock();
+   read_lock(bond-lock);
+   write_lock_bh(bond-curr_slave_lock);
+
if (!USES_PRIMARY(bond-params.mode)) {
printk(KERN_INFO DRV_NAME
   : %s: Unable to set primary slave; %s is in mode %d\n,
@@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device *d,
}
}
out:
-   write_unlock_bh(bond-lock);
-
+   write_unlock_bh(bond-curr_slave_lock);
+   read_unlock(bond-lock);
rtnl_unlock();

return count;
@@ -1190,7 +1193,8 @@ static ssize_t bonding_store_active_slave(struct device 
*d,
struct bonding *bond = to_bond(d);

rtnl_lock();
-   write_lock_bh(bond-lock);
+   read_lock(bond-lock);
+   write_lock_bh(bond-curr_slave_lock);

if (!USES_PRIMARY(bond-params.mode)) {
printk(KERN_INFO DRV_NAME
@@ -1247,7 +1251,8 @@ static ssize_t bonding_store_active_slave(struct device 
*d,
}
}
out:
-   write_unlock_bh(bond-lock);
+   write_unlock_bh(bond-curr_slave_lock);
+   read_unlock(bond-lock);
rtnl_unlock();

return count;


Vanilla 2.6.24-rc5 plus this patch:

=
[ INFO: possible irq lock inversion dependency detected ]
2.6.24-rc5 #1
-
events/0/9 just changed the state of lock:
 (mc-mca_lock){-+..}, at: [c0411c7a] mld_ifc_timer_expire+0x130/0x1fb
but this lock took another, soft-read-irq-unsafe lock in the past:
 (bond-lock){-.--}

and interrupts could create inverse lock ordering between them.


other info that might help us debug this:
4 locks held by events/0/9:
 #0:  (events){--..}, at: [c0133c57] run_workqueue+0x87/0x1b6
 #1:  ((linkwatch_work).work){--..}, at: [c0133c57] 
run_workqueue+0x87/0x1b6

 #2:  (rtnl_mutex){--..}, at: [c03abd50] linkwatch_event+0x5/0x22
 #3:  (ndev-lock){-.-+}, at: [c0411b61] 
mld_ifc_timer_expire+0x17/0x1fb


the first lock's dependencies:
- (mc-mca_lock){-+..} ops: 10 {
   initial-use  at:
[c0104ee2] dump_trace+0x83/0x8d
[c014289c] __lock_acquire+0x4ba/0xc07
[c0109ef2] save_stack_trace+0x20/0x3a
[c0142fa1] __lock_acquire+0xbbf/0xc07
[c0412452] ipv6_dev_mc_inc+0x24d/0x31c
[c0143062] lock_acquire+0x79/0x93
[c04120d6] igmp6_group_added+0x18/0x11d
[c0439d62] _spin_lock_bh+0x3b/0x64
[c04120d6] igmp6_group_added+0x18/0x11d
[c04120d6] igmp6_group_added+0x18/0x11d
[c0141f9f] trace_hardirqs_on+0x122/0x14c
[c04124a8] ipv6_dev_mc_inc+0x2a3/0x31c
[c0412452] ipv6_dev_mc_inc+0x24d/0x31c
[c04124dd] ipv6_dev_mc_inc+0x2d8/0x31c
[c0412205] ipv6_dev_mc_inc+0x0/0x31c
[c0401834] ipv6_add_dev+0x21c/0x24b
[c040b07d] ndisc_ifinfo_sysctl_change+0x0/0x1ef
[c05c5b40] addrconf_init+0x13/0x193
[c0199f63] proc_net_fops_create+0x10/0x21

Re: [Bridge] Packet per Second

2007-12-14 Thread Stephen Hemminger

On Fri, 14 Dec 2007 15:34:10 + (UTC)
Flávio Pires [EMAIL PROTECTED] wrote:

 Hi all, 
 
 It's my first time using usenet...
 
 Well, I work on an ISP and we have a linux box acting as a
 bridge+firewall. With this bridge+firewall we control the packet rate
 per second from each client and from our repeaters. But I can`t
 measure the packet rate per IP. Is there any tool for this?
 
 Actually, what I want is to measure the packet rate per IP and
 generate graphics with mrtg or rrdtool, but for this I must have the
 number of packets per second of each client :)
 
 Thank you all
 --
 Flávio
 
 


Not that I know of, but you might look at:
http://www.bandwidtharbitrator.com/
-- 
Stephen Hemminger [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bridge] Packet per Second

2007-12-14 Thread Flávio Pires

In article [EMAIL PROTECTED] Stephen
Hemminger[EMAIL PROTECTED] wrote:
  On Fri, 14 Dec 2007 15:34:10 + (UTC)
  Flvio Pires [EMAIL PROTECTED] wrote:
   Hi all, 
   
   It's my first time using usenet...
   
   Well, I work on an ISP and we have a linux box acting as a
 bridge+firewall. With this bridge+firewall we control the packet
 rate per second from each client and from our repeaters. But I can`t
 measure the packet rate per IP. Is there any tool for this?
   
   Actually, what I want is to measure the packet rate per IP and
 generate graphics with mrtg or rrdtool, but for this I must have the
 number of packets per second of each client :)
   
   Thank you all
   --
   Flvio
   
 

  Not that I know of, but you might look at:
  http://www.bandwidtharbitrator.com/

Yeah, we have a proprietary solution from etinc it does bandwidth
control and firewall... but using his firewall let this machine too
slow, so we created a box just for firewall... Then we need a way to
measure pps per host so we can determine which limits fits better to
our clients and our own needs.



-- 

I'm trying a new usenet client for Mac, Nemo OS X.
You can download it at http://www.malcom-mac.com/nemo


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: What was the reason for 2.6.22 SMP kernels to change how sendmsg is called?

2007-12-14 Thread Kevin Wilson

do not express your frustration ...

wasn't frustration but rather playful sarcasm.

This was not a bug report at all...

Wasn't really meant to be a true blue bug report (my bad I guess). Anywho, I 
know you guys have big fish to fry so I tried to keep it short and to the 
point. I knew something had changed and I was truly stumped in trying to figure 
out what it was so I decided to ask for some general guidance. 

Without having your code it is virtually impossible to say.

I know this is partly my own fault for not stating so explicitly in my first 
email. However, as I stated in my second email, I would have been happy to send 
it to anyone that expressed an interest (even though the issue wasn't 
interesting in itself) in my post. I just thought, being the experts you are, 
that having only the small subset of code that is actually involved in the 
offending call, you'd be able to say go take a look at commit such-n-such 
which you have now done. Thanks a million!

I'm not trying to start a nag war here, I know you guys are busy. Having both 
(mostly me) made pitiful assumptions I think we have reached an understanding 
on this (now dead) topic. I'm just starting out in the drivers arena with hopes 
of being a decent contributor to the kernel ecosystem some day so there will be 
some growing pains and this thread was one of them.

Thanks for the help and suggestions, I appreciate them immensely.

Kevin

 
-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Evgeniy Polyakov
Sent: Friday, December 14, 2007 00:33
To: Kevin Wilson
Cc: David Miller; netdev@vger.kernel.org
Subject: Re: What was the reason for 2.6.22 SMP kernels to change how
sendmsg is called?


Hi Kevin.

On Thu, Dec 13, 2007 at 04:00:02PM -0600, Kevin Wilson ([EMAIL PROTECTED]) 
wrote:
 I see your point but it just so happens it is a GPL'd driver, as is all of 
 our Linux code we produce for our hardware. Granted it is out of tree, and 
 after you saw it you would want it to stay that way. However, I would have 
 sent you the whole thing if that is a pre-req to cordial exchanges on this 
 list.
 
 Nonetheless, a somewhat recent change in your tree, that I could not pinpoint 
 on my own, caused the driver to stop functioning properly. So after much 
 searching in git/google/sources with no luck, I decided to ask for a little 
 assistance, maybe just a hint as to where the culprit may be in the tree so I 
 could investigate for myself. For SNGs I tried the method that now works but 
 I am still at a loss as to (can't find) what changes in the tree caused it to 
 fail.

Without having your code it is virtually impossible to say, why you have
a bug. And do not express your frustration telling 'zero people
responded to my bug report'. This was not a bug report at all, but empty
message about 'my code stopped working after some network changes, which
broke the stuff.

Now in 2.6.22 and later kernels you must use the higher level SOCKET to
make a call to PROTO_OPS then to sendmsg(). e.g., socket-ops-sendmsg().

It was done because of bug found in inet_sendmsg(), which tried to
autobind socket it should not try.

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 9543] New: RTNL: assertion failed at net/ipv6/addrconf.c (2164)/RTNL: assertion failed at net/ipv4/devinet.c (1055)

2007-12-14 Thread Andy Gospodarek

On Fri, Dec 14, 2007 at 05:14:57PM +0100, Krzysztof Oledzki wrote:
 
 
 On Wed, 12 Dec 2007, Jay Vosburgh wrote:
 
 Herbert Xu [EMAIL PROTECTED] wrote:
 
 diff -puN drivers/net/bonding/bond_sysfs.c~bonding-locking-fix 
 drivers/net/bonding/bond_sysfs.c
 --- a/drivers/net/bonding/bond_sysfs.c~bonding-locking-fix
 +++ a/drivers/net/bonding/bond_sysfs.c
 @@ -,8 +,6 @@ static ssize_t bonding_store_primary(str
 out:
write_unlock_bh(bond-lock);
 
 -   rtnl_unlock();
 -
 
 Looking at the changeset that added this perhaps the intention
 is to hold the lock? If so we should add an rtnl_lock to the start
 of the function.
 
  Yes, this function needs to hold locks, and more than just
 what's there now.  I believe the following should be correct; I haven't
 tested it, though (I'm supposedly on vacation right now).
 
  The following change should be correct for the
 bonding_store_primary case discussed in this thread, and also corrects
 the bonding_store_active case which performs similar functions.
 
  The bond_change_active_slave and bond_select_active_slave
 functions both require rtnl, bond-lock for read and curr_slave_lock for
 write_bh, and no other locks.  This is so that the lower level
 mode-specific functions can release locks down to just rtnl in order to
 call, e.g., dev_set_mac_address with the locks it expects (rtnl only).
 
 Signed-off-by: Jay Vosburgh [EMAIL PROTECTED]
 
 diff --git a/drivers/net/bonding/bond_sysfs.c 
 b/drivers/net/bonding/bond_sysfs.c
 index 11b76b3..28a2d80 100644
 --- a/drivers/net/bonding/bond_sysfs.c
 +++ b/drivers/net/bonding/bond_sysfs.c
 @@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device 
 *d,
  struct slave *slave;
  struct bonding *bond = to_bond(d);
 
 -write_lock_bh(bond-lock);
 +rtnl_lock();
 +read_lock(bond-lock);
 +write_lock_bh(bond-curr_slave_lock);
 +
  if (!USES_PRIMARY(bond-params.mode)) {
  printk(KERN_INFO DRV_NAME
 : %s: Unable to set primary slave; %s is in mode 
 %d\n,
 @@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device 
 *d,
  }
  }
 out:
 -write_unlock_bh(bond-lock);
 -
 +write_unlock_bh(bond-curr_slave_lock);
 +read_unlock(bond-lock);
  rtnl_unlock();
 
  return count;
 @@ -1190,7 +1193,8 @@ static ssize_t bonding_store_active_slave(struct 
 device *d,
  struct bonding *bond = to_bond(d);
 
  rtnl_lock();
 -write_lock_bh(bond-lock);
 +read_lock(bond-lock);
 +write_lock_bh(bond-curr_slave_lock);
 
  if (!USES_PRIMARY(bond-params.mode)) {
  printk(KERN_INFO DRV_NAME
 @@ -1247,7 +1251,8 @@ static ssize_t bonding_store_active_slave(struct 
 device *d,
  }
  }
 out:
 -write_unlock_bh(bond-lock);
 +write_unlock_bh(bond-curr_slave_lock);
 +read_unlock(bond-lock);
  rtnl_unlock();
 
  return count;
 
 Vanilla 2.6.24-rc5 plus this patch:
 
 =
 [ INFO: possible irq lock inversion dependency detected ]
 2.6.24-rc5 #1
 -
 events/0/9 just changed the state of lock:
  (mc-mca_lock){-+..}, at: [c0411c7a] mld_ifc_timer_expire+0x130/0x1fb
 but this lock took another, soft-read-irq-unsafe lock in the past:
  (bond-lock){-.--}
 
 and interrupts could create inverse lock ordering between them.
 
 

Grrr, I should have seen that -- sorry.  Try your luck with this instead:

diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 11b76b3..0694254 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device *d,
struct slave *slave;
struct bonding *bond = to_bond(d);
 
-   write_lock_bh(bond-lock);
+   rtnl_lock();
+   read_lock_bh(bond-lock);
+   write_lock_bh(bond-curr_slave_lock);
+
if (!USES_PRIMARY(bond-params.mode)) {
printk(KERN_INFO DRV_NAME
   : %s: Unable to set primary slave; %s is in mode %d\n,
@@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device *d,
}
}
 out:
-   write_unlock_bh(bond-lock);
-
+   write_unlock_bh(bond-curr_slave_lock);
+   read_unlock_bh(bond-lock);
rtnl_unlock();
 
return count;
@@ -1190,7 +1193,8 @@ static ssize_t bonding_store_active_slave(struct device 
*d,
struct bonding *bond = to_bond(d);
 
rtnl_lock();
-   write_lock_bh(bond-lock);
+   read_lock_bh(bond-lock);
+   write_lock_bh(bond-curr_slave_lock);
 
if (!USES_PRIMARY(bond-params.mode)) {
printk(KERN_INFO DRV_NAME
@@ -1247,7 +1251,8 @@ static ssize_t bonding_store_active_slave(struct device 
*d,
}
}
 out:
-   write_unlock_bh(bond-lock);
+

Re: [NETFILTER] xt_hashlimit : speedups hash_dst()

2007-12-14 Thread David Miller

From: Eric Dumazet [EMAIL PROTECTED]
Date: Fri, 14 Dec 2007 12:09:31 +0100

 1) Using jhash2() instead of jhash() is a litle bit faster if applicable.

 2) Thanks to jhash, hash value uses full 32 bits.
 Instead of returning hash % size (implying a divide)
 we return the high 32 bits of the (hash * size) that will
 give results between [0 and size-1] and same hash distribution.

On most cpus, a multiply is less expensive than a divide, by an order
of magnitude.

 Signed-off-by: Eric Dumazet [EMAIL PROTECTED]

As a side note, Jenkins performs nearly optimally (unlike
most traditional hash functions) with power of two hash
table sizes.

Using a pow2 hash table size would completely obviate the
issues solved by #2.

I don't know if that is feasible here in xt_hashlimit, but
if it is that is how we should solve this expensive
modulo.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] add driver for enc28j60 ethernet chip

2007-12-14 Thread Stephen Hemminger

On Fri, 14 Dec 2007 10:21:51 +0100
Claudio Lanconelli [EMAIL PROTECTED] wrote:

 Hi Stephen,
 thank you for your suggestions.
 I already applied trivial fixes, but I have questions on some points, 
 see inline.
 
 Stephen Hemminger wrote:
  General comments:
* device driver does no carrier detection. This makes it useless
  for bridging, bonding, or any form of failover.
 
* use msglevel method (via ethtool) to control debug messages
  rather than kernel configuration. This allows enabling debugging
  without recompilation which is important in distributions.
 
* Please add ethtool support
 
* Consider using NAPI
 

 Can you point me to a possibly simple driver that uses ethtool and NAPI? 

No driver stays simple! but look at tg3, sky2, r8169 for examples.

 Or other example that I can use for reference.
 May be the skeleton should be updated.
 
   * use netdev_priv(netdev) rather than netdev-priv
 
 I can't find where I used netdev-priv, may be do you mean priv-netdev?

yes

(skipping other comments)

 
  +static int __devinit enc28j60_probe(struct spi_device *spi)
  +{
  +   struct net_device *dev;
  +   struct enc28j60_net_local *priv;
  +   int ret = 0;
  +
  +   dev_dbg(spi-dev, %s() start\n, __FUNCTION__);
  +
  +   dev = alloc_etherdev(sizeof(struct enc28j60_net_local));
  +   if (!dev) {
  +   ret = -ENOMEM;
  +   goto error_alloc;
  +   }
  +   priv = netdev_priv(dev);
  +
  +   priv-netdev = dev; /* priv to netdev reference */
  +   priv-spi = spi;/* priv to spi reference */
  +   priv-spi_transfer_buf = kmalloc(SPI_TRANSFER_BUF_LEN, GFP_KERNEL);
 
  Why not declare the transfer buffer as an array in spi?

 I don't understand exactly what do you mean here.
 spi field point to struct spi_device from SPI subsystem.
 Other SPI client driver uses an allocated buffer too.

I just noticed that you alloc an ether device then do an additional
allocation for the buffer.  It makes sense if there is other uses.
You do need to be careful for cases where transfer_buf might be used
after free: module unload (your probably safe), and client driver
using during shutdown.



-- 
Stephen Hemminger [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-2.6.25 6/8] sctp: Use ipv4_is_type

2007-12-14 Thread Vlad Yasevich

Joe Perches wrote:
 Signed-off-by: Joe Perches [EMAIL PROTECTED]

Thanks Joe.  I've put this into my tree.

-vlad
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-2.6.25 6/8] sctp: Use ipv4_is_type

2007-12-14 Thread David Miller

From: Vlad Yasevich [EMAIL PROTECTED]
Date: Fri, 14 Dec 2007 14:11:44 -0500

 Joe Perches wrote:
  Signed-off-by: Joe Perches [EMAIL PROTECTED]
 
 Thanks Joe.  I've put this into my tree.

You can't, because your tree won't build without his first patch and I
have to approve that and stick it into my tree first.

Just let me review and potentially pick up all of
Joe's stuff since we have this dependency.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 2/4] net: use mutex_is_locked() for ASSERT_RTNL()

2007-12-14 Thread David Miller

From: Herbert Xu [EMAIL PROTECTED]
Date: Fri, 14 Dec 2007 16:30:37 +0800

 On Fri, Dec 14, 2007 at 12:22:09AM -0800, Andrew Morton wrote:

  I don't see how it could warn about that.  Nor should it - one might want
  to check that rtnl_lock is held inside preempt_disable() or spin_lock or
  whatever.

  It might make sense to warn if ASSERT_RTNL is called in in_interrupt()
  contexts though.

 Well the paths where ASSERT_RTNL is used should never be in an
 atomic context.  In the past it has been quite useful in pointing
 out bogus locking practices.

 There is currently one path where it's known to warn because of
 this and it (promiscuous mode) is on my todo list.

 Oh and it only warns when you have mutex debugging enabled.

Right, this change is just totally bogus.

I'm all for using existing facilities to replace hand-crafted copies,
but this case is removing useful debugging functionality so it's
wrong.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-2.6.25 6/8] sctp: Use ipv4_is_type

2007-12-14 Thread Vlad Yasevich

David Miller wrote:
 From: Vlad Yasevich [EMAIL PROTECTED]
 Date: Fri, 14 Dec 2007 14:11:44 -0500

 Joe Perches wrote:
 Signed-off-by: Joe Perches [EMAIL PROTECTED]
 Thanks Joe.  I've put this into my tree.

 You can't, because your tree won't build without his first patch and I
 have to approve that and stick it into my tree first.

 Just let me review and potentially pick up all of
 Joe's stuff since we have this dependency.

Ok.  In that case, you can have my ACK for the SCTP stuff.

-vlad
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GIT PULL] [NET]: Use {hton{s,l},cpu_to_be{16,32}}() where appropriate.

2007-12-14 Thread David Miller

From: YOSHIFUJI Hideaki / 吉藤英明 [EMAIL PROTECTED]
Date: Fri, 14 Dec 2007 16:28:35 +0900 (JST)

 Please consider pulling the following changes from the branch
   net-2.6-dev-20071214
 available at
   git://git.linux-ipv6.org/gitroot/yoshfuji/linux-2.6-dev.git
 which is on top of your net-2.6-devel tree.

Pulled, thank you.

Could you please provide the full pull URL all in one
line, instead of splitting the base URL and the HEAD
name onto seperate lines?

I have to cut and paste from multiple places in your mail in order to
compose the pull command line and I wish I didn't have to do that
every time.  I should be able to just cut and paste one line, which
should be the fully specified URL, for maximum efficiency.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 20/29] netfilter: NF_QUEUE vs emergency skbs

2007-12-14 Thread Peter Zijlstra

Avoid memory getting stuck waiting for userspace, drop all emergency packets.
This of course requires the regular storage route to not include an NF_QUEUE
target ;-)

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 net/netfilter/core.c |3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/net/netfilter/core.c
===
--- linux-2.6.orig/net/netfilter/core.c
+++ linux-2.6/net/netfilter/core.c
@@ -176,9 +176,12 @@ next_hook:
ret = 1;
goto unlock;
} else if (verdict == NF_DROP) {
+drop:
kfree_skb(skb);
ret = -EPERM;
} else if ((verdict  NF_VERDICT_MASK) == NF_QUEUE) {
+   if (skb_emergency(*pskb))
+   goto drop;
if (!nf_queue(skb, elem, pf, hook, indev, outdev, okfn,
  verdict  NF_VERDICT_BITS))
goto next_hook;

--

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 00/29] Swap over NFS -v15

2007-12-14 Thread Peter Zijlstra

Hi,

Another posting of the full swap over NFS series. 

Andrew/Linus, could we start thinking of sticking this in -mm?

[ patches against 2.6.24-rc5-mm1, also to be found online at:
  http://programming.kicks-ass.net/kernel-patches/vm_deadlock/v2.6.24-rc5-mm1/ ]

The patch-set can be split in roughtly 5 parts, for each of which I shall give
a description.


  Part 1, patches 1-11

The problem with swap over network is the generic swap problem: needing memory
to free memory. Normally this is solved using mempools, as can be seen in the
BIO layer.

Swap over network has the problem that the network subsystem does not use fixed
sized allocations, but heavily relies on kmalloc(). This makes mempools
unusable.

This first part provides a generic reserve framework.

Care is taken to only affect the slow paths - when we're low on memory.

Caveats: it currently doesn't do SLOB.

 1 - mm: gfp_to_alloc_flags()
 2 - mm: tag reseve pages
 3 - mm: sl[au]b: add knowledge of reserve pages
 4 - mm: kmem_estimate_pages()
 5 - mm: allow PF_MEMALLOC from softirq context
 6 - mm: serialize access to min_free_kbytes
 7 - mm: emergency pool
 8 - mm: system wide ALLOC_NO_WATERMARK
 9 - mm: __GFP_MEMALLOC
10 - mm: memory reserve management
11 - selinux: tag avc cache alloc as non-critical


  Part 2, patches 12-14

Provide some generic network infrastructure needed later on.

12 - net: wrap sk-sk_backlog_rcv()
13 - net: packet split receive api
14 - net: sk_allocation() - concentrate socket related allocations


  Part 3, patches 15-21

Now that we have a generic memory reserve system, use it on the network stack.
The thing that makes this interesting is that, contrary to BIO, both the
transmit and receive path require memory allocations. 

That is, in the BIO layer write back completion is usually just an ISR flipping
a bit and waking stuff up. A network write back completion involved receiving
packets, which when there is no memory, is rather hard. And even when there is
memory there is no guarantee that the required packet comes in in the window
that that memory buys us.

The solution to this problem is found in the fact that network is to be assumed
lossy. Even now, when there is no memory to receive packets the network card
will have to discard packets. What we do is move this into the network stack.

So we reserve a little pool to act as a receive buffer, this allows us to
inspect packets before tossing them. This way, we can filter out those packets
that ensure progress (writeback completion) and disregard the others (as would
have happened anyway). [ NOTE: this is a stable mode of operation with limited
memory usage, exactly the kind of thing we need ]

Again, care is taken to keep much of the overhead of this to only affect the
slow path. Only packets allocated from the reserves will suffer the extra
atomic overhead needed for accounting.

15 - netvm: network reserve infrastructure
16 - netvm: INET reserves.
17 - netvm: hook skb allocation to reserves
18 - netvm: filter emergency skbs.
19 - netvm: prevent a TCP specific deadlock
20 - netfilter: NF_QUEUE vs emergency skbs
21 - netvm: skb processing


  Part 4, patches 22-24

Generic vm infrastructure to handle swapping to a filesystem instead of a block
device.

This provides new a_ops to handle swapcache pages and could be used to obsolete
the bmap usage for swapfiles.

22 - mm: prepare swap entry methods for use in page methods
23 - mm: add support for non block device backed swap files
24 - mm: methods for teaching filesystems about PG_swapcache pages


  Part 5, patches 25-29

Finally, convert NFS to make use of the new network and vm infrastructure to
provide swap over NFS.

25 - nfs: remove mempools
26 - nfs: teach the NFS client how to treat PG_swapcache pages
27 - nfs: disable data cache revalidation for swapfiles
28 - nfs: enable swap on NFS
29 - nfs: fix various memory recursions possible with swap over NFS.


Changes since -v14:
 - SLAB support
 - a_ops rework
 - various bug fixes and cleanups


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 07/29] mm: emergency pool

2007-12-14 Thread Peter Zijlstra

Provide means to reserve a specific amount of pages.

The emergency pool is separated from the min watermark because ALLOC_HARDER
and ALLOC_HIGH modify the watermark in a relative way and thus do not ensure
a strict minimum.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/mmzone.h |3 +
 mm/page_alloc.c|   82 +++--
 mm/vmstat.c|6 +--
 3 files changed, 78 insertions(+), 13 deletions(-)

Index: linux-2.6/include/linux/mmzone.h
===
--- linux-2.6.orig/include/linux/mmzone.h
+++ linux-2.6/include/linux/mmzone.h
@@ -213,7 +213,7 @@ enum zone_type {
 
 struct zone {
/* Fields commonly accessed by the page allocator */
-   unsigned long   pages_min, pages_low, pages_high;
+   unsigned long   pages_emerg, pages_min, pages_low, pages_high;
/*
 * We don't know if the memory that we're going to allocate will be 
freeable
 * or/and it will be released eventually, so to avoid totally wasting 
several
@@ -682,6 +682,7 @@ int sysctl_min_unmapped_ratio_sysctl_han
struct file *, void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
struct file *, void __user *, size_t *, loff_t *);
+int adjust_memalloc_reserve(int pages);
 
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
struct file *, void __user *, size_t *, loff_t *);
Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -118,6 +118,8 @@ static char * const zone_names[MAX_NR_ZO
 
 static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
+static DEFINE_MUTEX(var_free_mutex);
+int var_free_kbytes;
 
 unsigned long __meminitdata nr_kernel_pages;
 unsigned long __meminitdata nr_all_pages;
@@ -1252,7 +1254,7 @@ int zone_watermark_ok(struct zone *z, in
if (alloc_flags  ALLOC_HARDER)
min -= min / 4;
 
-   if (free_pages = min + z-lowmem_reserve[classzone_idx])
+   if (free_pages = min + z-lowmem_reserve[classzone_idx] + 
z-pages_emerg)
return 0;
for (o = 0; o  order; o++) {
/* At the next order, this order's pages become unavailable */
@@ -1733,8 +1735,8 @@ nofail_alloc:
 nopage:
if (!(gfp_mask  __GFP_NOWARN)  printk_ratelimit()) {
printk(KERN_WARNING %s: page allocation failure.
-order:%d, mode:0x%x\n,
-   p-comm, order, gfp_mask);
+order:%d, mode:0x%x, alloc_flags:0x%x, pflags:0x%x\n,
+   p-comm, order, gfp_mask, alloc_flags, p-flags);
dump_stack();
show_mem();
}
@@ -1952,9 +1954,9 @@ void show_free_areas(void)
\n,
zone-name,
K(zone_page_state(zone, NR_FREE_PAGES)),
-   K(zone-pages_min),
-   K(zone-pages_low),
-   K(zone-pages_high),
+   K(zone-pages_emerg + zone-pages_min),
+   K(zone-pages_emerg + zone-pages_low),
+   K(zone-pages_emerg + zone-pages_high),
K(zone_page_state(zone, NR_ACTIVE)),
K(zone_page_state(zone, NR_INACTIVE)),
K(zone-present_pages),
@@ -4113,7 +4115,7 @@ static void calculate_totalreserve_pages
}
 
/* we treat pages_high as reserved pages. */
-   max += zone-pages_high;
+   max += zone-pages_high + zone-pages_emerg;
 
if (max  zone-present_pages)
max = zone-present_pages;
@@ -4170,7 +4172,8 @@ static void setup_per_zone_lowmem_reserv
  */
 static void __setup_per_zone_pages_min(void)
 {
-   unsigned long pages_min = min_free_kbytes  (PAGE_SHIFT - 10);
+   unsigned pages_min = min_free_kbytes  (PAGE_SHIFT - 10);
+   unsigned pages_emerg = var_free_kbytes  (PAGE_SHIFT - 10);
unsigned long lowmem_pages = 0;
struct zone *zone;
unsigned long flags;
@@ -4182,11 +4185,13 @@ static void __setup_per_zone_pages_min(v
}
 
for_each_zone(zone) {
-   u64 tmp;
+   u64 tmp, tmp_emerg;
 
spin_lock_irqsave(zone-lru_lock, flags);
tmp = (u64)pages_min * zone-present_pages;
do_div(tmp, lowmem_pages);
+   tmp_emerg = (u64)pages_emerg * zone-present_pages;
+   do_div(tmp_emerg, lowmem_pages);
if (is_highmem(zone)) {
/*
 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
@@

[PATCH 28/29] nfs: enable swap on NFS

2007-12-14 Thread Peter Zijlstra

Implement all the new swapfile a_ops for NFS. This will set the NFS socket to
SOCK_MEMALLOC and run socket reconnect under PF_MEMALLOC as well as reset
SOCK_MEMALLOC before engaging the protocol -connect() method.

PF_MEMALLOC should allow the allocation of struct socket and related objects
and the early (re)setting of SOCK_MEMALLOC should allow us to receive the
packets required for the TCP connection buildup.

(swapping continues over a server reset during heavy network traffic)

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/Kconfig  |   18 
 fs/nfs/file.c   |   12 
 fs/nfs/write.c  |   19 +
 include/linux/nfs_fs.h  |2 +
 include/linux/sunrpc/xprt.h |5 ++-
 net/sunrpc/sched.c  |9 --
 net/sunrpc/xprtsock.c   |   63 
 7 files changed, 125 insertions(+), 3 deletions(-)

Index: linux-2.6/fs/nfs/file.c
===
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -371,6 +371,13 @@ static int nfs_launder_page(struct page 
return nfs_wb_page(page_file_mapping(page)-host, page);
 }
 
+#ifdef CONFIG_NFS_SWAP
+static int nfs_swapfile(struct address_space *mapping, int enable)
+{
+   return xs_swapper(NFS_CLIENT(mapping-host)-cl_xprt, enable);
+}
+#endif
+
 const struct address_space_operations nfs_file_aops = {
.readpage = nfs_readpage,
.readpages = nfs_readpages,
@@ -385,6 +392,11 @@ const struct address_space_operations nf
.direct_IO = nfs_direct_IO,
 #endif
.launder_page = nfs_launder_page,
+#ifdef CONFIG_NFS_SWAP
+   .swapfile = nfs_swapfile,
+   .swap_out = nfs_swap_out,
+   .swap_in = nfs_readpage,
+#endif
 };
 
 static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct page *page)
Index: linux-2.6/fs/nfs/write.c
===
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -365,6 +365,25 @@ int nfs_writepage(struct page *page, str
return ret;
 }
 
+int nfs_swap_out(struct file *file, struct page *page,
+struct writeback_control *wbc)
+{
+   struct nfs_open_context *ctx = nfs_file_open_context(file);
+   int status;
+
+   status = nfs_writepage_setup(ctx, page, 0, nfs_page_length(page));
+   if (status  0) {
+   nfs_set_pageerror(page);
+   goto out;
+   }
+
+   status = nfs_writepage_locked(page, wbc);
+
+out:
+   unlock_page(page);
+   return status;
+}
+
 static int nfs_writepages_callback(struct page *page, struct writeback_control 
*wbc, void *data)
 {
int ret;
Index: linux-2.6/include/linux/nfs_fs.h
===
--- linux-2.6.orig/include/linux/nfs_fs.h
+++ linux-2.6/include/linux/nfs_fs.h
@@ -413,6 +413,8 @@ extern int  nfs_flush_incompatible(struc
 extern int  nfs_updatepage(struct file *, struct page *, unsigned int, 
unsigned int);
 extern int nfs_writeback_done(struct rpc_task *, struct nfs_write_data *);
 extern void nfs_writedata_release(void *);
+extern int  nfs_swap_out(struct file *file, struct page *page,
+struct writeback_control *wbc);
 
 /*
  * Try to write back everything synchronously (but check the
Index: linux-2.6/fs/Kconfig
===
--- linux-2.6.orig/fs/Kconfig
+++ linux-2.6/fs/Kconfig
@@ -1692,6 +1692,18 @@ config NFS_DIRECTIO
  causes open() to return EINVAL if a file residing in NFS is
  opened with the O_DIRECT flag.
 
+config NFS_SWAP
+   bool Provide swap over NFS support
+   default n
+   depends on NFS_FS
+   select SUNRPC_SWAP
+   help
+ This option enables swapon to work on files located on NFS mounts.
+
+ For more details, see Documentation/vm_deadlock.txt
+
+ If unsure, say N.
+
 config NFSD
tristate NFS server support
depends on INET
@@ -1835,6 +1847,12 @@ config SUNRPC_BIND34
  If unsure, say N to get traditional behavior (version 2 rpcbind
  requests only).
 
+config SUNRPC_SWAP
+   def_bool n
+   depends on SUNRPC
+   select NETVM
+   select SWAP_FILE
+
 config RPCSEC_GSS_KRB5
tristate Secure RPC: Kerberos V mechanism (EXPERIMENTAL)
depends on SUNRPC  EXPERIMENTAL
Index: linux-2.6/include/linux/sunrpc/xprt.h
===
--- linux-2.6.orig/include/linux/sunrpc/xprt.h
+++ linux-2.6/include/linux/sunrpc/xprt.h
@@ -143,7 +143,9 @@ struct rpc_xprt {
unsigned intmax_reqs;   /* total slots */
unsigned long   state;  /* transport state */
unsigned char   shutdown   : 1, /* being shut down */
-   resvport   : 1; /* use a reserved

[PATCH 01/29] mm: gfp_to_alloc_flags()

2007-12-14 Thread Peter Zijlstra

Factor out the gfp to alloc_flags mapping so it can be used in other places.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 mm/internal.h   |   11 ++
 mm/page_alloc.c |   98 
 2 files changed, 67 insertions(+), 42 deletions(-)

Index: linux-2.6/mm/internal.h
===
--- linux-2.6.orig/mm/internal.h
+++ linux-2.6/mm/internal.h
@@ -47,4 +47,15 @@ static inline unsigned long page_order(s
VM_BUG_ON(!PageBuddy(page));
return page_private(page);
 }
+
+#define ALLOC_HARDER   0x01 /* try to alloc harder */
+#define ALLOC_HIGH 0x02 /* __GFP_HIGH set */
+#define ALLOC_WMARK_MIN0x04 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW0x08 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH   0x10 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS0x20 /* don't check watermarks at all */
+#define ALLOC_CPUSET   0x40 /* check for correct cpuset */
+
+int gfp_to_alloc_flags(gfp_t gfp_mask);
+
 #endif
Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1139,14 +1139,6 @@ failed:
return NULL;
 }
 
-#define ALLOC_NO_WATERMARKS0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH   0x08 /* use pages_high watermark */
-#define ALLOC_HARDER   0x10 /* try to alloc harder */
-#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET   0x40 /* check for correct cpuset */
-
 #ifdef CONFIG_FAIL_PAGE_ALLOC
 
 static struct fail_page_alloc_attr {
@@ -1535,6 +1527,44 @@ static void set_page_owner(struct page *
 #endif /* CONFIG_PAGE_OWNER */
 
 /*
+ * get the deepest reaching allocation flags for the given gfp_mask
+ */
+int gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+   struct task_struct *p = current;
+   int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+   const gfp_t wait = gfp_mask  __GFP_WAIT;
+
+   /*
+* The caller may dip into page reserves a bit more if the caller
+* cannot run direct reclaim, or if the caller has realtime scheduling
+* policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
+* set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+*/
+   if (gfp_mask  __GFP_HIGH)
+   alloc_flags |= ALLOC_HIGH;
+
+   if (!wait) {
+   alloc_flags |= ALLOC_HARDER;
+   /*
+* Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+*/
+   alloc_flags = ~ALLOC_CPUSET;
+   } else if (unlikely(rt_task(p))  !in_interrupt())
+   alloc_flags |= ALLOC_HARDER;
+
+   if (likely(!(gfp_mask  __GFP_NOMEMALLOC))) {
+   if (!in_interrupt() 
+   ((p-flags  PF_MEMALLOC) ||
+unlikely(test_thread_flag(TIF_MEMDIE
+   alloc_flags |= ALLOC_NO_WATERMARKS;
+   }
+
+   return alloc_flags;
+}
+
+/*
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page * fastcall
@@ -1589,48 +1619,28 @@ restart:
 * OK, we're below the kswapd watermark and have kicked background
 * reclaim. Now things get more complex, so set up alloc_flags according
 * to how we want to proceed.
-*
-* The caller may dip into page reserves a bit more if the caller
-* cannot run direct reclaim, or if the caller has realtime scheduling
-* policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-* set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
 */
-   alloc_flags = ALLOC_WMARK_MIN;
-   if ((unlikely(rt_task(p))  !in_interrupt()) || !wait)
-   alloc_flags |= ALLOC_HARDER;
-   if (gfp_mask  __GFP_HIGH)
-   alloc_flags |= ALLOC_HIGH;
-   if (wait)
-   alloc_flags |= ALLOC_CPUSET;
+   alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
-   /*
-* Go through the zonelist again. Let __GFP_HIGH and allocations
-* coming from realtime tasks go deeper into reserves.
-*
-* This is the last chance, in general, before the goto nopage.
-* Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
-* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
-*/
-   page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+   /* This is the last chance, in general, before the goto nopage. */
+   page = get_page_from_freelist(gfp_mask, order, zonelist,
+

[PATCH 23/29] mm: add support for non block device backed swap files

2007-12-14 Thread Peter Zijlstra

New addres_space_operations methods are added:
  int swapfile(struct address_space *, int);
  int swap_out(struct file *, struct page *, struct writeback_control *);
  int swap_in(struct file *, struct page *);

When during sys_swapon() the swapfile() method is found and returns no error
the swapper_space.a_ops will proxy to sis-swap_file-f_mapping-a_ops, and
make use of swap_{out,in}() to write/read swapcache pages.

The swapfile method will be used to communicate to the address_space that the
VM relies on it, and the address_space should take adequate measures (like 
reserving memory for mempools or the like).

This new interface can be used to obviate the need for -bmap in the swapfile
code. A filesystem would need to load (and maybe even allocate) the full block
map for a file into memory and pin it there on -swapfile(,1) so that
-swap_{out,in}() have instant access to it. It can be released on
-swapfile(,0).

The reason to provide -swap_{out,in}() over using {write,read}page() is to
 1) make a distinction between swapcache and pagecache pages, and
 2) to provide a struct file * for credential context (normally not needed
in the context of writepage, as the page content is normally dirtied
using either of the following interfaces:
  write_{begin,end}()
  {prepare,commit}_write()
  page_mkwrite()
which do have the file context.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 Documentation/filesystems/Locking |   19 
 Documentation/filesystems/vfs.txt |   17 +++
 include/linux/buffer_head.h   |2 -
 include/linux/fs.h|8 +
 include/linux/swap.h  |3 +
 mm/Kconfig|3 +
 mm/page_io.c  |   58 ++
 mm/swap_state.c   |5 +++
 mm/swapfile.c |   22 +-
 9 files changed, 135 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/swap.h
===
--- linux-2.6.orig/include/linux/swap.h
+++ linux-2.6/include/linux/swap.h
@@ -164,6 +164,7 @@ enum {
SWP_USED= (1  0), /* is slot in swap_info[] used? */
SWP_WRITEOK = (1  1), /* ok to write to this swap?*/
SWP_ACTIVE  = (SWP_USED | SWP_WRITEOK),
+   SWP_FILE= (1  2), /* file swap area */
/* add others here before... */
SWP_SCANNING= (1  8), /* refcount in scan_swap_map */
 };
@@ -261,6 +262,8 @@ extern void swap_unplug_io_fn(struct bac
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct file *, struct page *);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
+extern void swap_sync_page(struct page *page);
+extern int swap_set_page_dirty(struct page *page);
 extern void end_swap_bio_read(struct bio *bio, int err);
 
 /* linux/mm/swap_state.c */
Index: linux-2.6/mm/page_io.c
===
--- linux-2.6.orig/mm/page_io.c
+++ linux-2.6/mm/page_io.c
@@ -17,6 +17,7 @@
 #include linux/bio.h
 #include linux/swapops.h
 #include linux/writeback.h
+#include linux/buffer_head.h
 #include asm/pgtable.h
 
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
@@ -102,6 +103,18 @@ int swap_writepage(struct page *page, st
unlock_page(page);
goto out;
}
+#ifdef CONFIG_SWAP_FILE
+   {
+   struct swap_info_struct *sis = page_swap_info(page);
+   if (sis-flags  SWP_FILE) {
+   ret = sis-swap_file-f_mapping-
+   a_ops-swap_out(sis-swap_file, page, wbc);
+   if (!ret)
+   count_vm_event(PSWPOUT);
+   return ret;
+   }
+   }
+#endif
bio = get_swap_bio(GFP_NOIO, page_private(page), page,
end_swap_bio_write);
if (bio == NULL) {
@@ -120,6 +133,39 @@ out:
return ret;
 }
 
+#ifdef CONFIG_SWAP_FILE
+void swap_sync_page(struct page *page)
+{
+   struct swap_info_struct *sis = page_swap_info(page);
+
+   if (sis-flags  SWP_FILE) {
+   const struct address_space_operations * a_ops =
+   sis-swap_file-f_mapping-a_ops;
+   if (a_ops-sync_page)
+   a_ops-sync_page(page);
+   } else
+   block_sync_page(page);
+}
+
+int swap_set_page_dirty(struct page *page)
+{
+   struct swap_info_struct *sis = page_swap_info(page);
+
+   if (sis-flags  SWP_FILE) {
+   const struct address_space_operations * a_ops =
+   sis-swap_file-f_mapping-a_ops;
+   int (*spd)(struct page *) = a_ops-set_page_dirty;
+#ifdef CONFIG_BLOCK
+   if (!spd)
+   spd = __set_page_dirty_buffers;
+#endif
+

[PATCH 10/29] mm: memory reserve management

2007-12-14 Thread Peter Zijlstra

Generic reserve management code. 

It provides methods to reserve and charge. Upon this, generic alloc/free style
reserve pools could be build, which could fully replace mempool_t
functionality.

It should also allow for a Banker's algorithm replacement of __GFP_NOFAIL.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/reserve.h |   54 +
 mm/Makefile |2 
 mm/reserve.c|  438 
 3 files changed, 493 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/reserve.h
===
--- /dev/null
+++ linux-2.6/include/linux/reserve.h
@@ -0,0 +1,54 @@
+/*
+ * Memory reserve management.
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra [EMAIL PROTECTED]
+ *
+ * This file contains the public data structure and API definitions.
+ */
+
+#ifndef _LINUX_RESERVE_H
+#define _LINUX_RESERVE_H
+
+#include linux/list.h
+#include linux/spinlock.h
+
+struct mem_reserve {
+   struct mem_reserve *parent;
+   struct list_head children;
+   struct list_head siblings;
+
+   const char *name;
+
+   long pages;
+   long limit;
+   long usage;
+   spinlock_t lock;/* protects limit and usage */
+};
+
+extern struct mem_reserve mem_reserve_root;
+
+void mem_reserve_init(struct mem_reserve *res, const char *name,
+ struct mem_reserve *parent);
+int mem_reserve_connect(struct mem_reserve *new_child,
+   struct mem_reserve *node);
+int mem_reserve_disconnect(struct mem_reserve *node);
+
+int mem_reserve_pages_set(struct mem_reserve *res, long pages);
+int mem_reserve_pages_add(struct mem_reserve *res, long pages);
+int mem_reserve_pages_charge(struct mem_reserve *res, long pages,
+int overcommit);
+
+int mem_reserve_kmalloc_set(struct mem_reserve *res, long bytes);
+int mem_reserve_kmalloc_charge(struct mem_reserve *res, long bytes,
+  int overcommit);
+
+struct kmem_cache;
+
+int mem_reserve_kmem_cache_set(struct mem_reserve *res,
+  struct kmem_cache *s,
+  int objects);
+int mem_reserve_kmem_cache_charge(struct mem_reserve *res,
+ long objs,
+ int overcommit);
+
+#endif /* _LINUX_RESERVE_H */
Index: linux-2.6/mm/Makefile
===
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -11,7 +11,7 @@ obj-y := bootmem.o filemap.o mempool.o
   page_alloc.o page-writeback.o pdflush.o \
   readahead.o swap.o truncate.o vmscan.o \
   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
-  page_isolation.o $(mmu-y)
+  page_isolation.o reserve.o $(mmu-y)
 
 obj-$(CONFIG_PROC_PAGE_MONITOR) += pagewalk.o
 obj-$(CONFIG_BOUNCE)   += bounce.o
Index: linux-2.6/mm/reserve.c
===
--- /dev/null
+++ linux-2.6/mm/reserve.c
@@ -0,0 +1,438 @@
+/*
+ * Memory reserve management.
+ *
+ *  Copyright (C) 2007, Red Hat, Inc., Peter Zijlstra [EMAIL PROTECTED]
+ *
+ * Description:
+ *
+ * Manage a set of memory reserves.
+ *
+ * A memory reserve is a reserve for a specified number of object of specified
+ * size. Since memory is managed in pages, this reserve demand is then
+ * translated into a page unit.
+ *
+ * So each reserve has a specified object limit, an object usage count and a
+ * number of pages required to back these objects.
+ *
+ * Usage is charged against a reserve, if the charge fails, the resource must
+ * not be allocated/used.
+ *
+ * The reserves are managed in a tree, and the resource demands (pages and
+ * limit) are propagated up the tree. Obviously the object limit will be
+ * meaningless as soon as the unit starts mixing, but the required page reserve
+ * (being of one unit) is still valid at the root.
+ *
+ * It is the page demand of the root node that is used to set the global
+ * reserve (adjust_memalloc_reserve() which sets zone-pages_emerg).
+ *
+ * As long as a subtree has the same usage unit, an aggregate node can be used
+ * to charge against, instead of the leaf nodes. However, do be consistent with
+ * who is charged, resource usage is not propagated up the tree (for
+ * performance reasons).
+ */
+
+#include linux/reserve.h
+#include linux/mutex.h
+#include linux/mmzone.h
+#include linux/log2.h
+#include linux/proc_fs.h
+#include linux/seq_file.h
+#include linux/module.h
+#include linux/slab.h
+
+static DEFINE_MUTEX(mem_reserve_mutex);
+
+/**
+ * @mem_reserve_root - the global reserve root
+ *
+ * The global reserve is empty, and has no limit unit, it merely
+ * acts as an aggregation point for reserves and an interface to
+ * adjust_memalloc_reserve().
+ */

[PATCH 17/29] netvm: hook skb allocation to reserves

2007-12-14 Thread Peter Zijlstra

Change the skb allocation api to indicate RX usage and use this to fall back to
the reserve when needed. SKBs allocated from the reserve are tagged in
skb-emergency.

Teach all other skb ops about emergency skbs and the reserve accounting.

Use the (new) packet split API to allocate and track fragment pages from the
emergency reserve. Do this using an atomic counter in page-index. This is
needed because the fragments have a different sharing semantic than that
indicated by skb_shinfo()-dataref. 

Note that the decision to distinguish between regular and emergency SKBs allows
the accounting overhead to be limited to the later kind.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/mm_types.h |1 
 include/linux/skbuff.h   |   26 +--
 net/core/skbuff.c|  173 +--
 3 files changed, 174 insertions(+), 26 deletions(-)

Index: linux-2.6/include/linux/skbuff.h
===
--- linux-2.6.orig/include/linux/skbuff.h
+++ linux-2.6/include/linux/skbuff.h
@@ -314,7 +314,9 @@ struct sk_buff {
__u16   tc_verd;/* traffic control verdict */
 #endif
 #endif
-   /* 2 byte hole */
+   __u8emergency:1;
+   /* 7 bit hole */
+   /* 1 byte hole */
 
 #ifdef CONFIG_NET_DMA
dma_cookie_tdma_cookie;
@@ -345,10 +347,22 @@ struct sk_buff {
 
 #include asm/system.h
 
+#define SKB_ALLOC_FCLONE   0x01
+#define SKB_ALLOC_RX   0x02
+
+static inline bool skb_emergency(const struct sk_buff *skb)
+{
+#ifdef CONFIG_NETVM
+   return unlikely(skb-emergency);
+#else
+   return false;
+#endif
+}
+
 extern void kfree_skb(struct sk_buff *skb);
 extern void   __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
-  gfp_t priority, int fclone, int node);
+  gfp_t priority, int flags, int node);
 static inline struct sk_buff *alloc_skb(unsigned int size,
gfp_t priority)
 {
@@ -358,7 +372,7 @@ static inline struct sk_buff *alloc_skb(
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
   gfp_t priority)
 {
-   return __alloc_skb(size, priority, 1, -1);
+   return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, -1);
 }
 
 extern struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
@@ -1303,7 +1317,8 @@ static inline void __skb_queue_purge(str
 static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
  gfp_t gfp_mask)
 {
-   struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+   struct sk_buff *skb =
+   __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, -1);
if (likely(skb))
skb_reserve(skb, NET_SKB_PAD);
return skb;
@@ -1349,6 +1364,7 @@ static inline struct sk_buff *netdev_all
 }
 
 extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t 
gfp_mask);
+extern void __netdev_free_page(struct net_device *dev, struct page *page);
 
 /**
  * netdev_alloc_page - allocate a page for ps-rx on a specific device
@@ -1365,7 +1381,7 @@ static inline struct page *netdev_alloc_
 
 static inline void netdev_free_page(struct net_device *dev, struct page *page)
 {
-   __free_page(page);
+   __netdev_free_page(dev, page);
 }
 
 /**
Index: linux-2.6/net/core/skbuff.c
===
--- linux-2.6.orig/net/core/skbuff.c
+++ linux-2.6/net/core/skbuff.c
@@ -179,21 +179,28 @@ EXPORT_SYMBOL(skb_truesize_bug);
  * %GFP_ATOMIC.
  */
 struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-   int fclone, int node)
+   int flags, int node)
 {
struct kmem_cache *cache;
struct skb_shared_info *shinfo;
struct sk_buff *skb;
u8 *data;
+   int emergency = 0, memalloc = sk_memalloc_socks();
 
-   cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+   size = SKB_DATA_ALIGN(size);
+   cache = (flags  SKB_ALLOC_FCLONE)
+   ? skbuff_fclone_cache : skbuff_head_cache;
+#ifdef CONFIG_NETVM
+   if (memalloc  (flags  SKB_ALLOC_RX))
+   gfp_mask |= __GFP_NOMEMALLOC|__GFP_NOWARN;
 
+retry_alloc:
+#endif
/* Get the HEAD */
skb = kmem_cache_alloc_node(cache, gfp_mask  ~__GFP_DMA, node);
if (!skb)
-   goto out;
+   goto noskb;
 
-   size = SKB_DATA_ALIGN(size);
data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
gfp_mask, node);
if (!data)
@@ -203,6 +210,7 @@ struct sk_buff *__alloc_skb(unsigned int
 * See comment in sk_buff definition, just before the 'tail'

[PATCH 11/29] selinux: tag avc cache alloc as non-critical

2007-12-14 Thread Peter Zijlstra

Failing to allocate a cache entry will only harm performance not correctness.
Do not consume valuable reserve pages for something like that.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
Acked-by: James Morris [EMAIL PROTECTED]
---
 security/selinux/avc.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-2/security/selinux/avc.c
===
--- linux-2.6-2.orig/security/selinux/avc.c
+++ linux-2.6-2/security/selinux/avc.c
@@ -334,7 +334,7 @@ static struct avc_node *avc_alloc_node(v
 {
struct avc_node *node;
 
-   node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC);
+   node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC|__GFP_NOMEMALLOC);
if (!node)
goto out;
 

--

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 05/29] mm: allow PF_MEMALLOC from softirq context

2007-12-14 Thread Peter Zijlstra

Allow PF_MEMALLOC to be set in softirq context. When running softirqs from
a borrowed context save current-flags, ksoftirqd will have its own 
task_struct.

This is needed to allow network softirq packet processing to make use of
PF_MEMALLOC.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/sched.h |4 
 kernel/softirq.c  |3 +++
 mm/page_alloc.c   |7 ---
 3 files changed, 11 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1557,9 +1557,10 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
alloc_flags |= ALLOC_HARDER;
 
if (likely(!(gfp_mask  __GFP_NOMEMALLOC))) {
-   if (!in_interrupt() 
-   ((p-flags  PF_MEMALLOC) ||
-unlikely(test_thread_flag(TIF_MEMDIE
+   if (!in_irq()  (p-flags  PF_MEMALLOC))
+   alloc_flags |= ALLOC_NO_WATERMARKS;
+   else if (!in_interrupt() 
+   unlikely(test_thread_flag(TIF_MEMDIE)))
alloc_flags |= ALLOC_NO_WATERMARKS;
}
 
Index: linux-2.6/kernel/softirq.c
===
--- linux-2.6.orig/kernel/softirq.c
+++ linux-2.6/kernel/softirq.c
@@ -211,6 +211,8 @@ asmlinkage void __do_softirq(void)
__u32 pending;
int max_restart = MAX_SOFTIRQ_RESTART;
int cpu;
+   unsigned long pflags = current-flags;
+   current-flags = ~PF_MEMALLOC;
 
pending = local_softirq_pending();
account_system_vtime(current);
@@ -249,6 +251,7 @@ restart:
 
account_system_vtime(current);
_local_bh_enable();
+   tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 #ifndef __ARCH_HAS_DO_SOFTIRQ
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1389,6 +1389,10 @@ static inline void put_task_struct(struc
 #define tsk_used_math(p) ((p)-flags  PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+#define tsk_restore_flags(p, pflags, mask) \
+   do {(p)-flags = ~(mask); \
+   (p)-flags |= ((pflags)  (mask)); } while (0)
+
 #ifdef CONFIG_SMP
 extern int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask);
 #else

--

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/29] mm: kmem_estimate_pages()

2007-12-14 Thread Peter Zijlstra

Provide a method to get the upper bound on the pages needed to allocate
a given number of objects from a given kmem_cache.

This lays the foundation for a generic reserve framework as presented in
a later patch in this series. This framework needs to convert object demand
(kmalloc() bytes, kmem_cache_alloc() objects) to pages.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/slab.h |3 +
 mm/slab.c|   74 ++
 mm/slub.c|   82 +++
 3 files changed, 159 insertions(+)

Index: linux-2.6/include/linux/slab.h
===
--- linux-2.6.orig/include/linux/slab.h
+++ linux-2.6/include/linux/slab.h
@@ -60,6 +60,7 @@ void kmem_cache_free(struct kmem_cache *
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
+unsigned kmem_estimate_pages(struct kmem_cache *cachep, gfp_t flags, int 
objects);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
@@ -94,6 +95,8 @@ int kmem_ptr_validate(struct kmem_cache 
 void * __must_check krealloc(const void *, size_t, gfp_t);
 void kfree(const void *);
 size_t ksize(const void *);
+unsigned kestimate_single(size_t, gfp_t, int);
+unsigned kestimate(gfp_t, size_t);
 
 /*
  * Allocator specific definitions. These are mainly used to establish optimized
Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -2446,6 +2446,37 @@ const char *kmem_cache_name(struct kmem_
 EXPORT_SYMBOL(kmem_cache_name);
 
 /*
+ * return the max number of pages required to allocated count
+ * objects from the given cache
+ */
+unsigned kmem_estimate_pages(struct kmem_cache *s, gfp_t flags, int objects)
+{
+   unsigned long slabs;
+
+   if (WARN_ON(!s) || WARN_ON(!s-objects))
+   return 0;
+
+   slabs = DIV_ROUND_UP(objects, s-objects);
+
+   /*
+* Account the possible additional overhead if the slab holds more that
+* one object.
+*/
+   if (s-objects  1) {
+   /*
+* Account the possible additional overhead if per cpu slabs
+* are currently empty and have to be allocated. This is very
+* unlikely but a possible scenario immediately after
+* kmem_cache_shrink.
+*/
+   slabs += num_online_cpus();
+   }
+
+   return slabs  s-order;
+}
+EXPORT_SYMBOL_GPL(kmem_estimate_pages);
+
+/*
  * Attempt to free all slabs on a node. Return the number of slabs we
  * were unable to free.
  */
@@ -2800,6 +2831,57 @@ static unsigned long count_partial(struc
 }
 
 /*
+ * return the max number of pages required to allocate @count objects
+ * of @size bytes from kmalloc given @flags.
+ */
+unsigned kestimate_single(size_t size, gfp_t flags, int count)
+{
+   struct kmem_cache *s = get_slab(size, flags);
+   if (!s)
+   return 0;
+
+   return kmem_estimate_pages(s, flags, count);
+
+}
+EXPORT_SYMBOL_GPL(kestimate_single);
+
+/*
+ * return the max number of pages required to allocate @bytes from kmalloc
+ * in an unspecified number of allocation of heterogeneous size.
+ */
+unsigned kestimate(gfp_t flags, size_t bytes)
+{
+   int i;
+   unsigned long pages;
+
+   /*
+* multiply by two, in order to account the worst case slack space
+* due to the power-of-two allocation sizes.
+*/
+   pages = DIV_ROUND_UP(2 * bytes, PAGE_SIZE);
+
+   /*
+* add the kmem_cache overhead of each possible kmalloc cache
+*/
+   for (i = 1; i  PAGE_SHIFT; i++) {
+   struct kmem_cache *s;
+
+#ifdef CONFIG_ZONE_DMA
+   if (unlikely(flags  SLUB_DMA))
+   s = dma_kmalloc_cache(i, flags);
+   else
+#endif
+   s = kmalloc_caches[i];
+
+   if (s)
+   pages += kmem_estimate_pages(s, flags, 0);
+   }
+
+   return pages;
+}
+EXPORT_SYMBOL_GPL(kestimate);
+
+/*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
  * most items in use come first. New allocations will then fill those up
Index: linux-2.6/mm/slab.c
===
--- linux-2.6.orig/mm/slab.c
+++ linux-2.6/mm/slab.c
@@ -3844,6 +3844,80 @@ const char *kmem_cache_name(struct kmem_
 EXPORT_SYMBOL_GPL(kmem_cache_name);
 
 /*
+ * return the max number of pages required to allocated count
+ * objects from the given cache
+ */
+unsigned kmem_estimate_pages(struct kmem_cache *cachep, gfp_t flags, int 
objects)
+{
+   /*
+* (1) memory for objects,
+*/
+

[PATCH 14/29] net: sk_allocation() - concentrate socket related allocations

2007-12-14 Thread Peter Zijlstra

Introduce sk_allocation(), this function allows to inject sock specific
flags to each sock related allocation.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/net/sock.h|5 +
 net/ipv4/tcp.c|2 +-
 net/ipv4/tcp_output.c |   11 ++-
 net/ipv6/tcp_ipv6.c   |   14 +-
 4 files changed, 21 insertions(+), 11 deletions(-)

Index: linux-2.6/net/ipv4/tcp_output.c
===
--- linux-2.6.orig/net/ipv4/tcp_output.c
+++ linux-2.6/net/ipv4/tcp_output.c
@@ -2063,7 +2063,7 @@ void tcp_send_fin(struct sock *sk)
} else {
/* Socket is locked, keep trying until memory is available. */
for (;;) {
-   skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_KERNEL);
+   skb = alloc_skb_fclone(MAX_TCP_HEADER, 
sk-sk_allocation);
if (skb)
break;
yield();
@@ -2096,7 +2096,7 @@ void tcp_send_active_reset(struct sock *
struct sk_buff *skb;
 
/* NOTE: No TCP options attached and we never retransmit this. */
-   skb = alloc_skb(MAX_TCP_HEADER, priority);
+   skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, priority));
if (!skb) {
NET_INC_STATS(LINUX_MIB_TCPABORTFAILED);
return;
@@ -2169,7 +2169,8 @@ struct sk_buff * tcp_make_synack(struct 
__u8 *md5_hash_location;
 #endif
 
-   skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1, GFP_ATOMIC);
+   skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1,
+   sk_allocation(sk, GFP_ATOMIC));
if (skb == NULL)
return NULL;
 
@@ -2428,7 +2429,7 @@ void tcp_send_ack(struct sock *sk)
 * tcp_transmit_skb() will set the ownership to this
 * sock.
 */
-   buff = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+   buff = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
if (buff == NULL) {
inet_csk_schedule_ack(sk);
inet_csk(sk)-icsk_ack.ato = TCP_ATO_MIN;
@@ -2470,7 +2471,7 @@ static int tcp_xmit_probe_skb(struct soc
struct sk_buff *skb;
 
/* We don't queue it, tcp_transmit_skb() sets ownership. */
-   skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+   skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
if (skb == NULL)
return -1;
 
Index: linux-2.6/include/net/sock.h
===
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -425,6 +425,11 @@ static inline int sock_flag(struct sock 
return test_bit(flag, sk-sk_flags);
 }
 
+static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
+{
+   return gfp_mask;
+}
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
sk-sk_ack_backlog--;
Index: linux-2.6/net/ipv6/tcp_ipv6.c
===
--- linux-2.6.orig/net/ipv6/tcp_ipv6.c
+++ linux-2.6/net/ipv6/tcp_ipv6.c
@@ -574,7 +574,8 @@ static int tcp_v6_md5_do_add(struct sock
} else {
/* reallocate new list if current one is full. */
if (!tp-md5sig_info) {
-   tp-md5sig_info = kzalloc(sizeof(*tp-md5sig_info), 
GFP_ATOMIC);
+   tp-md5sig_info = kzalloc(sizeof(*tp-md5sig_info),
+   sk_allocation(sk, GFP_ATOMIC));
if (!tp-md5sig_info) {
kfree(newkey);
return -ENOMEM;
@@ -587,7 +588,8 @@ static int tcp_v6_md5_do_add(struct sock
}
if (tp-md5sig_info-alloced6 == tp-md5sig_info-entries6) {
keys = kmalloc((sizeof (tp-md5sig_info-keys6[0]) *
-  (tp-md5sig_info-entries6 + 1)), 
GFP_ATOMIC);
+  (tp-md5sig_info-entries6 + 1)),
+  sk_allocation(sk, GFP_ATOMIC));
 
if (!keys) {
tcp_free_md5sig_pool();
@@ -711,7 +713,7 @@ static int tcp_v6_parse_md5_keys (struct
struct tcp_sock *tp = tcp_sk(sk);
struct tcp_md5sig_info *p;
 
-   p = kzalloc(sizeof(struct tcp_md5sig_info), GFP_KERNEL);
+   p = kzalloc(sizeof(struct tcp_md5sig_info), sk-sk_allocation);
if (!p)
return -ENOMEM;
 
@@ -1012,7 +1014,7 @@ static void tcp_v6_send_reset(struct soc
 */
 
buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
-GFP_ATOMIC);
+sk_allocation(sk, GFP_ATOMIC));
if (buff == NULL)
return;
 
@@ -1091,10 +1093,12 @@

[PATCH 09/29] mm: __GFP_MEMALLOC

2007-12-14 Thread Peter Zijlstra

__GFP_MEMALLOC will allow the allocation to disregard the watermarks, 
much like PF_MEMALLOC.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/gfp.h |3 ++-
 mm/page_alloc.c |4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/gfp.h
===
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -43,6 +43,7 @@ struct vm_area_struct;
 #define __GFP_REPEAT   ((__force gfp_t)0x400u) /* Retry the allocation.  Might 
fail */
 #define __GFP_NOFAIL   ((__force gfp_t)0x800u) /* Retry for ever.  Cannot fail 
*/
 #define __GFP_NORETRY  ((__force gfp_t)0x1000u)/* Do not retry.  Might fail */
+#define __GFP_MEMALLOC  ((__force gfp_t)0x2000u)/* Use emergency reserves */
 #define __GFP_COMP ((__force gfp_t)0x4000u)/* Add compound page metadata */
 #define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on 
success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x1u) /* Don't use emergency 
reserves */
@@ -88,7 +89,7 @@ struct vm_area_struct;
 /* Control page allocator reclaim behavior */
 #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
-   __GFP_NORETRY|__GFP_NOMEMALLOC)
+   __GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
 
 /* Control allocation constraints */
 #define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1560,7 +1560,9 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
alloc_flags |= ALLOC_HARDER;
 
if (likely(!(gfp_mask  __GFP_NOMEMALLOC))) {
-   if (!in_irq()  (p-flags  PF_MEMALLOC))
+   if (gfp_mask  __GFP_MEMALLOC)
+   alloc_flags |= ALLOC_NO_WATERMARKS;
+   else if (!in_irq()  (p-flags  PF_MEMALLOC))
alloc_flags |= ALLOC_NO_WATERMARKS;
else if (!in_interrupt() 
unlikely(test_thread_flag(TIF_MEMDIE)))

--

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/29] mm: tag reseve pages

2007-12-14 Thread Peter Zijlstra

Tag pages allocated from the reserves with a non-zero page-reserve.
This allows us to distinguish and account reserve pages.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/mm_types.h |1 +
 mm/page_alloc.c  |4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -70,6 +70,7 @@ struct page {
union {
pgoff_t index;  /* Our offset within mapping. */
void *freelist; /* SLUB: freelist req. slab lock */
+   int reserve;/* page_alloc: page is a reserve page */
};
struct list_head lru;   /* Pageout list, eg. active_list
 * protected by zone-lru_lock !
Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1448,8 +1448,10 @@ zonelist_scan:
}
 
page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
-   if (page)
+   if (page) {
+   page-reserve = !!(alloc_flags  ALLOC_NO_WATERMARKS);
break;
+   }
 this_zone_full:
if (NUMA_BUILD)
zlc_mark_zone_full(zonelist, z);

--

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 06/29] mm: serialize access to min_free_kbytes

2007-12-14 Thread Peter Zijlstra

There is a small race between the procfs caller and the memory hotplug caller
of setup_per_zone_pages_min(). Not a big deal, but the next patch will add yet
another caller. Time to close the gap.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 mm/page_alloc.c |   16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -116,6 +116,7 @@ static char * const zone_names[MAX_NR_ZO
 Movable,
 };
 
+static DEFINE_SPINLOCK(min_free_lock);
 int min_free_kbytes = 1024;
 
 unsigned long __meminitdata nr_kernel_pages;
@@ -4162,12 +4163,12 @@ static void setup_per_zone_lowmem_reserv
 }
 
 /**
- * setup_per_zone_pages_min - called when min_free_kbytes changes.
+ * __setup_per_zone_pages_min - called when min_free_kbytes changes.
  *
  * Ensures that the pages_{min,low,high} values for each zone are set correctly
  * with respect to min_free_kbytes.
  */
-void setup_per_zone_pages_min(void)
+static void __setup_per_zone_pages_min(void)
 {
unsigned long pages_min = min_free_kbytes  (PAGE_SHIFT - 10);
unsigned long lowmem_pages = 0;
@@ -4222,6 +4223,15 @@ void setup_per_zone_pages_min(void)
calculate_totalreserve_pages();
 }
 
+void setup_per_zone_pages_min(void)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(min_free_lock, flags);
+   __setup_per_zone_pages_min();
+   spin_unlock_irqrestore(min_free_lock, flags);
+}
+
 /*
  * Initialise min_free_kbytes.
  *
@@ -4257,7 +4267,7 @@ static int __init init_per_zone_pages_mi
min_free_kbytes = 128;
if (min_free_kbytes  65536)
min_free_kbytes = 65536;
-   setup_per_zone_pages_min();
+   __setup_per_zone_pages_min();
setup_per_zone_lowmem_reserve();
return 0;
 }

--

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 13/29] net: packet split receive api

2007-12-14 Thread Peter Zijlstra

Add some packet-split receive hooks.

For one this allows to do NUMA node affine page allocs. Later on these hooks
will be extended to do emergency reserve allocations for fragments.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 drivers/net/e1000/e1000_main.c |8 ++--
 drivers/net/sky2.c |   16 ++--
 include/linux/skbuff.h |   23 +++
 net/core/skbuff.c  |   20 
 4 files changed, 51 insertions(+), 16 deletions(-)

Index: linux-2.6/drivers/net/e1000/e1000_main.c
===
--- linux-2.6.orig/drivers/net/e1000/e1000_main.c
+++ linux-2.6/drivers/net/e1000/e1000_main.c
@@ -4392,12 +4392,8 @@ e1000_clean_rx_irq_ps(struct e1000_adapt
pci_unmap_page(pdev, ps_page_dma-ps_page_dma[j],
PAGE_SIZE, PCI_DMA_FROMDEVICE);
ps_page_dma-ps_page_dma[j] = 0;
-   skb_fill_page_desc(skb, j, ps_page-ps_page[j], 0,
-  length);
+   skb_add_rx_frag(skb, j, ps_page-ps_page[j], 0, length);
ps_page-ps_page[j] = NULL;
-   skb-len += length;
-   skb-data_len += length;
-   skb-truesize += length;
}
 
/* strip the ethernet crc, problem is we're using pages now so
@@ -4605,7 +4601,7 @@ e1000_alloc_rx_buffers_ps(struct e1000_a
if (j  adapter-rx_ps_pages) {
if (likely(!ps_page-ps_page[j])) {
ps_page-ps_page[j] =
-   alloc_page(GFP_ATOMIC);
+   netdev_alloc_page(netdev);
if (unlikely(!ps_page-ps_page[j])) {
adapter-alloc_rx_buff_failed++;
goto no_buffers;
Index: linux-2.6/include/linux/skbuff.h
===
--- linux-2.6.orig/include/linux/skbuff.h
+++ linux-2.6/include/linux/skbuff.h
@@ -851,6 +851,9 @@ static inline void skb_fill_page_desc(st
skb_shinfo(skb)-nr_frags = i + 1;
 }
 
+extern void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page,
+   int off, int size);
+
 #define SKB_PAGE_ASSERT(skb)   BUG_ON(skb_shinfo(skb)-nr_frags)
 #define SKB_FRAG_ASSERT(skb)   BUG_ON(skb_shinfo(skb)-frag_list)
 #define SKB_LINEAR_ASSERT(skb)  BUG_ON(skb_is_nonlinear(skb))
@@ -1344,6 +1347,26 @@ static inline struct sk_buff *netdev_all
return __netdev_alloc_skb(dev, length, GFP_ATOMIC);
 }
 
+extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t 
gfp_mask);
+
+/**
+ * netdev_alloc_page - allocate a page for ps-rx on a specific device
+ * @dev: network device to receive on
+ *
+ * Allocate a new page node local to the specified device.
+ *
+ * %NULL is returned if there is no free memory.
+ */
+static inline struct page *netdev_alloc_page(struct net_device *dev)
+{
+   return __netdev_alloc_page(dev, GFP_ATOMIC);
+}
+
+static inline void netdev_free_page(struct net_device *dev, struct page *page)
+{
+   __free_page(page);
+}
+
 /**
  * skb_clone_writable - is the header of a clone writable
  * @skb: buffer to check
Index: linux-2.6/net/core/skbuff.c
===
--- linux-2.6.orig/net/core/skbuff.c
+++ linux-2.6/net/core/skbuff.c
@@ -263,6 +263,24 @@ struct sk_buff *__netdev_alloc_skb(struc
return skb;
 }
 
+struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
+{
+   int node = dev-dev.parent ? dev_to_node(dev-dev.parent) : -1;
+   struct page *page;
+
+   page = alloc_pages_node(node, gfp_mask, 0);
+   return page;
+}
+
+void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
+   int size)
+{
+   skb_fill_page_desc(skb, i, page, off, size);
+   skb-len += size;
+   skb-data_len += size;
+   skb-truesize += size;
+}
+
 static void skb_drop_list(struct sk_buff **listp)
 {
struct sk_buff *list = *listp;
@@ -2466,6 +2484,8 @@ EXPORT_SYMBOL(kfree_skb);
 EXPORT_SYMBOL(__pskb_pull_tail);
 EXPORT_SYMBOL(__alloc_skb);
 EXPORT_SYMBOL(__netdev_alloc_skb);
+EXPORT_SYMBOL(__netdev_alloc_page);
+EXPORT_SYMBOL(skb_add_rx_frag);
 EXPORT_SYMBOL(pskb_copy);
 EXPORT_SYMBOL(pskb_expand_head);
 EXPORT_SYMBOL(skb_checksum);
Index: linux-2.6/drivers/net/sky2.c
===
--- linux-2.6.orig/drivers/net/sky2.c
+++ linux-2.6/drivers/net/sky2.c
@@ -1198,7 +1198,7 @@ static struct sk_buff *sky2_rx_alloc(str
}
 
for (i = 0; i  sky2-rx_nfrags; i++) {
-

[PATCH 12/29] net: wrap sk-sk_backlog_rcv()

2007-12-14 Thread Peter Zijlstra

Wrap calling sk-sk_backlog_rcv() in a function. This will allow extending the
generic sk_backlog_rcv behaviour.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/net/sock.h   |5 +
 net/core/sock.c  |4 ++--
 net/ipv4/tcp.c   |2 +-
 net/ipv4/tcp_timer.c |2 +-
 4 files changed, 9 insertions(+), 4 deletions(-)

Index: linux-2.6/include/net/sock.h
===
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -485,6 +485,11 @@ static inline void sk_add_backlog(struct
skb-next = NULL;
 }
 
+static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+   return sk-sk_backlog_rcv(sk, skb);
+}
+
 #define sk_wait_event(__sk, __timeo, __condition)  \
({  int __rc;   \
release_sock(__sk); \
Index: linux-2.6/net/core/sock.c
===
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -320,7 +320,7 @@ int sk_receive_skb(struct sock *sk, stru
 */
mutex_acquire(sk-sk_lock.dep_map, 0, 1, _RET_IP_);
 
-   rc = sk-sk_backlog_rcv(sk, skb);
+   rc = sk_backlog_rcv(sk, skb);
 
mutex_release(sk-sk_lock.dep_map, 1, _RET_IP_);
} else
@@ -1312,7 +1312,7 @@ static void __release_sock(struct sock *
struct sk_buff *next = skb-next;
 
skb-next = NULL;
-   sk-sk_backlog_rcv(sk, skb);
+   sk_backlog_rcv(sk, skb);
 
/*
 * We are in process context here with softirqs
Index: linux-2.6/net/ipv4/tcp.c
===
--- linux-2.6.orig/net/ipv4/tcp.c
+++ linux-2.6/net/ipv4/tcp.c
@@ -1134,7 +1134,7 @@ static void tcp_prequeue_process(struct 
 * necessary */
local_bh_disable();
while ((skb = __skb_dequeue(tp-ucopy.prequeue)) != NULL)
-   sk-sk_backlog_rcv(sk, skb);
+   sk_backlog_rcv(sk, skb);
local_bh_enable();
 
/* Clear memory counter. */
Index: linux-2.6/net/ipv4/tcp_timer.c
===
--- linux-2.6.orig/net/ipv4/tcp_timer.c
+++ linux-2.6/net/ipv4/tcp_timer.c
@@ -196,7 +196,7 @@ static void tcp_delack_timer(unsigned lo
NET_INC_STATS_BH(LINUX_MIB_TCPSCHEDULERFAILED);
 
while ((skb = __skb_dequeue(tp-ucopy.prequeue)) != NULL)
-   sk-sk_backlog_rcv(sk, skb);
+   sk_backlog_rcv(sk, skb);
 
tp-ucopy.memory = 0;
}

--

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 16/29] netvm: INET reserves.

2007-12-14 Thread Peter Zijlstra

Add reserves for INET.

The two big users seem to be the route cache and ip-fragment cache.

Reserve the route cache under generic RX reserve, its usage is bounded by
the high reclaim watermark, and thus does not need further accounting.

Reserve the ip-fragement caches under SKB data reserve, these add to the
SKB RX limit. By ensuring we can at least receive as much data as fits in
the reassmbly line we avoid fragment attack deadlocks.

Use proc conv() routines to update these limits and return -ENOMEM to user
space.

Adds to the reserve tree:

  total network reserve  
network TX reserve   
  protocol TX pages  
network RX reserve   
+ IPv6 route cache   
+ IPv4 route cache   
  SKB data reserve   
+   IPv6 fragment cache  
+   IPv4 fragment cache  

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 net/ipv4/ip_fragment.c |7 
 net/ipv4/route.c   |   64 +++--
 net/ipv4/sysctl_net_ipv4.c |   57 ++--
 net/ipv6/reassembly.c  |7 
 net/ipv6/route.c   |   64 +++--
 net/ipv6/sysctl_net_ipv6.c |   57 ++--
 6 files changed, 248 insertions(+), 8 deletions(-)

Index: linux-2.6/net/ipv4/sysctl_net_ipv4.c
===
--- linux-2.6.orig/net/ipv4/sysctl_net_ipv4.c
+++ linux-2.6/net/ipv4/sysctl_net_ipv4.c
@@ -21,6 +21,7 @@
 #include net/tcp.h
 #include net/cipso_ipv4.h
 #include net/inet_frag.h
+#include linux/reserve.h
 
 static int zero;
 static int tcp_retr1_max = 255;
@@ -192,6 +193,57 @@ static int strategy_allowed_congestion_c
 
 }
 
+static int ipv4_frag_bytes;
+extern struct mem_reserve ipv4_frag_reserve;
+
+static int proc_dointvec_fragment(struct ctl_table *table, int write,
+   struct file *filp, void __user *buffer, size_t *lenp,
+   loff_t *ppos)
+{
+   int old_bytes, ret;
+
+   if (!write)
+   ipv4_frag_bytes = ip4_frags_ctl.high_thresh;
+   old_bytes = ipv4_frag_bytes;
+
+   ret = proc_dointvec(table, write, filp, buffer, lenp, ppos);
+
+   if (!ret  write) {
+   ret = mem_reserve_kmalloc_set(ipv4_frag_reserve, 
ipv4_frag_bytes);
+   if (!ret)
+   ip4_frags_ctl.high_thresh = ipv4_frag_bytes;
+   else
+   ipv4_frag_bytes = old_bytes;
+   }
+
+   return ret;
+}
+
+static int sysctl_intvec_fragment(struct ctl_table *table,
+   int __user *name, int nlen,
+   void __user *oldval, size_t __user *oldlenp,
+   void __user *newval, size_t newlen)
+{
+   int old_bytes, ret;
+   int write = (newval  newlen);
+
+   if (!write)
+   ipv4_frag_bytes = ip4_frags_ctl.high_thresh;
+   old_bytes = ipv4_frag_bytes;
+
+   ret = sysctl_intvec(table, name, nlen, oldval, oldlenp, newval, newlen);
+
+   if (!ret  write) {
+   ret = mem_reserve_kmalloc_set(ipv4_frag_reserve, 
ipv4_frag_bytes);
+   if (!ret)
+   ip4_frags_ctl.high_thresh = ipv4_frag_bytes;
+   else
+   ipv4_frag_bytes = old_bytes;
+   }
+
+   return ret;
+}
+
 static struct ctl_table ipv4_table[] = {
{
.ctl_name   = NET_IPV4_TCP_TIMESTAMPS,
@@ -285,10 +337,11 @@ static struct ctl_table ipv4_table[] = {
{
.ctl_name   = NET_IPV4_IPFRAG_HIGH_THRESH,
.procname   = ipfrag_high_thresh,
-   .data   = ip4_frags_ctl.high_thresh,
+   .data   = ipv4_frag_bytes,
.maxlen = sizeof(int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec
+   .proc_handler   = proc_dointvec_fragment,
+   .strategy   = sysctl_intvec_fragment,
},
{
.ctl_name   = NET_IPV4_IPFRAG_LOW_THRESH,
Index: linux-2.6/net/ipv6/sysctl_net_ipv6.c
===
--- linux-2.6.orig/net/ipv6/sysctl_net_ipv6.c
+++ linux-2.6/net/ipv6/sysctl_net_ipv6.c
@@ -13,6 +13,58 @@
 #include net/ipv6.h
 #include net/addrconf.h
 #include net/inet_frag.h
+#include linux/reserve.h
+
+static int ipv6_frag_bytes;
+extern struct mem_reserve ipv6_frag_reserve;
+
+static int proc_dointvec_fragment(struct ctl_table *table, int write,
+   struct file *filp, void __user *buffer, size_t *lenp,
+   loff_t *ppos)
+{
+   int old_bytes, ret;
+
+   if (!write)
+   ipv6_frag_bytes = ip6_frags_ctl.high_thresh;
+   old_bytes = ipv6_frag_bytes;
+
+   ret = proc_dointvec(table, write, filp, buffer, lenp, ppos);
+
+   if (!ret  write) {
+   ret =

[PATCH 15/29] netvm: network reserve infrastructure

2007-12-14 Thread Peter Zijlstra

Provide the basic infrastructure to reserve and charge/account network memory.

We provide the following reserve tree:

1)  total network reserve
2)network TX reserve
3)  protocol TX pages
4)network RX reserve
5)  SKB data reserve

[1] is used to make all the network reserves a single subtree, for easy
manipulation.

[2] and [4] are merely for eastetic reasons.

The TX pages reserve [3] is assumed bounded by it being the upper bound of
memory that can be used for sending pages (not quite true, but good enough)

The SKB reserve [5] is an aggregate reserve, which is used to charge SKB data
against in the fallback path.

The consumers for these reserves are sockets marked with:
  SOCK_MEMALLOC

Such sockets are to be used to service the VM (iow. to swap over). They
must be handled kernel side, exposing such a socket to user-space is a BUG.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/net/sock.h |   35 +++-
 net/Kconfig|3 +
 net/core/sock.c|  113 +
 3 files changed, 150 insertions(+), 1 deletion(-)

Index: linux-2.6/include/net/sock.h
===
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -51,6 +51,7 @@
 #include linux/skbuff.h  /* struct sk_buff */
 #include linux/mm.h
 #include linux/security.h
+#include linux/reserve.h
 
 #include linux/filter.h
 
@@ -403,6 +404,7 @@ enum sock_flags {
SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+   SOCK_MEMALLOC, /* the VM depends on us - make sure we're serviced */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -425,9 +427,40 @@ static inline int sock_flag(struct sock 
return test_bit(flag, sk-sk_flags);
 }
 
+static inline int sk_has_memalloc(struct sock *sk)
+{
+   return sock_flag(sk, SOCK_MEMALLOC);
+}
+
+/*
+ * Guestimate the per request queue TX upper bound.
+ *
+ * Max packet size is 64k, and we need to reserve that much since the data
+ * might need to bounce it. Double it to be on the safe side.
+ */
+#define TX_RESERVE_PAGES DIV_ROUND_UP(2*65536, PAGE_SIZE)
+
+extern atomic_t memalloc_socks;
+
+extern struct mem_reserve net_rx_reserve;
+extern struct mem_reserve net_skb_reserve;
+
+static inline int sk_memalloc_socks(void)
+{
+   return atomic_read(memalloc_socks);
+}
+
+extern int rx_emergency_get(int bytes);
+extern int rx_emergency_get_overcommit(int bytes);
+extern void rx_emergency_put(int bytes);
+
+extern int sk_adjust_memalloc(int socks, long tx_reserve_pages);
+extern int sk_set_memalloc(struct sock *sk);
+extern int sk_clear_memalloc(struct sock *sk);
+
 static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
 {
-   return gfp_mask;
+   return gfp_mask | (sk-sk_allocation  __GFP_MEMALLOC);
 }
 
 static inline void sk_acceptq_removed(struct sock *sk)
Index: linux-2.6/net/core/sock.c
===
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -112,6 +112,7 @@
 #include linux/tcp.h
 #include linux/init.h
 #include linux/highmem.h
+#include linux/reserve.h
 
 #include asm/uaccess.h
 #include asm/system.h
@@ -213,6 +214,111 @@ __u32 sysctl_rmem_default __read_mostly 
 /* Maximal space eaten by iovec or ancilliary data plus some space */
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 
+atomic_t memalloc_socks;
+
+static struct mem_reserve net_reserve;
+struct mem_reserve net_rx_reserve;
+struct mem_reserve net_skb_reserve;
+static struct mem_reserve net_tx_reserve;
+static struct mem_reserve net_tx_pages;
+
+EXPORT_SYMBOL_GPL(net_rx_reserve); /* modular ipv6 only */
+EXPORT_SYMBOL_GPL(net_skb_reserve); /* modular ipv6 only */
+
+/*
+ * is there room for another emergency packet?
+ */
+static int __rx_emergency_get(int bytes, bool overcommit)
+{
+   return mem_reserve_kmalloc_charge(net_skb_reserve, bytes, overcommit);
+}
+
+int rx_emergency_get(int bytes)
+{
+   return __rx_emergency_get(bytes, false);
+}
+
+int rx_emergency_get_overcommit(int bytes)
+{
+   return __rx_emergency_get(bytes, true);
+}
+
+void rx_emergency_put(int bytes)
+{
+   mem_reserve_kmalloc_charge(net_skb_reserve, -bytes, 0);
+}
+
+/**
+ * sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
+ * @socks: number of new %SOCK_MEMALLOC sockets
+ * @tx_resserve_pages: number of pages to (un)reserve for TX
+ *
+ * This function adjusts the memalloc reserve based on system demand.
+ * The RX reserve is a limit, and only added once, not for each socket.
+ *
+ * NOTE:
+ *@tx_reserve_pages is an upper-bound of memory used for TX hence
+ *we need not account the pages like we do for RX pages.

[PATCH 29/29] nfs: fix various memory recursions possible with swap over NFS.

2007-12-14 Thread Peter Zijlstra

GFP_NOFS is not enough, since swap traffic is IO, hence fall back to GFP_NOIO.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/nfs/pagelist.c |2 +-
 fs/nfs/write.c|6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6/fs/nfs/write.c
===
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -44,7 +44,7 @@ static struct kmem_cache *nfs_wdata_cach
 
 struct nfs_write_data *nfs_commit_alloc(void)
 {
-   struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
+   struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOIO);
 
if (p) {
memset(p, 0, sizeof(*p));
@@ -68,7 +68,7 @@ void nfs_commit_free(struct nfs_write_da
 
 struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount)
 {
-   struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
+   struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOIO);
 
if (p) {
memset(p, 0, sizeof(*p));
@@ -77,7 +77,7 @@ struct nfs_write_data *nfs_writedata_all
if (pagecount = ARRAY_SIZE(p-page_array))
p-pagevec = p-page_array;
else {
-   p-pagevec = kcalloc(pagecount, sizeof(struct page *), 
GFP_NOFS);
+   p-pagevec = kcalloc(pagecount, sizeof(struct page *), 
GFP_NOIO);
if (!p-pagevec) {
kmem_cache_free(nfs_wdata_cachep, p);
p = NULL;
Index: linux-2.6/fs/nfs/pagelist.c
===
--- linux-2.6.orig/fs/nfs/pagelist.c
+++ linux-2.6/fs/nfs/pagelist.c
@@ -27,7 +27,7 @@ static inline struct nfs_page *
 nfs_page_alloc(void)
 {
struct nfs_page *p;
-   p = kmem_cache_alloc(nfs_page_cachep, GFP_KERNEL);
+   p = kmem_cache_alloc(nfs_page_cachep, GFP_NOIO);
if (p) {
memset(p, 0, sizeof(*p));
INIT_LIST_HEAD(p-wb_list);

--

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 25/29] nfs: remove mempools

2007-12-14 Thread Peter Zijlstra

With the introduction of the shared dirty page accounting in .19, NFS should
not be able to surpise the VM with all dirty pages. Thus it should always be
able to free some memory. Hence no more need for mempools.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/nfs/read.c  |   15 +++
 fs/nfs/write.c |   27 +--
 2 files changed, 8 insertions(+), 34 deletions(-)

Index: linux-2.6/fs/nfs/read.c
===
--- linux-2.6.orig/fs/nfs/read.c
+++ linux-2.6/fs/nfs/read.c
@@ -33,13 +33,10 @@ static const struct rpc_call_ops nfs_rea
 static const struct rpc_call_ops nfs_read_full_ops;
 
 static struct kmem_cache *nfs_rdata_cachep;
-static mempool_t *nfs_rdata_mempool;
-
-#define MIN_POOL_READ  (32)
 
 struct nfs_read_data *nfs_readdata_alloc(unsigned int pagecount)
 {
-   struct nfs_read_data *p = mempool_alloc(nfs_rdata_mempool, GFP_NOFS);
+   struct nfs_read_data *p = kmem_cache_alloc(nfs_rdata_cachep, GFP_NOFS);
 
if (p) {
memset(p, 0, sizeof(*p));
@@ -50,7 +47,7 @@ struct nfs_read_data *nfs_readdata_alloc
else {
p-pagevec = kcalloc(pagecount, sizeof(struct page *), 
GFP_NOFS);
if (!p-pagevec) {
-   mempool_free(p, nfs_rdata_mempool);
+   kmem_cache_free(nfs_rdata_cachep, p);
p = NULL;
}
}
@@ -63,7 +60,7 @@ static void nfs_readdata_rcu_free(struct
struct nfs_read_data *p = container_of(head, struct nfs_read_data, 
task.u.tk_rcu);
if (p  (p-pagevec != p-page_array[0]))
kfree(p-pagevec);
-   mempool_free(p, nfs_rdata_mempool);
+   kmem_cache_free(nfs_rdata_cachep, p);
 }
 
 static void nfs_readdata_free(struct nfs_read_data *rdata)
@@ -597,16 +594,10 @@ int __init nfs_init_readpagecache(void)
if (nfs_rdata_cachep == NULL)
return -ENOMEM;
 
-   nfs_rdata_mempool = mempool_create_slab_pool(MIN_POOL_READ,
-nfs_rdata_cachep);
-   if (nfs_rdata_mempool == NULL)
-   return -ENOMEM;
-
return 0;
 }
 
 void nfs_destroy_readpagecache(void)
 {
-   mempool_destroy(nfs_rdata_mempool);
kmem_cache_destroy(nfs_rdata_cachep);
 }
Index: linux-2.6/fs/nfs/write.c
===
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -28,9 +28,6 @@
 
 #define NFSDBG_FACILITYNFSDBG_PAGECACHE
 
-#define MIN_POOL_WRITE (32)
-#define MIN_POOL_COMMIT(4)
-
 /*
  * Local function declarations
  */
@@ -44,12 +41,10 @@ static const struct rpc_call_ops nfs_wri
 static const struct rpc_call_ops nfs_commit_ops;
 
 static struct kmem_cache *nfs_wdata_cachep;
-static mempool_t *nfs_wdata_mempool;
-static mempool_t *nfs_commit_mempool;
 
 struct nfs_write_data *nfs_commit_alloc(void)
 {
-   struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
+   struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
 
if (p) {
memset(p, 0, sizeof(*p));
@@ -63,7 +58,7 @@ static void nfs_commit_rcu_free(struct r
struct nfs_write_data *p = container_of(head, struct nfs_write_data, 
task.u.tk_rcu);
if (p  (p-pagevec != p-page_array[0]))
kfree(p-pagevec);
-   mempool_free(p, nfs_commit_mempool);
+   kmem_cache_free(nfs_wdata_cachep, p);
 }
 
 void nfs_commit_free(struct nfs_write_data *wdata)
@@ -73,7 +68,7 @@ void nfs_commit_free(struct nfs_write_da
 
 struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount)
 {
-   struct nfs_write_data *p = mempool_alloc(nfs_wdata_mempool, GFP_NOFS);
+   struct nfs_write_data *p = kmem_cache_alloc(nfs_wdata_cachep, GFP_NOFS);
 
if (p) {
memset(p, 0, sizeof(*p));
@@ -84,7 +79,7 @@ struct nfs_write_data *nfs_writedata_all
else {
p-pagevec = kcalloc(pagecount, sizeof(struct page *), 
GFP_NOFS);
if (!p-pagevec) {
-   mempool_free(p, nfs_wdata_mempool);
+   kmem_cache_free(nfs_wdata_cachep, p);
p = NULL;
}
}
@@ -97,7 +92,7 @@ static void nfs_writedata_rcu_free(struc
struct nfs_write_data *p = container_of(head, struct nfs_write_data, 
task.u.tk_rcu);
if (p  (p-pagevec != p-page_array[0]))
kfree(p-pagevec);
-   mempool_free(p, nfs_wdata_mempool);
+   kmem_cache_free(nfs_wdata_cachep, p);
 }
 
 static void nfs_writedata_free(struct nfs_write_data *wdata)
@@ -1474,16 +1469,6 @@ int __init nfs_init_writepagecache(void)
if (nfs_wdata_cachep == NULL)
return -ENOMEM;
 
-

[PATCH 18/29] netvm: filter emergency skbs.

2007-12-14 Thread Peter Zijlstra

Toss all emergency packets not for a SOCK_MEMALLOC socket. This ensures our
precious memory reserve doesn't get stuck waiting for user-space.

The correctness of this approach relies on the fact that networks must be
assumed lossy.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/net/sock.h |3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/include/net/sock.h
===
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -930,6 +930,9 @@ static inline int sk_filter(struct sock 
 {
int err;
struct sk_filter *filter;
+
+   if (skb_emergency(skb)  !sk_has_memalloc(sk))
+   return -ENOMEM;

err = security_sock_rcv_skb(sk, skb);
if (err)

--

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 26/29] nfs: teach the NFS client how to treat PG_swapcache pages

2007-12-14 Thread Peter Zijlstra

Replace all relevant occurences of page-index and page-mapping in the NFS
client with the new page_file_index() and page_file_mapping() functions.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/nfs/file.c |8 
 fs/nfs/internal.h |7 ---
 fs/nfs/pagelist.c |6 +++---
 fs/nfs/read.c |6 +++---
 fs/nfs/write.c|   49 +
 5 files changed, 39 insertions(+), 37 deletions(-)

Index: linux-2.6/fs/nfs/file.c
===
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -357,7 +357,7 @@ static void nfs_invalidate_page(struct p
if (offset != 0)
return;
/* Cancel any unstarted writes on this page */
-   nfs_wb_page_cancel(page-mapping-host, page);
+   nfs_wb_page_cancel(page_file_mapping(page)-host, page);
 }
 
 static int nfs_release_page(struct page *page, gfp_t gfp)
@@ -368,7 +368,7 @@ static int nfs_release_page(struct page 
 
 static int nfs_launder_page(struct page *page)
 {
-   return nfs_wb_page(page-mapping-host, page);
+   return nfs_wb_page(page_file_mapping(page)-host, page);
 }
 
 const struct address_space_operations nfs_file_aops = {
@@ -397,13 +397,13 @@ static int nfs_vm_page_mkwrite(struct vm
loff_t offset;
 
lock_page(page);
-   mapping = page-mapping;
+   mapping = page_file_mapping(page);
if (mapping != vma-vm_file-f_path.dentry-d_inode-i_mapping) {
unlock_page(page);
return -EINVAL;
}
pagelen = nfs_page_length(page);
-   offset = (loff_t)page-index  PAGE_CACHE_SHIFT;
+   offset = (loff_t)page_file_index(page)  PAGE_CACHE_SHIFT;
unlock_page(page);
 
/*
Index: linux-2.6/fs/nfs/pagelist.c
===
--- linux-2.6.orig/fs/nfs/pagelist.c
+++ linux-2.6/fs/nfs/pagelist.c
@@ -77,11 +77,11 @@ nfs_create_request(struct nfs_open_conte
 * update_nfs_request below if the region is not locked. */
req-wb_page= page;
atomic_set(req-wb_complete, 0);
-   req-wb_index   = page-index;
+   req-wb_index   = page_file_index(page);
page_cache_get(page);
BUG_ON(PagePrivate(page));
BUG_ON(!PageLocked(page));
-   BUG_ON(page-mapping-host != inode);
+   BUG_ON(page_file_mapping(page)-host != inode);
req-wb_offset  = offset;
req-wb_pgbase  = offset;
req-wb_bytes   = count;
@@ -383,7 +383,7 @@ void nfs_pageio_cond_complete(struct nfs
  * nfs_scan_list - Scan a list for matching requests
  * @nfsi: NFS inode
  * @dst: Destination list
- * @idx_start: lower bound of page-index to scan
+ * @idx_start: lower bound of page_file_index(page) to scan
  * @npages: idx_start + npages sets the upper bound to scan.
  * @tag: tag to scan for
  *
Index: linux-2.6/fs/nfs/read.c
===
--- linux-2.6.orig/fs/nfs/read.c
+++ linux-2.6/fs/nfs/read.c
@@ -460,11 +460,11 @@ static const struct rpc_call_ops nfs_rea
 int nfs_readpage(struct file *file, struct page *page)
 {
struct nfs_open_context *ctx;
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = page_file_mapping(page)-host;
int error;
 
dprintk(NFS: nfs_readpage (%p [EMAIL PROTECTED])\n,
-   page, PAGE_CACHE_SIZE, page-index);
+   page, PAGE_CACHE_SIZE, page_file_index(page));
nfs_inc_stats(inode, NFSIOS_VFSREADPAGE);
nfs_add_stats(inode, NFSIOS_READPAGES, 1);
 
@@ -511,7 +511,7 @@ static int
 readpage_async_filler(void *data, struct page *page)
 {
struct nfs_readdesc *desc = (struct nfs_readdesc *)data;
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = page_file_mapping(page)-host;
struct nfs_page *new;
unsigned int len;
int error;
Index: linux-2.6/fs/nfs/write.c
===
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -126,7 +126,7 @@ static struct nfs_page *nfs_page_find_re
 
 static struct nfs_page *nfs_page_find_request(struct page *page)
 {
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = page_file_mapping(page)-host;
struct nfs_page *req = NULL;
 
spin_lock(inode-i_lock);
@@ -138,13 +138,13 @@ static struct nfs_page *nfs_page_find_re
 /* Adjust the file length if we're writing beyond the end */
 static void nfs_grow_file(struct page *page, unsigned int offset, unsigned int 
count)
 {
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = page_file_mapping(page)-host;
loff_t end, i_size = i_size_read(inode);
pgoff_t end_index = (i_size - 1)  PAGE_CACHE_SHIFT;
 
-   if (i_size  0  page-index  end_index)
+   if (i_size  0

[PATCH 24/29] mm: methods for teaching filesystems about PG_swapcache pages

2007-12-14 Thread Peter Zijlstra

In order to teach filesystems to handle swap cache pages, two new page
functions are introduced:

  pgoff_t page_file_index(struct page *);
  struct address_space *page_file_mapping(struct page *);

page_file_index - gives the offset of this page in the file in PAGE_CACHE_SIZE
blocks. Like page-index is for mapped pages, this function also gives the
correct index for PG_swapcache pages.

page_file_mapping - gives the mapping backing the actual page; that is for
swap cache pages it will give swap_file-f_mapping.

page_offset() is modified to use page_file_index(), so that it will give the
expected result, even for PG_swapcache pages.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/mm.h  |   26 ++
 include/linux/pagemap.h |2 +-
 2 files changed, 27 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -14,6 +14,7 @@
 #include linux/mm_types.h
 #include linux/security.h
 #include linux/swap.h
+#include linux/fs.h
 
 struct mempolicy;
 struct anon_vma;
@@ -608,6 +609,16 @@ static inline struct swap_info_struct *p
return get_swap_info_struct(swp_type(swap));
 }
 
+static inline
+struct address_space *page_file_mapping(struct page *page)
+{
+#ifdef CONFIG_SWAP_FILE
+   if (unlikely(PageSwapCache(page)))
+   return page_swap_info(page)-swap_file-f_mapping;
+#endif
+   return page-mapping;
+}
+
 static inline int PageAnon(struct page *page)
 {
return ((unsigned long)page-mapping  PAGE_MAPPING_ANON) != 0;
@@ -625,6 +636,21 @@ static inline pgoff_t page_index(struct 
 }
 
 /*
+ * Return the file index of the page. Regular pagecache pages use -index
+ * whereas swapcache pages use swp_offset(-private)
+ */
+static inline pgoff_t page_file_index(struct page *page)
+{
+#ifdef CONFIG_SWAP_FILE
+   if (unlikely(PageSwapCache(page))) {
+   swp_entry_t swap = { .val = page_private(page) };
+   return swp_offset(swap);
+   }
+#endif
+   return page-index;
+}
+
+/*
  * The atomic page-_mapcount, like _count, starts from -1:
  * so that transitions both from it and to it can be tracked,
  * using atomic_inc_and_test and atomic_add_negative(-1).
Index: linux-2.6/include/linux/pagemap.h
===
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -145,7 +145,7 @@ extern void __remove_from_page_cache(str
  */
 static inline loff_t page_offset(struct page *page)
 {
-   return ((loff_t)page-index)  PAGE_CACHE_SHIFT;
+   return ((loff_t)page_file_index(page))  PAGE_CACHE_SHIFT;
 }
 
 static inline pgoff_t linear_page_index(struct vm_area_struct *vma,

--

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 22/29] mm: prepare swap entry methods for use in page methods

2007-12-14 Thread Peter Zijlstra

Move around the swap entry methods in preparation for use from
page methods.

Also provide a function to obtain the swap_info_struct backing
a swap cache page.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/mm.h  |8 +++
 include/linux/swap.h|   49 
 include/linux/swapops.h |   44 ---
 mm/swapfile.c   |1 
 4 files changed, 58 insertions(+), 44 deletions(-)

Index: linux-2.6/include/linux/mm.h
===
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -13,6 +13,7 @@
 #include linux/debug_locks.h
 #include linux/mm_types.h
 #include linux/security.h
+#include linux/swap.h
 
 struct mempolicy;
 struct anon_vma;
@@ -600,6 +601,13 @@ static inline struct address_space *page
return mapping;
 }
 
+static inline struct swap_info_struct *page_swap_info(struct page *page)
+{
+   swp_entry_t swap = { .val = page_private(page) };
+   BUG_ON(!PageSwapCache(page));
+   return get_swap_info_struct(swp_type(swap));
+}
+
 static inline int PageAnon(struct page *page)
 {
return ((unsigned long)page-mapping  PAGE_MAPPING_ANON) != 0;
Index: linux-2.6/include/linux/swap.h
===
--- linux-2.6.orig/include/linux/swap.h
+++ linux-2.6/include/linux/swap.h
@@ -80,6 +80,50 @@ typedef struct {
 } swp_entry_t;
 
 /*
+ * swapcache pages are stored in the swapper_space radix tree.  We want to
+ * get good packing density in that tree, so the index should be dense in
+ * the low-order bits.
+ *
+ * We arrange the `type' and `offset' fields so that `type' is at the five
+ * high-order bits of the swp_entry_t and `offset' is right-aligned in the
+ * remaining bits.
+ *
+ * swp_entry_t's are *never* stored anywhere in their arch-dependent format.
+ */
+#define SWP_TYPE_SHIFT(e)  (sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT)
+#define SWP_OFFSET_MASK(e) ((1UL  SWP_TYPE_SHIFT(e)) - 1)
+
+/*
+ * Store a type+offset into a swp_entry_t in an arch-independent format
+ */
+static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset)
+{
+   swp_entry_t ret;
+
+   ret.val = (type  SWP_TYPE_SHIFT(ret)) |
+   (offset  SWP_OFFSET_MASK(ret));
+   return ret;
+}
+
+/*
+ * Extract the `type' field from a swp_entry_t.  The swp_entry_t is in
+ * arch-independent format
+ */
+static inline unsigned swp_type(swp_entry_t entry)
+{
+   return (entry.val  SWP_TYPE_SHIFT(entry));
+}
+
+/*
+ * Extract the `offset' field from a swp_entry_t.  The swp_entry_t is in
+ * arch-independent format
+ */
+static inline pgoff_t swp_offset(swp_entry_t entry)
+{
+   return entry.val  SWP_OFFSET_MASK(entry);
+}
+
+/*
  * current-reclaim_state points to one of these when a task is running
  * memory reclaim
  */
@@ -321,6 +365,11 @@ static inline struct page *lookup_swap_c
return NULL;
 }
 
+static inline struct swap_info_struct *get_swap_info_struct(unsigned type)
+{
+   return NULL;
+}
+
 #define can_share_swap_page(p) (page_mapcount(p) == 1)
 
 static inline int move_to_swap_cache(struct page *page, swp_entry_t entry)
Index: linux-2.6/include/linux/swapops.h
===
--- linux-2.6.orig/include/linux/swapops.h
+++ linux-2.6/include/linux/swapops.h
@@ -1,47 +1,3 @@
-/*
- * swapcache pages are stored in the swapper_space radix tree.  We want to
- * get good packing density in that tree, so the index should be dense in
- * the low-order bits.
- *
- * We arrange the `type' and `offset' fields so that `type' is at the five
- * high-order bits of the swp_entry_t and `offset' is right-aligned in the
- * remaining bits.
- *
- * swp_entry_t's are *never* stored anywhere in their arch-dependent format.
- */
-#define SWP_TYPE_SHIFT(e)  (sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT)
-#define SWP_OFFSET_MASK(e) ((1UL  SWP_TYPE_SHIFT(e)) - 1)
-
-/*
- * Store a type+offset into a swp_entry_t in an arch-independent format
- */
-static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset)
-{
-   swp_entry_t ret;
-
-   ret.val = (type  SWP_TYPE_SHIFT(ret)) |
-   (offset  SWP_OFFSET_MASK(ret));
-   return ret;
-}
-
-/*
- * Extract the `type' field from a swp_entry_t.  The swp_entry_t is in
- * arch-independent format
- */
-static inline unsigned swp_type(swp_entry_t entry)
-{
-   return (entry.val  SWP_TYPE_SHIFT(entry));
-}
-
-/*
- * Extract the `offset' field from a swp_entry_t.  The swp_entry_t is in
- * arch-independent format
- */
-static inline pgoff_t swp_offset(swp_entry_t entry)
-{
-   return entry.val  SWP_OFFSET_MASK(entry);
-}
-
 /* check whether a pte points to a swap entry */
 static inline int is_swap_pte(pte_t pte)
 {
Index: linux-2.6/mm/swapfile.c

[PATCH 21/29] netvm: skb processing

2007-12-14 Thread Peter Zijlstra

In order to make sure emergency packets receive all memory needed to proceed
ensure processing of emergency SKBs happens under PF_MEMALLOC.

Use the (new) sk_backlog_rcv() wrapper to ensure this for backlog processing.

Skip taps, since those are user-space again.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/net/sock.h |5 
 net/core/dev.c |   59 +++--
 net/core/sock.c|   18 
 3 files changed, 76 insertions(+), 6 deletions(-)

Index: linux-2.6/net/core/dev.c
===
--- linux-2.6.orig/net/core/dev.c
+++ linux-2.6/net/core/dev.c
@@ -2008,6 +2008,30 @@ out:
 }
 #endif
 
+/*
+ * Filter the protocols for which the reserves are adequate.
+ *
+ * Before adding a protocol make sure that it is either covered by the existing
+ * reserves, or add reserves covering the memory need of the new protocol's
+ * packet processing.
+ */
+static int skb_emergency_protocol(struct sk_buff *skb)
+{
+   if (skb_emergency(skb))
+   switch(skb-protocol) {
+   case __constant_htons(ETH_P_ARP):
+   case __constant_htons(ETH_P_IP):
+   case __constant_htons(ETH_P_IPV6):
+   case __constant_htons(ETH_P_8021Q):
+   break;
+
+   default:
+   return 0;
+   }
+
+   return 1;
+}
+
 /**
  * netif_receive_skb - process receive buffer from network
  * @skb: buffer to process
@@ -2029,10 +2053,23 @@ int netif_receive_skb(struct sk_buff *sk
struct net_device *orig_dev;
int ret = NET_RX_DROP;
__be16 type;
+   unsigned long pflags = current-flags;
+
+   /* Emergency skb are special, they should
+*  - be delivered to SOCK_MEMALLOC sockets only
+*  - stay away from userspace
+*  - have bounded memory usage
+*
+* Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
+* This saves us from propagating the allocation context down to all
+* allocation sites.
+*/
+   if (skb_emergency(skb))
+   current-flags |= PF_MEMALLOC;
 
/* if we've gotten here through NAPI, check netpoll */
if (netpoll_receive_skb(skb))
-   return NET_RX_DROP;
+   goto out;
 
if (!skb-tstamp.tv64)
net_timestamp(skb);
@@ -2043,7 +2080,7 @@ int netif_receive_skb(struct sk_buff *sk
orig_dev = skb_bond(skb);
 
if (!orig_dev)
-   return NET_RX_DROP;
+   goto out;
 
__get_cpu_var(netdev_rx_stat).total++;
 
@@ -2062,6 +2099,9 @@ int netif_receive_skb(struct sk_buff *sk
}
 #endif
 
+   if (skb_emergency(skb))
+   goto skip_taps;
+
list_for_each_entry_rcu(ptype, ptype_all, list) {
if (!ptype-dev || ptype-dev == skb-dev) {
if (pt_prev)
@@ -2070,19 +2110,23 @@ int netif_receive_skb(struct sk_buff *sk
}
}
 
+skip_taps:
 #ifdef CONFIG_NET_CLS_ACT
skb = handle_ing(skb, pt_prev, ret, orig_dev);
if (!skb)
-   goto out;
+   goto unlock;
 ncls:
 #endif
 
+   if (!skb_emergency_protocol(skb))
+   goto drop;
+
skb = handle_bridge(skb, pt_prev, ret, orig_dev);
if (!skb)
-   goto out;
+   goto unlock;
skb = handle_macvlan(skb, pt_prev, ret, orig_dev);
if (!skb)
-   goto out;
+   goto unlock;
 
type = skb-protocol;
list_for_each_entry_rcu(ptype,
@@ -2098,6 +2142,7 @@ ncls:
if (pt_prev) {
ret = pt_prev-func(skb, skb-dev, pt_prev, orig_dev);
} else {
+drop:
kfree_skb(skb);
/* Jamal, now you will not able to escape explaining
 * me how you were going to use this. :-)
@@ -2105,8 +2150,10 @@ ncls:
ret = NET_RX_DROP;
}
 
-out:
+unlock:
rcu_read_unlock();
+out:
+   tsk_restore_flags(current, pflags, PF_MEMALLOC);
return ret;
 }
 
Index: linux-2.6/include/net/sock.h
===
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -529,8 +529,13 @@ static inline void sk_add_backlog(struct
skb-next = NULL;
 }
 
+extern int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb);
+
 static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
 {
+   if (skb_emergency(skb))
+   return __sk_backlog_rcv(sk, skb);
+
return sk-sk_backlog_rcv(sk, skb);
 }
 
Index: linux-2.6/net/core/sock.c
===
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -319,6 +319,24 @@ int sk_clear_memalloc(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(sk_clear_memalloc);

[PATCH 08/29] mm: system wide ALLOC_NO_WATERMARK

2007-12-14 Thread Peter Zijlstra

Change ALLOC_NO_WATERMARK page allocation such that the reserves are system
wide - which they are per setup_per_zone_pages_min(), when we scrape the
barrel, do it properly.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 mm/page_alloc.c |6 ++
 1 file changed, 6 insertions(+)

Index: linux-2.6/mm/page_alloc.c
===
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -1638,6 +1638,12 @@ restart:
 rebalance:
if (alloc_flags  ALLOC_NO_WATERMARKS) {
 nofail_alloc:
+   /*
+* break out of mempolicy boundaries
+*/
+   zonelist = NODE_DATA(numa_node_id())-node_zonelists +
+   gfp_zone(gfp_mask);
+
/* go through the zonelist yet again, ignoring mins */
page = get_page_from_freelist(gfp_mask, order, zonelist,
ALLOC_NO_WATERMARKS);

--

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/29] mm: slb: add knowledge of reserve pages

2007-12-14 Thread Peter Zijlstra

Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
contexts that are entitled to it. This is done to ensure reserve pages don't
leak out and get consumed.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/slub_def.h |1 
 mm/slab.c|   59 +++
 mm/slub.c|   27 -
 3 files changed, 72 insertions(+), 15 deletions(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -21,11 +21,12 @@
 #include linux/ctype.h
 #include linux/kallsyms.h
 #include linux/memory.h
+#include internal.h
 
 /*
  * Lock order:
  *   1. slab_lock(page)
- *   2. slab-list_lock
+ *   2. node-list_lock
  *
  *   The slab_lock protects operations on the object of a particular
  *   slab and its metadata in the page struct. If the slab lock
@@ -1071,7 +1072,7 @@ static void setup_object(struct kmem_cac
 }
 
 static noinline struct page *new_slab(struct kmem_cache *s,
-   gfp_t flags, int node)
+   gfp_t flags, int node, int *reserve)
 {
struct page *page;
struct kmem_cache_node *n;
@@ -1087,6 +1088,7 @@ static noinline struct page *new_slab(st
if (!page)
goto out;
 
+   *reserve = page-reserve;
n = get_node(s, page_to_nid(page));
if (n)
atomic_long_inc(n-nr_slabs);
@@ -1517,11 +1519,12 @@ static noinline unsigned long get_new_sl
 {
struct kmem_cache_cpu *c = *pc;
struct page *page;
+   int reserve;
 
if (gfpflags  __GFP_WAIT)
local_irq_enable();
 
-   page = new_slab(s, gfpflags, node);
+   page = new_slab(s, gfpflags, node, reserve);
 
if (gfpflags  __GFP_WAIT)
local_irq_disable();
@@ -1530,6 +1533,7 @@ static noinline unsigned long get_new_sl
return 0;
 
*pc = c = get_cpu_slab(s, smp_processor_id());
+   c-reserve = reserve;
if (c-page)
flush_slab(s, c);
c-page = page;
@@ -1564,6 +1568,16 @@ static void *__slab_alloc(struct kmem_ca
local_irq_save(flags);
preempt_enable_no_resched();
 #endif
+   if (unlikely(c-reserve)) {
+   /*
+* If the current slab is a reserve slab and the current
+* allocation context does not allow access to the reserves we
+* must force an allocation to test the current levels.
+*/
+   if (!(gfp_to_alloc_flags(gfpflags)  ALLOC_NO_WATERMARKS))
+   goto grow_slab;
+   }
+
if (likely(c-page)) {
state = slab_lock(c-page);
 
@@ -1586,7 +1600,7 @@ load_freelist:
 */
VM_BUG_ON(c-page-freelist == c-page-end);
 
-   if (unlikely(state  SLABDEBUG))
+   if (unlikely((state  SLABDEBUG) || c-reserve))
goto debug;
 
object = c-page-freelist;
@@ -1615,7 +1629,7 @@ grow_slab:
 /* Perform debugging */
 debug:
object = c-page-freelist;
-   if (!alloc_debug_processing(s, c-page, object, addr))
+   if ((state  SLABDEBUG)  !alloc_debug_processing(s, c-page, object, 
addr))
goto another_slab;
 
c-page-inuse++;
@@ -2156,10 +2170,11 @@ static struct kmem_cache_node *early_kme
struct page *page;
struct kmem_cache_node *n;
unsigned long flags;
+   int reserve;
 
BUG_ON(kmalloc_caches-size  sizeof(struct kmem_cache_node));
 
-   page = new_slab(kmalloc_caches, gfpflags, node);
+   page = new_slab(kmalloc_caches, gfpflags, node, reserve);
 
BUG_ON(!page);
if (page_to_nid(page) != node) {
Index: linux-2.6/include/linux/slub_def.h
===
--- linux-2.6.orig/include/linux/slub_def.h
+++ linux-2.6/include/linux/slub_def.h
@@ -18,6 +18,7 @@ struct kmem_cache_cpu {
unsigned int offset;/* Freepointer offset (in word units) */
unsigned int objsize;   /* Size of an object (from kmem_cache) */
unsigned int objects;   /* Objects per slab (from kmem_cache) */
+   int reserve;/* Did the current page come from the reserve */
 };
 
 struct kmem_cache_node {
Index: linux-2.6/mm/slab.c
===
--- linux-2.6.orig/mm/slab.c
+++ linux-2.6/mm/slab.c
@@ -115,6 +115,8 @@
 #include   asm/tlbflush.h
 #include   asm/page.h
 
+#include   internal.h
+
 /*
  * DEBUG   - 1 for kmem_cache_create() to honour; SLAB_RED_ZONE  
SLAB_POISON.
  *   0 for faster, smaller code (especially in the critical paths).
@@ -265,7 +267,8 @@ struct array_cache {
unsigned int avail;
unsigned int limit;
unsigned int batchcount;
-   unsigned int touched;
+

[PATCH 27/29] nfs: disable data cache revalidation for swapfiles

2007-12-14 Thread Peter Zijlstra

Do as Trond suggested:
  http://lkml.org/lkml/2006/8/25/348

Disable NFS data cache revalidation on swap files since it doesn't really 
make sense to have other clients change the file while you are using it.

Thereby we can stop setting PG_private on swap pages, since there ought to
be no further races with invalidate_inode_pages2() to deal with.

And since we cannot set PG_private we cannot use page-private (which is
already used by PG_swapcache pages anyway) to store the nfs_page. Thus
augment the new nfs_page_find_request logic.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 fs/nfs/inode.c |6 
 fs/nfs/write.c |   73 ++---
 2 files changed, 65 insertions(+), 14 deletions(-)

Index: linux-2.6/fs/nfs/inode.c
===
--- linux-2.6.orig/fs/nfs/inode.c
+++ linux-2.6/fs/nfs/inode.c
@@ -758,6 +758,12 @@ int nfs_revalidate_mapping_nolock(struct
struct nfs_inode *nfsi = NFS_I(inode);
int ret = 0;
 
+   /*
+* swapfiles are not supposed to be shared.
+*/
+   if (IS_SWAPFILE(inode))
+   goto out;
+
if ((nfsi-cache_validity  NFS_INO_REVAL_PAGECACHE)
|| nfs_attribute_timeout(inode) || NFS_STALE(inode)) {
ret = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
Index: linux-2.6/fs/nfs/write.c
===
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -112,25 +112,62 @@ static void nfs_context_set_write_error(
set_bit(NFS_CONTEXT_ERROR_WRITE, ctx-flags);
 }
 
-static struct nfs_page *nfs_page_find_request_locked(struct page *page)
+static struct nfs_page *
+__nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page, int 
get)
 {
struct nfs_page *req = NULL;
 
-   if (PagePrivate(page)) {
+   if (PagePrivate(page))
req = (struct nfs_page *)page_private(page);
-   if (req != NULL)
-   kref_get(req-wb_kref);
-   }
+   else if (unlikely(PageSwapCache(page)))
+   req = radix_tree_lookup(nfsi-nfs_page_tree, 
page_file_index(page));
+
+   if (get  req)
+   kref_get(req-wb_kref);
+
return req;
 }
 
+static inline struct nfs_page *
+nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page)
+{
+   return __nfs_page_find_request_locked(nfsi, page, 1);
+}
+
+static int __nfs_page_has_request(struct page *page)
+{
+   struct inode *inode = page_file_mapping(page)-host;
+   struct nfs_page *req = NULL;
+
+   spin_lock(inode-i_lock);
+   req = __nfs_page_find_request_locked(NFS_I(inode), page, 0);
+   spin_unlock(inode-i_lock);
+
+   /*
+* hole here plugged by the caller holding onto PG_locked
+*/
+
+   return req != NULL;
+}
+
+static inline int nfs_page_has_request(struct page *page)
+{
+   if (PagePrivate(page))
+   return 1;
+
+   if (unlikely(PageSwapCache(page)))
+   return __nfs_page_has_request(page);
+
+   return 0;
+}
+
 static struct nfs_page *nfs_page_find_request(struct page *page)
 {
struct inode *inode = page_file_mapping(page)-host;
struct nfs_page *req = NULL;
 
spin_lock(inode-i_lock);
-   req = nfs_page_find_request_locked(page);
+   req = nfs_page_find_request_locked(NFS_I(inode), page);
spin_unlock(inode-i_lock);
return req;
 }
@@ -253,7 +290,7 @@ static int nfs_page_async_flush(struct n
 
spin_lock(inode-i_lock);
for(;;) {
-   req = nfs_page_find_request_locked(page);
+   req = nfs_page_find_request_locked(nfsi, page);
if (req == NULL) {
spin_unlock(inode-i_lock);
return 0;
@@ -370,8 +407,14 @@ static void nfs_inode_add_request(struct
if (nfs_have_delegation(inode, FMODE_WRITE))
nfsi-change_attr++;
}
-   SetPagePrivate(req-wb_page);
-   set_page_private(req-wb_page, (unsigned long)req);
+   /*
+* Swap-space should not get truncated. Hence no need to plug the race
+* with invalidate/truncate.
+*/
+   if (likely(!PageSwapCache(req-wb_page))) {
+   SetPagePrivate(req-wb_page);
+   set_page_private(req-wb_page, (unsigned long)req);
+   }
nfsi-npages++;
kref_get(req-wb_kref);
 }
@@ -387,8 +430,10 @@ static void nfs_inode_remove_request(str
BUG_ON (!NFS_WBACK_BUSY(req));
 
spin_lock(inode-i_lock);
-   set_page_private(req-wb_page, 0);
-   ClearPagePrivate(req-wb_page);
+   if (likely(!PageSwapCache(req-wb_page))) {
+   set_page_private(req-wb_page, 0);
+   ClearPagePrivate(req-wb_page);
+   }
radix_tree_delete(nfsi-nfs_page_tree, req-wb_index);

[PATCH 19/29] netvm: prevent a TCP specific deadlock

2007-12-14 Thread Peter Zijlstra

It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
that we're over the global rmem limit. This will prevent SOCK_MEMALLOC buffers
from receiving data, which will prevent userspace from running, which is needed
to reduce the buffered data.

Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/net/sock.h |7 ---
 net/core/stream.c  |5 +++--
 2 files changed, 7 insertions(+), 5 deletions(-)

Index: linux-2.6/include/net/sock.h
===
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -756,7 +756,8 @@ static inline struct inode *SOCK_INODE(s
 }
 
 extern void __sk_stream_mem_reclaim(struct sock *sk);
-extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
+extern int sk_stream_mem_schedule(struct sock *sk, struct sk_buff *skb,
+   int size, int kind);
 
 #define SK_STREAM_MEM_QUANTUM ((int)PAGE_SIZE)
 
@@ -774,13 +775,13 @@ static inline void sk_stream_mem_reclaim
 static inline int sk_stream_rmem_schedule(struct sock *sk, struct sk_buff *skb)
 {
return (int)skb-truesize = sk-sk_forward_alloc ||
-   sk_stream_mem_schedule(sk, skb-truesize, 1);
+   sk_stream_mem_schedule(sk, skb, skb-truesize, 1);
 }
 
 static inline int sk_stream_wmem_schedule(struct sock *sk, int size)
 {
return size = sk-sk_forward_alloc ||
-  sk_stream_mem_schedule(sk, size, 0);
+  sk_stream_mem_schedule(sk, NULL, size, 0);
 }
 
 /* Used by processes to lock a socket state, so that
Index: linux-2.6/net/core/stream.c
===
--- linux-2.6.orig/net/core/stream.c
+++ linux-2.6/net/core/stream.c
@@ -207,7 +207,7 @@ void __sk_stream_mem_reclaim(struct sock
 
 EXPORT_SYMBOL(__sk_stream_mem_reclaim);
 
-int sk_stream_mem_schedule(struct sock *sk, int size, int kind)
+int sk_stream_mem_schedule(struct sock *sk, struct sk_buff *skb, int size, int 
kind)
 {
int amt = sk_stream_pages(size);
struct proto *prot = sk-sk_prot;
@@ -225,7 +225,8 @@ int sk_stream_mem_schedule(struct sock *
/* Over hard limit. */
if (atomic_read(prot-memory_allocated)  prot-sysctl_mem[2]) {
prot-enter_memory_pressure();
-   goto suppress_allocation;
+   if (!skb || (skb  !skb_emergency(skb)))
+   goto suppress_allocation;
}
 
/* Under pressure. */

--

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc5-mm1

2007-12-14 Thread David Miller

From: Herbert Xu [EMAIL PROTECTED]
Date: Fri, 14 Dec 2007 10:08:07 +0800

 [UDP]: Move udp_stats_in6 into net/ipv4/udp.c

 Now that external users may increment the counters directly, we need to
 ensure that udp_stats_in6 is always available.  Otherwise we'd either
 have to requrie the external users to be built as modules or ipv6 to be
 built-in.

 This isn't too bad because udp_stats_in6 is just a pair of pointers plus
 an EXPORT, e.g., just 40 (16 + 24) bytes on x86-64.

 Signed-off-by: Herbert Xu [EMAIL PROTECTED]

Applied.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHES 0/3]: DCCP patches for 2.6.25

2007-12-14 Thread David Miller

From: Arnaldo Carvalho de Melo [EMAIL PROTECTED]
Date: Thu, 13 Dec 2007 23:41:59 -0200

   Please consider pulling from:
 
 master.kernel.org:/pub/scm/linux/kernel/git/acme/net-2.6.25

Pulled, but could you please reformat Gerrit's changelog entries in
the future?  They have these 80+ long lines which are painful to read
in ascii email clients and in terminal output.

I'll do this by hand during my next rebase for this case, but I will
push back when I see it again in future pull requests.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/4] Updates to nfsroot documentation

2007-12-14 Thread David Miller

From: [EMAIL PROTECTED]
Date: Thu, 13 Dec 2007 16:02:33 -0800

 From: Amos Waterland [EMAIL PROTECTED]

 The difference between ip=off and ip=::off has been a cause of much
 confusion.  Document how each behaves, and do not contradict ourselves by
 saying that off is the default when in fact any is the default and is
 descibed as being so lower in the file.

 Signed-off-by: Amos Waterland [EMAIL PROTECTED]
 Cc: Simon Horman [EMAIL PROTECTED]
 Cc: Andi Kleen [EMAIL PROTECTED]
 Cc: David S. Miller [EMAIL PROTECTED]
 Signed-off-by: Andrew Morton [EMAIL PROTECTED]

Applied.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 2/4] net: use mutex_is_locked() for ASSERT_RTNL()

2007-12-14 Thread David Miller

From: [EMAIL PROTECTED]
Date: Thu, 13 Dec 2007 16:02:36 -0800

 From: Andrew Morton [EMAIL PROTECTED]

 ASSERT_RTNL() uses mutex_trylock(), but it's better to use mutex_is_locked().

 Make that change, and remove rtnl_trylock() altogether.

 (not tested yet!)

 Cc: David S. Miller [EMAIL PROTECTED]
 Signed-off-by: Andrew Morton [EMAIL PROTECTED]

NACK, as explained please remove this until the replacement
doesn't remove valid checks which are done currently.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 3/4] tipc: fix semaphore handling

2007-12-14 Thread David Miller

From: [EMAIL PROTECTED]
Date: Thu, 13 Dec 2007 16:02:36 -0800

 From: Andrew Morton [EMAIL PROTECTED]

 As noted by Kevin, tipc's release() does down_interruptible() and ignores the
 return value.  So if signal_pending() we'll end up doing up() on a non-downed
 semaphore.  Fix.

 Cc: Kevin Winchester [EMAIL PROTECTED]
 Cc: Per Liden [EMAIL PROTECTED]
 Cc: Jon Maloy [EMAIL PROTECTED]
 Cc: Allan Stephens [EMAIL PROTECTED]
 Cc: David S. Miller [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Signed-off-by: Andrew Morton [EMAIL PROTECTED]

This is already in my net-2.6 tree, but thanks for resubmitting
anyways :)
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 4/4] PPP synchronous tty: convert dead_sem to completion

2007-12-14 Thread David Miller

From: [EMAIL PROTECTED]
Date: Thu, 13 Dec 2007 16:02:37 -0800

 From: Matthias Kaehlcke [EMAIL PROTECTED]
 
 PPP synchronous tty channel driver: convert the semaphore dead_sem to a
 completion
 
 Signed-off-by: Matthias Kaehlcke [EMAIL PROTECTED]
 Cc: Paul Mackerras [EMAIL PROTECTED]
 Signed-off-by: Andrew Morton [EMAIL PROTECTED]

Applied to net-2.6.25, thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][XFRM] Fix potential race vs xfrm_state(only)_find and xfrm_hash_resize.

2007-12-14 Thread David Miller

From: Pavel Emelyanov [EMAIL PROTECTED]
Date: Thu, 13 Dec 2007 13:56:14 +0300

 The _find calls calculate the hash value using the 
 xfrm_state_hmask, without the xfrm_state_lock. But the 
 value of this mask can change in the _resize call under
 the state_lock, so we risk to fail in finding the desired 
 entry in hash.

 I think, that the hash value is better to calculate
 under the state lock.

 Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

Thanks for the bug fix.

I know why I coded it this way, I wanted to give GCC more
room to schedule the loads away from the uses in the hash
calculation.

Once you cram it after the spin lock acquire, it can't load unrelated
values earlier to soften the load/use cost on cache misses.

Of course it's invalid because the hash mask can change as you
noticed.

I wish there was a way to conditionally clobber memory, then we could
tell GCC exactly what memory objects are protected by the lock and
thus help in situations like this so much.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] ixgb: enable sun hardware support for broadcom phy

2007-12-14 Thread Auke Kok

From: Matheos Worku [EMAIL PROTECTED]

Implement support for a SUN-specific PHY.

SUN provides a modified 82597-based board with their own
PHY that works with very little modification to the code. This
patch implements this new PHY which is identified by the
subvendor device ID. The device ID of the adapter remains
the same.

Signed-off-by: Matheos Worku [EMAIL PROTECTED]
Signed-off-by: Jesse Brandeburg [EMAIL PROTECTED]
Signed-off-by: Auke Kok [EMAIL PROTECTED]
---

 drivers/net/ixgb/ixgb_hw.c   |   82 +-
 drivers/net/ixgb/ixgb_hw.h   |3 +-
 drivers/net/ixgb/ixgb_ids.h  |4 ++
 drivers/net/ixgb/ixgb_main.c |   10 +++--
 4 files changed, 91 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ixgb/ixgb_hw.c b/drivers/net/ixgb/ixgb_hw.c
index 2c6367a..80a8b98 100644
--- a/drivers/net/ixgb/ixgb_hw.c
+++ b/drivers/net/ixgb/ixgb_hw.c
@@ -45,6 +45,8 @@ static boolean_t ixgb_link_reset(struct ixgb_hw *hw);
 
 static void ixgb_optics_reset(struct ixgb_hw *hw);
 
+static void ixgb_optics_reset_bcm(struct ixgb_hw *hw);
+
 static ixgb_phy_type ixgb_identify_phy(struct ixgb_hw *hw);
 
 static void ixgb_clear_hw_cntrs(struct ixgb_hw *hw);
@@ -90,10 +92,20 @@ static uint32_t ixgb_mac_reset(struct ixgb_hw *hw)
ASSERT(!(ctrl_reg  IXGB_CTRL0_RST));
 #endif
 
-   if (hw-phy_type == ixgb_phy_type_txn17401) {
-   ixgb_optics_reset(hw);
+   if (hw-subsystem_vendor_id == SUN_SUBVENDOR_ID) {
+   ctrl_reg =  /* Enable interrupt from XFP and SerDes */
+  IXGB_CTRL1_GPI0_EN |
+  IXGB_CTRL1_SDP6_DIR |
+  IXGB_CTRL1_SDP7_DIR |
+  IXGB_CTRL1_SDP6 |
+  IXGB_CTRL1_SDP7;
+   IXGB_WRITE_REG(hw, CTRL1, ctrl_reg);
+   ixgb_optics_reset_bcm(hw);
}
 
+   if (hw-phy_type == ixgb_phy_type_txn17401)
+   ixgb_optics_reset(hw);
+
return ctrl_reg;
 }
 
@@ -253,6 +265,10 @@ ixgb_identify_phy(struct ixgb_hw *hw)
break;
}
 
+   /* update phy type for sun specific board */
+   if (hw-subsystem_vendor_id == SUN_SUBVENDOR_ID)
+   phy_type = ixgb_phy_type_bcm;
+
return (phy_type);
 }
 
@@ -1225,3 +1241,65 @@ ixgb_optics_reset(struct ixgb_hw *hw)
 
return;
 }
+
+/**
+ * Resets the 10GbE optics module for Sun variant NIC.
+ *
+ * hw - Struct containing variables accessed by shared code
+ */
+
+#define   IXGB_BCM8704_USER_PMD_TX_CTRL_REG 0xC803
+#define   IXGB_BCM8704_USER_PMD_TX_CTRL_REG_VAL 0x0164
+#define   IXGB_BCM8704_USER_CTRL_REG0xC800
+#define   IXGB_BCM8704_USER_CTRL_REG_VAL0x7FBF
+#define   IXGB_BCM8704_USER_DEV3_ADDR   0x0003
+#define   IXGB_SUN_PHY_ADDRESS  0x
+#define   IXGB_SUN_PHY_RESET_DELAY 305
+
+static void
+ixgb_optics_reset_bcm(struct ixgb_hw *hw)
+{
+   u32 ctrl = IXGB_READ_REG(hw, CTRL0);
+   ctrl = ~IXGB_CTRL0_SDP2;
+   ctrl |= IXGB_CTRL0_SDP3;
+   IXGB_WRITE_REG(hw, CTRL0, ctrl);
+
+   /* SerDes needs extra delay */
+   msleep(IXGB_SUN_PHY_RESET_DELAY);
+
+   /* Broadcom 7408L configuration */
+   /* Reference clock config */
+   ixgb_write_phy_reg(hw,
+  IXGB_BCM8704_USER_PMD_TX_CTRL_REG,
+  IXGB_SUN_PHY_ADDRESS,
+  IXGB_BCM8704_USER_DEV3_ADDR,
+  IXGB_BCM8704_USER_PMD_TX_CTRL_REG_VAL);
+   /*  we must read the registers twice */
+   ixgb_read_phy_reg(hw,
+ IXGB_BCM8704_USER_PMD_TX_CTRL_REG,
+ IXGB_SUN_PHY_ADDRESS,
+ IXGB_BCM8704_USER_DEV3_ADDR);
+   ixgb_read_phy_reg(hw,
+ IXGB_BCM8704_USER_PMD_TX_CTRL_REG,
+ IXGB_SUN_PHY_ADDRESS,
+ IXGB_BCM8704_USER_DEV3_ADDR);
+
+   ixgb_write_phy_reg(hw,
+  IXGB_BCM8704_USER_CTRL_REG,
+  IXGB_SUN_PHY_ADDRESS,
+  IXGB_BCM8704_USER_DEV3_ADDR,
+  IXGB_BCM8704_USER_CTRL_REG_VAL);
+   ixgb_read_phy_reg(hw,
+ IXGB_BCM8704_USER_CTRL_REG,
+ IXGB_SUN_PHY_ADDRESS,
+ IXGB_BCM8704_USER_DEV3_ADDR);
+   ixgb_read_phy_reg(hw,
+ IXGB_BCM8704_USER_CTRL_REG,
+ IXGB_SUN_PHY_ADDRESS,
+ IXGB_BCM8704_USER_DEV3_ADDR);
+
+   /* SerDes needs extra delay */
+   msleep(IXGB_SUN_PHY_RESET_DELAY);
+
+   return;
+}
diff --git a/drivers/net/ixgb/ixgb_hw.h b/drivers/net/ixgb/ixgb_hw.h
index af56433..f4e0044 100644
---

[PATCH 1/2] ixgb: make sure jumbos stay enabled after reset

2007-12-14 Thread Auke Kok

From: Matheos Worku [EMAIL PROTECTED]

Currently a device reset (ethtool -r ethX) would cause the
adapter to fall back to regular MTU sizes.

Signed-off-by: Matheos Worku [EMAIL PROTECTED]
Signed-off-by: Jesse Brandeburg [EMAIL PROTECTED]
Signed-off-by: Auke Kok [EMAIL PROTECTED]
---

 drivers/net/ixgb/ixgb_main.c |   16 ++--
 1 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c
index 3021234..bf9085f 100644
--- a/drivers/net/ixgb/ixgb_main.c
+++ b/drivers/net/ixgb/ixgb_main.c
@@ -320,10 +320,22 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t 
kill_watchdog)
 void
 ixgb_reset(struct ixgb_adapter *adapter)
 {
+   struct ixgb_hw *hw = adapter-hw;
 
-   ixgb_adapter_stop(adapter-hw);
-   if(!ixgb_init_hw(adapter-hw))
+   ixgb_adapter_stop(hw);
+   if (!ixgb_init_hw(hw))
DPRINTK(PROBE, ERR, ixgb_init_hw failed.\n);
+
+   /* restore frame size information */
+   IXGB_WRITE_REG(hw, MFS, hw-max_frame_size  IXGB_MFS_SHIFT);
+   if (hw-max_frame_size 
+   IXGB_MAX_ENET_FRAME_SIZE_WITHOUT_FCS + ENET_FCS_LENGTH) {
+   u32 ctrl0 = IXGB_READ_REG(hw, CTRL0);
+   if (!(ctrl0  IXGB_CTRL0_JFE)) {
+   ctrl0 |= IXGB_CTRL0_JFE;
+   IXGB_WRITE_REG(hw, CTRL0, ctrl0);
+   }
+   }
 }
 
 /**

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-2.6.25] Revert recent TCP work

2007-12-14 Thread Ilpo Järvinen

On Fri, 14 Dec 2007, Ilpo Järvinen wrote:

 So, I might soon prepare a revert patch for most of the questionable 
 TCP parts and ask Dave to apply it (and drop them fully during next 
 rebase) unless I suddently figure something out soon which explains 
 all/most of the problems, then return to drawing board. ...As it seems 
 that the cumulative ACK processing problem discovered later on (having 
 rather cumbersome solution with skbs only) will make part of the work 
 that's currently in net-2.6.25 quite useless/duplicate effort. But thanks 
 anyway for reporting these.

Hi Dave,

Could you either drop my recent patches (+one fix to them from Herbert
Xu == [TCP]: Fix crash in tcp_advance_send_head), all mine after [TCP]: 
Abstract tp-highest_sack accessing  point to next skb from net-2.6.25 
or just apply the revert from below and do the removal during next rebase. 
I think it could even be automated by something like this (untested):
for i in $(cat commits | cut -d ' ' -f 1); do git-rebase --onto $i^ $i; done 
(I've attached the commits list).

I'll resend small bits that are still useful but get removed in this kind 
of straightforward operation (I guess it's easier for you to track this 
way and makes conflicts a non-problem).

...It was buggy as well, I've tried to Cc all bug reporters that I've
noticed so far... Related bugs include at least these cases:

These are completely removed by this revert:
__tcp_rb_insert
(__|)tcp_reset_fack_counts
May still trigger later due to other, genuine bugs:
tcp_sacktag_one (I'll rework  resend this soon)
tcp_fastretrans_alert (fackets_out trap)
BUG_TRAP(packets = tp-packets_out); in tcp_mark_head_lost

-- 
 i.


[PATCH net-2.6.25] Revert recent TCP work

It was recently discovered that there's yet another processing
aspect to consider related to cumulative ACK processing. This
solution wasn't enough to handle that but (arguably) complex
and intrusive changes were still necessary in addition to the
complexity this already introduced. Another approach is on the
drawing board.

This was somehow buggy as well, a lot of reports against it
were filed already :-), but hunting the cause doesn't seem so
beneficial anymore.

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 include/linux/skbuff.h   |3 -
 include/linux/tcp.h  |4 -
 include/net/tcp.h|  362 --
 net/ipv4/tcp_input.c |  341 ---
 net/ipv4/tcp_ipv4.c  |1 -
 net/ipv4/tcp_minisocks.c |1 -
 net/ipv4/tcp_output.c|   13 +-
 net/ipv6/tcp_ipv6.c  |1 -
 8 files changed, 196 insertions(+), 530 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f21fee6..c618fbf 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -18,7 +18,6 @@
 #include linux/compiler.h
 #include linux/time.h
 #include linux/cache.h
-#include linux/rbtree.h
 
 #include asm/atomic.h
 #include asm/types.h
@@ -254,8 +253,6 @@ struct sk_buff {
struct sk_buff  *next;
struct sk_buff  *prev;
 
-   struct rb_node  rb;
-
struct sock *sk;
ktime_t tstamp;
struct net_device   *dev;
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 56342c3..08027f1 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -174,7 +174,6 @@ struct tcp_md5sig {
 
 #include linux/skbuff.h
 #include linux/dmaengine.h
-#include linux/rbtree.h
 #include net/sock.h
 #include net/inet_connection_sock.h
 #include net/inet_timewait_sock.h
@@ -321,9 +320,6 @@ struct tcp_sock {
u32 snd_cwnd_used;
u32 snd_cwnd_stamp;
 
-   struct rb_root  write_queue_rb;
-   struct rb_root  sacked_queue_rb;
-   struct sk_buff_head sacked_queue;
struct sk_buff_head out_of_order_queue; /* Out of order segments go 
here */
 
u32 rcv_wnd;/* Current receiver window  */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5e6c433..5ec1cac 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -555,7 +555,6 @@ struct tcp_skb_cb {
__u32   seq;/* Starting sequence number */
__u32   end_seq;/* SEQ + FIN + SYN + datalen*/
__u32   when;   /* used to compute rtt's*/
-   unsigned intfack_count; /* speed up SACK processing */
__u8flags;  /* TCP header flags.*/
 
/* NOTE: These must match up to the flags byte in a
@@ -1191,112 +1190,29 @@ static inline void 
tcp_put_md5sig_pool(void)
 }
 
 /* write queue abstraction */
-#define TCP_WQ_SACKED  1
-
-static inline struct sk_buff_head *__tcp_list_select(struct sock *sk, const 
int queue)
-{
-   if (queue == TCP_WQ_SACKED)
-   return tcp_sk(sk)-sacked_queue;
-   else
-

Re: [PATCH 8/8] gianfar: Magic Packet and suspend/resume support.

2007-12-14 Thread Jeff Garzik


Scott Wood wrote:

Signed-off-by: Scott Wood [EMAIL PROTECTED]
---
Jeff, can you ack this to go through Paul's tree
(assuming nothing wrong with it)?

 drivers/net/gianfar.c |  137 -
 drivers/net/gianfar.h |   13 +++-
 drivers/net/gianfar_ethtool.c |   41 -
 3 files changed, 185 insertions(+), 6 deletions(-)


ACK


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sky2: RX lockup fix

2007-12-14 Thread Jeff Garzik


Stephen Hemminger wrote:

I'm using a Marvell 88E8062 on a custom PPC64 blade and ran into RX
lockups while validating the sky2 driver.  The receive MAC FIFO would
become stuck during testing with high traffic.  One port of the 88E8062
would lockup, while the other port remained functional.  Re-inserting
the sky2 module would not fix the problem - only a power cycle would.

I looked over Marvell's most recent sk98lin driver and it looks like
they had a workaround for the Yukon XL that the sky2 doesn't have yet.
The sk98lin driver disables the RX MAC FIFO flush feature for all
revisions of the Yukon XL.

According to skgeinit.c of the sk98lin driver, Flushing must be enabled
(needed for ASF see dev. #4.29), but the flushing mask should be
disabled (see dev. #4.115).  Nice. I implemented this same change in
the sky2 driver and verified that the RX lockup I was seeing was
resolved.

Signed-off-by: Peter Tyser [EMAIL PROTECTED]
Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

---
Original patch reformatted to remove line wrap.


applied #upstream-fixes


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] e100: free IRQ to remove warningwhenrebooting

2007-12-14 Thread Jeff Garzik


Auke Kok wrote:

Adapted from Ian Wienand [EMAIL PROTECTED]

Explicitly free the IRQ before removing the device to remove a
warning Destroying IRQ without calling free_irq

Signed-off-by: Auke Kok [EMAIL PROTECTED]
Cc: Ian Wienand [EMAIL PROTECTED]
---

 drivers/net/e100.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)


applied #upstream-fixes


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [2.6 patch] drivers/net/sis190.c section fix

2007-12-14 Thread Jeff Garzik


Adrian Bunk wrote:

This patch fixes the following section mismatch with CONFIG_HOTPLUG=n:

--  snip  --

...
WARNING: vmlinux.o(.init.text.20+0x4cb25): Section mismatch: reference to 
.exit.text:sis190_mii_remove (between 'sis190_init_one' and 'read_eeprom')
...

--  snip  --

Signed-off-by: Adrian Bunk [EMAIL PROTECTED]

---
29fae057ba15a552a7cad1e731d3238d567032ba 
diff --git a/drivers/net/sis190.c b/drivers/net/sis190.c

index 7200883..49f767b 100644
--- a/drivers/net/sis190.c
+++ b/drivers/net/sis190.c
@@ -1381,7 +1381,7 @@ out:
return rc;
 }
 
-static void __devexit sis190_mii_remove(struct net_device *dev)

+static void sis190_mii_remove(struct net_device *dev)
 {
struct sis190_private *tp = netdev_priv(dev);
 



applied #upstream-fixes


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [2.6 patch] drivers/net/s2io.c section fixes

2007-12-14 Thread Jeff Garzik


Adrian Bunk wrote:

Code used by the non-__devinit s2io_open() mustn't be __devinit.

This patch fixes the following section mismatch with CONFIG_HOTPLUG=n:

--  snip  --

...
WARNING: vmlinux.o(.text+0x6f6e3e): Section mismatch: reference to 
.init.text.20:s2io_test_intr (between 's2io_open' and 's2io_ethtool_sset')
...

--  snip  --

Signed-off-by: Adrian Bunk [EMAIL PROTECTED]

---

 drivers/net/s2io.c |4 ++--


applied #upstream-fixes


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [NETFILTER] xt_hashlimit : speedups hash_dst()

2007-12-14 Thread Jarek Poplawski

Eric Dumazet wrote, On 12/14/2007 12:09 PM:
...

 + /*
 +  * Instead of returning hash % ht-cfg.size (implying a divide)
 +  * we return the high 32 bits of the (hash * ht-cfg.size) that will
 +  * give results between [0 and cfg.size-1] and same hash distribution,
 +  * but using a multiply, less expensive than a divide
 +  */
 + return ((u64)hash * ht-cfg.size)  32;

Are we sure of the same hash distribution? Probably I miss something,
but: if this 'hash' is well distributed on 32 bits, and ht-cfg.size
is smaller than 32 bits, e.g. 256 (8 bits), then this multiplication
moves to the higher 32 of u64 only max. 8 bits of the most significant
byte, and the other three bytes are never used, while division is
always affected by all four bytes...

Regards,
Jarek P.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel 2.6.23.8: KERNEL: assertion in net/ipv4/tcp_input.c

2007-12-14 Thread Ilpo Järvinen

On Thu, 13 Dec 2007, Wolfgang Walter wrote:

 it happened again with your patch applied:
 
 WARNING: at net/ipv4/tcp_input.c:1018 tcp_sacktag_write_queue()
 
 Call Trace:
 IRQ  [80549290] tcp_sacktag_write_queue+0x7d0/0xa60
 [80283869] add_partial+0x19/0x60
 [80549ac4] tcp_ack+0x5a4/0x1d70
 [8054e625] tcp_rcv_established+0x485/0x7b0
 [80554c3d] tcp_v4_do_rcv+0xed/0x3e0
 [80556fe7] tcp_v4_rcv+0x947/0x970
 [80538c6c] ip_local_deliver+0xac/0x290
 [80538862] ip_rcv+0x362/0x6c0
 [804fc5d3] netif_receive_skb+0x323/0x420
 [8042ab40] tg3_poll+0x630/0xa50
 [804fecba] net_rx_action+0x8a/0x140
 [8023a269] __do_softirq+0x69/0xe0
 [8020d47c] call_softirq+0x1c/0x30
 [8020f315] do_softirq+0x35/0x90
 [8023a105] irq_exit+0x55/0x60
 [8020f3f0] do_IRQ+0x80/0x100
 [8020c7d1] ret_from_intr+0x0/0xa
 EOI

...Yeah, as I suspected, left_out != 0 when sacked_out and lost_out are 
zero. I'll try to read the code again to see how that could happen (in 
any case this is just annoying at the best, no other harm but the 
message is being done). ...If nothing comes up I might ask you to run
with another test patch but it might take week or so until I've enough
time to dig into this fully because I must also come familiar with 
something as pre-historic as the 2.6.23 (there are already large number
of related changes since then, both in upcoming 2.6.24 and some in 
net-2.6.25)... :-)

  Any tweaking done to TCP related sysctls?
 
 net/core/somaxconn=2048
 net/ipv4/tcp_syncookies=1
 net/ipv4/tcp_max_syn_backlog=8192
 net/ipv4/tcp_max_tw_buckets=180
 net/ipv4/tcp_window_scaling=0
 net/ipv4/tcp_timestamps=0

Thanks, these won't be that significant, though timestamps will exclude 
some possibilities :-).


-- 
 i.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 01/10] e1000e: make E1000E default to the same kconfig setting as E1000

2007-12-14 Thread Jeff Garzik

[EMAIL PROTECTED] wrote:

From: Randy Dunlap [EMAIL PROTECTED]

Make E1000E default to the same kconfig setting as E1000.  So people's
machiens don't stop working when they use oldconfig.

Signed-off-by: Randy Dunlap [EMAIL PROTECTED]
Cc: Jeff Garzik [EMAIL PROTECTED]
Cc: Auke Kok [EMAIL PROTECTED]
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
---

 drivers/net/Kconfig |1 +
 1 file changed, 1 insertion(+)

diff -puN 
drivers/net/Kconfig~e1000e-make-e1000e-default-to-the-same-kconfig-setting-as-e1000
 drivers/net/Kconfig
--- 
a/drivers/net/Kconfig~e1000e-make-e1000e-default-to-the-same-kconfig-setting-as-e1000
+++ a/drivers/net/Kconfig
@@ -1986,6 +1986,7 @@ config E1000_DISABLE_PACKET_SPLIT
 config E1000E
tristate Intel(R) PRO/1000 PCI-Express Gigabit Ethernet support
depends on PCI
+   default E1000

I am not inclined to apply this one.  This practice, applied over time, 
will tend to accumulate weird 'default' and 'select' statements.

So I think the breakage that occurs is mitigated by two factors:
1) kernel hackers that do their own configs are expected to be able to 
figure this stuff.
2) kernel builders (read: distros, mainly) are expected to have put 
thought into the Kconfig selection and driver migration strategies.

PCI IDs move across drivers from time, and we don't want to apply these 
sorts changes:  Viewed in the long term, the suggested patch is merely a 
temporary change to allow kernel experts to more easily deal with the 
PCI ID migration across drivers.

I would prefer simply to communicate to kernel experts and builders 
about a Kconfig issue that could potentially their booting/networking... 
 because this patch is only needed if the kernel experts do not already 
know about a necessary config update.

Jeff

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [patch 02/10] forcedeth: power down phy when interface is down

2007-12-14 Thread Ayaz Abdulla

Ed,

You mention that the phy will become 100Mbit half duplex, but during
nv_close, the phy setting is not modified. This might be a separate
issue.

Ayaz

-Original Message-
From: Andrew Morton [mailto:[EMAIL PROTECTED] 
Sent: Thursday, December 13, 2007 5:07 PM
To: Ed Swierk
Cc: Ayaz Abdulla; [EMAIL PROTECTED]; netdev@vger.kernel.org
Subject: Re: [patch 02/10] forcedeth: power down phy when interface is
down


On Thu, 13 Dec 2007 16:53:58 -0800
Ed Swierk [EMAIL PROTECTED] wrote:

 On 12/13/07, Andrew Morton [EMAIL PROTECTED] wrote:
  Does this patch actually fix any observeable problem?
 
 Without the patch, ifconfig down leaves the physical link up, which
 confuses datacenter users who expect the link lights both on the NIC
 and the switch to go out when they bring an interface down.
 
 Furthermore, even though the phy is powered on, autonegotiation stops
 working, so a normally gigabit link might suddenly become 100 Mbit
 half-duplex when the interface goes down, and become gigabit when it
 comes up again.
 

OK, thanks, I added that text to the changelog along with Ayaz's
objection
and shall continue to bug people with it until we have a fix merged.  
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 04/10] ucc_geth-fix-build-break-introduced-by-commit-09f75cd7bf13720738e6a196cc0107ce9a5bd5a0-checkpatch-fixes

2007-12-14 Thread Jeff Garzik


[EMAIL PROTECTED] wrote:

From: Andrew Morton [EMAIL PROTECTED]

Cc: David S. Miller [EMAIL PROTECTED]
Cc: Emil Medve [EMAIL PROTECTED]
Cc: Jeff Garzik [EMAIL PROTECTED]
Cc: Kumar Gala [EMAIL PROTECTED]
Cc: Li Yang [EMAIL PROTECTED]
Cc: Paul Mackerras [EMAIL PROTECTED]
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
---

 drivers/net/ucc_geth.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN 
drivers/net/ucc_geth.c~ucc_geth-fix-build-break-introduced-by-commit-09f75cd7bf13720738e6a196cc0107ce9a5bd5a0-checkpatch-fixes
 drivers/net/ucc_geth.c
--- 
a/drivers/net/ucc_geth.c~ucc_geth-fix-build-break-introduced-by-commit-09f75cd7bf13720738e6a196cc0107ce9a5bd5a0-checkpatch-fixes
+++ a/drivers/net/ucc_geth.c
@@ -3447,7 +3447,7 @@ static int ucc_geth_rx(struct ucc_geth_p
u16 length, howmany = 0;
u32 bd_status;
u8 *bdBuffer;
-   struct net_device * dev;
+   struct net_device *dev;
 
 	ugeth_vdbg(%s: IN, __FUNCTION__);


applied this crucial fix to #upstream-fixes with a suitable changelog


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 06/10] Net: ibm_newemac, remove SPIN_LOCK_UNLOCKED

2007-12-14 Thread Jeff Garzik


[EMAIL PROTECTED] wrote:

From: Jiri Slaby [EMAIL PROTECTED]

SPIN_LOCK_UNLOCKED is deprecated, use DEFINE_SPINLOCK instead

Signed-off-by: Jiri Slaby [EMAIL PROTECTED]
Cc: Jeff Garzik [EMAIL PROTECTED]
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
---

 drivers/net/ibm_newemac/debug.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN 
drivers/net/ibm_newemac/debug.c~net-ibm_newemac-remove-spin_lock_unlocked 
drivers/net/ibm_newemac/debug.c
--- a/drivers/net/ibm_newemac/debug.c~net-ibm_newemac-remove-spin_lock_unlocked
+++ a/drivers/net/ibm_newemac/debug.c
@@ -21,7 +21,7 @@
 
 #include core.h
 
-static spinlock_t emac_dbg_lock = SPIN_LOCK_UNLOCKED;

+static DEFINE_SPINLOCK(emac_dbg_lock);
 


applied #upstream-fixes


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 08/10] net: smc911x: shut up compiler warnings

2007-12-14 Thread Jeff Garzik


[EMAIL PROTECTED] wrote:

From: Paul Mundt [EMAIL PROTECTED]

Trivial fix to shut up gcc.

Signed-off-by: Paul Mundt [EMAIL PROTECTED]
Cc: Jeff Garzik [EMAIL PROTECTED]
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
---

 drivers/net/smc911x.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN drivers/net/smc911x.h~net-smc911x-shut-up-compiler-warnings 
drivers/net/smc911x.h
--- a/drivers/net/smc911x.h~net-smc911x-shut-up-compiler-warnings
+++ a/drivers/net/smc911x.h
@@ -76,7 +76,7 @@
 
 
 
-#if	 SMC_USE_PXA_DMA

+#ifdef SMC_USE_PXA_DMA
 #define SMC_USE_DMA
 
 /*

_


applied #upstream-fixes


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHES 0/3]: DCCP patches for 2.6.25

2007-12-14 Thread Arnaldo Carvalho de Melo

Em Fri, Dec 14, 2007 at 11:29:14AM -0800, David Miller escreveu:
 From: Arnaldo Carvalho de Melo [EMAIL PROTECTED]
 Date: Thu, 13 Dec 2007 23:41:59 -0200

  Please consider pulling from:

  master.kernel.org:/pub/scm/linux/kernel/git/acme/net-2.6.25

 Pulled, but could you please reformat Gerrit's changelog entries in
 the future?  They have these 80+ long lines which are painful to read
 in ascii email clients and in terminal output.

 I'll do this by hand during my next rebase for this case, but I will
 push back when I see it again in future pull requests.

OK, will take that into account in future requests,

Thanks a lot,

- Arnaldo
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] HDLC driver: use unregister_netdev instead of unregister_netdevice

2007-12-14 Thread Krzysztof Halasa

Wang Chen [EMAIL PROTECTED] writes:

 [PATCH] HDLC driver: use unregister_netdev instead of unregister_netdevice

 Since the caller and the upper caller doesn't hod the rtnl semaphore.
 We should use unregister_netdev instead of unregister_netdevice.

NAK, not-a-bug. The caller actually holds rtnl, it goes through
the netdev core ioctl dispatcher:

(unregister_netdevice+0x0/0x24) from (fr_ioctl+0x688/0x75c)
/* fr_del_pvc() and fr_add_pvc() optimized out by gcc */
(fr_ioctl+0x0/0x75c) from (hdlc_ioctl+0x4c/0x8c)
(hdlc_ioctl+0x0/0x8c) from (hss_ioctl+0x3c/0x324)
(hss_ioctl+0x0/0x324) from (dev_ifsioc+0x428/0x4e8)
(dev_ifsioc+0x0/0x4e8) from (dev_ioctl+0x5d8/0x664)
(dev_ioctl+0x0/0x664) from (sock_ioctl+0x90/0x254)
(sock_ioctl+0x0/0x254) from (do_ioctl+0x34/0x78)
(do_ioctl+0x0/0x78) from (vfs_ioctl+0x78/0x2a8)
(vfs_ioctl+0x0/0x2a8) from (sys_ioctl+0x40/0x64)
(sys_ioctl+0x0/0x64) from (ret_fast_syscall+0x0/0x2c)

The patch would make it deadlock.

Please note that sister fr_add_pvc() uses register_netdevice().
The same applies to fr_destroy().
-- 
Krzysztof Halasa
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] [NETDEV] sky2: rtnl_lock out of loop will be faster

2007-12-14 Thread Jeff Garzik


Wang Chen wrote:

[PATCH 4/4] [NETDEV] sky2: rtnl_lock out of loop will be faster

Before this patch, it gets and releases the lock at each
iteration of the loop. Changing unregister_netdev to
unregister_netdevice and locking outside of the loop will
be faster for this approach.

Signed-off-by: Wang Chen [EMAIL PROTECTED]
---
 sky2.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletion(-)

--- linux-2.6.24.rc5.org/drivers/net/sky2.c 2007-12-12 10:19:43.0 
+0800
+++ linux-2.6.24.rc5/drivers/net/sky2.c 2007-12-12 15:23:37.0 +0800
@@ -4270,8 +4270,10 @@ static void __devexit sky2_remove(struct
del_timer_sync(hw-watchdog_timer);
cancel_work_sync(hw-restart_work);
 
+	rtnl_lock();

for (i = hw-ports-1; i = 0; --i)
-   unregister_netdev(hw-dev[i]);
+   unregister_netdevice(hw-dev[i]);
+   rtnl_unlock();


while true and correct, I don't see the remove path as needing this type 
of micro-optimization.


Removing and shutting down hardware is an operation that can take many 
seconds (an eternity, to a computer)... a very slow operation.


Thus, given that speed is not a priority here, I place more value on 
smaller, more compact, easily reviewable code -- the existing unpatched 
code in this case.


Jeff





--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] ixgb: make sure jumbos stay enabled after reset

2007-12-14 Thread Jeff Garzik


Auke Kok wrote:

From: Matheos Worku [EMAIL PROTECTED]

Currently a device reset (ethtool -r ethX) would cause the
adapter to fall back to regular MTU sizes.

Signed-off-by: Matheos Worku [EMAIL PROTECTED]
Signed-off-by: Jesse Brandeburg [EMAIL PROTECTED]
Signed-off-by: Auke Kok [EMAIL PROTECTED]
---

 drivers/net/ixgb/ixgb_main.c |   16 ++--
 1 files changed, 14 insertions(+), 2 deletions(-)


applied #upstream-fixes


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 16/29] netvm: INET reserves.

2007-12-14 Thread Daniel Phillips

Hi Peter,

sysctl_intvec_fragment, proc_dointvec_fragment, sysctl_intvec_fragment 
seem to suffer from cut-n-pastitis.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Re: [patch 03/10] forcedeth: fix MAC address detection on network card (regression in 2.6.23)

2007-12-14 Thread Jeff Garzik


[EMAIL PROTECTED] wrote:

From: Michael Pyne [EMAIL PROTECTED]

Partially revert a change to mac address detection introduced to the forcedeth
driver.  The change was intended to correct mac address detection for newer
nVidia chipsets where the mac address was stored in reverse order.  One of
those chipsets appears to still have the mac address in reverse order (or at
least, it does on my system).

The change that broke mac address detection for my card was commit
ef756b3e56c68a4d76d9d7b9a73fa8f4f739180f forcedeth: mac address correct

My network card is an nVidia built-in Ethernet card, output from lspci as
follows (with text and numeric ids):
$ lspci | grep Ethernet
00:07.0 Bridge: nVidia Corporation MCP61 Ethernet (rev a2)
$ lspci -n | grep 07.0
00:07.0 0680: 10de:03ef (rev a2)

The vendor id is, of course, nVidia.  The device id corresponds to the
NVIDIA_NVENET_19 entry.

The included patch fixes the MAC address detection on my system.
Interestingly, the MAC address appears to be in the range reserved for my
motherboard manufacturer (Gigabyte) and not nVidia.

Signed-off-by: Michael J. Pyne [EMAIL PROTECTED]
Cc: Jeff Garzik [EMAIL PROTECTED]
Cc: Ayaz Abdulla [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]

On Wed, 21 Nov 2007 15:34:52 -0800
Ayaz Abdulla [EMAIL PROTECTED] wrote:


The solution is to get the OEM to update their BIOS (instead of
integrating this patch) since the MCP61 specs indicate that the MAC
Address should be in correct order from BIOS.

By changing the feature DEV_HAS_CORRECT_MACADDR to all MCP61 boards, it
could cause it to break on other OEM systems who have implemented it
correctly.



Signed-off-by: Andrew Morton [EMAIL PROTECTED]
---

 drivers/net/forcedeth.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN 
drivers/net/forcedeth.c~forcedeth-fix-mac-address-detection-on-network-card-regression-in-2623
 drivers/net/forcedeth.c
--- 
a/drivers/net/forcedeth.c~forcedeth-fix-mac-address-detection-on-network-card-regression-in-2623
+++ a/drivers/net/forcedeth.c
@@ -5551,7 +5551,7 @@ static struct pci_device_id pci_tbl[] = 
 	},

{   /* MCP61 Ethernet Controller */
PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 
PCI_DEVICE_ID_NVIDIA_NVENET_19),
-   .driver_data = 
DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_HIGH_DMA|DEV_HAS_POWER_CNTRL|DEV_HAS_MSI|DEV_HAS_PAUSEFRAME_TX|DEV_HAS_STATISTICS_V2|DEV_HAS_TEST_EXTENDED|DEV_HAS_MGMT_UNIT|DEV_HAS_CORRECT_MACADDR,
+   .driver_data = 
DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_HIGH_DMA|DEV_HAS_POWER_CNTRL|DEV_HAS_MSI|DEV_HAS_PAUSEFRAME_TX|DEV_HAS_STATISTICS_V2|DEV_HAS_TEST_EXTENDED|DEV_HAS_MGMT_UNIT,


As discussed in the thread (and Michael did provide dmidecode output 
IIRC), one make everybody happy solution is to use a technique similar 
to that found in drivers/ata/ata_piix.c to match a list of BIOS that 
have incorrect mac addresses, and clear the feature bit 
DEV_HAS_CORRECT_MACADDR.


I have attached an example patch of this approach -- someone merely 
needs to take the patch, fill in the blanks, and test it!  :)


Jeff




diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c
index a96583c..f7aab9b 100644
--- a/drivers/net/forcedeth.c
+++ b/drivers/net/forcedeth.c
@@ -147,6 +147,7 @@
 #include linux/init.h
 #include linux/if_vlan.h
 #include linux/dma-mapping.h
+#include linux/dmi.h
 
 #include asm/irq.h
 #include asm/io.h
@@ -4987,6 +4988,26 @@ static int nv_close(struct net_device *dev)
return 0;
 }
 
+static int have_broken_macaddr(void)
+{
+   static const struct dmi_system_id brokenmac_sysids[] = {
+   {
+   .ident = blahblah,
+   .matches = {
+   DMI_MATCH(DMI_SYS_VENDOR, MY_VENDOR),
+   DMI_MATCH(DMI_PRODUCT_NAME, blahblah),
+   },
+   },
+
+   { } /* terminate list */
+   };
+
+   if (dmi_check_system(brokenmac_sysids))
+   return 1;
+   
+   return 0;
+}
+
 static int __devinit nv_probe(struct pci_dev *pci_dev, const struct 
pci_device_id *id)
 {
struct net_device *dev;
@@ -4997,6 +5018,7 @@ static int __devinit nv_probe(struct pci_dev *pci_dev, 
const struct pci_device_i
u32 powerstate, txreg;
u32 phystate_orig = 0, phystate;
int phyinitialized = 0;
+   int broken_macaddr = 0;
DECLARE_MAC_BUF(mac);
static int printed_version;
 
@@ -5180,10 +5202,14 @@ static int __devinit nv_probe(struct pci_dev *pci_dev, 
const struct pci_device_i
np-orig_mac[0] = readl(base + NvRegMacAddrA);
np-orig_mac[1] = readl(base + NvRegMacAddrB);
 
+   if (!(id-driver_data  DEV_HAS_CORRECT_MACADDR))
+   broken_macaddr = 1;
+   else if (have_broken_macaddr())
+   broken_macaddr = 1;
+
/* check the workaround bit for correct mac address order */
txreg = readl(base +

Re: [PATCH 00/29] Swap over NFS -v15

2007-12-14 Thread Daniel Phillips

Hi Peter,

A major feature of this patch set is the network receive deadlock 
avoidance, but there is quite a bit of stuff bundled with it, the NFS 
user accounting for a big part of the patch by itself.

Is it possible to provide a before and after demonstration case for just 
the network receive deadlock part, given a subset of the patch set and 
a user space recipe that anybody can try?

Regards,

Daniel

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] HDLC driver: use unregister_netdev instead of unregister_netdevice

2007-12-14 Thread David Miller

From: Krzysztof Halasa [EMAIL PROTECTED]
Date: Fri, 14 Dec 2007 22:28:07 +0100

 Wang Chen [EMAIL PROTECTED] writes:
 
  [PATCH] HDLC driver: use unregister_netdev instead of unregister_netdevice
 
  Since the caller and the upper caller doesn't hod the rtnl semaphore.
  We should use unregister_netdev instead of unregister_netdevice.
 
 NAK, not-a-bug. The caller actually holds rtnl, it goes through
 the netdev core ioctl dispatcher:
 
 (unregister_netdevice+0x0/0x24) from (fr_ioctl+0x688/0x75c)
 /* fr_del_pvc() and fr_add_pvc() optimized out by gcc */
 (fr_ioctl+0x0/0x75c) from (hdlc_ioctl+0x4c/0x8c)
 (hdlc_ioctl+0x0/0x8c) from (hss_ioctl+0x3c/0x324)
 (hss_ioctl+0x0/0x324) from (dev_ifsioc+0x428/0x4e8)
 (dev_ifsioc+0x0/0x4e8) from (dev_ioctl+0x5d8/0x664)
 (dev_ioctl+0x0/0x664) from (sock_ioctl+0x90/0x254)
 (sock_ioctl+0x0/0x254) from (do_ioctl+0x34/0x78)
 (do_ioctl+0x0/0x78) from (vfs_ioctl+0x78/0x2a8)
 (vfs_ioctl+0x0/0x2a8) from (sys_ioctl+0x40/0x64)
 (sys_ioctl+0x0/0x64) from (ret_fast_syscall+0x0/0x2c)
 
 The patch would make it deadlock.

Ok, I'll drop this patch, thanks for checking.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [NETFILTER] xt_hashlimit : speedups hash_dst()

2007-12-14 Thread Eric Dumazet


Jarek Poplawski a écrit :

Eric Dumazet wrote, On 12/14/2007 12:09 PM:
...


+   /*
+* Instead of returning hash % ht-cfg.size (implying a divide)
+* we return the high 32 bits of the (hash * ht-cfg.size) that will
+* give results between [0 and cfg.size-1] and same hash distribution,
+* but using a multiply, less expensive than a divide
+*/
+   return ((u64)hash * ht-cfg.size)  32;


Are we sure of the same hash distribution? Probably I miss something,
but: if this 'hash' is well distributed on 32 bits, and ht-cfg.size
is smaller than 32 bits, e.g. 256 (8 bits), then this multiplication
moves to the higher 32 of u64 only max. 8 bits of the most significant
byte, and the other three bytes are never used, while division is
always affected by all four bytes...


Not sure what you are saying... but if size=256, then, yes, we want a final 
result between 0 and 255, so three bytes are nul.



'size' is the size of hashtable, its not a random 32bits value :)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [NETFILTER] xt_hashlimit : speedups hash_dst()

2007-12-14 Thread Jarek Poplawski

Jarek Poplawski wrote, On 12/14/2007 09:59 PM:

 Eric Dumazet wrote, On 12/14/2007 12:09 PM:
 ...
 
 +/*
 + * Instead of returning hash % ht-cfg.size (implying a divide)
 + * we return the high 32 bits of the (hash * ht-cfg.size) that will
 + * give results between [0 and cfg.size-1] and same hash distribution,
 + * but using a multiply, less expensive than a divide
 + */
 +return ((u64)hash * ht-cfg.size)  32;
 
 Are we sure of the same hash distribution? Probably I miss something,
 but: if this 'hash' is well distributed on 32 bits, and ht-cfg.size
 is smaller than 32 bits, e.g. 256 (8 bits), then this multiplication
 moves to the higher 32 of u64 only max. 8 bits of the most significant
 byte, and the other three bytes are never used, while division is
 always affected by all four bytes...


OOPS! So, I've missed this division here is also affected by only one
byte, but from the other side - so, almost the same... It seems this
could have been replaced with masking from the beginning...

Sorry,
Jarek P.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/29] mm: kmem_estimate_pages()

2007-12-14 Thread Daniel Phillips

On Friday 14 December 2007 07:39, Peter Zijlstra wrote:
 Provide a method to get the upper bound on the pages needed to
 allocate a given number of objects from a given kmem_cache.

 This lays the foundation for a generic reserve framework as presented
 in a later patch in this series. This framework needs to convert
 object demand (kmalloc() bytes, kmem_cache_alloc() objects) to pages.

And hence the big idea that all reserve accounting can be done in units
of pages, allowing the use of a single global reserve that already 
exists.

The other big idea here is that reserve accounting can be independent of 
the actual resource allocations.  This is a powerful idea which we may 
not have explained clearly yet.

Daniel
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 9543] New: RTNL: assertion failed at net/ipv6/addrconf.c (2164)/RTNL: assertion failed at net/ipv4/devinet.c (1055)

2007-12-14 Thread Krzysztof Oledzki




On Fri, 14 Dec 2007, Andy Gospodarek wrote:


On Fri, Dec 14, 2007 at 07:57:42PM +0100, Krzysztof Oledzki wrote:



On Fri, 14 Dec 2007, Andy Gospodarek wrote:


On Fri, Dec 14, 2007 at 05:14:57PM +0100, Krzysztof Oledzki wrote:



On Wed, 12 Dec 2007, Jay Vosburgh wrote:


Herbert Xu [EMAIL PROTECTED] wrote:


diff -puN drivers/net/bonding/bond_sysfs.c~bonding-locking-fix
drivers/net/bonding/bond_sysfs.c
--- a/drivers/net/bonding/bond_sysfs.c~bonding-locking-fix
+++ a/drivers/net/bonding/bond_sysfs.c
@@ -,8 +,6 @@ static ssize_t bonding_store_primary(str
out:
 write_unlock_bh(bond-lock);

-   rtnl_unlock();
-


Looking at the changeset that added this perhaps the intention
is to hold the lock? If so we should add an rtnl_lock to the start
of the function.


Yes, this function needs to hold locks, and more than just
what's there now.  I believe the following should be correct; I haven't
tested it, though (I'm supposedly on vacation right now).

The following change should be correct for the
bonding_store_primary case discussed in this thread, and also corrects
the bonding_store_active case which performs similar functions.

The bond_change_active_slave and bond_select_active_slave
functions both require rtnl, bond-lock for read and curr_slave_lock for
write_bh, and no other locks.  This is so that the lower level
mode-specific functions can release locks down to just rtnl in order to
call, e.g., dev_set_mac_address with the locks it expects (rtnl only).

Signed-off-by: Jay Vosburgh [EMAIL PROTECTED]

diff --git a/drivers/net/bonding/bond_sysfs.c
b/drivers/net/bonding/bond_sysfs.c
index 11b76b3..28a2d80 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device
*d,
struct slave *slave;
struct bonding *bond = to_bond(d);

-   write_lock_bh(bond-lock);
+   rtnl_lock();
+   read_lock(bond-lock);
+   write_lock_bh(bond-curr_slave_lock);

F

+
if (!USES_PRIMARY(bond-params.mode)) {
printk(KERN_INFO DRV_NAME
   : %s: Unable to set primary slave; %s is in mode
   %d\n,
@@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device
*d,
}
}
out:
-   write_unlock_bh(bond-lock);
-
+   write_unlock_bh(bond-curr_slave_lock);
+   read_unlock(bond-lock);
rtnl_unlock();

return count;
@@ -1190,7 +1193,8 @@ static ssize_t bonding_store_active_slave(struct
device *d,
struct bonding *bond = to_bond(d);

rtnl_lock();
-   write_lock_bh(bond-lock);
+   read_lock(bond-lock);
+   write_lock_bh(bond-curr_slave_lock);

if (!USES_PRIMARY(bond-params.mode)) {
printk(KERN_INFO DRV_NAME
@@ -1247,7 +1251,8 @@ static ssize_t bonding_store_active_slave(struct
device *d,
}
}
out:
-   write_unlock_bh(bond-lock);
+   write_unlock_bh(bond-curr_slave_lock);
+   read_unlock(bond-lock);
rtnl_unlock();

return count;


Vanilla 2.6.24-rc5 plus this patch:

=
[ INFO: possible irq lock inversion dependency detected ]
2.6.24-rc5 #1
-
events/0/9 just changed the state of lock:
(mc-mca_lock){-+..}, at: [c0411c7a] mld_ifc_timer_expire+0x130/0x1fb
but this lock took another, soft-read-irq-unsafe lock in the past:
(bond-lock){-.--}

and interrupts could create inverse lock ordering between them.




Grrr, I should have seen that -- sorry.  Try your luck with this instead:

CUT

No luck.

bonding: bond0: setting mode to active-backup (1).
bonding: bond0: Setting MII monitoring interval to 100.
ADDRCONF(NETDEV_UP): bond0: link is not ready
bonding: bond0: Adding slave eth0.
e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: RX/TX
bonding: bond0: making interface eth0 the new active one.
bonding: bond0: first active interface up!
bonding: bond0: enslaving eth0 as an active interface with an up link.
bonding: bond0: Adding slave eth1.
ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready


SNIP


bonding: bond0: enslaving eth1 as a backup interface with a down link.
bonding: bond0: Setting eth0 as primary slave.
bond0: no IPv6 routers present



Based on the console log, I'm guessing your initialization scripts use
sysfs to set eth0 as the primary interface for bond0?  Can you confirm?


Yep, that's correct:

postup() {
if [[ ${IFACE} == bond0 ]] ; then
echo -n +eth0  /sys/class/net/${IFACE}/bonding/slaves
echo -n +eth1  /sys/class/net/${IFACE}/bonding/slaves
echo -n  eth0  /sys/class/net/${IFACE}/bonding/primary
fi
}


If you did somehow use sysfs to set the primary device as eth0, I'm
guessing you never see this issue without that line or without this
patch.

Re: [Bugme-new] [Bug 9543] New: RTNL: assertion failed at net/ipv6/addrconf.c (2164)/RTNL: assertion failed at net/ipv4/devinet.c (1055)

2007-12-14 Thread Andy Gospodarek

On Fri, Dec 14, 2007 at 07:57:42PM +0100, Krzysztof Oledzki wrote:
 
 
 On Fri, 14 Dec 2007, Andy Gospodarek wrote:
 
 On Fri, Dec 14, 2007 at 05:14:57PM +0100, Krzysztof Oledzki wrote:
 
 
 On Wed, 12 Dec 2007, Jay Vosburgh wrote:
 
 Herbert Xu [EMAIL PROTECTED] wrote:
 
 diff -puN drivers/net/bonding/bond_sysfs.c~bonding-locking-fix
 drivers/net/bonding/bond_sysfs.c
 --- a/drivers/net/bonding/bond_sysfs.c~bonding-locking-fix
 +++ a/drivers/net/bonding/bond_sysfs.c
 @@ -,8 +,6 @@ static ssize_t bonding_store_primary(str
 out:
   write_unlock_bh(bond-lock);
 
 -   rtnl_unlock();
 -
 
 Looking at the changeset that added this perhaps the intention
 is to hold the lock? If so we should add an rtnl_lock to the start
 of the function.
 
Yes, this function needs to hold locks, and more than just
 what's there now.  I believe the following should be correct; I haven't
 tested it, though (I'm supposedly on vacation right now).
 
The following change should be correct for the
 bonding_store_primary case discussed in this thread, and also corrects
 the bonding_store_active case which performs similar functions.
 
The bond_change_active_slave and bond_select_active_slave
 functions both require rtnl, bond-lock for read and curr_slave_lock for
 write_bh, and no other locks.  This is so that the lower level
 mode-specific functions can release locks down to just rtnl in order to
 call, e.g., dev_set_mac_address with the locks it expects (rtnl only).
 
 Signed-off-by: Jay Vosburgh [EMAIL PROTECTED]
 
 diff --git a/drivers/net/bonding/bond_sysfs.c
 b/drivers/net/bonding/bond_sysfs.c
 index 11b76b3..28a2d80 100644
 --- a/drivers/net/bonding/bond_sysfs.c
 +++ b/drivers/net/bonding/bond_sysfs.c
 @@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device
 *d,
struct slave *slave;
struct bonding *bond = to_bond(d);
 
 -  write_lock_bh(bond-lock);
 +  rtnl_lock();
 +  read_lock(bond-lock);
 +  write_lock_bh(bond-curr_slave_lock);
F
 +
if (!USES_PRIMARY(bond-params.mode)) {
printk(KERN_INFO DRV_NAME
   : %s: Unable to set primary slave; %s is in mode
   %d\n,
 @@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device
 *d,
}
}
 out:
 -  write_unlock_bh(bond-lock);
 -
 +  write_unlock_bh(bond-curr_slave_lock);
 +  read_unlock(bond-lock);
rtnl_unlock();
 
return count;
 @@ -1190,7 +1193,8 @@ static ssize_t bonding_store_active_slave(struct
 device *d,
struct bonding *bond = to_bond(d);
 
rtnl_lock();
 -  write_lock_bh(bond-lock);
 +  read_lock(bond-lock);
 +  write_lock_bh(bond-curr_slave_lock);
 
if (!USES_PRIMARY(bond-params.mode)) {
printk(KERN_INFO DRV_NAME
 @@ -1247,7 +1251,8 @@ static ssize_t bonding_store_active_slave(struct
 device *d,
}
}
 out:
 -  write_unlock_bh(bond-lock);
 +  write_unlock_bh(bond-curr_slave_lock);
 +  read_unlock(bond-lock);
rtnl_unlock();
 
return count;
 
 Vanilla 2.6.24-rc5 plus this patch:
 
 =
 [ INFO: possible irq lock inversion dependency detected ]
 2.6.24-rc5 #1
 -
 events/0/9 just changed the state of lock:
  (mc-mca_lock){-+..}, at: [c0411c7a] mld_ifc_timer_expire+0x130/0x1fb
 but this lock took another, soft-read-irq-unsafe lock in the past:
  (bond-lock){-.--}
 
 and interrupts could create inverse lock ordering between them.
 
 
 
 Grrr, I should have seen that -- sorry.  Try your luck with this instead:
 CUT
 
 No luck.
 
 bonding: bond0: setting mode to active-backup (1).
 bonding: bond0: Setting MII monitoring interval to 100.
 ADDRCONF(NETDEV_UP): bond0: link is not ready
 bonding: bond0: Adding slave eth0.
 e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow 
 Control: RX/TX
 bonding: bond0: making interface eth0 the new active one.
 bonding: bond0: first active interface up!
 bonding: bond0: enslaving eth0 as an active interface with an up link.
 bonding: bond0: Adding slave eth1.
 ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready

SNIP

 bonding: bond0: enslaving eth1 as a backup interface with a down link.
 bonding: bond0: Setting eth0 as primary slave.
 bond0: no IPv6 routers present
 
 
Based on the console log, I'm guessing your initialization scripts use
sysfs to set eth0 as the primary interface for bond0?  Can you confirm?

If you did somehow use sysfs to set the primary device as eth0, I'm
guessing you never see this issue without that line or without this
patch.  Please confirm this as well.

Thanks,

-andy

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 9543] New: RTNL: assertion failed at net/ipv6/addrconf.c (2164)/RTNL: assertion failed at net/ipv4/devinet.c (1055)

2007-12-14 Thread Andy Gospodarek

On Fri, Dec 14, 2007 at 11:11:15PM +0100, Krzysztof Oledzki wrote:
 
 
 On Fri, 14 Dec 2007, Andy Gospodarek wrote:
 
 On Fri, Dec 14, 2007 at 07:57:42PM +0100, Krzysztof Oledzki wrote:
 
 
 On Fri, 14 Dec 2007, Andy Gospodarek wrote:
 
 On Fri, Dec 14, 2007 at 05:14:57PM +0100, Krzysztof Oledzki wrote:
 
 
 On Wed, 12 Dec 2007, Jay Vosburgh wrote:
 
 Herbert Xu [EMAIL PROTECTED] wrote:
 
 diff -puN drivers/net/bonding/bond_sysfs.c~bonding-locking-fix
 drivers/net/bonding/bond_sysfs.c
 --- a/drivers/net/bonding/bond_sysfs.c~bonding-locking-fix
 +++ a/drivers/net/bonding/bond_sysfs.c
 @@ -,8 +,6 @@ static ssize_t bonding_store_primary(str
 out:
  write_unlock_bh(bond-lock);
 
 -   rtnl_unlock();
 -
 
 Looking at the changeset that added this perhaps the intention
 is to hold the lock? If so we should add an rtnl_lock to the start
 of the function.
 
  Yes, this function needs to hold locks, and more than just
 what's there now.  I believe the following should be correct; I haven't
 tested it, though (I'm supposedly on vacation right now).
 
  The following change should be correct for the
 bonding_store_primary case discussed in this thread, and also corrects
 the bonding_store_active case which performs similar functions.
 
  The bond_change_active_slave and bond_select_active_slave
 functions both require rtnl, bond-lock for read and curr_slave_lock 
 for
 write_bh, and no other locks.  This is so that the lower level
 mode-specific functions can release locks down to just rtnl in order to
 call, e.g., dev_set_mac_address with the locks it expects (rtnl only).
 
 Signed-off-by: Jay Vosburgh [EMAIL PROTECTED]
 
 diff --git a/drivers/net/bonding/bond_sysfs.c
 b/drivers/net/bonding/bond_sysfs.c
 index 11b76b3..28a2d80 100644
 --- a/drivers/net/bonding/bond_sysfs.c
 +++ b/drivers/net/bonding/bond_sysfs.c
 @@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct 
 device
 *d,
  struct slave *slave;
  struct bonding *bond = to_bond(d);
 
 -write_lock_bh(bond-lock);
 +rtnl_lock();
 +read_lock(bond-lock);
 +write_lock_bh(bond-curr_slave_lock);
 F
 +
  if (!USES_PRIMARY(bond-params.mode)) {
  printk(KERN_INFO DRV_NAME
 : %s: Unable to set primary slave; %s is in mode
 %d\n,
 @@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct 
 device
 *d,
  }
  }
 out:
 -write_unlock_bh(bond-lock);
 -
 +write_unlock_bh(bond-curr_slave_lock);
 +read_unlock(bond-lock);
  rtnl_unlock();
 
  return count;
 @@ -1190,7 +1193,8 @@ static ssize_t bonding_store_active_slave(struct
 device *d,
  struct bonding *bond = to_bond(d);
 
  rtnl_lock();
 -write_lock_bh(bond-lock);
 +read_lock(bond-lock);
 +write_lock_bh(bond-curr_slave_lock);
 
  if (!USES_PRIMARY(bond-params.mode)) {
  printk(KERN_INFO DRV_NAME
 @@ -1247,7 +1251,8 @@ static ssize_t bonding_store_active_slave(struct
 device *d,
  }
  }
 out:
 -write_unlock_bh(bond-lock);
 +write_unlock_bh(bond-curr_slave_lock);
 +read_unlock(bond-lock);
  rtnl_unlock();
 
  return count;
 
 Vanilla 2.6.24-rc5 plus this patch:
 
 =
 [ INFO: possible irq lock inversion dependency detected ]
 2.6.24-rc5 #1
 -
 events/0/9 just changed the state of lock:
 (mc-mca_lock){-+..}, at: [c0411c7a] mld_ifc_timer_expire+0x130/0x1fb
 but this lock took another, soft-read-irq-unsafe lock in the past:
 (bond-lock){-.--}
 
 and interrupts could create inverse lock ordering between them.
 
 
 
 Grrr, I should have seen that -- sorry.  Try your luck with this instead:
 CUT
 
 No luck.
 
 bonding: bond0: setting mode to active-backup (1).
 bonding: bond0: Setting MII monitoring interval to 100.
 ADDRCONF(NETDEV_UP): bond0: link is not ready
 bonding: bond0: Adding slave eth0.
 e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow
 Control: RX/TX
 bonding: bond0: making interface eth0 the new active one.
 bonding: bond0: first active interface up!
 bonding: bond0: enslaving eth0 as an active interface with an up link.
 bonding: bond0: Adding slave eth1.
 ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
 
 SNIP
 
 bonding: bond0: enslaving eth1 as a backup interface with a down link.
 bonding: bond0: Setting eth0 as primary slave.
 bond0: no IPv6 routers present
 
 
 Based on the console log, I'm guessing your initialization scripts use
 sysfs to set eth0 as the primary interface for bond0?  Can you confirm?
 
 Yep, that's correct:
 
 postup() {
 if [[ ${IFACE} == bond0 ]] ; then
 echo -n +eth0  /sys/class/net/${IFACE}/bonding/slaves
 echo -n +eth1  /sys/class/net/${IFACE}/bonding/slaves
 echo -n  eth0  /sys/class/net/${IFACE}/bonding/primary
 fi
 }
 

Good. Thanks for the confirmation.

 If you did somehow use sysfs to

Re: [patch 01/10] e1000e: make E1000E default to the same kconfig setting as E1000

2007-12-14 Thread Adrian Bunk

On Fri, Dec 14, 2007 at 03:39:26PM -0500, Jeff Garzik wrote:
 [EMAIL PROTECTED] wrote:
 From: Randy Dunlap [EMAIL PROTECTED]
...
 So I think the breakage that occurs is mitigated by two factors:
 1) kernel hackers that do their own configs are expected to be able to 
 figure this stuff.
 2) kernel builders (read: distros, mainly) are expected to have put thought 
 into the Kconfig selection and driver migration strategies.
...
 I would prefer simply to communicate to kernel experts and builders about a 
 Kconfig issue that could potentially their booting/networking...  because 
 this patch is only needed if the kernel experts do not already know about a 
 necessary config update.

You miss the vast majority of kconfig users:

3) system administrators etc. who for different reasons compile their 
own kernels but neither are nor want to be kernel developers

There's a reason why e.g. LPI requires you to be able to compile your 
own kernel even for getting a Junior Level Linux Professional 
certificate.

Or that one of the authors of Linux Device drivers has written a book 
covering only how to build and run your own kernel.

   Jeff

cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 9543] New: RTNL: assertion failed at net/ipv6/addrconf.c (2164)/RTNL: assertion failed at net/ipv4/devinet.c (1055)

2007-12-14 Thread Andy Gospodarek

On Fri, Dec 14, 2007 at 07:57:42PM +0100, Krzysztof Oledzki wrote:
 
 
 On Fri, 14 Dec 2007, Andy Gospodarek wrote:
 
 On Fri, Dec 14, 2007 at 05:14:57PM +0100, Krzysztof Oledzki wrote:
 
 
 On Wed, 12 Dec 2007, Jay Vosburgh wrote:
 
 Herbert Xu [EMAIL PROTECTED] wrote:
 
 diff -puN drivers/net/bonding/bond_sysfs.c~bonding-locking-fix
 drivers/net/bonding/bond_sysfs.c
 --- a/drivers/net/bonding/bond_sysfs.c~bonding-locking-fix
 +++ a/drivers/net/bonding/bond_sysfs.c
 @@ -,8 +,6 @@ static ssize_t bonding_store_primary(str
 out:
   write_unlock_bh(bond-lock);
 
 -   rtnl_unlock();
 -
 
 Looking at the changeset that added this perhaps the intention
 is to hold the lock? If so we should add an rtnl_lock to the start
 of the function.
 
Yes, this function needs to hold locks, and more than just
 what's there now.  I believe the following should be correct; I haven't
 tested it, though (I'm supposedly on vacation right now).
 
The following change should be correct for the
 bonding_store_primary case discussed in this thread, and also corrects
 the bonding_store_active case which performs similar functions.
 
The bond_change_active_slave and bond_select_active_slave
 functions both require rtnl, bond-lock for read and curr_slave_lock for
 write_bh, and no other locks.  This is so that the lower level
 mode-specific functions can release locks down to just rtnl in order to
 call, e.g., dev_set_mac_address with the locks it expects (rtnl only).
 
 Signed-off-by: Jay Vosburgh [EMAIL PROTECTED]
 
 diff --git a/drivers/net/bonding/bond_sysfs.c
 b/drivers/net/bonding/bond_sysfs.c
 index 11b76b3..28a2d80 100644
 --- a/drivers/net/bonding/bond_sysfs.c
 +++ b/drivers/net/bonding/bond_sysfs.c
 @@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device
 *d,
struct slave *slave;
struct bonding *bond = to_bond(d);
 
 -  write_lock_bh(bond-lock);
 +  rtnl_lock();
 +  read_lock(bond-lock);
 +  write_lock_bh(bond-curr_slave_lock);
 +
if (!USES_PRIMARY(bond-params.mode)) {
printk(KERN_INFO DRV_NAME
   : %s: Unable to set primary slave; %s is in mode
   %d\n,
 @@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device
 *d,
}
}
 out:
 -  write_unlock_bh(bond-lock);
 -
 +  write_unlock_bh(bond-curr_slave_lock);
 +  read_unlock(bond-lock);
rtnl_unlock();
 
return count;
 @@ -1190,7 +1193,8 @@ static ssize_t bonding_store_active_slave(struct
 device *d,
struct bonding *bond = to_bond(d);
 
rtnl_lock();
 -  write_lock_bh(bond-lock);
 +  read_lock(bond-lock);
 +  write_lock_bh(bond-curr_slave_lock);
 
if (!USES_PRIMARY(bond-params.mode)) {
printk(KERN_INFO DRV_NAME
 @@ -1247,7 +1251,8 @@ static ssize_t bonding_store_active_slave(struct
 device *d,
}
}
 out:
 -  write_unlock_bh(bond-lock);
 +  write_unlock_bh(bond-curr_slave_lock);
 +  read_unlock(bond-lock);
rtnl_unlock();
 
return count;
 
 Vanilla 2.6.24-rc5 plus this patch:
 
 =
 [ INFO: possible irq lock inversion dependency detected ]
 2.6.24-rc5 #1
 -
 events/0/9 just changed the state of lock:
  (mc-mca_lock){-+..}, at: [c0411c7a] mld_ifc_timer_expire+0x130/0x1fb
 but this lock took another, soft-read-irq-unsafe lock in the past:
  (bond-lock){-.--}
 
 and interrupts could create inverse lock ordering between them.
 
 
 
 Grrr, I should have seen that -- sorry.  Try your luck with this instead:
 CUT
 
 No luck.
 


I'm guessing if we go back to using a write-lock for bond-lock this
will go back to working again, but I'm not totally convinced since there
are plenty of places where we used a read-lock with it.


diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 11b76b3..635b857 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device *d,
struct slave *slave;
struct bonding *bond = to_bond(d);
 
+   rtnl_lock();
write_lock_bh(bond-lock);
+   write_lock_bh(bond-curr_slave_lock);
+
if (!USES_PRIMARY(bond-params.mode)) {
printk(KERN_INFO DRV_NAME
   : %s: Unable to set primary slave; %s is in mode %d\n,
@@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device *d,
}
}
 out:
+   write_unlock_bh(bond-curr_slave_lock);
write_unlock_bh(bond-lock);
-
rtnl_unlock();
 
return count;
@@ -1191,6 +1194,7 @@ static ssize_t bonding_store_active_slave(struct device 
*d,
 
rtnl_lock();
write_lock_bh(bond-lock);
+   write_lock_bh(bond-curr_slave_lock);
 
if (!USES_PRIMARY(bond-params.mode)) {
printk(KERN_INFO DRV_NAME
@@ -1247,6 +1251,7 @@ static ssize_t

Re: Packet per Second

2007-12-14 Thread Glen Turner


On Fri, 2007-12-14 at 15:34 +, Flávio Pires wrote:

 Well, I work on an ISP and we have a linux box acting as a
 bridge+firewall. With this bridge+firewall we control the packet rate
 per second from each client and from our repeaters. But I can`t
 measure the packet rate per IP. Is there any tool for this?

The usual approach is to generate NetFlow records -- there are
a number of Linux tools for this. Collect them with a collector
(flow-tools being a common choice). Then have a Perl script
which reads the flow records, processes them whichever way you
desire, and drops the result into a rrdtool file (there are modules
for both reading the flow-tools data and outputing in the rrdtool
format). The rrdtool utilities have a limited range of graphs,
but there is a huge selection of graphing packages from other
authors for rrdtool-stored data (Drraw, etc).  Flow-tools also
has some third-party analysis tools, some of those have good
top talker statistics.

This is a lot of work, since you are really putting a complete
measurement infrastructure in place to get the one statistic
you desire.  But I'd encourage you to do that, since knowing
one statistic usually leads to further questions of the data.

-- 
Glen Turner, Senior Network Engineer
Australia's Academic  Research Networkwww.aarnet.edu.au

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/29] mm: slb: add knowledge of reserve pages

2007-12-14 Thread Daniel Phillips

On Friday 14 December 2007 07:39, Peter Zijlstra wrote:
 Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to
 allocation contexts that are entitled to it. This is done to ensure
 reserve pages don't leak out and get consumed.

Tighter definitions of leak out and get consumed would be helpful 
here.  As I see it, the chain of reasoning is:

  * Any MEMALLOC mode allocation must have come from a properly
   throttled path and  has a finite lifetime that must eventually
   produce writeout  progress.

  * Since the transaction that made the allocation was throttled and
must have a finite lifetime, we  know that it must eventually return
the resources it consumed to the appropriate resource pool.

Now, I think what you mean by get consumed and leak out is: become 
pinned by false sharing with other allocations that do not guarantee 
that they will be returned to the resource pool.  We can say pinned 
for short.

So you are attempting to prevent slab pages from becoming pinned by 
users that do not obey the reserve management rules, which I think your 
approach achieves.  However...

Note that false sharing of slab pages is still possible between two 
unrelated writeout processes, both of which obey rules for their own 
writeout path, but the pinned combination does not.  This still leaves 
a hole through which a deadlock may slip.

My original solution was simply to allocate a full page when drawing 
from the memaloc reserve, which may use a tad more reserve, but makes 
it possible to prove the algorithm correct.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] [ROSE] ax25_send_frame() called with a constant paclen = 260

2007-12-14 Thread Bernard Pidoux


Hi,

In rose_link.c ax25_send_frame() was called with a constant paclen 
parameter of 260 bytes.
This value looked odd to me for it did not correspond to any defined or 
possible computed length.Replacing this value by 0 (zero) allowed 
ax25_send_frame() to substitute it by the default AX25 frame size, which 
in turn induced significant results on the AX25 frame fragmentation and 
removed some garbage trailing characters in AX25 frames sent.



signed off by Bernard Pidoux, [EMAIL PROTECTED]
--- linux-2.6.24-rc5/net/rose/rose_link.c   2007-12-11 04:48:43.0 
+0100
+++ b/net/rose/rose_link.c  2007-12-14 14:39:23.0 +0100
@@ -107,7 +107,7 @@
else
rose_call = rose_callsign;
 
-   neigh-ax25 = ax25_send_frame(skb, 260, rose_call, neigh-callsign, 
neigh-digipeat, neigh-dev);
+   neigh-ax25 = ax25_send_frame(skb, 0, rose_call, neigh-callsign, 
neigh-digipeat, neigh-dev);
 
return (neigh-ax25 != NULL);
 }

1 2 >

1 - 100 of 144 matches

Mail list logo