Re: Pull request for 'jg-20061103-00' tag
Francois Romieu wrote: Please pull from tag 'jg-20061103-00' in repository git://electric-eye.fr.zoreil.com/home/romieu/linux-2.6.git jg-20061103-00 to get the changes below. Distance from 'upstream-fixes' - 17fddc34b36fc26aa8b5f130fe32b446d9d88fa2 Diffstat drivers/net/r8169.c | 22 -- 1 files changed, 20 insertions(+), 2 deletions(-) Shortlog Francois Romieu: r8169: perform a PHY reset before any other operation at boot time This warrants much more testing than pushing into 2.6.19-rc4 would give us, so I'm pulling it into #upstream. In the past, with 10/100 hubs or ancient Cisco switches, we really didn't want to reset the phy and restart autonegotiation, because that might be problematic. In any case, this is a behavior change that may solve problems... but also needs testing to insure that it doesn't also cause problems. Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/18] e1000: features, updates, documentation
pulled. still waiting on those changes to better modularize the feature detection, etc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] s2io ppc64 fix for readq/writeq
This seems a bit ugly. Could you add #define readq readq to your platform instead? That's ugly too imho but I suppose I can do it :-) I generally think it's a bug in the kernel-wide API, if use of said API requires arch-specific ifdefs. Yes. I agree. In that specific case, I suppose what you propose is the least ugly of the solutions. HAVE_ARCH_* is pretty much out of fascion (and I tend to agree with Linus that it's not pretty anyway). Actually, I tend to think in that specific case that the driver defining something called readq and writeq based on a pair of readl's and writel's is fairly bogus though. Or maybe the problem could be solved another way, by guaranteeing that a good enough for drivers readq() and writeq() exist on all platforms, even 32-bit platforms where the operation isn't inherently atomic. I'd rather not provide readq/writeq if they aren't atomic. Ben. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] forcedeth: add recoverable error support
Ayaz Abdulla wrote: This patch adds support to recover from a previously fatal MAC error. In the past the MAC would be hung on an internal fatal error. On new chipsets, the MAC has the ability to enter a non-fatal state and allow the driver to re-init it. Signed-Off-By: Ayaz Abdulla [EMAIL PROTECTED] applied - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] forcedeth: add new NVIDIA pci ids
Ayaz Abdulla wrote: This patch adds pci device ids for the NVIDIA MCP67 chip. Signed-Off-By: Ayaz Abdulla [EMAIL PROTECTED] ACK, but please rediff and resend against netdev-2.6.git#upstream - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.18] defxx: Big-endian hosts support
Maciej W. Rozycki wrote: The PDQ DMA engine requires a different byte-swapping mode for big-endian hosts; also the MAC address which is read from a register through PIO has to be byte-swapped. These changes have been verified with DEFPA-DC (PCI) boards and a Broadcom BCM91250A (MIPS CPU based) host. Signed-off-by: Maciej W. Rozycki [EMAIL PROTECTED] applied - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] forcedeth: add mgmt unit support
Ayaz Abdulla wrote: This patch adds support for the mgmt unit in certain chipsets. The MAC and the mgmt unit share the PHY and therefore proper intialization procedures are needed for them to maintain coexistense. Signed-Off-By: Ayaz Abdulla [EMAIL PROTECTED] applied - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATC 2/2] forcedeth: add support for new mcp67 device
Ayaz Abdulla wrote: This patch adds support for the new mcp67 device into forcedeth. Signed-Off-By: Ayaz Abdulla [EMAIL PROTECTED] ACK, but please rediff and resend against latest netdev-2.6.git#upstream - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] s2io ppc64 fix for readq/writeq
Benjamin Herrenschmidt wrote: This seems a bit ugly. Could you add #define readq readq to your platform instead? That's ugly too imho but I suppose I can do it :-) I generally think it's a bug in the kernel-wide API, if use of said API requires arch-specific ifdefs. Yes. I agree. In that specific case, I suppose what you propose is the least ugly of the solutions. HAVE_ARCH_* is pretty much out of fascion (and I tend to agree with Linus that it's not pretty anyway). Actually, I tend to think in that specific case that the driver defining something called readq and writeq based on a pair of readl's and writel's is fairly bogus though. Or maybe the problem could be solved another way, by guaranteeing that a good enough for drivers readq() and writeq() exist on all platforms, even 32-bit platforms where the operation isn't inherently atomic. I'd rather not provide readq/writeq if they aren't atomic. This is why I said good enough for drivers. This is _key_. I have run into several [PCI] devices with 64-bit registers, and __none__ of them had requirements such that the Linux platform code -must- provide an atomic readq/writeq. Probably because everybody wants to support 32-bit platforms with their devices. What you call fairly bogus is precisely what drivers need. These devices with 64-bit registers just don't need the atomicity that arch developers harp about :) Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] s2io ppc64 fix for readq/writeq
This is why I said good enough for drivers. This is _key_. I have run into several [PCI] devices with 64-bit registers, and __none__ of them had requirements such that the Linux platform code -must- provide an atomic readq/writeq. Probably because everybody wants to support 32-bit platforms with their devices. What you call fairly bogus is precisely what drivers need. These devices with 64-bit registers just don't need the atomicity that arch developers harp about :) Is there any consistency in that case in which half need to be read/written first ? Or none of these ever had side effects ? Ben. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[git patches] net driver fixes
Please pull from 'upstream-linus' branch of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git upstream-linus to receive the following updates: drivers/net/Kconfig |4 ++-- drivers/net/ehea/ehea.h |5 + drivers/net/ehea/ehea_ethtool.c |2 +- drivers/net/ehea/ehea_main.c| 26 +- drivers/net/ehea/ehea_phyp.c|2 +- drivers/net/ehea/ehea_phyp.h|6 -- drivers/net/ehea/ehea_qmr.c | 17 + drivers/net/wireless/bcm43xx/bcm43xx_leds.c |7 ++- drivers/net/wireless/bcm43xx/bcm43xx_leds.h |6 ++ drivers/net/wireless/bcm43xx/bcm43xx_main.c | 16 +++- drivers/net/wireless/hostap/hostap_plx.c|4 ++-- net/ieee80211/ieee80211_rx.c| 12 ++-- 12 files changed, 66 insertions(+), 41 deletions(-) Jiri Benc: ieee80211: don't flood log with errors Larry Finger: bcm43xx: fix unexpected LED control values in BCM4303 sprom Michael Buesch: bcm43xx: Fix low-traffic netdev watchdog TX timeouts Pavel Roskin: hostap_plx: fix CIS verification Randy Dunlap: Kconfig: remove redundant NETDEVICES depends Thomas Klein: ehea: Nullpointer dereferencation fix ehea: Removed redundant define ehea: 64K page support fix diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 28c17d1..9cb3ca5 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -486,7 +486,7 @@ config SGI_IOC3_ETH_HW_TX_CSUM config MIPS_SIM_NET tristate MIPS simulator Network device (EXPERIMENTAL) - depends on NETDEVICES MIPS_SIM EXPERIMENTAL + depends on MIPS_SIM EXPERIMENTAL help The MIPSNET device is a simple Ethernet network device which is emulated by the MIPS Simulator. @@ -2467,7 +2467,7 @@ config ISERIES_VETH config RIONET tristate RapidIO Ethernet over messaging driver support - depends on NETDEVICES RAPIDIO + depends on RAPIDIO config RIONET_TX_SIZE int Number of outbound queue entries diff --git a/drivers/net/ehea/ehea.h b/drivers/net/ehea/ehea.h index b40724f..39ad9f7 100644 --- a/drivers/net/ehea/ehea.h +++ b/drivers/net/ehea/ehea.h @@ -39,7 +39,7 @@ #include asm/abs_addr.h #include asm/io.h #define DRV_NAME ehea -#define DRV_VERSIONEHEA_0034 +#define DRV_VERSIONEHEA_0043 #define EHEA_MSG_DEFAULT (NETIF_MSG_LINK | NETIF_MSG_TIMER \ | NETIF_MSG_RX_ERR | NETIF_MSG_TX_ERR) @@ -105,9 +105,6 @@ #define EHEA_BCMC_TAGGED0x00 #define EHEA_BCMC_VLANID_ALL 0x01 #define EHEA_BCMC_VLANID_SINGLE0x00 -/* Use this define to kmallocate pHYP control blocks */ -#define H_CB_ALIGNMENT 4096 - #define EHEA_CACHE_LINE 128 /* Memory Regions */ diff --git a/drivers/net/ehea/ehea_ethtool.c b/drivers/net/ehea/ehea_ethtool.c index 82eb2fb..9f57c2e 100644 --- a/drivers/net/ehea/ehea_ethtool.c +++ b/drivers/net/ehea/ehea_ethtool.c @@ -238,7 +238,7 @@ static void ehea_get_ethtool_stats(struc data[i++] = port-port_res[0].swqe_refill_th; data[i++] = port-resets; - cb6 = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + cb6 = kzalloc(PAGE_SIZE, GFP_KERNEL); if (!cb6) { ehea_error(no mem for cb6); return; diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c index 4538c99..6ad6961 100644 --- a/drivers/net/ehea/ehea_main.c +++ b/drivers/net/ehea/ehea_main.c @@ -92,7 +92,7 @@ static struct net_device_stats *ehea_get memset(stats, 0, sizeof(*stats)); - cb2 = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + cb2 = kzalloc(PAGE_SIZE, GFP_KERNEL); if (!cb2) { ehea_error(no mem for cb2); goto out; @@ -586,8 +586,8 @@ int ehea_sense_port_attr(struct ehea_por u64 hret; struct hcp_ehea_port_cb0 *cb0; - cb0 = kzalloc(H_CB_ALIGNMENT, GFP_ATOMIC); /* May be called via */ - if (!cb0) { /* ehea_neq_tasklet() */ + cb0 = kzalloc(PAGE_SIZE, GFP_ATOMIC); /* May be called via */ + if (!cb0) { /* ehea_neq_tasklet() */ ehea_error(no mem for cb0); ret = -ENOMEM; goto out; @@ -670,7 +670,7 @@ int ehea_set_portspeed(struct ehea_port u64 hret; int ret = 0; - cb4 = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + cb4 = kzalloc(PAGE_SIZE, GFP_KERNEL); if (!cb4) { ehea_error(no mem for cb4); ret = -ENOMEM; @@ -985,7 +985,7 @@ static int ehea_configure_port(struct eh struct hcp_ehea_port_cb0 *cb0; ret = -ENOMEM; - cb0 = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + cb0 = kzalloc(PAGE_SIZE, GFP_KERNEL); if (!cb0) goto out; @@ -1443,7 +1443,7 @@ static
Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets
2006/11/6, Eric Dumazet [EMAIL PROTECTED]: We dont know. You might post some data so that we can have some ideas. Also, these kind of question is better handled by linux netdev mailing list, so I added a CC to this list. cat /proc/slabinfo cat /proc/meminfo cat /proc/net/sockstat cat /proc/buddyinfo TCP stack is one thing, but other things may consume ram on your kernel. Also, kernel memory allocation might use twice the ram you intend to use because of power of two alignments. Are you using iptables connection tracking ? If you plan to use a lot of RAM in kernel, why dont you use a 64 bits kernel, so that all ram is available for kernel, not only 900 MB ? Eric Thank you again for your help. To have more detailed statistic data, I did another round of test and gathered some data. I give the overall description here and detailed /proc/net/sockstat, /proc/meminfo, /proc/slabinfo and /proc/buddyinfo follows. = slab mem costtcp mem pages lowmem free with traffic: 254668KB 34693 38772KB without traffic: 104080KB 1 702652KB = detailed info: during the test (with traffic): [EMAIL PROTECTED] ~]# cat /proc/net/sockstat sockets: used 12058 TCP: inuse 4007 orphan 0 tw 0 alloc 4010 mem 34693 UDP: inuse 4 RAW: inuse 0 FRAG: inuse 0 memory 0 [EMAIL PROTECTED] ~]# cat /proc/meminfo MemTotal: 4136580 kB MemFree: 3169160 kB Buffers: 42092 kB Cached: 20048 kB SwapCached: 0 kB Active: 146808 kB Inactive:35492 kB HighTotal: 3276160 kB HighFree: 3130388 kB LowTotal: 860420 kB LowFree: 38772 kB SwapTotal: 2031608 kB SwapFree: 2031608 kB Dirty: 0 kB Writeback: 0 kB Mapped: 127720 kB Slab: 254668 kB CommitLimit: 4099896 kB Committed_AS: 367784 kB PageTables: 1696 kB VmallocTotal: 116728 kB VmallocUsed: 3876 kB VmallocChunk: 110548 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 2048 kB [EMAIL PROTECTED] ~]# cat /proc/slabinfo slabinfo - version: 2.1 # nameactive_objs num_objs objsize objperslab pagesperslab : tunables limit batchcount sharedfactor : slabdata active_slabs num_slabs sharedavail ip_conntrack_expect 0 0 92 421 : tunables 120 608 : slabdata 0 0 0 ip_conntrack4049 4352228 171 : tunables 120 60 8 : slabdata256256 0 bridge_fdb_cache 6 59 64 591 : tunables 120 60 8 : slabdata 1 1 0 fib6_nodes 7113 32 1131 : tunables 120 60 8 : slabdata 1 1 0 ip6_dst_cache 10 30256 151 : tunables 120 60 8 : slabdata 2 2 0 ndisc_cache1 20192 201 : tunables 120 60 8 : slabdata 1 1 0 RAWv6 7 1076851 : tunables 54 27 8 : slabdata 2 2 0 UDPv6 0 0704 112 : tunables 54 27 8 : slabdata 0 0 0 tw_sock_TCPv6 0 0128 301 : tunables 120 60 8 : slabdata 0 0 0 request_sock_TCPv6 0 0128 301 : tunables 120 60 8 : slabdata 0 0 0 TCPv6 3 3 134431 : tunables 24 12 8 : slabdata 1 1 0 cifs_small_rq 30 3644891 : tunables 54 27 8 : slabdata 4 4 0 cifs_request 4 4 1651218 : tunables84 0 : slabdata 4 4 0 cifs_oplock_structs 0 0 32 1131 : tunables 120 608 : slabdata 0 0 0 cifs_mpx_ids 3 59 64 591 : tunables 120 60 8 : slabdata 1 1 0 cifs_inode_cache 0 049681 : tunables 54 27 8 : slabdata 0 0 0 rpc_buffers8 8 204821 : tunables 24 12 8 : slabdata 4 4 0 rpc_tasks 8 20192 201 : tunables 120 60 8 : slabdata 1 1 0 rpc_inode_cache6 757671 : tunables 54 27 8 : slabdata 1 1 0 ip_fib_alias 9113 32 1131 : tunables 120 60 8 : slabdata 1 1 0 ip_fib_hash9113 32 1131 : tunables 120 60 8 : slabdata 1 1 0 uhci_urb_priv 0 0 40 921 : tunables 120 60 8 : slabdata 0 0 0 dm-snapshot-in 128134 56 671 : tunables 120 60 8 : slabdata 2 2 0 dm-snapshot-ex 0 0 24 1451 : tunables 120 60 8 : slabdata 0 0 0 ext3_inode_cache8275 1837864061 : tunables
Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets
2006/11/6, Eric Dumazet [EMAIL PROTECTED]: Slab: 293952 kB So 292 MB used by slab for 2000 sessions. Expect 600 MB used by slab for 4000 sessions. So your precious LOWMEM is not gone at all. It *IS* used by SLAB. You forgot to send cat /proc/slabinfo sorry I didn't make myself clear enough. 2000 sessions means 4000 sockets, 2000 for the server, 2000 for the client. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: !! SPAM Suspect : SPAM-URL-DBL !! Re: (usagi-core 31424) Re: [PATCH 7/13] [RFC] [IPV6] Move source address selection into route lookup.
The host testlab.linux-ipv6.org doesn't seem to be visible to the outside world so could you post the results somewhere where I could take a closer look at the results? It is visible world-wide, assuming you have IPv6 connection. With IPv4-only connection, one can try to append .ipv4.sixxs.org: http://testlab.linux-ipv6.org.ipv4.sixxs.org/tahi-autorun.2/net-2.6_20061018/ Jean-Mickael - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] s2io ppc64 fix for readq/writeq
Benjamin Herrenschmidt wrote: This is why I said good enough for drivers. This is _key_. I have run into several [PCI] devices with 64-bit registers, and __none__ of them had requirements such that the Linux platform code -must- provide an atomic readq/writeq. Probably because everybody wants to support 32-bit platforms with their devices. What you call fairly bogus is precisely what drivers need. These devices with 64-bit registers just don't need the atomicity that arch developers harp about :) Is there any consistency in that case in which half need to be read/written first ? Or none of these ever had side effects ? Generally the kernel code should write the two 32-bit chunks to the memory-mapped region in order (low dword first), and let things take care of themselves from there. That's pretty much the implementation that -every- driver copies, when they need readq/writeq to work on a 32-bit platform. Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] s2io ppc64 fix for readq/writeq
On Mon, 6 Nov 2006, Jeff Garzik wrote: This seems a bit ugly. Could you add #define readq readq to your platform instead? Heartily agreed. MUCH better than adding unrelated #if defined() stuff, whether arch-related or otherwise. Linus - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: (usagi-core 31424) Re: [PATCH 7/13] [RFC] [IPV6] Move source address selection into route lookup.
[ reposted, with better subject ] http://testlab.linux-ipv6.org/tahi-autorun.2/net-2.6_20061018/ The host testlab.linux-ipv6.org doesn't seem to be visible to the outside world so could you post the results somewhere where I could take a closer look at the results? It is visible world-wide, assuming you have IPv6 connection. With IPv4-only connection, one can try to append .ipv4.sixxs.org: http://testlab.linux-ipv6.org.ipv4.sixxs.org/tahi-autorun.2/net-2.6_20061018/ Jean-Mickael - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] s2io ppc64 fix for readq/writeq
On Mon, 2006-11-06 at 01:37 -0800, Linus Torvalds wrote: On Mon, 6 Nov 2006, Jeff Garzik wrote: This seems a bit ugly. Could you add #define readq readq to your platform instead? Heartily agreed. MUCH better than adding unrelated #if defined() stuff, whether arch-related or otherwise. I agree it's less ugly, though I still don't like it much :-) Anyway, what do you think of Jeff proposal to just implement them as two 32 bits operations ? My arch guy side screams at the idea, but if, indeed, drivers generally cope fine with it, I suppose that's ok. Ben. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] s2io ppc64 fix for readq/writeq
On Mon, 6 Nov 2006, Benjamin Herrenschmidt wrote: Anyway, what do you think of Jeff proposal to just implement them as two 32 bits operations ? My arch guy side screams at the idea, but if, indeed, drivers generally cope fine with it, I suppose that's ok. Last I saw, that's how normal PCI will split the IO anyway, so I guess it makes sense. Linus - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] s2io ppc64 fix for readq/writeq
On Mon, 2006-11-06 at 01:50 -0800, Linus Torvalds wrote: On Mon, 6 Nov 2006, Benjamin Herrenschmidt wrote: Anyway, what do you think of Jeff proposal to just implement them as two 32 bits operations ? My arch guy side screams at the idea, but if, indeed, drivers generally cope fine with it, I suppose that's ok. Last I saw, that's how normal PCI will split the IO anyway, so I guess it makes sense. Hrm.. true indeed. I'll implement them that way for ppc32 then. Ben. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] s2io ppc64 fix for readq/writeq
Benjamin Herrenschmidt wrote: On Mon, 2006-11-06 at 01:50 -0800, Linus Torvalds wrote: On Mon, 6 Nov 2006, Benjamin Herrenschmidt wrote: Anyway, what do you think of Jeff proposal to just implement them as two 32 bits operations ? My arch guy side screams at the idea, but if, indeed, drivers generally cope fine with it, I suppose that's ok. Last I saw, that's how normal PCI will split the IO anyway, so I guess it makes sense. Hrm.. true indeed. I'll implement them that way for ppc32 then. Bonus points if you want to find-and-kill where individual drivers did #ifndef readq implement readq and writeq by hand... #endif :) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] s2io ppc64 fix for readq/writeq
On Mon, 2006-11-06 at 04:55 -0500, Jeff Garzik wrote: Benjamin Herrenschmidt wrote: On Mon, 2006-11-06 at 01:50 -0800, Linus Torvalds wrote: On Mon, 6 Nov 2006, Benjamin Herrenschmidt wrote: Anyway, what do you think of Jeff proposal to just implement them as two 32 bits operations ? My arch guy side screams at the idea, but if, indeed, drivers generally cope fine with it, I suppose that's ok. Last I saw, that's how normal PCI will split the IO anyway, so I guess it makes sense. Hrm.. true indeed. I'll implement them that way for ppc32 then. Bonus points if you want to find-and-kill where individual drivers did #ifndef readq implement readq and writeq by hand... #endif Yes, well, we would have to make sure all archs have them defined first though, but I suppose I can have a look later this week, maybe tomorrow. Shouldn't be too hard :) Ben. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[DECNET] Endian bug fixes
Hi, Here is a patch which fixes some endianess problems. Patrick: since you have both big little endian machines at your disposal, can you test to ensure this is ok? Thanks, Steve. From ed3de950e89f8b02302308a2bedd59123ff3b88e Mon Sep 17 00:00:00 2001 From: Steven Whitehouse [EMAIL PROTECTED] Date: Mon, 6 Nov 2006 10:30:30 -0500 Subject: [PATCH] [DECNET] Endianess fixes Here are some fixes to endianess problems spotted by Al Viro. Cc: Al Viro [EMAIL PROTECTED] Cc: Patrick Caulfield [EMAIL PROTECTED] Signed-off-by: Steven Whitehouse [EMAIL PROTECTED] --- net/decnet/af_decnet.c | 21 ++--- net/decnet/dn_rules.c |4 ++-- 2 files changed, 12 insertions(+), 13 deletions(-) diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c index 3456cd3..37b4720 100644 --- a/net/decnet/af_decnet.c +++ b/net/decnet/af_decnet.c @@ -166,7 +166,7 @@ static struct hlist_head *dn_find_list(s if (scp-addr.sdn_flags SDF_WILD) return hlist_empty(dn_wild_sk) ? dn_wild_sk : NULL; - return dn_sk_hash[scp-addrloc DN_SK_HASH_MASK]; + return dn_sk_hash[dn_ntohs(scp-addrloc) DN_SK_HASH_MASK]; } /* @@ -180,7 +180,7 @@ static int check_port(__le16 port) if (port == 0) return -1; - sk_for_each(sk, node, dn_sk_hash[port DN_SK_HASH_MASK]) { + sk_for_each(sk, node, dn_sk_hash[dn_ntohs(port) DN_SK_HASH_MASK]) { struct dn_scp *scp = DN_SK(sk); if (scp-addrloc == port) return -1; @@ -194,12 +194,12 @@ static unsigned short port_alloc(struct static unsigned short port = 0x2000; unsigned short i_port = port; - while(check_port(++port) != 0) { + while(check_port(dn_htons(++port)) != 0) { if (port == i_port) return 0; } - scp-addrloc = port; + scp-addrloc = dn_htons(port); return 1; } @@ -418,7 +418,7 @@ struct sock *dn_find_by_skb(struct sk_bu struct dn_scp *scp; read_lock(dn_hash_lock); - sk_for_each(sk, node, dn_sk_hash[cb-dst_port DN_SK_HASH_MASK]) { + sk_for_each(sk, node, dn_sk_hash[dn_ntohs(cb-dst_port) DN_SK_HASH_MASK]) { scp = DN_SK(sk); if (cb-src != dn_saddr2dn(scp-peer)) continue; @@ -1016,13 +1016,12 @@ static void dn_access_copy(struct sk_buf static void dn_user_copy(struct sk_buff *skb, struct optdata_dn *opt) { -unsigned char *ptr = skb-data; - -opt-opt_optl = *ptr++; -opt-opt_status = 0; -memcpy(opt-opt_data, ptr, opt-opt_optl); -skb_pull(skb, dn_ntohs(opt-opt_optl) + 1); + unsigned char *ptr = skb-data; + opt-opt_optl = dn_htons((__u16)*ptr++); + opt-opt_status = 0; + memcpy(opt-opt_data, ptr, dn_ntohs(opt-opt_optl)); + skb_pull(skb, dn_ntohs(opt-opt_optl) + 1); } static struct sk_buff *dn_wait_for_connect(struct sock *sk, long *timeo) diff --git a/net/decnet/dn_rules.c b/net/decnet/dn_rules.c index 3e0c882..590e0a7 100644 --- a/net/decnet/dn_rules.c +++ b/net/decnet/dn_rules.c @@ -124,8 +124,8 @@ static struct nla_policy dn_fib_rule_pol static int dn_fib_rule_match(struct fib_rule *rule, struct flowi *fl, int flags) { struct dn_fib_rule *r = (struct dn_fib_rule *)rule; - u16 daddr = fl-fld_dst; - u16 saddr = fl-fld_src; + __le16 daddr = fl-fld_dst; + __le16 saddr = fl-fld_src; if (((saddr ^ r-src) r-srcmask) || ((daddr ^ r-dst) r-dstmask)) -- 1.4.1
Re: [DECNET] Endian bug fixes
On Mon, Nov 06, 2006 at 10:32:43AM +, Al Viro wrote: On Mon, Nov 06, 2006 at 10:31:02AM +, Steven Whitehouse wrote: + opt-opt_optl = dn_htons((__u16)*ptr++); Lose that cast; it's only confusing the things... + memcpy(opt-opt_data, ptr, dn_ntohs(opt-opt_optl)); + skb_pull(skb, dn_ntohs(opt-opt_optl) + 1); ... and I'd actually do u16 len = *ptr++; /* yes, it's 8bit on the wire */ opt-opt_optl = dn_htons(len); BUG_ON(len 16); /* we've checked the contents earlier */ memcpy(opt-opt_data, ptr, len); skb_pull(skb, len + 1); BTW, why the hell do we keep -opt_optl __le16 internally? If we ever pass it to userland, fine, but let's convert to __le16 *then*... - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [DECNET] Endian bug fixes
On Mon, Nov 06, 2006 at 10:31:02AM +, Steven Whitehouse wrote: + opt-opt_optl = dn_htons((__u16)*ptr++); Lose that cast; it's only confusing the things... + memcpy(opt-opt_data, ptr, dn_ntohs(opt-opt_optl)); + skb_pull(skb, dn_ntohs(opt-opt_optl) + 1); ... and I'd actually do u16 len = *ptr++; /* yes, it's 8bit on the wire */ opt-opt_optl = dn_htons(len); BUG_ON(len 16); /* we've checked the contents earlier */ memcpy(opt-opt_data, ptr, len); skb_pull(skb, len + 1); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets
On Monday 06 November 2006 09:59, Zhao Xiaoming wrote: Thank you again for your help. To have more detailed statistic data, I did another round of test and gathered some data. I give the overall description here and detailed /proc/net/sockstat, /proc/meminfo, /proc/slabinfo and /proc/buddyinfo follows. = slab mem costtcp mem pages lowmem free with traffic: 254668KB 34693 38772KB without traffic: 104080KB 1 702652KB = Thank you for detailed infos. It appears you have an extensive use of threads (about 1), since : task_struct10095 10095 136031 : tunables 24 12 8 : slabdata 3365 3365 0 Each thread has a kernel stack, 8KB (ie 2 pages, order-1 allocation), plus a user vma vm_area_struct 21346 21504 92 421 : tunables 120 60 8 : slabdata512512 0 Most likely you dont need that much threads. A program with fewer threads will perform better and use less ram. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
tg3_read_partno(): possible array overrun
The Coverity checker noted the following in drivers/net/tg3.c: -- snip -- ... static void __devinit tg3_read_partno(struct tg3 *tp) { unsigned char vpd_data[256]; int i; ... /* Now parse and find the part number. */ for (i = 0; i 256; ) { unsigned char val = vpd_data[i]; int block_end; if (val == 0x82 || val == 0x91) { i = (i + 3 + (vpd_data[i + 1] + (vpd_data[i + 2] 8))); continue; } if (val != 0x90) goto out_not_found; block_end = (i + 3 + (vpd_data[i + 1] + (vpd_data[i + 2] 8))); i += 3; ... -- snip -- The problem is that vpd_data[i + 2] could be vpd_data[255 + 2]. cu Adrian -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets
2006/11/6, Zhao Xiaoming [EMAIL PROTECTED]: 2006/11/6, Eric Dumazet [EMAIL PROTECTED]: On Monday 06 November 2006 09:59, Zhao Xiaoming wrote: Thank you again for your help. To have more detailed statistic data, I did another round of test and gathered some data. I give the overall description here and detailed /proc/net/sockstat, /proc/meminfo, /proc/slabinfo and /proc/buddyinfo follows. = slab mem costtcp mem pages lowmem free with traffic: 254668KB 34693 38772KB without traffic: 104080KB 1 702652KB = Thank you for detailed infos. It appears you have an extensive use of threads (about 1), since : task_struct10095 10095 136031 : tunables 24 12 8 : slabdata 3365 3365 0 Each thread has a kernel stack, 8KB (ie 2 pages, order-1 allocation), plus a user vma vm_area_struct 21346 21504 92 421 : tunables 120 60 8 : slabdata512512 0 Most likely you dont need that much threads. A program with fewer threads will perform better and use less ram. Thanks for the comments. I known the threads may cost many memory. However, I already excluded them from the statistics. The 'after test' info was gotten while the 1 threads running but no traffics relayed. You may look at the meminfo of 'after test', there is still 104080 kB slab memory which should already included the thread kernel memory cost (8K*1=80MB). I know 1 threads are not necessary and just use the simple logic to do some test. and I just tried 2500 threads. the results are the same. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets
2006/11/6, Arjan van de Ven [EMAIL PROTECTED]: On Mon, 2006-11-06 at 14:07 +0800, Zhao Xiaoming wrote: Dears, I'm running a linux box with kernel version 2.6.16. The hardware has 2 Woodcrest Xeon CPUs (2 cores each) and 4G RAM. The NIC cards is Intel 82571 on PCI-e bus. are you using a 32 bit or a 64 bit OS? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets
2006/11/6, Eric Dumazet [EMAIL PROTECTED]: On Monday 06 November 2006 09:59, Zhao Xiaoming wrote: Thank you again for your help. To have more detailed statistic data, I did another round of test and gathered some data. I give the overall description here and detailed /proc/net/sockstat, /proc/meminfo, /proc/slabinfo and /proc/buddyinfo follows. = slab mem costtcp mem pages lowmem free with traffic: 254668KB 34693 38772KB without traffic: 104080KB 1 702652KB = Thank you for detailed infos. It appears you have an extensive use of threads (about 1), since : task_struct10095 10095 136031 : tunables 24 12 8 : slabdata 3365 3365 0 Each thread has a kernel stack, 8KB (ie 2 pages, order-1 allocation), plus a user vma vm_area_struct 21346 21504 92 421 : tunables 120 60 8 : slabdata512512 0 Most likely you dont need that much threads. A program with fewer threads will perform better and use less ram. Thanks for the comments. I known the threads may cost many memory. However, I already excluded them from the statistics. The 'after test' info was gotten while the 1 threads running but no traffics relayed. You may look at the meminfo of 'after test', there is still 104080 kB slab memory which should already included the thread kernel memory cost (8K*1=80MB). I know 1 threads are not necessary and just use the simple logic to do some test. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ieee80211softmac: fix verbosity when debug disabled
On Sat, 2006-11-04 at 13:29 -0600, Larry Finger wrote: SoftMAC contains a number of debug-type messages that continue to print even when debugging is turned off. This patch substitutes dprintkl for printkl for those lines. Signed-off-by: Larry Finger [EMAIL PROTECTED] Fine with me. Acked-by: Johannes Berg [EMAIL PROTECTED] Index: wireless-2.6/net/ieee80211/softmac/ieee80211softmac_auth.c === --- wireless-2.6.orig/net/ieee80211/softmac/ieee80211softmac_auth.c +++ wireless-2.6/net/ieee80211/softmac/ieee80211softmac_auth.c @@ -158,7 +158,7 @@ ieee80211softmac_auth_resp(struct net_de /* Make sure that we've got an auth queue item for this request */ if(aq == NULL) { - printkl(KERN_DEBUG PFX Authentication response received from MAC_FMT but no queue item exists.\n, MAC_ARG(auth-header.addr2)); + dprintkl(KERN_DEBUG PFX Authentication response received from MAC_FMT but no queue item exists.\n, MAC_ARG(auth-header.addr2)); /* Error #? */ return -1; } @@ -166,7 +166,7 @@ ieee80211softmac_auth_resp(struct net_de /* Check for out of order authentication */ if(!net-authenticating) { - printkl(KERN_DEBUG PFX Authentication response received from MAC_FMT but did not request authentication.\n,MAC_ARG(auth-header.addr2)); + dprintkl(KERN_DEBUG PFX Authentication response received from MAC_FMT but did not request authentication.\n,MAC_ARG(auth-header.addr2)); return -1; } @@ -342,7 +342,7 @@ ieee80211softmac_deauth_req(struct ieee8 /* Make sure the network is authenticated */ if (!net-authenticated) { - printkl(KERN_DEBUG PFX Can't send deauthentication packet, network is not authenticated.\n); + dprintkl(KERN_DEBUG PFX Can't send deauthentication packet, network is not authenticated.\n); /* Error okay? */ return -EPERM; } @@ -376,7 +376,7 @@ ieee80211softmac_deauth_resp(struct net_ net = ieee80211softmac_get_network_by_bssid(mac, deauth-header.addr2); if (net == NULL) { - printkl(KERN_DEBUG PFX Received deauthentication packet from MAC_FMT, but that network is unknown.\n, + dprintkl(KERN_DEBUG PFX Received deauthentication packet from MAC_FMT, but that network is unknown.\n, MAC_ARG(deauth-header.addr2)); return 0; } @@ -384,7 +384,7 @@ ieee80211softmac_deauth_resp(struct net_ /* Make sure the network is authenticated */ if(!net-authenticated) { - printkl(KERN_DEBUG PFX Can't perform deauthentication, network is not authenticated.\n); + dprintkl(KERN_DEBUG PFX Can't perform deauthentication, network is not authenticated.\n); /* Error okay? */ return -EPERM; } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Source address selection + multicast
Hello, Preferred source address selection in the routing table (src field) currently does not work properly with multicast destination adresses: it leads packets to be routed through the wrong network device (see http://bugzilla.kernel.org/show_bug.cgi?id=7398). It seems to me that the main reason for this is compatibility with old multicast applications, and I can see no fundamental reason preventing the use of this two features together. Why not finding a way to let them coexist? What about a sysctl option, letting users who really want to disable the compatibility hack, and restore normal behavior? I am thinking about something like the patch below. Or does another simple way to do it come to your mind? What do you think about it? diff -urNp linux-2.6.18/Documentation/filesystems/proc.txt linux-2.6.18/Documentation/filesystems/proc.txt --- linux-2.6.18/Documentation/filesystems/proc.txt 2006-09-19 20:42:06.0 -0700 +++ linux-2.6.18/Documentation/filesystems/proc.txt 2006-10-26 05:13:15.0 -0700 @@ -1758,6 +1758,15 @@ max_delay, min_delay Delays for flushing the routing cache. +mc_src_strict +- + +There is a hack in the kernel router which provides compatibility for old +multicast applications such as vic, vat and friends. Unfortunately, this +hack also breaks normal behavior of preferred source address selection +(iproute2 src field) with multicast and limited broadcast. Enabling this +option disables this hack and restores normal (strict) behavior. + redirect_load, redirect_number -- diff -urNp linux-2.6.18/include/linux/sysctl.h linux-2.6.18/include/linux/sysctl.h --- linux-2.6.18/include/linux/sysctl.h 2006-09-19 20:42:06.0 -0700 +++ linux-2.6.18/include/linux/sysctl.h 2006-10-26 04:25:00.0 -0700 @@ -433,6 +433,7 @@ enum { NET_IPV4_ROUTE_MIN_ADVMSS=17, NET_IPV4_ROUTE_SECRET_INTERVAL=18, NET_IPV4_ROUTE_GC_MIN_INTERVAL_MS=19, + NET_IPV4_ROUTE_MC_SRC_STRICT=20, }; enum diff -urNp linux-2.6.18/net/ipv4/route.c linux-2.6.18/net/ipv4/route.c --- linux-2.6.18/net/ipv4/route.c 2006-09-19 20:42:06.0 -0700 +++ linux-2.6.18/net/ipv4/route.c 2006-10-26 05:11:00.0 -0700 @@ -132,6 +132,7 @@ static int ip_rt_mtu_expires= 10 * 60 static int ip_rt_min_pmtu = 512 + 20 + 20; static int ip_rt_min_advmss= 256; static int ip_rt_secret_interval = 10 * 60 * HZ; +static int ip_rt_mc_src_strict = 0; static unsigned long rt_deadline; #define RTprint(a...) printk(KERN_DEBUG a) @@ -2416,7 +2417,7 @@ static int ip_route_output_slow(struct r of another iface. --ANK */ - if (oldflp-oif == 0 + if (!ip_rt_mc_src_strict oldflp-oif == 0 (MULTICAST(oldflp-fl4_dst) || oldflp-fl4_dst == 0x)) { /* Special hack: user can direct multicasts and limited broadcast via necessary interface @@ -2431,6 +2432,12 @@ static int ip_route_output_slow(struct r cannot know, that ttl is zero, so that packet will not leave this host and route is valid). Luckily, this hack is good workaround. + + Unfortunately, it also breaks normal behavior of + source address preference, so I added a sysctl option + to let the user disable this hack and restore normal + behavior. By default, the hack is still enabled (old + compatibility behavior). -- PY */ fl.oif = dev_out-ifindex; @@ -3057,6 +3064,15 @@ ctl_table ipv4_route_table[] = { .proc_handler = proc_dointvec_jiffies, .strategy = sysctl_jiffies, }, + { + .ctl_name = NET_IPV4_ROUTE_MC_SRC_STRICT, + .procname = mc_src_strict, + .data = ip_rt_mc_src_strict, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = ipv4_doint_and_flush, + .strategy = ipv4_doint_and_flush_strategy, + }, { .ctl_name = 0 } }; #endif -- Pierre Ynard ___ Découvrez une nouvelle façon d'obtenir des réponses à toutes vos questions ! Profitez des connaissances, des opinions et des expériences des internautes sur Yahoo! Questions/Réponses http://fr.answers.yahoo.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2.6.19-rc4-git10][PKT_SCHED] sch_htb: INIT_HLIST_NODE after hlist_del()
After hlist_del() next and pprev pointers are not NULL so hlist_unhashed() doesn't work properly. Signed-off-by: Jarek Poplawski [EMAIL PROTECTED] --- diff -Nurp linux-2.6.19-rc4-git10-/net/sched/sch_htb.c linux-2.6.19-rc4-git10/net/sched/sch_htb.c --- linux-2.6.19-rc4-git10-/net/sched/sch_htb.c 2006-11-06 11:42:41.0 +0100 +++ linux-2.6.19-rc4-git10/net/sched/sch_htb.c 2006-11-06 11:53:15.0 +0100 @@ -1284,8 +1284,10 @@ static void htb_destroy_class(struct Qdi struct htb_class, sibling)); /* note: this delete may happen twice (see htb_delete) */ - if (!hlist_unhashed(cl-hlist)) + if (!hlist_unhashed(cl-hlist)) { hlist_del(cl-hlist); + INIT_HLIST_NODE(cl-hlist); + } list_del(cl-sibling); if (cl-prio_activity) @@ -1333,8 +1335,10 @@ static int htb_delete(struct Qdisc *sch, sch_tree_lock(sch); /* delete from hash and active; remainder in destroy_class */ - if (!hlist_unhashed(cl-hlist)) + if (!hlist_unhashed(cl-hlist)) { hlist_del(cl-hlist); + INIT_HLIST_NODE(cl-hlist); + } if (cl-prio_activity) htb_deactivate(q, cl); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [DECNET] Endian bug fixes
Hi, On Mon, 2006-11-06 at 10:32 +, Al Viro wrote: On Mon, Nov 06, 2006 at 10:31:02AM +, Steven Whitehouse wrote: + opt-opt_optl = dn_htons((__u16)*ptr++); Lose that cast; it's only confusing the things... + memcpy(opt-opt_data, ptr, dn_ntohs(opt-opt_optl)); + skb_pull(skb, dn_ntohs(opt-opt_optl) + 1); ... and I'd actually do u16 len = *ptr++; /* yes, it's 8bit on the wire */ opt-opt_optl = dn_htons(len); BUG_ON(len 16); /* we've checked the contents earlier */ memcpy(opt-opt_data, ptr, len); skb_pull(skb, len + 1); Ok, and I've also made the same change in the other places too, so far as its relevant in those cases. New patch attached, Steve. From a184f89a13fa292589f309057cc0775a8256a89e Mon Sep 17 00:00:00 2001 From: Steven Whitehouse [EMAIL PROTECTED] Date: Mon, 6 Nov 2006 11:51:00 -0500 Subject: [DECNET] Endianess fixes (try #2) Here are some fixes to endianess problems spotted by Al Viro. Cc: Al Viro [EMAIL PROTECTED] Cc: Patrick Caulfield [EMAIL PROTECTED] Signed-off-by: Steven Whitehouse [EMAIL PROTECTED] --- net/decnet/af_decnet.c | 25 + net/decnet/dn_nsp_in.c |8 net/decnet/dn_nsp_out.c |2 +- net/decnet/dn_rules.c |4 ++-- 4 files changed, 20 insertions(+), 19 deletions(-) diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c index 3456cd3..21f20f2 100644 --- a/net/decnet/af_decnet.c +++ b/net/decnet/af_decnet.c @@ -166,7 +166,7 @@ static struct hlist_head *dn_find_list(s if (scp-addr.sdn_flags SDF_WILD) return hlist_empty(dn_wild_sk) ? dn_wild_sk : NULL; - return dn_sk_hash[scp-addrloc DN_SK_HASH_MASK]; + return dn_sk_hash[dn_ntohs(scp-addrloc) DN_SK_HASH_MASK]; } /* @@ -180,7 +180,7 @@ static int check_port(__le16 port) if (port == 0) return -1; - sk_for_each(sk, node, dn_sk_hash[port DN_SK_HASH_MASK]) { + sk_for_each(sk, node, dn_sk_hash[dn_ntohs(port) DN_SK_HASH_MASK]) { struct dn_scp *scp = DN_SK(sk); if (scp-addrloc == port) return -1; @@ -194,12 +194,12 @@ static unsigned short port_alloc(struct static unsigned short port = 0x2000; unsigned short i_port = port; - while(check_port(++port) != 0) { + while(check_port(dn_htons(++port)) != 0) { if (port == i_port) return 0; } - scp-addrloc = port; + scp-addrloc = dn_htons(port); return 1; } @@ -418,7 +418,7 @@ struct sock *dn_find_by_skb(struct sk_bu struct dn_scp *scp; read_lock(dn_hash_lock); - sk_for_each(sk, node, dn_sk_hash[cb-dst_port DN_SK_HASH_MASK]) { + sk_for_each(sk, node, dn_sk_hash[dn_ntohs(cb-dst_port) DN_SK_HASH_MASK]) { scp = DN_SK(sk); if (cb-src != dn_saddr2dn(scp-peer)) continue; @@ -1016,13 +1016,14 @@ static void dn_access_copy(struct sk_buf static void dn_user_copy(struct sk_buff *skb, struct optdata_dn *opt) { -unsigned char *ptr = skb-data; - -opt-opt_optl = *ptr++; -opt-opt_status = 0; -memcpy(opt-opt_data, ptr, opt-opt_optl); -skb_pull(skb, dn_ntohs(opt-opt_optl) + 1); - + unsigned char *ptr = skb-data; + u16 len = *ptr++; /* yes, it's 8bit on the wire */ + + BUG_ON(len 16); /* we've checked the contents earlier */ + opt-opt_optl = dn_htons(len); + opt-opt_status = 0; + memcpy(opt-opt_data, ptr, len); + skb_pull(skb, len + 1); } static struct sk_buff *dn_wait_for_connect(struct sock *sk, long *timeo) diff --git a/net/decnet/dn_nsp_in.c b/net/decnet/dn_nsp_in.c index 72ecc6e..7683d4f 100644 --- a/net/decnet/dn_nsp_in.c +++ b/net/decnet/dn_nsp_in.c @@ -360,9 +360,9 @@ static void dn_nsp_conn_conf(struct sock scp-max_window = decnet_no_fc_max_cwnd; if (skb-len 0) { - unsigned char dlen = *skb-data; + u16 dlen = *skb-data; if ((dlen = 16) (dlen = skb-len)) { - scp-conndata_in.opt_optl = dn_htons((__u16)dlen); + scp-conndata_in.opt_optl = dn_htons(dlen); memcpy(scp-conndata_in.opt_data, skb-data + 1, dlen); } } @@ -404,9 +404,9 @@ static void dn_nsp_disc_init(struct sock memset(scp-discdata_in.opt_data, 0, 16); if (skb-len 0) { - unsigned char dlen = *skb-data; + u16 dlen = *skb-data; if ((dlen = 16) (dlen = skb-len)) { - scp-discdata_in.opt_optl = dn_htons((__u16)dlen); + scp-discdata_in.opt_optl = dn_htons(dlen); memcpy(scp-discdata_in.opt_data, skb-data + 1, dlen); } }
Re: [DECNET] Endian bug fixes
Hi, On Mon, 2006-11-06 at 10:34 +, Al Viro wrote: On Mon, Nov 06, 2006 at 10:32:43AM +, Al Viro wrote: On Mon, Nov 06, 2006 at 10:31:02AM +, Steven Whitehouse wrote: + opt-opt_optl = dn_htons((__u16)*ptr++); Lose that cast; it's only confusing the things... + memcpy(opt-opt_data, ptr, dn_ntohs(opt-opt_optl)); + skb_pull(skb, dn_ntohs(opt-opt_optl) + 1); ... and I'd actually do u16 len = *ptr++; /* yes, it's 8bit on the wire */ opt-opt_optl = dn_htons(len); BUG_ON(len 16); /* we've checked the contents earlier */ memcpy(opt-opt_data, ptr, len); skb_pull(skb, len + 1); BTW, why the hell do we keep -opt_optl __le16 internally? If we ever pass it to userland, fine, but let's convert to __le16 *then*... Really the only thing that we do with this data is verify it and pass to userland. It does mean that getsockopt() is simpler for just being able to use copy_to_user() with a ptr len depending on which of the structures the user has requested rather than having to convert each field of each structure for example. I'm not sure its worth changing it now, for saving just one byte per socket in this case, Steve. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets
32 bit. Of course 64 bit kernel can help me overcome the 900M barrier. However, if I can't find the reason why so much memory getting 'lost', it will be difficult to support more heavy loadded concurrent TCP connections. 2006/11/6, Arjan van de Ven [EMAIL PROTECTED]: On Mon, 2006-11-06 at 14:07 +0800, Zhao Xiaoming wrote: Dears, I'm running a linux box with kernel version 2.6.16. The hardware has 2 Woodcrest Xeon CPUs (2 cores each) and 4G RAM. The NIC cards is Intel 82571 on PCI-e bus. are you using a 32 bit or a 64 bit OS? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets
On Monday 06 November 2006 10:46, Zhao Xiaoming wrote: 2006/11/6, Eric Dumazet [EMAIL PROTECTED]: On Monday 06 November 2006 09:59, Zhao Xiaoming wrote: Thank you again for your help. To have more detailed statistic data, I did another round of test and gathered some data. I give the overall description here and detailed /proc/net/sockstat, /proc/meminfo, /proc/slabinfo and /proc/buddyinfo follows. = slab mem costtcp mem pages lowmem free with traffic: 254668KB 34693 38772KB without traffic: 104080KB 1 702652KB = Thank you for detailed infos. It appears you have an extensive use of threads (about 1), since : task_struct10095 10095 136031 : tunables 24 12 8 : slabdata 3365 3365 0 Each thread has a kernel stack, 8KB (ie 2 pages, order-1 allocation), plus a user vma vm_area_struct 21346 21504 92 421 : tunables 120 60 8 : slabdata512512 0 Most likely you dont need that much threads. A program with fewer threads will perform better and use less ram. Thanks for the comments. I known the threads may cost many memory. However, I already excluded them from the statistics. The 'after test' info was gotten while the 1 threads running but no traffics relayed. You may look at the meminfo of 'after test', there is still 104080 kB slab memory which should already included the thread kernel memory cost (8K*1=80MB). I know 1 threads are not necessary and just use the simple logic to do some test. In fact, your kernel has CONFIG_4KSTACKS, kernel thread stacks use 4K instead of 8K. If you want to increase LOWMEM, (and keep 32bits kernel), you can chose a 2G/2G user/kernel split, instead of the 3G/1G default split. (see config : CONFIG_VMSPLIT_2G) Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pktgen patch available for perusal.
jamal writes: If you are listening then start with: 1) Do a simple test with just udp traffic as above, doing simple accounting. This helps you to get a feel on how things work. 2) modify the matching rules to match your magic cookie 3) write a simple action invoked by your matching rules and use tc to add it to the policy. 4) integrate in your app now that you know what you are doing. Yes. Sounds like simple and general solution. No call-backs, no #ifdef's no extra modules. Just a little recipe in pktgen.txt Cheers. --ro - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pktgen patch available for perusal.
Ben Greear writes: Changes: * use a nano-second timer based on the scheduler timer (TSC) for relative times, instead of get_time_of_day. Seems I missed to set tsc as clocksource. It makes a difference. Performance is normal and I'm less confused. e1000 82546GB @ 1.6 GHz Opteron. Kernel 2.6.19-rc1_Bifrost_-g18e199c6-dirty echo acpi_pm /sys/devices/system/clocksource/clocksource0/current_clocksource psize pps - 60 556333 124 526942 252 452981 508 234996 1020 119748 1496 82248 echo tsc /sys/devices/system/clocksource/clocksource0/current_clocksource psize pps - 60 819914 124 747286 252 452975 508 234993 1020 119749 1496 82247 Cheers. --ro - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/18] e1000: features, updates, documentation
Jeff Garzik wrote: pulled. still waiting on those changes to better modularize the feature detection, etc. that will start coming in early januari I think. We're currently validating all silicon that the code supports against the old and new code, and that is going to take quite some time to finish! Cheers, Auke - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC: 2.6 patch] bcm43xx_sprom_write(): add error checks
The Coverity checker noted that these if (err)'s couldn't ever be true. It seems the intention was to check the return values of the bcm43xx_pci_write_config32()'s? Signed-off-by: Adrian Bunk [EMAIL PROTECTED] --- drivers/net/wireless/bcm43xx/bcm43xx_main.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- linux-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c.old 2006-11-06 14:45:47.0 +0100 +++ linux-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c 2006-11-06 14:46:53.0 +0100 @@ -737,47 +737,47 @@ int bcm43xx_sprom_write(struct bcm43xx_p crc = bcm43xx_sprom_crc(sprom); expected_crc = (sprom[BCM43xx_SPROM_VERSION] 0xFF00) 8; if (crc != expected_crc) { printk(KERN_ERR PFX SPROM input data: Invalid CRC\n); return -EINVAL; } printk(KERN_INFO PFX Writing SPROM. Do NOT turn off the power! Please stand by...\n); err = bcm43xx_pci_read_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl); if (err) goto err_ctlreg; spromctl |= 0x10; /* SPROM WRITE enable. */ - bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl); + err = bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl); if (err) goto err_ctlreg; /* We must burn lots of CPU cycles here, but that does not * really matter as one does not write the SPROM every other minute... */ printk(KERN_INFO PFX [ 0%%); mdelay(500); for (i = 0; i BCM43xx_SPROM_SIZE; i++) { if (i == 16) printk(25%%); else if (i == 32) printk(50%%); else if (i == 48) printk(75%%); else if (i % 2) printk(.); bcm43xx_write16(bcm, BCM43xx_SPROM_BASE + (i * 2), sprom[i]); mmiowb(); mdelay(20); } spromctl = ~0x10; /* SPROM WRITE enable. */ - bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl); + err = bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl); if (err) goto err_ctlreg; mdelay(500); printk(100%% ]\n); printk(KERN_INFO PFX SPROM written.\n); bcm43xx_controller_restart(bcm, SPROM update); return 0; err_ctlreg: printk(KERN_ERR PFX Could not access SPROM control register.\n); return -ENODEV; } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC: 2.6 patch] hostap_80211_rx(): fix a use-after-free
This patch fixes a use-after-free for skb spotted by the Coverity checker. Signed-off-by: Adrian Bunk [EMAIL PROTECTED] --- linux-2.6/drivers/net/wireless/hostap/hostap_80211_rx.c.old 2006-11-06 14:51:36.0 +0100 +++ linux-2.6/drivers/net/wireless/hostap/hostap_80211_rx.c 2006-11-06 14:52:16.0 +0100 @@ -1004,10 +1004,10 @@ void hostap_80211_rx(struct net_device * if (local-hostapd local-apdev) { /* Send IEEE 802.1X frames to the user * space daemon for processing */ - prism2_rx_80211(local-apdev, skb, rx_stats, - PRISM2_RX_MGMT); local-apdevstats.rx_packets++; local-apdevstats.rx_bytes += skb-len; + prism2_rx_80211(local-apdev, skb, rx_stats, + PRISM2_RX_MGMT); goto rx_exit; } } else if (!frame_authorized) { - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC: 2.6 patch] bcm43xx_sprom_write(): add error checks
Adrian Bunk wrote: The Coverity checker noted that these if (err)'s couldn't ever be true. It seems the intention was to check the return values of the bcm43xx_pci_write_config32()'s? Exactly. This patch sent to wireless-2.6. Thanks, Larry - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets
Eric Dumazet wrote: Zhao Xiaoming a écrit : Dears, I'm running a linux box with kernel version 2.6.16. The hardware has 2 Woodcrest Xeon CPUs (2 cores each) and 4G RAM. The NIC cards is Intel 82571 on PCI-e bus. The box is acting as ethernet bridge between 2 Gigabit Ethernets. By configuring ebtables and iptables, an application is running as TCP proxy which will intercept all TCP connections requests from the network and setup another TCP connection to the acture server. The TCP proxy then relays all traffics in both directions. The problem is the memory. Since the box must support thousands of concurrent connections, I know the memory size of ZONE_NORMAL would be a bottleneck as TCP packets would need many buffers. After setting upper limit of net.ipv4.tcp_rmem and net.ipv4.tcp_wmem to 32K bytes, our test began. My test scenario employs 2000 concurrent downloading connections to a IIS server's port 80. The throughput is about 500~600 Mbps which is limited by the capability of the client application. Because all traffics are from server to client and the capability of client machine is bottleneck, I believe the receiver side of the sockets connected with server and the sender side of the sockets connected with client should be filled with packets in correspondent windows. Thus, roughly there should be about 32K * 2000+ 32K*2000 = 128M bytes memory occupied by TCP/IP stack for packet buffering. Data from slabtop confermed it. it's about 140M bytes memory cost after I start the traffic. That reasonablly matched with my estimation. However, /proc/meminfo had a different story. The 'LowFree' dropped from about 710M to 80M. In other words, there's addtional 500M memory in ZONE_NORMAL allocated by someone other than the slab. Why? The amount of memory per socket is controlled by the socket buffering. Your application could be setting the value by calling setsockopt(). Otherwise, the tcp memory is limited by the sysctl settings tcp_rmem (receiver) and tcp_wmem (sender). For example on this server: $ cat /proc/sys/net/ipv4/tcp_wmem 409616384 131072 Each sending socket would start with 16K of buffering, but could grow up to 128K based on TCP send autotuning. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.19-rc4-git10][PKT_SCHED] sch_htb: INIT_HLIST_NODE after hlist_del()
On Mon, 6 Nov 2006 12:33:53 +0100 Jarek Poplawski [EMAIL PROTECTED] wrote: After hlist_del() next and pprev pointers are not NULL so hlist_unhashed() doesn't work properly. Signed-off-by: Jarek Poplawski [EMAIL PROTECTED] --- diff -Nurp linux-2.6.19-rc4-git10-/net/sched/sch_htb.c linux-2.6.19-rc4-git10/net/sched/sch_htb.c --- linux-2.6.19-rc4-git10-/net/sched/sch_htb.c 2006-11-06 11:42:41.0 +0100 +++ linux-2.6.19-rc4-git10/net/sched/sch_htb.c2006-11-06 11:53:15.0 +0100 @@ -1284,8 +1284,10 @@ static void htb_destroy_class(struct Qdi struct htb_class, sibling)); /* note: this delete may happen twice (see htb_delete) */ - if (!hlist_unhashed(cl-hlist)) + if (!hlist_unhashed(cl-hlist)) { hlist_del(cl-hlist); + INIT_HLIST_NODE(cl-hlist); + } why not use hlist_del_init? list_del(cl-sibling); if (cl-prio_activity) @@ -1333,8 +1335,10 @@ static int htb_delete(struct Qdisc *sch, sch_tree_lock(sch); /* delete from hash and active; remainder in destroy_class */ - if (!hlist_unhashed(cl-hlist)) + if (!hlist_unhashed(cl-hlist)) { hlist_del(cl-hlist); + INIT_HLIST_NODE(cl-hlist); + } if (cl-prio_activity) htb_deactivate(q, cl); -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [sungem] proposal for a new locking strategy
On Sun, 5 Nov 2006 21:11:34 +0100 Eric Lemoine [EMAIL PROTECTED] wrote: On 11/5/06, Stephen Hemminger [EMAIL PROTECTED] wrote: On Sun, 5 Nov 2006 18:52:45 +0100 Eric Lemoine [EMAIL PROTECTED] wrote: On 11/5/06, Stephen Hemminger [EMAIL PROTECTED] wrote: On Sun, 5 Nov 2006 18:28:33 +0100 Eric Lemoine [EMAIL PROTECTED] wrote: You could also just use net_tx_lock() now. You mean netif_tx_lock()? Thanks for letting me know about that function. Yes, I may need it. tg3 and bnx2 use it to wake up the transmit queue: if (unlikely(netif_queue_stopped(tp-dev) (tg3_tx_avail(tp) TG3_TX_WAKEUP_THRESH))) { netif_tx_lock(tp-dev); if (netif_queue_stopped(tp-dev) (tg3_tx_avail(tp) TG3_TX_WAKEUP_THRESH)) netif_wake_queue(tp-dev); netif_tx_unlock(tp-dev); } 2.6.17 didn't use it. Was it a bug? Thanks, No, it was introduced in 2.6.18. The functions are just a wrapper around the network device transmit lock that is normally held. If the device does not need to acquire the lock during IRQ, it is a good alternative and avoids a second lock. For transmit locking there are three common alternatives: Method A: dev-queue_xmit_lock and per-device tx_lock send: dev-xmit_lock held by caller dev-hard_start_xmit acquires netdev_priv(dev)-tx_lock irq: netdev_priv(dev)-tx_lock acquired Method B: dev-queue_xmit_lock only send: dev-xmit_lock held by caller irq: schedules softirq (NAPI) napi_poll: calls netif_tx_lock() which acquires dev-xmit_lock Method C: LLTX set dev-features LLTX send: no locks held by caller dev-hard_start_xmit acquires netdev_priv(dev)-tx_lock irq: netdev_priv(dev)-tx_lock acquired Method A is the only one that works with 2.4 and early (2.6.8?) kernels. Current sungem does Method C, and uses two locks: lock and tx_lock. What I was planning to do is Method B (which current tg3 uses). It seems to me that Method B is better than Method C. What do you think? B is better than C because the transmit logic doesn't have to spin in the case of lock contention, but it is not a big difference. Current sungem does C but uses try_lock() to acquire its private tx_lock. So it doesn't spin either in case of contention. But the spin is still there, just more complex.. In qdisc_restart() processing of NETDEV_TX_LOCKED causes: spin_lock(dev-xmit_lock) q-requeue() netif_schedule(dev); SOFTIRQ: net_tx_action() qdisc_run() -- qdisc_restart() So instead of spinning in tight loop, you end up with a longer code path. -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SKGE backport to 2.4 : success
On Sat, 4 Nov 2006 22:08:55 +0100 Willy Tarreau [EMAIL PROTECTED] wrote: Hi Stephen, I don't know if you received my mail since I got no reply. Thanks in advance for your comments, Willy On Sat, Oct 28, 2006 at 10:57:07PM +0200, Willy Tarreau wrote: Hi Stephen, In my own kernels, I've added your backport of SKGE to 2.4 that I found here : http://developer.osdl.org/shemminger/releases/skge-sky2-backport.tar.bz2 It seems to work pretty well compared to the original syskonnect driver (up to and including 8.36). Several people around me have reported very slow NFS operations with the official driver, which I finally attributed to a strange effect of UDP packets not going out after a while until they get pushed by a TCP packet. I even noticed the problem at the company and we turned the NFS server to an unused 100 Mbps card to workaround the problem before being able to fully ananlyze the problem. It seems your driver is getting mature and its performance is very close to the official one, while its code is smaller and apparently more reliable. I was thinking about merging it in mainline 2.4 as a fix for people having trouble with the syskonnect driver. It might also be easier to backport fixes from 2.6 to 2.4 when the driver is the same. I don't think we risk any regression because it won't replace an existing driver, but will provide one to people who are used to download new versions from an external tree. Also, I'm not yet sure whether I would also backport the sky2 driver, because I know about a handful boxes running in production with the official one with 88E8053 chips at high packet rates with no trouble at all. Anyway, as long as the backport does not prevent them from using the external driver, there should be no problem. I'd like to get your opinion on this matter, and of course, Jeff's and Davem's. Thanks in advance, Willy The backport needs to be updated. It is of older code. I plan to do a new backport this week. The backport version doesn't use NAPI, because of issues with not wanting to change netdevice.h. For a good 2.4 version, I would make a version that was closer to 2.6 code (using NAPI). I did the backport because one of the equipment donors gave a VPN box whose base OS is RHEL based on 2.4. -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
SKB BUG: Invalid truesize, current git
Hi all, I managed to get a backtrace for the Invalid truesize bug. The trigger is running LMbench2, but it's rater intermittent. Traffic should be going over the loopback interface, but the main nic on the machine is e1000. Let me know if anyone has any ideas for things to try. -ben Linux version 2.6.19-rc4 ([EMAIL PROTECTED]) (gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)) #73 SMP Mon Nov 6 13:13:44 EST 2006 Command line: ro root=LABEL=/1 console=ttyS0,38400 BIOS-provided physical RAM map: BIOS-e820: - 0009cc00 (usable) BIOS-e820: 0009cc00 - 000a (reserved) BIOS-e820: 000cc000 - 000d (reserved) BIOS-e820: 000e4000 - 0010 (reserved) BIOS-e820: 0010 - bff6 (usable) BIOS-e820: bff6 - bff69000 (ACPI data) BIOS-e820: bff69000 - bff8 (ACPI NVS) BIOS-e820: bff8 - c000 (reserved) BIOS-e820: e000 - f000 (reserved) BIOS-e820: fec0 - fec1 (reserved) BIOS-e820: fee0 - fee01000 (reserved) BIOS-e820: ff00 - 0001 (reserved) BIOS-e820: 0001 - 00014000 (usable) Entering add_active_range(0, 0, 156) 0 entries of 256 used Entering add_active_range(0, 256, 786272) 1 entries of 256 used Entering add_active_range(0, 1048576, 1310720) 2 entries of 256 used end_pfn_map = 1310720 DMI present. ACPI: RSDP (v000 PTLTD ) @ 0x000f58d0 ACPI: RSDT (v001 PTLTDRSDT 0x0604 LTP 0x) @ 0xbff636df ACPI: FADT (v001 INTEL TUMWATER 0x0604 PTL 0x0003) @ 0xbff68e48 ACPI: MADT (v001 PTLTD APIC 0x0604 LTP 0x) @ 0xbff68ebc ACPI: MCFG (v001 PTLTDMCFG 0x0604 LTP 0x) @ 0xbff68f4c ACPI: BOOT (v001 PTLTD $SBFTBL$ 0x0604 LTP 0x0001) @ 0xbff68f88 ACPI: SPCR (v001 PTLTD $UCRTBL$ 0x0604 PTL 0x0001) @ 0xbff68fb0 ACPI: SSDT (v001 PmRefCpuPm 0x3000 INTL 0x20050228) @ 0xbff6371b ACPI: DSDT (v001 Intel BLAKFORD 0x0604 MSFT 0x010e) @ 0x Entering add_active_range(0, 0, 156) 0 entries of 256 used Entering add_active_range(0, 256, 786272) 1 entries of 256 used Entering add_active_range(0, 1048576, 1310720) 2 entries of 256 used Zone PFN ranges: DMA 0 - 4096 DMA324096 - 1048576 Normal1048576 - 1310720 early_node_map[3] active PFN ranges 0:0 - 156 0: 256 - 786272 0: 1048576 - 1310720 On node 0 totalpages: 1048316 DMA zone: 56 pages used for memmap DMA zone: 1395 pages reserved DMA zone: 2545 pages, LIFO batch:0 DMA32 zone: 14280 pages used for memmap DMA32 zone: 767896 pages, LIFO batch:31 Normal zone: 3584 pages used for memmap Normal zone: 258560 pages, LIFO batch:31 ACPI: Local APIC address 0xfee0 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 (Bootup-CPU) ACPI: LAPIC (acpi_id[0x01] lapic_id[0x06] enabled) Processor #6 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled) Processor #1 ACPI: LAPIC (acpi_id[0x03] lapic_id[0x07] enabled) Processor #7 ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1]) ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0]) IOAPIC[0]: apic_id 2, address 0xfec0, GSI 0-23 ACPI: IOAPIC (id[0x03] address[0xfec8] gsi_base[24]) IOAPIC[1]: apic_id 3, address 0xfec8, GSI 24-47 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) ACPI: IRQ0 used by override. ACPI: IRQ2 used by override. ACPI: IRQ9 used by override. Setting APIC routing to flat Using ACPI (MADT) for SMP configuration information Nosave address range: 0009c000 - 0009d000 Nosave address range: 0009d000 - 000a Nosave address range: 000a - 000cc000 Nosave address range: 000cc000 - 000d Nosave address range: 000d - 000e4000 Nosave address range: 000e4000 - 0010 Nosave address range: bff6 - bff69000 Nosave address range: bff69000 - bff8 Nosave address range: bff8 - c000 Nosave address range: c000 - e000 Nosave address range: e000 - f000 Nosave address range: f000 - fec0 Nosave address range: fec0 - fec1 Nosave address range: fec1 - fee0 Nosave address range: fee0 - fee01000 Nosave address range: fee01000 - ff00 Nosave address range: ff00 - 0001 Allocating
Re: SKB BUG: Invalid truesize, current git
On Mon, Nov 06, 2006 at 07:07:26PM +, Benjamin LaHaise wrote: I managed to get a backtrace for the Invalid truesize bug. The trigger is running LMbench2, but it's rater intermittent. Traffic should be going over the loopback interface, but the main nic on the machine is e1000. Let me know if anyone has any ideas for things to try. OK, this should cure it. BTW, this indicates that your app is retransmitting unnecessarily which might be a problem in itself. This patch applies to all recent 2.6 kernels. [NET]: Set truesize in pskb_copy Since pskb_copy tacks on the non-linear bits from the original skb, it needs to count them in the truesize field of the new skb. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- diff --git a/net/core/skbuff.c b/net/core/skbuff.c index f735455..b8b1063 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -639,6 +639,7 @@ struct sk_buff *pskb_copy(struct sk_buff n-csum = skb-csum; n-ip_summed = skb-ip_summed; + n-truesize += skb-data_len; n-data_len = skb-data_len; n-len = skb-len; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/11] convert d80211 to a proper protocol
On Sun, 05 Nov 2006 16:39:34 +0100, Johannes Berg wrote: 003-d80211-cookie.patch d80211: change the cookie to be opaque This changes the 'cookie' that d80211 returns from alloc_hw to be an opaque value to the driver. Turned out that it wasn't such a great idea but since it was generally a clean up I kept this patch to base my other patches on. ACK. 005-d80211-reduce-mdev-1.patch 006-d80211-reduce-mdev-2.patch d80211: reduce mdev usage These two patches reduce mdev madness and change a lot of functions to take a struct ieee80211_local * instead of the master netdev ACK. 007-d80211-cleanup-rxmgmt.patch d80211: reduce mdev usage, fix ieee80211_rx_mgmt Cleans up the ieee80211_rx_mgmt and related code Looks good after a quick look. Need to review it more deeply. 008-d80211-scan-sanity.patch d80211: reduce master ieee80211_ptr deref in scan routines Similar to the reduce mdev patches, just for the scan routines ACK. 009-d80211-convert-spaces.patch d80211: convert leading spaces to tabs I hated working on the code, so I did this. The next patch breaks everything anyway. NAK. There are too many patches pending. Let's do this just before merging. 010-d80211-proto.patch d80211: convert to an 802.11 protocol Converts d80211 to be a protocol together with tons of cleanups and more. Hard to describe in two lines. NAK. This is too big patch for a review, it does too much things and I fundamentally disagree with some parts of the patch. Split it into individual patches. Just some things which are broken with the patch (the list is probably not complete): * The mdev no longer has a sub_if_data attached (why ever did it??) It's private area is for the driver since we don't create it but the driver does. I did keep the notation of mdev/master all through, but it's no longer the stacks device. Keep that in mind. This definitely breaks AP mode. In the code, there is heavily (ab)used the fact that the master device is in fact an AP device. I tried to fix that but it was so difficult I gave up. It is needed to rewrite the whole RX path (and even that is probably not enough). As this will be fixed for free when we have native 802.11 devices, I don't think we need to do anything about it now. * sysfs layout changed. There is no wiphy or an ieee80211 class any more, the attributes that used to be there are now in the net_device that the driver registered, and our attributes are below the devices we created. You want an ieee80211 class. Once you get rid of a master interface you need something with per-hardware information, statistics etc. * sysfs layout changed. There is no wiphy or an ieee80211 class any more, the attributes that used to be there are now in the net_device that the driver registered, and our attributes are below the devices we created. Doesn't belong to this patch. And probably lots more. ??? What did happen with d80211: add a function to get the wiphy index d80211: add a perm_addr hardware property d80211: add a struct device* hardware property d80211: add a ethtool_ops hardware property patches? Thanks, Jiri -- Jiri Benc SUSE Labs - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] d80211: fix key access race
On Fri, 03 Nov 2006 11:48:22 +0800, Hong Liu wrote: It seems we don't have any protection when accessing the key. The RX/TX path may acquire a key which can be freed by the ioctl cmd. I put a key_lock spinlock to protect all the accesses to the key (whether the sta_info-key or ieee80211_sub_if_data-keys[]). Don't find a good way to handle it :( NAK, this is too expensive. I'm aware of the problem and figured how to fix it correctly while working on fixing of sta_list locking. Will send a patch later this week, stay tuned :-) Thanks, Jiri -- Jiri Benc SUSE Labs - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] s2io ppc64 fix for readq/writeq
The 64 bit io operation on the IA64 platform is a 64 bit transaction on the pci bus and is optimal to leave it as such. I prefer Jeff's suggestion - guaranteeing that a good enough for drivers readq() and writeq() exist on all platforms even 32-bit platforms where the operation isn't inherently atomic. Ram -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Benjamin Herrenschmidt Sent: Monday, November 06, 2006 1:57 AM To: Jeff Garzik Cc: Linus Torvalds; netdev@vger.kernel.org Subject: Re: [PATCH] s2io ppc64 fix for readq/writeq On Mon, 2006-11-06 at 04:55 -0500, Jeff Garzik wrote: Benjamin Herrenschmidt wrote: On Mon, 2006-11-06 at 01:50 -0800, Linus Torvalds wrote: On Mon, 6 Nov 2006, Benjamin Herrenschmidt wrote: Anyway, what do you think of Jeff proposal to just implement them as two 32 bits operations ? My arch guy side screams at the idea, but if, indeed, drivers generally cope fine with it, I suppose that's ok. Last I saw, that's how normal PCI will split the IO anyway, so I guess it makes sense. Hrm.. true indeed. I'll implement them that way for ppc32 then. Bonus points if you want to find-and-kill where individual drivers did #ifndef readq implement readq and writeq by hand... #endif Yes, well, we would have to make sure all archs have them defined first though, but I suppose I can have a look later this week, maybe tomorrow. Shouldn't be too hard :) Ben. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] s2io ppc64 fix for readq/writeq
On Mon, Nov 06, 2006 at 03:33:19PM -0500, Ramkrishna Vepa wrote: The 64 bit io operation on the IA64 platform is a 64 bit transaction on the pci bus and is optimal to leave it as such. I prefer Jeff's suggestion - guaranteeing that a good enough for drivers readq() and writeq() exist on all platforms even 32-bit platforms where the operation isn't inherently atomic. For consistencies sake we really want to have readq() and writeq() available on all platforms. I remember that some IB cards require it to actually be a 64bit transactions, otherwise they have to do funny workarounds. I think the best solution is to define ARCH_HAS_ATOMIC_READQ_WRITEQ and let drivers do their workarounds based on that. I've Cc'ed Roland because he should be able to explain the IB issue in details. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [sungem] proposal for a new locking strategy
On 11/6/06, Stephen Hemminger [EMAIL PROTECTED] wrote: On Sun, 5 Nov 2006 21:11:34 +0100 Eric Lemoine [EMAIL PROTECTED] wrote: On 11/5/06, Stephen Hemminger [EMAIL PROTECTED] wrote: On Sun, 5 Nov 2006 18:52:45 +0100 Eric Lemoine [EMAIL PROTECTED] wrote: On 11/5/06, Stephen Hemminger [EMAIL PROTECTED] wrote: On Sun, 5 Nov 2006 18:28:33 +0100 Eric Lemoine [EMAIL PROTECTED] wrote: You could also just use net_tx_lock() now. You mean netif_tx_lock()? Thanks for letting me know about that function. Yes, I may need it. tg3 and bnx2 use it to wake up the transmit queue: if (unlikely(netif_queue_stopped(tp-dev) (tg3_tx_avail(tp) TG3_TX_WAKEUP_THRESH))) { netif_tx_lock(tp-dev); if (netif_queue_stopped(tp-dev) (tg3_tx_avail(tp) TG3_TX_WAKEUP_THRESH)) netif_wake_queue(tp-dev); netif_tx_unlock(tp-dev); } 2.6.17 didn't use it. Was it a bug? Thanks, No, it was introduced in 2.6.18. The functions are just a wrapper around the network device transmit lock that is normally held. If the device does not need to acquire the lock during IRQ, it is a good alternative and avoids a second lock. For transmit locking there are three common alternatives: Method A: dev-queue_xmit_lock and per-device tx_lock send: dev-xmit_lock held by caller dev-hard_start_xmit acquires netdev_priv(dev)-tx_lock irq: netdev_priv(dev)-tx_lock acquired Method B: dev-queue_xmit_lock only send: dev-xmit_lock held by caller irq: schedules softirq (NAPI) napi_poll: calls netif_tx_lock() which acquires dev-xmit_lock Method C: LLTX set dev-features LLTX send: no locks held by caller dev-hard_start_xmit acquires netdev_priv(dev)-tx_lock irq: netdev_priv(dev)-tx_lock acquired Method A is the only one that works with 2.4 and early (2.6.8?) kernels. Current sungem does Method C, and uses two locks: lock and tx_lock. What I was planning to do is Method B (which current tg3 uses). It seems to me that Method B is better than Method C. What do you think? B is better than C because the transmit logic doesn't have to spin in the case of lock contention, but it is not a big difference. Current sungem does C but uses try_lock() to acquire its private tx_lock. So it doesn't spin either in case of contention. But the spin is still there, just more complex.. In qdisc_restart() processing of NETDEV_TX_LOCKED causes: spin_lock(dev-xmit_lock) q-requeue() netif_schedule(dev); SOFTIRQ: net_tx_action() qdisc_run() -- qdisc_restart() So instead of spinning in tight loop, you end up with a longer code path. Stephen, sorry for insisting a bit but I'm failing to see how B is different from C in that respect. With method B, in qdisc_restart(), if netif_tx_trylock() fails to acquire the lock then we also requeue(), etc. Same long code path in case of contention. -- Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] s2io ppc64 fix for readq/writeq
For consistencies sake we really want to have readq() and writeq() available on all platforms. I remember that some IB cards require it to actually be a 64bit transactions, otherwise they have to do funny workarounds. I think the best solution is to define ARCH_HAS_ATOMIC_READQ_WRITEQ and let drivers do their workarounds based on that. I've Cc'ed Roland because he should be able to explain the IB issue in details. The issue I know about is drivers/infiniband/hw/mthca. The card has 64-bit doorbell registers, and the restriction is that if you write the doorbell write two 32-bit writes, you can't write anything else on the same register page in between writing the two halves. Since different CPUs might be doing stuff on the same doorbell page at the same time, there are two things we can do: - If writeq() exists then use that and assume it will generate only a single bus transaction that can't let anything sneak in the middle. (That's a fairly safe assumption because the devices being driven are either 64-bit PCI-X or PCIe only) - If writeq() doesn't exist, use a spinlock to protect access to each doorbell page. ARCH_HAS_ATOMIC_READQ_WRITEQ would be fine for that, but of course the tricky thing is writing down the exact semantics that HAS_ATOMIC is actually promising. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [sungem] proposal for a new locking strategy
On Mon, 6 Nov 2006 21:55:20 +0100 Eric Lemoine [EMAIL PROTECTED] wrote: On 11/6/06, Stephen Hemminger [EMAIL PROTECTED] wrote: On Sun, 5 Nov 2006 21:11:34 +0100 Eric Lemoine [EMAIL PROTECTED] wrote: On 11/5/06, Stephen Hemminger [EMAIL PROTECTED] wrote: On Sun, 5 Nov 2006 18:52:45 +0100 Eric Lemoine [EMAIL PROTECTED] wrote: On 11/5/06, Stephen Hemminger [EMAIL PROTECTED] wrote: On Sun, 5 Nov 2006 18:28:33 +0100 Eric Lemoine [EMAIL PROTECTED] wrote: You could also just use net_tx_lock() now. You mean netif_tx_lock()? Thanks for letting me know about that function. Yes, I may need it. tg3 and bnx2 use it to wake up the transmit queue: if (unlikely(netif_queue_stopped(tp-dev) (tg3_tx_avail(tp) TG3_TX_WAKEUP_THRESH))) { netif_tx_lock(tp-dev); if (netif_queue_stopped(tp-dev) (tg3_tx_avail(tp) TG3_TX_WAKEUP_THRESH)) netif_wake_queue(tp-dev); netif_tx_unlock(tp-dev); } 2.6.17 didn't use it. Was it a bug? Thanks, No, it was introduced in 2.6.18. The functions are just a wrapper around the network device transmit lock that is normally held. If the device does not need to acquire the lock during IRQ, it is a good alternative and avoids a second lock. For transmit locking there are three common alternatives: Method A: dev-queue_xmit_lock and per-device tx_lock send: dev-xmit_lock held by caller dev-hard_start_xmit acquires netdev_priv(dev)-tx_lock irq: netdev_priv(dev)-tx_lock acquired Method B: dev-queue_xmit_lock only send: dev-xmit_lock held by caller irq: schedules softirq (NAPI) napi_poll: calls netif_tx_lock() which acquires dev-xmit_lock Method C: LLTX set dev-features LLTX send: no locks held by caller dev-hard_start_xmit acquires netdev_priv(dev)-tx_lock irq: netdev_priv(dev)-tx_lock acquired Method A is the only one that works with 2.4 and early (2.6.8?) kernels. Current sungem does Method C, and uses two locks: lock and tx_lock. What I was planning to do is Method B (which current tg3 uses). It seems to me that Method B is better than Method C. What do you think? B is better than C because the transmit logic doesn't have to spin in the case of lock contention, but it is not a big difference. Current sungem does C but uses try_lock() to acquire its private tx_lock. So it doesn't spin either in case of contention. But the spin is still there, just more complex.. In qdisc_restart() processing of NETDEV_TX_LOCKED causes: spin_lock(dev-xmit_lock) q-requeue() netif_schedule(dev); SOFTIRQ: net_tx_action() qdisc_run() -- qdisc_restart() So instead of spinning in tight loop, you end up with a longer code path. Stephen, sorry for insisting a bit but I'm failing to see how B is different from C in that respect. With method B, in qdisc_restart(), if netif_tx_trylock() fails to acquire the lock then we also requeue(), etc. Same long code path in case of contention. Method C LLTX causes repeated softirq's which will be slower since the loop requires more instructions than a simple spin loop (Method B). -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC: 2.6 patch] hostap_80211_rx(): fix a use-after-free
On Mon, Nov 06, 2006 at 03:21:48PM +0100, Adrian Bunk wrote: This patch fixes a use-after-free for skb spotted by the Coverity checker. --- linux-2.6/drivers/net/wireless/hostap/hostap_80211_rx.c.old +++ linux-2.6/drivers/net/wireless/hostap/hostap_80211_rx.c @@ -1004,10 +1004,10 @@ void hostap_80211_rx(struct net_device * if (local-hostapd local-apdev) { /* Send IEEE 802.1X frames to the user * space daemon for processing */ - prism2_rx_80211(local-apdev, skb, rx_stats, - PRISM2_RX_MGMT); local-apdevstats.rx_packets++; local-apdevstats.rx_bytes += skb-len; + prism2_rx_80211(local-apdev, skb, rx_stats, + PRISM2_RX_MGMT); goto rx_exit; Network drivers set rx_packets and rx_bytes after netif_rx. And last_rx, too. The trick seems to be to use pkt_len variable. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tg3_read_partno(): possible array overrun
On Mon, 2006-11-06 at 10:45 +0100, Adrian Bunk wrote: The Coverity checker noted the following in drivers/net/tg3.c: -- snip -- The problem is that vpd_data[i + 2] could be vpd_data[255 + 2]. Thanks. This should fix it: [TG3]: Fix array overrun in tg3_read_partno(). Use proper upper limits for the loops and check for all error conditions. The problem was noticed by Adrian Bunk. Signed-off-by: Michael Chan [EMAIL PROTECTED] diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c index 8f059b7..06e4f77 100644 --- a/drivers/net/tg3.c +++ b/drivers/net/tg3.c @@ -10212,7 +10212,7 @@ skip_phy_reset: static void __devinit tg3_read_partno(struct tg3 *tp) { unsigned char vpd_data[256]; - int i; + unsigned int i; u32 magic; if (tg3_nvram_read_swab(tp, 0x0, magic)) @@ -10258,9 +10258,9 @@ static void __devinit tg3_read_partno(st } /* Now parse and find the part number. */ - for (i = 0; i 256; ) { + for (i = 0; i 254; ) { unsigned char val = vpd_data[i]; - int block_end; + unsigned int block_end; if (val == 0x82 || val == 0x91) { i = (i + 3 + @@ -10276,21 +10276,26 @@ static void __devinit tg3_read_partno(st (vpd_data[i + 1] + (vpd_data[i + 2] 8))); i += 3; - while (i block_end) { + + if (block_end 256) + goto out_not_found; + + while (i (block_end - 2)) { if (vpd_data[i + 0] == 'P' vpd_data[i + 1] == 'N') { int partno_len = vpd_data[i + 2]; - if (partno_len 24) + i += 3; + if (partno_len 24 || (partno_len + i) 256) goto out_not_found; memcpy(tp-board_part_number, - vpd_data[i + 3], - partno_len); + vpd_data[i], partno_len); /* Success. */ return; } + i += 3 + vpd_data[i + 2]; } /* Part number not found. */ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] bcm43xx: Drain TX status before starting IRQs
From: Michael Buesch [EMAIL PROTECTED] Drain the Microcode TX-status-FIFO before we enable IRQs. This is required, because the FIFO may still have entries left from a previous run. Those would immediately fire after enabling IRQs and would lead to an oops in the DMA TXstatus handling code. Signed-off-by: Michael Buesch [EMAIL PROTECTED] Signed-off-by: Larry Finger [EMAIL PROTECTED] --- John, Please apply this to wireless-2.6 and push it to 2.6.19. It has already been sent to -stable for inclusion in 2.6.18.3. This patch replaces one with the same name that was sent by Michael on October 19. It had a bug, fixed in this version, that would lock up certain core revisions. Larry Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c === --- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c +++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c @@ -1467,6 +1467,23 @@ static void handle_irq_transmit_status(s } } +static void drain_txstatus_queue(struct bcm43xx_private *bcm) +{ + u32 dummy; + + if (bcm-current_core-rev 5) + return; + /* Read all entries from the microcode TXstatus FIFO +* and throw them away. +*/ + while (1) { + dummy = bcm43xx_read32(bcm, BCM43xx_MMIO_XMITSTAT_0); + if (!dummy) + break; + dummy = bcm43xx_read32(bcm, BCM43xx_MMIO_XMITSTAT_1); + } +} + static void bcm43xx_generate_noise_sample(struct bcm43xx_private *bcm) { bcm43xx_shm_write16(bcm, BCM43xx_SHM_SHARED, 0x408, 0x7F7F); @@ -3569,6 +3586,7 @@ int bcm43xx_select_wireless_core(struct bcm43xx_macfilter_clear(bcm, BCM43xx_MACFILTER_ASSOC); bcm43xx_macfilter_set(bcm, BCM43xx_MACFILTER_SELF, (u8 *)(bcm-net_dev-dev_addr)); bcm43xx_security_init(bcm); + drain_txstatus_queue(bcm); ieee80211softmac_start(bcm-net_dev); /* Let's go! Be careful after enabling the IRQs. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] bcm43xx: Add error checking in bcm43xx_sprom_write()
From: Adrian Bunk [EMAIL PROTECTED] The Coverity checker noted that these if (err)'s couldn't ever be true. It seems the intention was to check the return values of the bcm43xx_pci_write_config32()'s? Signed-off-by: Adrian Bunk [EMAIL PROTECTED] Signed-off-by: Larry Finger [EMAIL PROTECTED] Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c === --- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c +++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c @@ -750,7 +750,7 @@ int bcm43xx_sprom_write(struct bcm43xx_p if (err) goto err_ctlreg; spromctl |= 0x10; /* SPROM WRITE enable. */ - bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl); + err = bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl); if (err) goto err_ctlreg; /* We must burn lots of CPU cycles here, but that does not @@ -772,7 +772,7 @@ int bcm43xx_sprom_write(struct bcm43xx_p mdelay(20); } spromctl = ~0x10; /* SPROM WRITE enable. */ - bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl); + err = bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl); if (err) goto err_ctlreg; mdelay(500); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bcm43xx: Add error checking in bcm43xx_sprom_write()
On Monday 06 November 2006 16:48, Larry Finger wrote: From: Adrian Bunk [EMAIL PROTECTED] The Coverity checker noted that these if (err)'s couldn't ever be true. It seems the intention was to check the return values of the bcm43xx_pci_write_config32()'s? Whoops, I thought I had fixed this bug long time ago. The patch is correct. Signed-off-by: Adrian Bunk [EMAIL PROTECTED] Signed-off-by: Larry Finger [EMAIL PROTECTED] Signed-off-by: Michael Buesch [EMAIL PROTECTED] Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c === --- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c +++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c @@ -750,7 +750,7 @@ int bcm43xx_sprom_write(struct bcm43xx_p if (err) goto err_ctlreg; spromctl |= 0x10; /* SPROM WRITE enable. */ - bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl); + err = bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl); if (err) goto err_ctlreg; /* We must burn lots of CPU cycles here, but that does not @@ -772,7 +772,7 @@ int bcm43xx_sprom_write(struct bcm43xx_p mdelay(20); } spromctl = ~0x10; /* SPROM WRITE enable. */ - bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl); + err = bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl); if (err) goto err_ctlreg; mdelay(500); -- Greetings Michael. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/11] convert d80211 to a proper protocol
[reordering a bit] This changes the 'cookie' that d80211 returns from alloc_hw to be an opaque value to the driver. Turned out that it wasn't such a great idea but since it was generally a clean up I kept this patch to base my other patches on. ACK. What did happen with d80211: add a function to get the wiphy index d80211: add a perm_addr hardware property d80211: add a struct device* hardware property d80211: add a ethtool_ops hardware property patches? Well after some chat with a few people I decided that it was stupid and not very maintainable to copy all the fields in net_device to a new structure. 009-d80211-convert-spaces.patch d80211: convert leading spaces to tabs I hated working on the code, so I did this. The next patch breaks everything anyway. NAK. There are too many patches pending. Let's do this just before merging. Oh come off it! It's really stupid to have to check all the tabs/spaces all the time. The patch changes 451 lines. And wiggle can handle that just fine. Besides, if you do s/^\+ /+\t/ s/^- /-\t/ s//\t/ on your patches, they'll be fine too. This is too big patch for a review, Yeah. It's pretty bad actually, but I couldn't really find a good way to split it into logical chunks. * The mdev no longer has a sub_if_data attached (why ever did it??) It's private area is for the driver since we don't create it but the driver does. I did keep the notation of mdev/master all through, but it's no longer the stacks device. Keep that in mind. This definitely breaks AP mode. In the code, there is heavily (ab)used the fact that the master device is in fact an AP device. I tried to fix that but it was so difficult I gave up. It is needed to rewrite the whole RX path (and even that is probably not enough). Bugger. I didn't notice that. I'll have a look. That is indeed a showstopper. As this will be fixed for free when we have native 802.11 devices, I don't think we need to do anything about it now. I don't think I understand this. I mean, my patch actually gives us native 802.11 devices by making the drivers register those and then handling them virtually similar to how 8021q handles ethernet devices. I honestly thought that this was the plan for said native 802.11 devices. * sysfs layout changed. There is no wiphy or an ieee80211 class any more, the attributes that used to be there are now in the net_device that the driver registered, and our attributes are below the devices we created. You want an ieee80211 class. Once you get rid of a master interface you need something with per-hardware information, statistics etc. Yeah, I gave up trying to get rid of the master interface in favour of having a native 802.11 device which is registered by the phy driver instead. * sysfs layout changed. There is no wiphy or an ieee80211 class any more, the attributes that used to be there are now in the net_device that the driver registered, and our attributes are below the devices we created. Doesn't belong to this patch. Had to be here initially due to the way I did things, but ok, probably changeable. johannes signature.asc Description: This is a digitally signed message part
Re: [sungem] proposal for a new locking strategy
On 11/6/06, Stephen Hemminger [EMAIL PROTECTED] wrote: On Mon, 6 Nov 2006 21:55:20 +0100 Eric Lemoine [EMAIL PROTECTED] wrote: On 11/6/06, Stephen Hemminger [EMAIL PROTECTED] wrote: On Sun, 5 Nov 2006 21:11:34 +0100 Eric Lemoine [EMAIL PROTECTED] wrote: On 11/5/06, Stephen Hemminger [EMAIL PROTECTED] wrote: On Sun, 5 Nov 2006 18:52:45 +0100 Eric Lemoine [EMAIL PROTECTED] wrote: On 11/5/06, Stephen Hemminger [EMAIL PROTECTED] wrote: On Sun, 5 Nov 2006 18:28:33 +0100 Eric Lemoine [EMAIL PROTECTED] wrote: You could also just use net_tx_lock() now. You mean netif_tx_lock()? Thanks for letting me know about that function. Yes, I may need it. tg3 and bnx2 use it to wake up the transmit queue: if (unlikely(netif_queue_stopped(tp-dev) (tg3_tx_avail(tp) TG3_TX_WAKEUP_THRESH))) { netif_tx_lock(tp-dev); if (netif_queue_stopped(tp-dev) (tg3_tx_avail(tp) TG3_TX_WAKEUP_THRESH)) netif_wake_queue(tp-dev); netif_tx_unlock(tp-dev); } 2.6.17 didn't use it. Was it a bug? Thanks, No, it was introduced in 2.6.18. The functions are just a wrapper around the network device transmit lock that is normally held. If the device does not need to acquire the lock during IRQ, it is a good alternative and avoids a second lock. For transmit locking there are three common alternatives: Method A: dev-queue_xmit_lock and per-device tx_lock send: dev-xmit_lock held by caller dev-hard_start_xmit acquires netdev_priv(dev)-tx_lock irq: netdev_priv(dev)-tx_lock acquired Method B: dev-queue_xmit_lock only send: dev-xmit_lock held by caller irq: schedules softirq (NAPI) napi_poll: calls netif_tx_lock() which acquires dev-xmit_lock Method C: LLTX set dev-features LLTX send: no locks held by caller dev-hard_start_xmit acquires netdev_priv(dev)-tx_lock irq: netdev_priv(dev)-tx_lock acquired Method A is the only one that works with 2.4 and early (2.6.8?) kernels. Current sungem does Method C, and uses two locks: lock and tx_lock. What I was planning to do is Method B (which current tg3 uses). It seems to me that Method B is better than Method C. What do you think? B is better than C because the transmit logic doesn't have to spin in the case of lock contention, but it is not a big difference. Current sungem does C but uses try_lock() to acquire its private tx_lock. So it doesn't spin either in case of contention. But the spin is still there, just more complex.. In qdisc_restart() processing of NETDEV_TX_LOCKED causes: spin_lock(dev-xmit_lock) q-requeue() netif_schedule(dev); SOFTIRQ: net_tx_action() qdisc_run() -- qdisc_restart() So instead of spinning in tight loop, you end up with a longer code path. Stephen, sorry for insisting a bit but I'm failing to see how B is different from C in that respect. With method B, in qdisc_restart(), if netif_tx_trylock() fails to acquire the lock then we also requeue(), etc. Same long code path in case of contention. Method C LLTX causes repeated softirq's which will be slower since the loop requires more instructions than a simple spin loop (Method B). What I'm saying above is that Method B also causes repeated tx softirqs in case of contention on netif_tx_lock. The code path is : netif_tx_trylock() fails - requeue() - netif_schedule() - raise_softirq(NET_TX_SOFTIRQ). Am I missing anything? -- Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SKGE backport to 2.4 : success
On Mon, Nov 06, 2006 at 10:56:09AM -0800, Stephen Hemminger wrote: On Sat, 4 Nov 2006 22:08:55 +0100 Willy Tarreau [EMAIL PROTECTED] wrote: Hi Stephen, I don't know if you received my mail since I got no reply. Thanks in advance for your comments, Willy On Sat, Oct 28, 2006 at 10:57:07PM +0200, Willy Tarreau wrote: Hi Stephen, In my own kernels, I've added your backport of SKGE to 2.4 that I found here : http://developer.osdl.org/shemminger/releases/skge-sky2-backport.tar.bz2 It seems to work pretty well compared to the original syskonnect driver (up to and including 8.36). Several people around me have reported very slow NFS operations with the official driver, which I finally attributed to a strange effect of UDP packets not going out after a while until they get pushed by a TCP packet. I even noticed the problem at the company and we turned the NFS server to an unused 100 Mbps card to workaround the problem before being able to fully ananlyze the problem. It seems your driver is getting mature and its performance is very close to the official one, while its code is smaller and apparently more reliable. I was thinking about merging it in mainline 2.4 as a fix for people having trouble with the syskonnect driver. It might also be easier to backport fixes from 2.6 to 2.4 when the driver is the same. I don't think we risk any regression because it won't replace an existing driver, but will provide one to people who are used to download new versions from an external tree. Also, I'm not yet sure whether I would also backport the sky2 driver, because I know about a handful boxes running in production with the official one with 88E8053 chips at high packet rates with no trouble at all. Anyway, as long as the backport does not prevent them from using the external driver, there should be no problem. I'd like to get your opinion on this matter, and of course, Jeff's and Davem's. Thanks in advance, Willy The backport needs to be updated. It is of older code. I plan to do a new backport this week. The backport version doesn't use NAPI, because of issues with not wanting to change netdevice.h. For a good 2.4 version, I would make a version that was closer to 2.6 code (using NAPI). That would be perfect, it would make backport of fixes even easier. I have turned last version into a patch against 2.4.33 for in-tree inclusion, so if you're interested in getting it for the Config.in, Makefiles and Configure.help, do not hesitate. I did the backport because one of the equipment donors gave a VPN box whose base OS is RHEL based on 2.4. It's amazing how having the hardware stimulates development, isn't it? :-) Tbanks, Willy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] add netpoll support for gianfar
On Nov 6, 2006, at 05:19, Vitaly Wool wrote: The patch inlined below adds NET_POLL_CONTROLLER support for gianfar network driver. drivers/net/gianfar.c | 34 ++ 1 file changed, 34 insertions(+) Signed-off-by: Vitaly Wool [EMAIL PROTECTED] Index: powerpc/drivers/net/gianfar.c === --- powerpc.orig/drivers/net/gianfar.c +++ powerpc/drivers/net/gianfar.c @@ -133,6 +133,9 @@ static void gfar_set_hash_for_addr(struc #ifdef CONFIG_GFAR_NAPI static int gfar_poll(struct net_device *dev, int *budget); #endif +#ifdef CONFIG_NET_POLL_CONTROLLER +static void gfar_netpoll(struct net_device *dev); +#endif int gfar_clean_rx_ring(struct net_device *dev, int rx_work_limit); static int gfar_process_frame(struct net_device *dev, struct sk_buff *skb, int length); static void gfar_vlan_rx_register(struct net_device *netdev, @@ -260,6 +263,9 @@ static int gfar_probe(struct platform_de dev-poll = gfar_poll; dev-weight = GFAR_DEV_WEIGHT; #endif +#ifdef CONFIG_NET_POLL_CONTROLLER + dev-poll_controller = gfar_netpoll; +#endif dev-stop = gfar_close; dev-get_stats = gfar_get_stats; dev-change_mtu = gfar_change_mtu; @@ -1536,6 +1542,34 @@ static int gfar_poll(struct net_device * } #endif +#ifdef CONFIG_NET_POLL_CONTROLLER +/* + * Polling 'interrupt' - used by things like netconsole to send skbs + * without having to re-enable interrupts. It's not called while + * the interrupt routine is executing. + */ +static void gfar_netpoll(struct net_device *dev) +{ + struct gfar_private *priv = netdev_priv(dev); + + /* If the device has multiple interrupts, run tx/rx */ + if (priv-einfo-device_flags FSL_GIANFAR_DEV_HAS_MULTI_INTR) { + disable_irq(priv-interruptTransmit); + disable_irq(priv-interruptReceive); + disable_irq(priv-interruptError); + gfar_transmit(priv-interruptTransmit, dev, NULL); + gfar_receive(priv-interruptReceive, dev, NULL); You are passing extra arguments, here + enable_irq(priv-interruptError); + enable_irq(priv-interruptReceive); + enable_irq(priv-interruptTransmit); + } else { + disable_irq(priv-interruptTransmit); + gfar_interrupt(priv-interruptTransmit, dev, NULL); and here (pt_regs got eliminated). Also, a few more comments: 1) Do we need the disable/enable irq stuff? It seems like we should be able to either just *mask* the interrupts at the controller, or rely on the locks to disable the interrupts. 2) If we are calling gfar_transmit and gfar_receive, shouldn't we call gfar_error? 3) I think it should be possible to just call gfar_interrupt() in every situation, but I'm not very familiar with net poll's requirements (You can add that into your evaluation of #1, too). Andy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Zero checksum in netconsole/netdump packets
Hello, I was reading some tcpdump's of netdump traffic today, and I realized that all of the packets that go from the crashing machine to the netdump server have a zero checksum. Looking at the code, it looks like netconsole/netdump use the function netpoll_send_udp to send out the packets. However, in netdump_send_udp, the checksum is set to 0, and never seems to be computed. Is this intentional, or just an oversight? I would think that we would always want to compute the UDP checksum, but there might be something I am overlooking. Incidentally, it seems like the only user of netpoll_send_udp is netconsole (and netdump in RedHat kernels). Assuming that this is just an oversight, attached is a simple patch to compute the UDP checksum in netpoll_send_udp. Signed-off-by: Chris Lalancette [EMAIL PROTECTED] --- linux-2.6/net/core/netpoll.c.orig 2006-11-06 18:16:58.0 -0500 +++ linux-2.6/net/core/netpoll.c 2006-11-06 18:31:20.0 -0500 @@ -356,6 +356,10 @@ void netpoll_send_udp(struct netpoll *np put_unaligned(htonl(np-remote_ip), (iph-daddr)); iph-check= ip_fast_csum((unsigned char *)iph, iph-ihl); + udph-check = csum_tcpudp_magic(iph-saddr, iph-daddr, udp_len, + IPPROTO_UDP, + csum_partial((unsigned char *)udph, udp_len, 0)); + eth = (struct ethhdr *) skb_push(skb, ETH_HLEN); skb-mac.raw = skb-data; skb-protocol = eth-h_proto = htons(ETH_P_IP);
Re: [PATCH] s2io ppc64 fix for readq/writeq
Generally the kernel code should write the two 32-bit chunks to the memory-mapped region in order (low dword first), and let things take care of themselves from there. That's pretty much the implementation that -every- driver copies, when they need readq/writeq to work on a 32-bit platform. What do you mean by low dword first ? For example, the implementation in the s2io driver does: static inline u64 readq(void __iomem *addr) { u64 ret = 0; ret = readl(addr + 4); ret = 32; ret |= readl(addr); return ret; } static inline void writeq(u64 val, void __iomem *addr) { writel((u32) (val), addr); writel((u32) (val 32), (addr + 4)); } As you can see, it reads the -second- dword first (high order dword in little endian), but writes the first dword first (low order dword in little endian). If there is any logic here, it's card specific. Or is this really what PCI does when doing 64 bits accesses on a 32 bits PCI bus ? I would have expected the later (what write does) but this driver does it reverse on reads. I'm tempted to go to the simple #define readq readq for now until we clear that up. Ben. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Net: kconfig, correct traffic shaper
From: Jeff Garzik [EMAIL PROTECTED] Date: Mon, 06 Nov 2006 02:52:02 -0500 ACK from me, though I think that since it relates to traffic schedulers I think this patch should be merged through DaveM... I've merged it into my tree, thanks everyone. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] add dev_to_node()
On Sun, Nov 05, 2006 at 12:22:37AM -0800, David Miller wrote: Looks good to me. So what's the right path to get this in? There's one patch touching MM code, one adding something to the driver core and then finally a networking patch depending on the previous two. Do you want to take them all and send them in through the networking tree? Or should we put the burden on Andrew? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] don't use highmem in tcp hash size calculation
This patch removes consideration of high memory when determining TCP hash table sizes. Taking into account high memory results in tcp_mem values that are too large. Signed-off-by: John Heffner [EMAIL PROTECTED] --- commit ea55b7c31b47edf90132baea9a088da3bbe2bb5c tree 82311e12d4e4e006fba1688cb537de06cf7a4e4b parent 4f6f9ba021f8a2149238f7c081cd7cf55c70c775 author John Heffner [EMAIL PROTECTED] Mon, 06 Nov 2006 20:03:01 -0500 committer John Heffner [EMAIL PROTECTED] Mon, 06 Nov 2006 20:03:01 -0500 net/ipv4/tcp.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 66e9a72..4322318 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2270,7 +2270,7 @@ void __init tcp_init(void) thash_entries, (num_physpages = 128 * 1024) ? 13 : 15, - HASH_HIGHMEM, + 0, tcp_hashinfo.ehash_size, NULL, 0); @@ -2286,7 +2286,7 @@ void __init tcp_init(void) tcp_hashinfo.ehash_size, (num_physpages = 128 * 1024) ? 13 : 15, - HASH_HIGHMEM, + 0, tcp_hashinfo.bhash_size, NULL, 64 * 1024);
RE: [PATCH] s2io ppc64 fix for readq/writeq
-Original Message- From: Roland Dreier [mailto:[EMAIL PROTECTED] Sent: Monday, November 06, 2006 12:55 PM To: Christoph Hellwig Cc: Ramkrishna Vepa; Benjamin Herrenschmidt; Jeff Garzik; Linus Torvalds; netdev@vger.kernel.org; [EMAIL PROTECTED] Subject: Re: [PATCH] s2io ppc64 fix for readq/writeq For consistencies sake we really want to have readq() and writeq() available on all platforms. I remember that some IB cards require it to actually be a 64bit transactions, otherwise they have to do funny workarounds. I think the best solution is to define ARCH_HAS_ATOMIC_READQ_WRITEQ and let drivers do their workarounds based on that. I've Cc'ed Roland because he should be able to explain the IB issue in details. The issue I know about is drivers/infiniband/hw/mthca. The card has 64-bit doorbell registers, and the restriction is that if you write the doorbell write two 32-bit writes, you can't write anything else on the same register page in between writing the two halves. Since different CPUs might be doing stuff on the same doorbell page at the same time, there are two things we can do: - If writeq() exists then use that and assume it will generate only a single bus transaction that can't let anything sneak in the middle. (That's a fairly safe assumption because the devices being driven are either 64-bit PCI-X or PCIe only) - If writeq() doesn't exist, use a spinlock to protect access to each doorbell page. ARCH_HAS_ATOMIC_READQ_WRITEQ would be fine for that, but of course the tricky thing is writing down the exact semantics that HAS_ATOMIC is actually promising. - R. [Ram] If the writes broken up into 32 bit writes they are posted to the bridge and need to be flushed with a lock around the whole access. This is in the domain of the driver and need not be part of the platform specific code. Ram - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets
On 11/6/06, Eric Dumazet [EMAIL PROTECTED] wrote: On Monday 06 November 2006 10:46, Zhao Xiaoming wrote: 2006/11/6, Eric Dumazet [EMAIL PROTECTED]: On Monday 06 November 2006 09:59, Zhao Xiaoming wrote: Thank you again for your help. To have more detailed statistic data, I did another round of test and gathered some data. I give the overall description here and detailed /proc/net/sockstat, /proc/meminfo, /proc/slabinfo and /proc/buddyinfo follows. = slab mem costtcp mem pages lowmem free with traffic: 254668KB 34693 38772KB without traffic: 104080KB 1 702652KB = Thank you for detailed infos. It appears you have an extensive use of threads (about 1), since : task_struct10095 10095 136031 : tunables 24 12 8 : slabdata 3365 3365 0 Each thread has a kernel stack, 8KB (ie 2 pages, order-1 allocation), plus a user vma vm_area_struct 21346 21504 92 421 : tunables 120 60 8 : slabdata512512 0 Most likely you dont need that much threads. A program with fewer threads will perform better and use less ram. Thanks for the comments. I known the threads may cost many memory. However, I already excluded them from the statistics. The 'after test' info was gotten while the 1 threads running but no traffics relayed. You may look at the meminfo of 'after test', there is still 104080 kB slab memory which should already included the thread kernel memory cost (8K*1=80MB). I know 1 threads are not necessary and just use the simple logic to do some test. In fact, your kernel has CONFIG_4KSTACKS, kernel thread stacks use 4K instead of 8K. If you want to increase LOWMEM, (and keep 32bits kernel), you can chose a 2G/2G user/kernel split, instead of the 3G/1G default split. (see config : CONFIG_VMSPLIT_2G) Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Thank you for your advice. I know increase LOMEM could be help, but now my concern is why I lose my 500M bytes memory after excluding all known memory cost. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets
On 11/7/06, Stephen Hemminger [EMAIL PROTECTED] wrote: Eric Dumazet wrote: Zhao Xiaoming a écrit : Dears, I'm running a linux box with kernel version 2.6.16. The hardware has 2 Woodcrest Xeon CPUs (2 cores each) and 4G RAM. The NIC cards is Intel 82571 on PCI-e bus. The box is acting as ethernet bridge between 2 Gigabit Ethernets. By configuring ebtables and iptables, an application is running as TCP proxy which will intercept all TCP connections requests from the network and setup another TCP connection to the acture server. The TCP proxy then relays all traffics in both directions. The problem is the memory. Since the box must support thousands of concurrent connections, I know the memory size of ZONE_NORMAL would be a bottleneck as TCP packets would need many buffers. After setting upper limit of net.ipv4.tcp_rmem and net.ipv4.tcp_wmem to 32K bytes, our test began. My test scenario employs 2000 concurrent downloading connections to a IIS server's port 80. The throughput is about 500~600 Mbps which is limited by the capability of the client application. Because all traffics are from server to client and the capability of client machine is bottleneck, I believe the receiver side of the sockets connected with server and the sender side of the sockets connected with client should be filled with packets in correspondent windows. Thus, roughly there should be about 32K * 2000+ 32K*2000 = 128M bytes memory occupied by TCP/IP stack for packet buffering. Data from slabtop confermed it. it's about 140M bytes memory cost after I start the traffic. That reasonablly matched with my estimation. However, /proc/meminfo had a different story. The 'LowFree' dropped from about 710M to 80M. In other words, there's addtional 500M memory in ZONE_NORMAL allocated by someone other than the slab. Why? The amount of memory per socket is controlled by the socket buffering. Your application could be setting the value by calling setsockopt(). Otherwise, the tcp memory is limited by the sysctl settings tcp_rmem (receiver) and tcp_wmem (sender). For example on this server: $ cat /proc/sys/net/ipv4/tcp_wmem 409616384 131072 Each sending socket would start with 16K of buffering, but could grow up to 128K based on TCP send autotuning. Of course I can change the TCP buffers and I already discribed I set both uppper limit of tcp_rmem and tcp_wmem to 32K. And if you go through my former posts, you should notic that TCP stack on my machine only occupied 34K memory pages for buffering which is close to my theoretical estimation: 128M. But at the same time, my free LOMEM size decreased from over 700M to less than 100M. The question is where the additional 500M bytes gone? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets
Zhao Xiaoming a écrit : On 11/6/06, Eric Dumazet [EMAIL PROTECTED] wrote: In fact, your kernel has CONFIG_4KSTACKS, kernel thread stacks use 4K instead of 8K. If you want to increase LOWMEM, (and keep 32bits kernel), you can chose a 2G/2G user/kernel split, instead of the 3G/1G default split. (see config : CONFIG_VMSPLIT_2G) Eric Thank you for your advice. I know increase LOMEM could be help, but now my concern is why I lose my 500M bytes memory after excluding all known memory cost. Unfortunatly you dont provide very much details. AFAIK you didnt even gave whcih version of linux you run, which programs you run... You keep answering where you 'lost' your mem, it's quite buging. Maybe some Oracles on this list will see the light for you, before exchanging 100 mails with you ? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets
The latest update: It seems that Linux kernel memory management mechanisms including buddy and slab algorisms are not very efficient under my test conditions that tcp stack requires a lot of (hundreds of MB) packet buffers and release them very frequently. Here is the proof. After change my kernel configuration to support 2/2 VM splition, LOMEM consumption reduced to 270M bytes compared with 640M bytes of the 1/3 kernel. All test conditions are the same and memory pages allocated by TCP stack are also the same, 34K ~ 38K pages. In other words, 'lost' memory changed from ~500M to ~130M. Thus, I have nothing to do but guessing the much more free pages make the slab/buddy algorisms more efficient and waste less memory. Finally I got what I want. Thank you all for your help and advices. Xiaoming. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets
On 11/7/06, Eric Dumazet [EMAIL PROTECTED] wrote: Zhao Xiaoming a écrit : On 11/6/06, Eric Dumazet [EMAIL PROTECTED] wrote: In fact, your kernel has CONFIG_4KSTACKS, kernel thread stacks use 4K instead of 8K. If you want to increase LOWMEM, (and keep 32bits kernel), you can chose a 2G/2G user/kernel split, instead of the 3G/1G default split. (see config : CONFIG_VMSPLIT_2G) Eric Thank you for your advice. I know increase LOMEM could be help, but now my concern is why I lose my 500M bytes memory after excluding all known memory cost. Unfortunatly you dont provide very much details. AFAIK you didnt even gave whcih version of linux you run, which programs you run... You keep answering where you 'lost' your mem, it's quite buging. Maybe some Oracles on this list will see the light for you, before exchanging 100 mails with you ? I think I aready gave the kernel version and introduced my application in the first post. What are the further details you want? The reason I keep asking for the 'lost mem' is that I want to focus on the problem, not the workarrounds that may lead to further problems if I keep increasing the concurrent scale. Anyway, since the problem is already solved (see my last post), I'd like to thank you for the help. Xiaoming. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] add dev_to_node()
On Sun, Nov 05, 2006 at 12:53:23AM +0100, Christoph Hellwig wrote: On Sat, Nov 04, 2006 at 06:06:48PM -0500, Dave Jones wrote: On Sat, Nov 04, 2006 at 11:56:29PM +0100, Christoph Hellwig wrote: This will break the compile for !NUMA if someone ends up doing a bisect and lands here as a bisect point. You introduce this nice wrapper.. The dev_to_node wrapper is not enough as we can't assign to (-1) for the non-NUMA case. So I added a second macro, set_dev_node for that. The patch below compiles and works on numa and non-NUMA platforms. Hi Christoph, dev_to_node does not work as expected on x86_64 (and i386). This is because node value returned by pcibus_to_node is initialized after a struct device is created with current x86_64 code. We need the node value initialized before the call to pci_scan_bus_parented, as the generic devices are allocated and initialized off pci_scan_child_bus, which gets called from pci_scan_bus_parented The following patch does that using pci_sysdata introduced by the PCI domain patches in -mm. Signed-off-by: Alok N Kataria [EMAIL PROTECTED] Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED] Signed-off-by: Shai Fultheim [EMAIL PROTECTED] Index: linux-2.6.19-rc4mm2/arch/i386/pci/acpi.c === --- linux-2.6.19-rc4mm2.orig/arch/i386/pci/acpi.c 2006-11-06 11:03:50.0 -0800 +++ linux-2.6.19-rc4mm2/arch/i386/pci/acpi.c2006-11-06 22:04:14.0 -0800 @@ -9,6 +9,7 @@ struct pci_bus * __devinit pci_acpi_scan { struct pci_bus *bus; struct pci_sysdata *sd; + int pxm; /* Allocate per-root-bus (not per bus) arch-specific data. * TODO: leak; this memory is never freed. @@ -30,15 +31,21 @@ struct pci_bus * __devinit pci_acpi_scan } #endif /* CONFIG_PCI_DOMAINS */ + sd-node = -1; + + pxm = acpi_get_pxm(device-handle); +#ifdef CONFIG_ACPI_NUMA + if (pxm = 0) + sd-node = pxm_to_node(pxm); +#endif + bus = pci_scan_bus_parented(NULL, busnum, pci_root_ops, sd); if (!bus) kfree(sd); #ifdef CONFIG_ACPI_NUMA if (bus != NULL) { - int pxm = acpi_get_pxm(device-handle); if (pxm = 0) { - sd-node = pxm_to_node(pxm); printk(bus %d - pxm %d - node %d\n, busnum, pxm, sd-node); } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
TCP stack sometimes loses ACKs ... or something
I upgraded my notebook from 2.6.16 to 2.6.18 recently and noticed that I couldn't talk to my VOIP device (which has a WEB interface). Watching traffic I see the three-way-handshake working perfectly, and then the first data packet is sent (a partial HTTP request: GET / HTTP/1.1 ) and an ACK comes back from the device. Then the next data packet (remainder of the HTTP request) is sent, but tcpdump never sees the ACK, nor does the TCP stack. So the data gets recent repeatedly. No ack. Ever. With 2.6.16, The ack comes back just fine and the connection proceeds as you would expect. As it was a very reproducible problem I decided to try git bisect and found bad: [7b4f4b5ebceab67ce440a61081a69f0265e17c2a] [TCP]: Set default max buffers from memory pool size I double checked as this seemed a fairly unlikely patch to cause the problem, but this definitely is it. The net effect of this patch is to change the last of the three numbers in cat /proc/sys/net/ipv4/tcp_[rw]mem from well below 2^20 to well above. 2^20 seems to be a significant number. I set tcp_wmem to that and the ACK was lost. I set it to one less and the first ACK (at least) was accepted. I ended up setting both r and w to 10 and everything is fine. Exploring more deeply, and comparing: - a failing connection (to VIOP box, [rw]mem large) - a working connection to VOIP box ([rw]mem small) - a working connection to another machine ([rw]mem irrelevant). I find: The VIOP returns MSS=1360 in the SYN/ACK packet. Other machine returns MSS=1460 The ack that is getting lost contains data as well as the ACK. i.e. the same packet that ACKs at the TCP level includes the HTTP level reply. The matching ACK from the other machine (some Linux 2.6.8 I think) is a data-less ACK followed very quickly by the HTTP reply in a separate packet. The 'Timestamps' option coming back from the VOIP box is a little odd. The Timestamp in the SYN/ACK is the same as the timestamp in the next ACK (the ack for the first partial HTTP request). The Timestamp in the next packet which is the one that gets lost has exactly the same TSval as previous packets, and TSecr is one more than in the previous packet. I assume that one (or more) of these differences combined with the large tcp_[rw]mem value cause the packet loss, but I have no idea which. Help? I can make the tcp traces available if needed, but these are really the only non-trivial differences. I'm willing to test patches. NeilBrown - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.19-rc4-git10][PKT_SCHED] sch_htb: INIT_HLIST_NODE after hlist_del()
On Mon, Nov 06, 2006 at 09:44:49AM -0800, Stephen Hemminger wrote: On Mon, 6 Nov 2006 12:33:53 +0100 Jarek Poplawski [EMAIL PROTECTED] wrote: After hlist_del() next and pprev pointers are not NULL so hlist_unhashed() doesn't work properly. Signed-off-by: Jarek Poplawski [EMAIL PROTECTED] --- diff -Nurp linux-2.6.19-rc4-git10-/net/sched/sch_htb.c linux-2.6.19-rc4-git10/net/sched/sch_htb.c --- linux-2.6.19-rc4-git10-/net/sched/sch_htb.c 2006-11-06 11:42:41.0 +0100 +++ linux-2.6.19-rc4-git10/net/sched/sch_htb.c 2006-11-06 11:53:15.0 +0100 @@ -1284,8 +1284,10 @@ static void htb_destroy_class(struct Qdi struct htb_class, sibling)); /* note: this delete may happen twice (see htb_delete) */ - if (!hlist_unhashed(cl-hlist)) + if (!hlist_unhashed(cl-hlist)) { hlist_del(cl-hlist); + INIT_HLIST_NODE(cl-hlist); + } why not use hlist_del_init? Yes, this is the question! As a matter of fact I expected another question. Yesterday I was short on time so I didn't describe the bug enough. I'm not sure if you know the problem, so here are more details (for me problem is 199% repeatable). After something like this: # tc qdisc add dev lo root handle 1: htb # tc class add dev lo parent 1: classid 1:1 htb rate 200kbps # tc class del dev lo classid 1:1 enter the BUG... I've found the last command is the culprit and if you do: # tc qdisc del dev lo root there is no problem. And probably it is enough to do the change only in htb_delete - btw. is this hlist_del really needed there? and shouldn't all deletions be done after zeroing the refcount? - but you should know better. list_del(cl-sibling); if (cl-prio_activity) @@ -1333,8 +1335,10 @@ static int htb_delete(struct Qdisc *sch, sch_tree_lock(sch); /* delete from hash and active; remainder in destroy_class */ - if (!hlist_unhashed(cl-hlist)) + if (!hlist_unhashed(cl-hlist)) { hlist_del(cl-hlist); + INIT_HLIST_NODE(cl-hlist); + } if (cl-prio_activity) htb_deactivate(q, cl); Best regards, Jarek P. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] don't use highmem in tcp hash size calculation
Thanks very much for catching this John, patch applied. Guess what? Nobody uses HASH_HIGHMEM after this change, and frankly I can't think of any valid use of it besides perhaps something such as a page cache hash table but that's irrelevant since we use a per-object tree data structure for that these days. We should probably kill off HASH_HIGHMEM. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Zero checksum in netconsole/netdump packets
From: Chris Lalancette [EMAIL PROTECTED] Date: Mon, 06 Nov 2006 18:40:59 -0500 Assuming that this is just an oversight, attached is a simple patch to compute the UDP checksum in netpoll_send_udp. If the resulting checksum is zero, you should set it to all 1's, like the real UDP code does. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP stack sometimes loses ACKs ... or something
Neil Brown wrote: I upgraded my notebook from 2.6.16 to 2.6.18 recently and noticed that I couldn't talk to my VOIP device (which has a WEB interface). Watching traffic I see the three-way-handshake working perfectly, and then the first data packet is sent (a partial HTTP request: GET / HTTP/1.1 ) and an ACK comes back from the device. Then the next data packet (remainder of the HTTP request) is sent, but tcpdump never sees the ACK, nor does the TCP stack. So the data gets recent repeatedly. No ack. Ever. With 2.6.16, The ack comes back just fine and the connection proceeds as you would expect. As it was a very reproducible problem I decided to try git bisect and found bad: [7b4f4b5ebceab67ce440a61081a69f0265e17c2a] [TCP]: Set default max buffers from memory pool size I double checked as this seemed a fairly unlikely patch to cause the problem, but this definitely is it. The net effect of this patch is to change the last of the three numbers in cat /proc/sys/net/ipv4/tcp_[rw]mem from well below 2^20 to well above. 2^20 seems to be a significant number. I set tcp_wmem to that and the ACK was lost. I set it to one less and the first ACK (at least) was accepted. I ended up setting both r and w to 10 and everything is fine. Exploring more deeply, and comparing: - a failing connection (to VIOP box, [rw]mem large) - a working connection to VOIP box ([rw]mem small) - a working connection to another machine ([rw]mem irrelevant). I find: The VIOP returns MSS=1360 in the SYN/ACK packet. Other machine returns MSS=1460 The ack that is getting lost contains data as well as the ACK. i.e. the same packet that ACKs at the TCP level includes the HTTP level reply. The matching ACK from the other machine (some Linux 2.6.8 I think) is a data-less ACK followed very quickly by the HTTP reply in a separate packet. The 'Timestamps' option coming back from the VOIP box is a little odd. The Timestamp in the SYN/ACK is the same as the timestamp in the next ACK (the ack for the first partial HTTP request). The Timestamp in the next packet which is the one that gets lost has exactly the same TSval as previous packets, and TSecr is one more than in the previous packet. I assume that one (or more) of these differences combined with the large tcp_[rw]mem value cause the packet loss, but I have no idea which. Help? I can make the tcp traces available if needed, but these are really the only non-trivial differences. I'm willing to test patches. NeilBrown You almost certainly have a windows scale corrupting firewall in your path. See http://lwn.net/Articles/92727/ 2.6.18 increased the maximum window size, so it aggravated a pre-existing condition in your network. You can turn off window scaling globally (with sysctl) or per route congestion window limit. It could also be that VOIP application is getting aggravated by TCP ABC. That can be turned off with sysctl (net.ipv4.tcp_abc=0) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP stack sometimes loses ACKs ... or something
Window scaling... there is some intermediate device which is trying to prevent out of window segments from passing through, but it is not taking the negotiated window scale into account. So it thinks that segments are outside of the window, when they are not. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html