date:20061106

Re: Pull request for 'jg-20061103-00' tag

2006-11-06 Thread Jeff Garzik


Francois Romieu wrote:

Please pull from tag 'jg-20061103-00' in repository

git://electric-eye.fr.zoreil.com/home/romieu/linux-2.6.git jg-20061103-00

to get the changes below.

Distance from 'upstream-fixes'
-

17fddc34b36fc26aa8b5f130fe32b446d9d88fa2

Diffstat


 drivers/net/r8169.c |   22 --
 1 files changed, 20 insertions(+), 2 deletions(-)

Shortlog


Francois Romieu:
  r8169: perform a PHY reset before any other operation at boot time


This warrants much more testing than pushing into 2.6.19-rc4 would give 
us, so I'm pulling it into #upstream.


In the past, with 10/100 hubs or ancient Cisco switches, we really 
didn't want to reset the phy and restart autonegotiation, because that 
might be problematic.


In any case, this is a behavior change that may solve problems... but 
also needs testing to insure that it doesn't also cause problems.


Jeff


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/18] e1000: features, updates, documentation

2006-11-06 Thread Jeff Garzik


pulled.

still waiting on those changes to better modularize the feature 
detection, etc.



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] s2io ppc64 fix for readq/writeq

2006-11-06 Thread Benjamin Herrenschmidt


 This seems a bit ugly.  Could you add
 
   #define readq readq
 
 to your platform instead?

That's ugly too imho but I suppose I can do it :-)

 I generally think it's a bug in the kernel-wide API, if use of said API 
 requires arch-specific ifdefs.

Yes. I agree. In that specific case, I suppose what you propose is the
least ugly of the solutions. HAVE_ARCH_* is pretty much out of fascion
(and I tend to agree with Linus that it's not pretty anyway).

Actually, I tend to think in that specific case that the driver defining
something called readq and writeq based on a pair of readl's and
writel's is fairly bogus though.

 Or maybe the problem could be solved another way, by guaranteeing that a 
 good enough for drivers readq() and writeq() exist on all platforms, 
 even 32-bit platforms where the operation isn't inherently atomic.

I'd rather not provide readq/writeq if they aren't atomic.

Ben.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] forcedeth: add recoverable error support

2006-11-06 Thread Jeff Garzik


Ayaz Abdulla wrote:
This patch adds support to recover from a previously fatal MAC error. In 
the past the MAC would be hung on an internal fatal error. On new 
chipsets, the MAC has the ability to enter a non-fatal state and allow 
the driver to re-init it.


Signed-Off-By: Ayaz Abdulla [EMAIL PROTECTED]


applied


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] forcedeth: add new NVIDIA pci ids

2006-11-06 Thread Jeff Garzik


Ayaz Abdulla wrote:

This patch adds pci device ids for the NVIDIA MCP67 chip.

Signed-Off-By: Ayaz Abdulla [EMAIL PROTECTED]


ACK, but please rediff and resend against netdev-2.6.git#upstream


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2.6.18] defxx: Big-endian hosts support

2006-11-06 Thread Jeff Garzik


Maciej W. Rozycki wrote:
 The PDQ DMA engine requires a different byte-swapping mode for big-endian 
hosts; also the MAC address which is read from a register through PIO has 
to be byte-swapped.  These changes have been verified with DEFPA-DC (PCI) 
boards and a Broadcom BCM91250A (MIPS CPU based) host.


Signed-off-by: Maciej W. Rozycki [EMAIL PROTECTED]


applied


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] forcedeth: add mgmt unit support

2006-11-06 Thread Jeff Garzik


Ayaz Abdulla wrote:
This patch adds support for the mgmt unit in certain chipsets. The MAC 
and the mgmt unit share the PHY and therefore proper intialization 
procedures are needed for them to maintain coexistense.


Signed-Off-By: Ayaz Abdulla [EMAIL PROTECTED]


applied


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATC 2/2] forcedeth: add support for new mcp67 device

2006-11-06 Thread Jeff Garzik


Ayaz Abdulla wrote:

This patch adds support for the new mcp67 device into forcedeth.

Signed-Off-By: Ayaz Abdulla [EMAIL PROTECTED]


ACK, but please rediff and resend against latest netdev-2.6.git#upstream


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] s2io ppc64 fix for readq/writeq

2006-11-06 Thread Jeff Garzik


Benjamin Herrenschmidt wrote:

This seems a bit ugly.  Could you add

#define readq readq

to your platform instead?


That's ugly too imho but I suppose I can do it :-)

I generally think it's a bug in the kernel-wide API, if use of said API 
requires arch-specific ifdefs.


Yes. I agree. In that specific case, I suppose what you propose is the
least ugly of the solutions. HAVE_ARCH_* is pretty much out of fascion
(and I tend to agree with Linus that it's not pretty anyway).

Actually, I tend to think in that specific case that the driver defining
something called readq and writeq based on a pair of readl's and
writel's is fairly bogus though.

Or maybe the problem could be solved another way, by guaranteeing that a 
good enough for drivers readq() and writeq() exist on all platforms, 
even 32-bit platforms where the operation isn't inherently atomic.


I'd rather not provide readq/writeq if they aren't atomic.


This is why I said good enough for drivers.  This is _key_.

I have run into several [PCI] devices with 64-bit registers, and 
__none__ of them had requirements such that the Linux platform code 
-must- provide an atomic readq/writeq.  Probably because everybody wants 
to support 32-bit platforms with their devices.


What you call fairly bogus is precisely what drivers need.  These 
devices with 64-bit registers just don't need the atomicity that arch 
developers harp about :)


Jeff


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] s2io ppc64 fix for readq/writeq

2006-11-06 Thread Benjamin Herrenschmidt

 This is why I said good enough for drivers.  This is _key_.
 
 I have run into several [PCI] devices with 64-bit registers, and 
 __none__ of them had requirements such that the Linux platform code 
 -must- provide an atomic readq/writeq.  Probably because everybody wants 
 to support 32-bit platforms with their devices.
 
 What you call fairly bogus is precisely what drivers need.  These 
 devices with 64-bit registers just don't need the atomicity that arch 
 developers harp about :)

Is there any consistency in that case in which half need to be
read/written first ? Or none of these ever had side effects ?

Ben.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[git patches] net driver fixes

2006-11-06 Thread Jeff Garzik


Please pull from 'upstream-linus' branch of
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git 
upstream-linus

to receive the following updates:

 drivers/net/Kconfig |4 ++--
 drivers/net/ehea/ehea.h |5 +
 drivers/net/ehea/ehea_ethtool.c |2 +-
 drivers/net/ehea/ehea_main.c|   26 +-
 drivers/net/ehea/ehea_phyp.c|2 +-
 drivers/net/ehea/ehea_phyp.h|6 --
 drivers/net/ehea/ehea_qmr.c |   17 +
 drivers/net/wireless/bcm43xx/bcm43xx_leds.c |7 ++-
 drivers/net/wireless/bcm43xx/bcm43xx_leds.h |6 ++
 drivers/net/wireless/bcm43xx/bcm43xx_main.c |   16 +++-
 drivers/net/wireless/hostap/hostap_plx.c|4 ++--
 net/ieee80211/ieee80211_rx.c|   12 ++--
 12 files changed, 66 insertions(+), 41 deletions(-)

Jiri Benc:
  ieee80211: don't flood log with errors

Larry Finger:
  bcm43xx: fix unexpected LED control values in BCM4303 sprom

Michael Buesch:
  bcm43xx: Fix low-traffic netdev watchdog TX timeouts

Pavel Roskin:
  hostap_plx: fix CIS verification

Randy Dunlap:
  Kconfig: remove redundant NETDEVICES depends

Thomas Klein:
  ehea: Nullpointer dereferencation fix
  ehea: Removed redundant define
  ehea: 64K page support fix

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 28c17d1..9cb3ca5 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -486,7 +486,7 @@ config SGI_IOC3_ETH_HW_TX_CSUM
 
 config MIPS_SIM_NET
tristate MIPS simulator Network device (EXPERIMENTAL)
-   depends on NETDEVICES  MIPS_SIM  EXPERIMENTAL
+   depends on MIPS_SIM  EXPERIMENTAL
help
  The MIPSNET device is a simple Ethernet network device which is
  emulated by the MIPS Simulator.
@@ -2467,7 +2467,7 @@ config ISERIES_VETH
 
 config RIONET
tristate RapidIO Ethernet over messaging driver support
-   depends on NETDEVICES  RAPIDIO
+   depends on RAPIDIO
 
 config RIONET_TX_SIZE
int Number of outbound queue entries
diff --git a/drivers/net/ehea/ehea.h b/drivers/net/ehea/ehea.h
index b40724f..39ad9f7 100644
--- a/drivers/net/ehea/ehea.h
+++ b/drivers/net/ehea/ehea.h
@@ -39,7 +39,7 @@ #include asm/abs_addr.h
 #include asm/io.h
 
 #define DRV_NAME   ehea
-#define DRV_VERSIONEHEA_0034
+#define DRV_VERSIONEHEA_0043
 
 #define EHEA_MSG_DEFAULT (NETIF_MSG_LINK | NETIF_MSG_TIMER \
| NETIF_MSG_RX_ERR | NETIF_MSG_TX_ERR)
@@ -105,9 +105,6 @@ #define EHEA_BCMC_TAGGED0x00
 #define EHEA_BCMC_VLANID_ALL   0x01
 #define EHEA_BCMC_VLANID_SINGLE0x00
 
-/* Use this define to kmallocate pHYP control blocks */
-#define H_CB_ALIGNMENT 4096
-
 #define EHEA_CACHE_LINE  128
 
 /* Memory Regions */
diff --git a/drivers/net/ehea/ehea_ethtool.c b/drivers/net/ehea/ehea_ethtool.c
index 82eb2fb..9f57c2e 100644
--- a/drivers/net/ehea/ehea_ethtool.c
+++ b/drivers/net/ehea/ehea_ethtool.c
@@ -238,7 +238,7 @@ static void ehea_get_ethtool_stats(struc
data[i++] = port-port_res[0].swqe_refill_th;
data[i++] = port-resets;
 
-   cb6 = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL);
+   cb6 = kzalloc(PAGE_SIZE, GFP_KERNEL);
if (!cb6) {
ehea_error(no mem for cb6);
return;
diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c
index 4538c99..6ad6961 100644
--- a/drivers/net/ehea/ehea_main.c
+++ b/drivers/net/ehea/ehea_main.c
@@ -92,7 +92,7 @@ static struct net_device_stats *ehea_get
 
memset(stats, 0, sizeof(*stats));
 
-   cb2 = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL);
+   cb2 = kzalloc(PAGE_SIZE, GFP_KERNEL);
if (!cb2) {
ehea_error(no mem for cb2);
goto out;
@@ -586,8 +586,8 @@ int ehea_sense_port_attr(struct ehea_por
u64 hret;
struct hcp_ehea_port_cb0 *cb0;
 
-   cb0 = kzalloc(H_CB_ALIGNMENT, GFP_ATOMIC);   /* May be called via */
-   if (!cb0) {  /* ehea_neq_tasklet() */
+   cb0 = kzalloc(PAGE_SIZE, GFP_ATOMIC);   /* May be called via */
+   if (!cb0) { /* ehea_neq_tasklet() */
ehea_error(no mem for cb0);
ret = -ENOMEM;
goto out;
@@ -670,7 +670,7 @@ int ehea_set_portspeed(struct ehea_port 
u64 hret;
int ret = 0;
 
-   cb4 = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL);
+   cb4 = kzalloc(PAGE_SIZE, GFP_KERNEL);
if (!cb4) {
ehea_error(no mem for cb4);
ret = -ENOMEM;
@@ -985,7 +985,7 @@ static int ehea_configure_port(struct eh
struct hcp_ehea_port_cb0 *cb0;
 
ret = -ENOMEM;
-   cb0 = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL);
+   cb0 = kzalloc(PAGE_SIZE, GFP_KERNEL);
if (!cb0)
goto out;
 
@@ -1443,7 +1443,7 @@ static

Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006-11-06 Thread Zhao Xiaoming


2006/11/6, Eric Dumazet [EMAIL PROTECTED]:

We dont know. You might post some data so that we can have some ideas.

Also, these kind of question is better handled by linux netdev mailing list,
so I added a CC to this list.

cat /proc/slabinfo
cat /proc/meminfo
cat /proc/net/sockstat
cat /proc/buddyinfo


TCP stack is one thing, but other things may consume ram on your kernel.

Also, kernel memory allocation might use twice the ram you intend to use
because of power of two alignments.

Are you using iptables connection tracking ?

If you plan to use a lot of RAM in kernel, why dont you use a 64 bits kernel,
so that all ram is available for kernel, not only 900 MB ?

Eric



Thank you again for your help. To have more detailed statistic data, I
did another round of test and gathered some data.  I give the overall
description here and detailed /proc/net/sockstat, /proc/meminfo,
/proc/slabinfo and /proc/buddyinfo follows.
=
  slab mem costtcp mem pages   lowmem free
with traffic: 254668KB 34693
 38772KB
without traffic:   104080KB   1
  702652KB
=

detailed info:

during the test (with traffic):

[EMAIL PROTECTED] ~]# cat /proc/net/sockstat
sockets: used 12058
TCP: inuse 4007 orphan 0 tw 0 alloc 4010 mem 34693
UDP: inuse 4
RAW: inuse 0
FRAG: inuse 0 memory 0
[EMAIL PROTECTED] ~]# cat /proc/meminfo
MemTotal:  4136580 kB
MemFree:   3169160 kB
Buffers: 42092 kB
Cached:  20048 kB
SwapCached:  0 kB
Active: 146808 kB
Inactive:35492 kB
HighTotal: 3276160 kB
HighFree:  3130388 kB
LowTotal:   860420 kB
LowFree: 38772 kB
SwapTotal: 2031608 kB
SwapFree:  2031608 kB
Dirty:   0 kB
Writeback:   0 kB
Mapped: 127720 kB
Slab:   254668 kB
CommitLimit:   4099896 kB
Committed_AS:   367784 kB
PageTables:   1696 kB
VmallocTotal:   116728 kB
VmallocUsed:  3876 kB
VmallocChunk:   110548 kB
HugePages_Total: 0
HugePages_Free:  0
Hugepagesize: 2048 kB
[EMAIL PROTECTED] ~]# cat /proc/slabinfo
slabinfo - version: 2.1
# nameactive_objs num_objs objsize objperslab
pagesperslab : tunables limit batchcount sharedfactor :
slabdata active_slabs num_slabs sharedavail
ip_conntrack_expect  0  0 92   421 : tunables  120
608 : slabdata  0  0  0
ip_conntrack4049   4352228   171 : tunables  120   60
 8 : slabdata256256  0
bridge_fdb_cache   6 59 64   591 : tunables  120   60
 8 : slabdata  1  1  0
fib6_nodes 7113 32  1131 : tunables  120   60
 8 : slabdata  1  1  0
ip6_dst_cache 10 30256   151 : tunables  120   60
 8 : slabdata  2  2  0
ndisc_cache1 20192   201 : tunables  120   60
 8 : slabdata  1  1  0
RAWv6  7 1076851 : tunables   54   27
 8 : slabdata  2  2  0
UDPv6  0  0704   112 : tunables   54   27
 8 : slabdata  0  0  0
tw_sock_TCPv6  0  0128   301 : tunables  120   60
 8 : slabdata  0  0  0
request_sock_TCPv6  0  0128   301 : tunables  120   60
  8 : slabdata  0  0  0
TCPv6  3  3   134431 : tunables   24   12
 8 : slabdata  1  1  0
cifs_small_rq 30 3644891 : tunables   54   27
 8 : slabdata  4  4  0
cifs_request   4  4  1651218 : tunables84
 0 : slabdata  4  4  0
cifs_oplock_structs  0  0 32  1131 : tunables  120
608 : slabdata  0  0  0
cifs_mpx_ids   3 59 64   591 : tunables  120   60
 8 : slabdata  1  1  0
cifs_inode_cache   0  049681 : tunables   54   27
 8 : slabdata  0  0  0
rpc_buffers8  8   204821 : tunables   24   12
 8 : slabdata  4  4  0
rpc_tasks  8 20192   201 : tunables  120   60
 8 : slabdata  1  1  0
rpc_inode_cache6  757671 : tunables   54   27
 8 : slabdata  1  1  0
ip_fib_alias   9113 32  1131 : tunables  120   60
 8 : slabdata  1  1  0
ip_fib_hash9113 32  1131 : tunables  120   60
 8 : slabdata  1  1  0
uhci_urb_priv  0  0 40   921 : tunables  120   60
 8 : slabdata  0  0  0
dm-snapshot-in   128134 56   671 : tunables  120   60
 8 : slabdata  2  2  0
dm-snapshot-ex 0  0 24  1451 : tunables  120   60
 8 : slabdata  0  0  0
ext3_inode_cache8275  1837864061 : tunables

Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006-11-06 Thread Zhao Xiaoming


2006/11/6, Eric Dumazet [EMAIL PROTECTED]:


Slab:   293952 kB
So 292 MB used by slab for 2000 sessions.

Expect 600 MB used by slab for 4000 sessions.

So your precious LOWMEM is not gone at all. It *IS* used by SLAB.

You forgot to send
cat /proc/slabinfo


sorry I didn't make myself clear enough. 2000 sessions means 4000
sockets, 2000 for the server, 2000 for the client.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: !! SPAM Suspect : SPAM-URL-DBL !! Re: (usagi-core 31424) Re: [PATCH 7/13] [RFC] [IPV6] Move source address selection into route lookup.

2006-11-06 Thread Jean-Mickael Guerin




The host testlab.linux-ipv6.org doesn't seem to be visible to the
outside world so could you post the results somewhere where I could take
a closer look at the results?


It is visible world-wide, assuming you have IPv6 connection.


With IPv4-only connection, one can try to append .ipv4.sixxs.org:

http://testlab.linux-ipv6.org.ipv4.sixxs.org/tahi-autorun.2/net-2.6_20061018/

Jean-Mickael


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] s2io ppc64 fix for readq/writeq

2006-11-06 Thread Jeff Garzik


Benjamin Herrenschmidt wrote:

This is why I said good enough for drivers.  This is _key_.

I have run into several [PCI] devices with 64-bit registers, and 
__none__ of them had requirements such that the Linux platform code 
-must- provide an atomic readq/writeq.  Probably because everybody wants 
to support 32-bit platforms with their devices.


What you call fairly bogus is precisely what drivers need.  These 
devices with 64-bit registers just don't need the atomicity that arch 
developers harp about :)


Is there any consistency in that case in which half need to be
read/written first ? Or none of these ever had side effects ?


Generally the kernel code should write the two 32-bit chunks to the 
memory-mapped region in order (low dword first), and let things take 
care of themselves from there.


That's pretty much the implementation that -every- driver copies, when 
they need readq/writeq to work on a 32-bit platform.


Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] s2io ppc64 fix for readq/writeq

2006-11-06 Thread Linus Torvalds



On Mon, 6 Nov 2006, Jeff Garzik wrote:

 This seems a bit ugly.  Could you add
 
   #define readq readq
 
 to your platform instead?

Heartily agreed. MUCH better than adding unrelated #if defined() stuff, 
whether arch-related or otherwise.

Linus
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: (usagi-core 31424) Re: [PATCH 7/13] [RFC] [IPV6] Move source address selection into route lookup.

2006-11-06 Thread Jean-Mickael Guerin


[ reposted, with better subject ]

http://testlab.linux-ipv6.org/tahi-autorun.2/net-2.6_20061018/

The host testlab.linux-ipv6.org doesn't seem to be visible to the
outside world so could you post the results somewhere where I could take
a closer look at the results?


It is visible world-wide, assuming you have IPv6 connection.

With IPv4-only connection, one can try to append .ipv4.sixxs.org:

http://testlab.linux-ipv6.org.ipv4.sixxs.org/tahi-autorun.2/net-2.6_20061018/ 



Jean-Mickael
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] s2io ppc64 fix for readq/writeq

2006-11-06 Thread Benjamin Herrenschmidt

On Mon, 2006-11-06 at 01:37 -0800, Linus Torvalds wrote:
 
 On Mon, 6 Nov 2006, Jeff Garzik wrote:
 
  This seems a bit ugly.  Could you add
  
  #define readq readq
  
  to your platform instead?
 
 Heartily agreed. MUCH better than adding unrelated #if defined() stuff, 
 whether arch-related or otherwise.

I agree it's less ugly, though I still don't like it much :-)

Anyway, what do you think of Jeff proposal to just implement them as two
32 bits operations ? My arch guy side screams at the idea, but if,
indeed, drivers generally cope fine with it, I suppose that's ok.

Ben.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] s2io ppc64 fix for readq/writeq

2006-11-06 Thread Linus Torvalds



On Mon, 6 Nov 2006, Benjamin Herrenschmidt wrote:

 Anyway, what do you think of Jeff proposal to just implement them as two
 32 bits operations ? My arch guy side screams at the idea, but if,
 indeed, drivers generally cope fine with it, I suppose that's ok.

Last I saw, that's how normal PCI will split the IO anyway, so I guess it 
makes sense.

Linus
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] s2io ppc64 fix for readq/writeq

2006-11-06 Thread Benjamin Herrenschmidt

On Mon, 2006-11-06 at 01:50 -0800, Linus Torvalds wrote:
 
 On Mon, 6 Nov 2006, Benjamin Herrenschmidt wrote:
 
  Anyway, what do you think of Jeff proposal to just implement them as two
  32 bits operations ? My arch guy side screams at the idea, but if,
  indeed, drivers generally cope fine with it, I suppose that's ok.
 
 Last I saw, that's how normal PCI will split the IO anyway, so I guess it 
 makes sense.

Hrm.. true indeed. I'll implement them that way for ppc32 then.

Ben.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] s2io ppc64 fix for readq/writeq

2006-11-06 Thread Jeff Garzik


Benjamin Herrenschmidt wrote:

On Mon, 2006-11-06 at 01:50 -0800, Linus Torvalds wrote:

On Mon, 6 Nov 2006, Benjamin Herrenschmidt wrote:

Anyway, what do you think of Jeff proposal to just implement them as two
32 bits operations ? My arch guy side screams at the idea, but if,
indeed, drivers generally cope fine with it, I suppose that's ok.
Last I saw, that's how normal PCI will split the IO anyway, so I guess it 
makes sense.


Hrm.. true indeed. I'll implement them that way for ppc32 then.


Bonus points if you want to find-and-kill where individual drivers did

#ifndef readq
implement readq and writeq by hand...
#endif

:)

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] s2io ppc64 fix for readq/writeq

2006-11-06 Thread Benjamin Herrenschmidt

On Mon, 2006-11-06 at 04:55 -0500, Jeff Garzik wrote:
 Benjamin Herrenschmidt wrote:
  On Mon, 2006-11-06 at 01:50 -0800, Linus Torvalds wrote:
  On Mon, 6 Nov 2006, Benjamin Herrenschmidt wrote:
  Anyway, what do you think of Jeff proposal to just implement them as two
  32 bits operations ? My arch guy side screams at the idea, but if,
  indeed, drivers generally cope fine with it, I suppose that's ok.
  Last I saw, that's how normal PCI will split the IO anyway, so I guess it 
  makes sense.
  
  Hrm.. true indeed. I'll implement them that way for ppc32 then.
 
 Bonus points if you want to find-and-kill where individual drivers did
 
   #ifndef readq
   implement readq and writeq by hand...
   #endif

Yes, well, we would have to make sure all archs have them defined
first though, but I suppose I can have a look later this week, maybe
tomorrow. Shouldn't be too hard :)

Ben.



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[DECNET] Endian bug fixes

2006-11-06 Thread Steven Whitehouse

Hi,

Here is a patch which fixes some endianess problems. Patrick: since you
have both big  little endian machines at your disposal, can you test to
ensure this is ok? Thanks,

Steve.

From ed3de950e89f8b02302308a2bedd59123ff3b88e Mon Sep 17 00:00:00 2001
From: Steven Whitehouse [EMAIL PROTECTED]
Date: Mon, 6 Nov 2006 10:30:30 -0500
Subject: [PATCH] [DECNET] Endianess fixes

Here are some fixes to endianess problems spotted by Al Viro.

Cc: Al Viro [EMAIL PROTECTED]
Cc: Patrick Caulfield [EMAIL PROTECTED]
Signed-off-by: Steven Whitehouse [EMAIL PROTECTED]
---
 net/decnet/af_decnet.c |   21 ++---
 net/decnet/dn_rules.c  |4 ++--
 2 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
index 3456cd3..37b4720 100644
--- a/net/decnet/af_decnet.c
+++ b/net/decnet/af_decnet.c
@@ -166,7 +166,7 @@ static struct hlist_head *dn_find_list(s
if (scp-addr.sdn_flags  SDF_WILD)
return hlist_empty(dn_wild_sk) ? dn_wild_sk : NULL;
 
-   return dn_sk_hash[scp-addrloc  DN_SK_HASH_MASK];
+   return dn_sk_hash[dn_ntohs(scp-addrloc)  DN_SK_HASH_MASK];
 }
 
 /* 
@@ -180,7 +180,7 @@ static int check_port(__le16 port)
if (port == 0)
return -1;
 
-   sk_for_each(sk, node, dn_sk_hash[port  DN_SK_HASH_MASK]) {
+   sk_for_each(sk, node, dn_sk_hash[dn_ntohs(port)  DN_SK_HASH_MASK]) {
struct dn_scp *scp = DN_SK(sk);
if (scp-addrloc == port)
return -1;
@@ -194,12 +194,12 @@ static unsigned short port_alloc(struct 
 static unsigned short port = 0x2000;
unsigned short i_port = port;
 
-   while(check_port(++port) != 0) {
+   while(check_port(dn_htons(++port)) != 0) {
if (port == i_port)
return 0;
}
 
-   scp-addrloc = port;
+   scp-addrloc = dn_htons(port);
 
return 1;
 }
@@ -418,7 +418,7 @@ struct sock *dn_find_by_skb(struct sk_bu
struct dn_scp *scp;
 
read_lock(dn_hash_lock);
-   sk_for_each(sk, node, dn_sk_hash[cb-dst_port  DN_SK_HASH_MASK]) {
+   sk_for_each(sk, node, dn_sk_hash[dn_ntohs(cb-dst_port)  
DN_SK_HASH_MASK]) {
scp = DN_SK(sk);
if (cb-src != dn_saddr2dn(scp-peer))
continue;
@@ -1016,13 +1016,12 @@ static void dn_access_copy(struct sk_buf
 
 static void dn_user_copy(struct sk_buff *skb, struct optdata_dn *opt)
 {
-unsigned char *ptr = skb-data;
-
-opt-opt_optl   = *ptr++;
-opt-opt_status = 0;
-memcpy(opt-opt_data, ptr, opt-opt_optl);
-skb_pull(skb, dn_ntohs(opt-opt_optl) + 1);
+   unsigned char *ptr = skb-data;
 
+   opt-opt_optl   = dn_htons((__u16)*ptr++);
+   opt-opt_status = 0;
+   memcpy(opt-opt_data, ptr, dn_ntohs(opt-opt_optl));
+   skb_pull(skb, dn_ntohs(opt-opt_optl) + 1);
 }
 
 static struct sk_buff *dn_wait_for_connect(struct sock *sk, long *timeo)
diff --git a/net/decnet/dn_rules.c b/net/decnet/dn_rules.c
index 3e0c882..590e0a7 100644
--- a/net/decnet/dn_rules.c
+++ b/net/decnet/dn_rules.c
@@ -124,8 +124,8 @@ static struct nla_policy dn_fib_rule_pol
 static int dn_fib_rule_match(struct fib_rule *rule, struct flowi *fl, int 
flags)
 {
struct dn_fib_rule *r = (struct dn_fib_rule *)rule;
-   u16 daddr = fl-fld_dst;
-   u16 saddr = fl-fld_src;
+   __le16 daddr = fl-fld_dst;
+   __le16 saddr = fl-fld_src;
 
if (((saddr ^ r-src)  r-srcmask) ||
((daddr ^ r-dst)  r-dstmask))
-- 
1.4.1

Re: [DECNET] Endian bug fixes

2006-11-06 Thread Al Viro

On Mon, Nov 06, 2006 at 10:32:43AM +, Al Viro wrote:
 On Mon, Nov 06, 2006 at 10:31:02AM +, Steven Whitehouse wrote:
  +   opt-opt_optl   = dn_htons((__u16)*ptr++);
 
 Lose that cast; it's only confusing the things...
 
  +   memcpy(opt-opt_data, ptr, dn_ntohs(opt-opt_optl));
  +   skb_pull(skb, dn_ntohs(opt-opt_optl) + 1);
 
 ... and I'd actually do
 
   u16 len = *ptr++; /* yes, it's 8bit on the wire */
   opt-opt_optl   = dn_htons(len);
   BUG_ON(len  16); /* we've checked the contents earlier */
   memcpy(opt-opt_data, ptr, len);
   skb_pull(skb, len + 1);

BTW, why the hell do we keep -opt_optl __le16 internally?  If we ever
pass it to userland, fine, but let's convert to __le16 *then*...
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [DECNET] Endian bug fixes

2006-11-06 Thread Al Viro

On Mon, Nov 06, 2006 at 10:31:02AM +, Steven Whitehouse wrote:
 + opt-opt_optl   = dn_htons((__u16)*ptr++);

Lose that cast; it's only confusing the things...

 + memcpy(opt-opt_data, ptr, dn_ntohs(opt-opt_optl));
 + skb_pull(skb, dn_ntohs(opt-opt_optl) + 1);

... and I'd actually do

u16 len = *ptr++; /* yes, it's 8bit on the wire */
opt-opt_optl   = dn_htons(len);
BUG_ON(len  16); /* we've checked the contents earlier */
memcpy(opt-opt_data, ptr, len);
skb_pull(skb, len + 1);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006-11-06 Thread Eric Dumazet

On Monday 06 November 2006 09:59, Zhao Xiaoming wrote:

 Thank you again for your help. To have more detailed statistic data, I
 did another round of test and gathered some data.  I give the overall
 description here and detailed /proc/net/sockstat, /proc/meminfo,
 /proc/slabinfo and /proc/buddyinfo follows.
 =
slab mem costtcp mem pages   lowmem
 free with traffic: 254668KB 34693
   38772KB
 without traffic:   104080KB   1
702652KB
 =

Thank you for detailed infos.

It appears you have an extensive use of threads (about 1), since :

 task_struct10095  10095   136031 : tunables   24   12
   8 : slabdata   3365   3365  0

Each thread has a kernel stack, 8KB (ie 2 pages, order-1 allocation), plus a 
user vma

 vm_area_struct 21346  21504 92   421 : tunables  120   60
   8 : slabdata512512  0

Most likely you dont need that much threads. A program with fewer threads will 
perform better and use less ram.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

tg3_read_partno(): possible array overrun

2006-11-06 Thread Adrian Bunk

The Coverity checker noted the following in drivers/net/tg3.c:

--  snip  --

...
static void __devinit tg3_read_partno(struct tg3 *tp)
{
unsigned char vpd_data[256];
int i;
...
/* Now parse and find the part number. */
for (i = 0; i  256; ) {
unsigned char val = vpd_data[i];
int block_end;

if (val == 0x82 || val == 0x91) {
i = (i + 3 +
 (vpd_data[i + 1] +
  (vpd_data[i + 2]  8)));
continue;
}

if (val != 0x90)
goto out_not_found;

block_end = (i + 3 +
 (vpd_data[i + 1] +
  (vpd_data[i + 2]  8)));
i += 3;
...

--  snip  --

The problem is that vpd_data[i + 2] could be vpd_data[255 + 2].

cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006-11-06 Thread Zhao Xiaoming


2006/11/6, Zhao Xiaoming [EMAIL PROTECTED]:

2006/11/6, Eric Dumazet [EMAIL PROTECTED]:
 On Monday 06 November 2006 09:59, Zhao Xiaoming wrote:

  Thank you again for your help. To have more detailed statistic data, I
  did another round of test and gathered some data.  I give the overall
  description here and detailed /proc/net/sockstat, /proc/meminfo,
  /proc/slabinfo and /proc/buddyinfo follows.
  =
 slab mem costtcp mem pages   lowmem
  free with traffic: 254668KB 34693
38772KB
  without traffic:   104080KB   1
 702652KB
  =

 Thank you for detailed infos.

 It appears you have an extensive use of threads (about 1), since :

  task_struct10095  10095   136031 : tunables   24   12
8 : slabdata   3365   3365  0

 Each thread has a kernel stack, 8KB (ie 2 pages, order-1 allocation), plus a
 user vma

  vm_area_struct 21346  21504 92   421 : tunables  120   60
8 : slabdata512512  0

 Most likely you dont need that much threads. A program with fewer threads will
 perform better and use less ram.


Thanks for the comments. I known the threads may cost many memory.
However, I already excluded them from the statistics. The 'after test'
info was gotten while the 1 threads running but no traffics
relayed. You may look at the meminfo of 'after test', there is still
104080 kB slab memory which should already included the thread kernel
memory cost (8K*1=80MB). I know 1 threads are not necessary
and just use the simple logic to do some test.


and I just tried 2500 threads. the results are the same.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006-11-06 Thread Zhao Xiaoming


2006/11/6, Arjan van de Ven [EMAIL PROTECTED]:

On Mon, 2006-11-06 at 14:07 +0800, Zhao Xiaoming wrote:
 Dears,
 I'm running a linux box with kernel version 2.6.16. The hardware
 has 2 Woodcrest Xeon CPUs (2 cores each) and 4G RAM. The NIC cards is
 Intel 82571 on PCI-e bus.

are you using a 32 bit or a 64 bit OS?




-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006-11-06 Thread Zhao Xiaoming


2006/11/6, Eric Dumazet [EMAIL PROTECTED]:

On Monday 06 November 2006 09:59, Zhao Xiaoming wrote:

 Thank you again for your help. To have more detailed statistic data, I
 did another round of test and gathered some data.  I give the overall
 description here and detailed /proc/net/sockstat, /proc/meminfo,
 /proc/slabinfo and /proc/buddyinfo follows.
 =
slab mem costtcp mem pages   lowmem
 free with traffic: 254668KB 34693
   38772KB
 without traffic:   104080KB   1
702652KB
 =

Thank you for detailed infos.

It appears you have an extensive use of threads (about 1), since :

 task_struct10095  10095   136031 : tunables   24   12
   8 : slabdata   3365   3365  0

Each thread has a kernel stack, 8KB (ie 2 pages, order-1 allocation), plus a
user vma

 vm_area_struct 21346  21504 92   421 : tunables  120   60
   8 : slabdata512512  0

Most likely you dont need that much threads. A program with fewer threads will
perform better and use less ram.



Thanks for the comments. I known the threads may cost many memory.
However, I already excluded them from the statistics. The 'after test'
info was gotten while the 1 threads running but no traffics
relayed. You may look at the meminfo of 'after test', there is still
104080 kB slab memory which should already included the thread kernel
memory cost (8K*1=80MB). I know 1 threads are not necessary
and just use the simple logic to do some test.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ieee80211softmac: fix verbosity when debug disabled

2006-11-06 Thread Johannes Berg

On Sat, 2006-11-04 at 13:29 -0600, Larry Finger wrote:
 SoftMAC contains a number of debug-type messages that continue to print
 even when debugging is turned off. This patch substitutes dprintkl for
 printkl for those lines.
 
 Signed-off-by: Larry Finger [EMAIL PROTECTED]

Fine with me.
Acked-by: Johannes Berg [EMAIL PROTECTED]


 Index: wireless-2.6/net/ieee80211/softmac/ieee80211softmac_auth.c
 ===
 --- wireless-2.6.orig/net/ieee80211/softmac/ieee80211softmac_auth.c
 +++ wireless-2.6/net/ieee80211/softmac/ieee80211softmac_auth.c
 @@ -158,7 +158,7 @@ ieee80211softmac_auth_resp(struct net_de
   /* Make sure that we've got an auth queue item for this request */
   if(aq == NULL)
   {
 - printkl(KERN_DEBUG PFX Authentication response received from 
 MAC_FMT but no queue item exists.\n, MAC_ARG(auth-header.addr2));
 + dprintkl(KERN_DEBUG PFX Authentication response received from 
 MAC_FMT but no queue item exists.\n, MAC_ARG(auth-header.addr2));
   /* Error #? */
   return -1;
   }   
 @@ -166,7 +166,7 @@ ieee80211softmac_auth_resp(struct net_de
   /* Check for out of order authentication */
   if(!net-authenticating)
   {
 - printkl(KERN_DEBUG PFX Authentication response received from 
 MAC_FMT but did not request authentication.\n,MAC_ARG(auth-header.addr2));
 + dprintkl(KERN_DEBUG PFX Authentication response received from 
 MAC_FMT but did not request authentication.\n,MAC_ARG(auth-header.addr2));
   return -1;
   }
  
 @@ -342,7 +342,7 @@ ieee80211softmac_deauth_req(struct ieee8
   /* Make sure the network is authenticated */
   if (!net-authenticated)
   {
 - printkl(KERN_DEBUG PFX Can't send deauthentication packet, 
 network is not authenticated.\n);
 + dprintkl(KERN_DEBUG PFX Can't send deauthentication packet, 
 network is not authenticated.\n);
   /* Error okay? */
   return -EPERM;
   }
 @@ -376,7 +376,7 @@ ieee80211softmac_deauth_resp(struct net_
   net = ieee80211softmac_get_network_by_bssid(mac, deauth-header.addr2);
   
   if (net == NULL) {
 - printkl(KERN_DEBUG PFX Received deauthentication packet from 
 MAC_FMT, but that network is unknown.\n,
 + dprintkl(KERN_DEBUG PFX Received deauthentication packet from 
 MAC_FMT, but that network is unknown.\n,
   MAC_ARG(deauth-header.addr2));
   return 0;
   }
 @@ -384,7 +384,7 @@ ieee80211softmac_deauth_resp(struct net_
   /* Make sure the network is authenticated */
   if(!net-authenticated)
   {
 - printkl(KERN_DEBUG PFX Can't perform deauthentication, network 
 is not authenticated.\n);
 + dprintkl(KERN_DEBUG PFX Can't perform deauthentication, 
 network is not authenticated.\n);
   /* Error okay? */
   return -EPERM;
   }
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Source address selection + multicast

2006-11-06 Thread Pierre Ynard

Hello,

Preferred source address selection in the routing table (src field)
currently does not work properly with multicast destination adresses:
it leads packets to be routed through the wrong network device (see
http://bugzilla.kernel.org/show_bug.cgi?id=7398).

It seems to me that the main reason for this is compatibility with
old multicast applications, and I can see no fundamental reason
preventing the use of this two features together.

Why not finding a way to let them coexist?

What about a sysctl option, letting users who really want to disable
the compatibility hack, and restore normal behavior? I am thinking
about something like the patch below. Or does another simple way to
do it come to your mind?

What do you think about it?

diff -urNp linux-2.6.18/Documentation/filesystems/proc.txt 
linux-2.6.18/Documentation/filesystems/proc.txt
--- linux-2.6.18/Documentation/filesystems/proc.txt 2006-09-19 
20:42:06.0 -0700
+++ linux-2.6.18/Documentation/filesystems/proc.txt 2006-10-26 
05:13:15.0 -0700
@@ -1758,6 +1758,15 @@ max_delay, min_delay

 Delays for flushing the routing cache.

+mc_src_strict
+-
+
+There is a hack in the kernel router which provides compatibility for old
+multicast applications such as vic, vat and friends. Unfortunately, this
+hack also breaks normal behavior of preferred source address selection
+(iproute2 src field) with multicast and limited broadcast. Enabling this
+option disables this hack and restores normal (strict) behavior.
+
 redirect_load, redirect_number
 --

diff -urNp linux-2.6.18/include/linux/sysctl.h 
linux-2.6.18/include/linux/sysctl.h
--- linux-2.6.18/include/linux/sysctl.h 2006-09-19 20:42:06.0 -0700
+++ linux-2.6.18/include/linux/sysctl.h 2006-10-26 04:25:00.0 -0700
@@ -433,6 +433,7 @@ enum {
NET_IPV4_ROUTE_MIN_ADVMSS=17,
NET_IPV4_ROUTE_SECRET_INTERVAL=18,
NET_IPV4_ROUTE_GC_MIN_INTERVAL_MS=19,
+   NET_IPV4_ROUTE_MC_SRC_STRICT=20,
 };

 enum
diff -urNp linux-2.6.18/net/ipv4/route.c linux-2.6.18/net/ipv4/route.c
--- linux-2.6.18/net/ipv4/route.c   2006-09-19 20:42:06.0 -0700
+++ linux-2.6.18/net/ipv4/route.c   2006-10-26 05:11:00.0 -0700
@@ -132,6 +132,7 @@ static int ip_rt_mtu_expires= 10 * 60 
 static int ip_rt_min_pmtu  = 512 + 20 + 20;
 static int ip_rt_min_advmss= 256;
 static int ip_rt_secret_interval   = 10 * 60 * HZ;
+static int ip_rt_mc_src_strict = 0;
 static unsigned long rt_deadline;

 #define RTprint(a...)  printk(KERN_DEBUG a)
@@ -2416,7 +2417,7 @@ static int ip_route_output_slow(struct r
  of another iface. --ANK
 */

-   if (oldflp-oif == 0
+   if (!ip_rt_mc_src_strict  oldflp-oif == 0
 (MULTICAST(oldflp-fl4_dst) || oldflp-fl4_dst == 
0x)) {
/* Special hack: user can direct multicasts
   and limited broadcast via necessary interface
@@ -2431,6 +2432,12 @@ static int ip_route_output_slow(struct r
   cannot know, that ttl is zero, so that packet
   will not leave this host and route is valid).
   Luckily, this hack is good workaround.
+
+  Unfortunately, it also breaks normal behavior of
+  source address preference, so I added a sysctl option
+  to let the user disable this hack and restore normal
+  behavior. By default, the hack is still enabled (old
+  compatibility behavior). -- PY
 */

fl.oif = dev_out-ifindex;
@@ -3057,6 +3064,15 @@ ctl_table ipv4_route_table[] = {
.proc_handler   = proc_dointvec_jiffies,
.strategy   = sysctl_jiffies,
},
+   {
+   .ctl_name   = NET_IPV4_ROUTE_MC_SRC_STRICT,
+   .procname   = mc_src_strict,
+   .data   = ip_rt_mc_src_strict,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = ipv4_doint_and_flush,
+   .strategy   = ipv4_doint_and_flush_strategy,
+   },
{ .ctl_name = 0 }
 };
 #endif

-- 
Pierre Ynard








___ 
Découvrez une nouvelle façon d'obtenir des réponses à toutes vos questions ! 
Profitez des connaissances, des opinions et des expériences des internautes sur 
Yahoo! Questions/Réponses 
http://fr.answers.yahoo.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2.6.19-rc4-git10][PKT_SCHED] sch_htb: INIT_HLIST_NODE after hlist_del()

2006-11-06 Thread Jarek Poplawski

After hlist_del() next and pprev pointers are not NULL
so hlist_unhashed() doesn't work properly.


Signed-off-by: Jarek Poplawski [EMAIL PROTECTED]
---


diff -Nurp linux-2.6.19-rc4-git10-/net/sched/sch_htb.c 
linux-2.6.19-rc4-git10/net/sched/sch_htb.c
--- linux-2.6.19-rc4-git10-/net/sched/sch_htb.c 2006-11-06 11:42:41.0 
+0100
+++ linux-2.6.19-rc4-git10/net/sched/sch_htb.c  2006-11-06 11:53:15.0 
+0100
@@ -1284,8 +1284,10 @@ static void htb_destroy_class(struct Qdi
  struct htb_class, sibling));
 
/* note: this delete may happen twice (see htb_delete) */
-   if (!hlist_unhashed(cl-hlist))
+   if (!hlist_unhashed(cl-hlist)) {
hlist_del(cl-hlist);
+   INIT_HLIST_NODE(cl-hlist);
+   }
list_del(cl-sibling);
 
if (cl-prio_activity)
@@ -1333,8 +1335,10 @@ static int htb_delete(struct Qdisc *sch,
sch_tree_lock(sch);
 
/* delete from hash and active; remainder in destroy_class */
-   if (!hlist_unhashed(cl-hlist))
+   if (!hlist_unhashed(cl-hlist)) {
hlist_del(cl-hlist);
+   INIT_HLIST_NODE(cl-hlist);
+   }
 
if (cl-prio_activity)
htb_deactivate(q, cl);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [DECNET] Endian bug fixes

2006-11-06 Thread Steven Whitehouse

Hi,

On Mon, 2006-11-06 at 10:32 +, Al Viro wrote:
 On Mon, Nov 06, 2006 at 10:31:02AM +, Steven Whitehouse wrote:
  +   opt-opt_optl   = dn_htons((__u16)*ptr++);
 
 Lose that cast; it's only confusing the things...
 
  +   memcpy(opt-opt_data, ptr, dn_ntohs(opt-opt_optl));
  +   skb_pull(skb, dn_ntohs(opt-opt_optl) + 1);
 
 ... and I'd actually do
 
   u16 len = *ptr++; /* yes, it's 8bit on the wire */
   opt-opt_optl   = dn_htons(len);
   BUG_ON(len  16); /* we've checked the contents earlier */
   memcpy(opt-opt_data, ptr, len);
   skb_pull(skb, len + 1);

Ok, and I've also made the same change in the other places too, so far
as its relevant in those cases. New patch attached,

Steve.

From a184f89a13fa292589f309057cc0775a8256a89e Mon Sep 17 00:00:00 2001
From: Steven Whitehouse [EMAIL PROTECTED]
Date: Mon, 6 Nov 2006 11:51:00 -0500
Subject: [DECNET] Endianess fixes (try #2)

Here are some fixes to endianess problems spotted by Al Viro.

Cc: Al Viro [EMAIL PROTECTED]
Cc: Patrick Caulfield [EMAIL PROTECTED]
Signed-off-by: Steven Whitehouse [EMAIL PROTECTED]
---
 net/decnet/af_decnet.c  |   25 +
 net/decnet/dn_nsp_in.c  |8 
 net/decnet/dn_nsp_out.c |2 +-
 net/decnet/dn_rules.c   |4 ++--
 4 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
index 3456cd3..21f20f2 100644
--- a/net/decnet/af_decnet.c
+++ b/net/decnet/af_decnet.c
@@ -166,7 +166,7 @@ static struct hlist_head *dn_find_list(s
if (scp-addr.sdn_flags  SDF_WILD)
return hlist_empty(dn_wild_sk) ? dn_wild_sk : NULL;
 
-   return dn_sk_hash[scp-addrloc  DN_SK_HASH_MASK];
+   return dn_sk_hash[dn_ntohs(scp-addrloc)  DN_SK_HASH_MASK];
 }
 
 /* 
@@ -180,7 +180,7 @@ static int check_port(__le16 port)
if (port == 0)
return -1;
 
-   sk_for_each(sk, node, dn_sk_hash[port  DN_SK_HASH_MASK]) {
+   sk_for_each(sk, node, dn_sk_hash[dn_ntohs(port)  DN_SK_HASH_MASK]) {
struct dn_scp *scp = DN_SK(sk);
if (scp-addrloc == port)
return -1;
@@ -194,12 +194,12 @@ static unsigned short port_alloc(struct 
 static unsigned short port = 0x2000;
unsigned short i_port = port;
 
-   while(check_port(++port) != 0) {
+   while(check_port(dn_htons(++port)) != 0) {
if (port == i_port)
return 0;
}
 
-   scp-addrloc = port;
+   scp-addrloc = dn_htons(port);
 
return 1;
 }
@@ -418,7 +418,7 @@ struct sock *dn_find_by_skb(struct sk_bu
struct dn_scp *scp;
 
read_lock(dn_hash_lock);
-   sk_for_each(sk, node, dn_sk_hash[cb-dst_port  DN_SK_HASH_MASK]) {
+   sk_for_each(sk, node, dn_sk_hash[dn_ntohs(cb-dst_port)  
DN_SK_HASH_MASK]) {
scp = DN_SK(sk);
if (cb-src != dn_saddr2dn(scp-peer))
continue;
@@ -1016,13 +1016,14 @@ static void dn_access_copy(struct sk_buf
 
 static void dn_user_copy(struct sk_buff *skb, struct optdata_dn *opt)
 {
-unsigned char *ptr = skb-data;
-
-opt-opt_optl   = *ptr++;
-opt-opt_status = 0;
-memcpy(opt-opt_data, ptr, opt-opt_optl);
-skb_pull(skb, dn_ntohs(opt-opt_optl) + 1);
-
+   unsigned char *ptr = skb-data;
+   u16 len = *ptr++; /* yes, it's 8bit on the wire */
+
+   BUG_ON(len  16); /* we've checked the contents earlier */
+   opt-opt_optl   = dn_htons(len);
+   opt-opt_status = 0;
+   memcpy(opt-opt_data, ptr, len);
+   skb_pull(skb, len + 1);
 }
 
 static struct sk_buff *dn_wait_for_connect(struct sock *sk, long *timeo)
diff --git a/net/decnet/dn_nsp_in.c b/net/decnet/dn_nsp_in.c
index 72ecc6e..7683d4f 100644
--- a/net/decnet/dn_nsp_in.c
+++ b/net/decnet/dn_nsp_in.c
@@ -360,9 +360,9 @@ static void dn_nsp_conn_conf(struct sock
scp-max_window = decnet_no_fc_max_cwnd;
 
if (skb-len  0) {
-   unsigned char dlen = *skb-data;
+   u16 dlen = *skb-data;
if ((dlen = 16)  (dlen = skb-len)) {
-   scp-conndata_in.opt_optl = 
dn_htons((__u16)dlen);
+   scp-conndata_in.opt_optl = dn_htons(dlen);
memcpy(scp-conndata_in.opt_data, skb-data + 
1, dlen);
}
}
@@ -404,9 +404,9 @@ static void dn_nsp_disc_init(struct sock
memset(scp-discdata_in.opt_data, 0, 16);
 
if (skb-len  0) {
-   unsigned char dlen = *skb-data;
+   u16 dlen = *skb-data;
if ((dlen = 16)  (dlen = skb-len)) {
-   scp-discdata_in.opt_optl = dn_htons((__u16)dlen);
+   scp-discdata_in.opt_optl = dn_htons(dlen);
memcpy(scp-discdata_in.opt_data, skb-data + 1, dlen);
}
}

Re: [DECNET] Endian bug fixes

2006-11-06 Thread Steven Whitehouse

Hi,

On Mon, 2006-11-06 at 10:34 +, Al Viro wrote:
 On Mon, Nov 06, 2006 at 10:32:43AM +, Al Viro wrote:
  On Mon, Nov 06, 2006 at 10:31:02AM +, Steven Whitehouse wrote:
   + opt-opt_optl   = dn_htons((__u16)*ptr++);
  
  Lose that cast; it's only confusing the things...
  
   + memcpy(opt-opt_data, ptr, dn_ntohs(opt-opt_optl));
   + skb_pull(skb, dn_ntohs(opt-opt_optl) + 1);
  
  ... and I'd actually do
  
  u16 len = *ptr++; /* yes, it's 8bit on the wire */
  opt-opt_optl   = dn_htons(len);
  BUG_ON(len  16); /* we've checked the contents earlier */
  memcpy(opt-opt_data, ptr, len);
  skb_pull(skb, len + 1);
 
 BTW, why the hell do we keep -opt_optl __le16 internally?  If we ever
 pass it to userland, fine, but let's convert to __le16 *then*...

Really the only thing that we do with this data is verify it and pass to
userland. It does mean that getsockopt() is simpler for just being able
to use copy_to_user() with a ptr  len depending on which of the
structures the user has requested rather than having to convert each
field of each structure for example.

I'm not sure its worth changing it now, for saving just one byte per
socket in this case,

Steve.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006-11-06 Thread Zhao Xiaoming


32 bit.  Of course 64 bit kernel can help me overcome the 900M
barrier. However, if I can't find the reason why so much memory
getting 'lost', it will be difficult to support more heavy loadded
concurrent TCP connections.


2006/11/6, Arjan van de Ven [EMAIL PROTECTED]:
 On Mon, 2006-11-06 at 14:07 +0800, Zhao Xiaoming wrote:
  Dears,
  I'm running a linux box with kernel version 2.6.16. The hardware
  has 2 Woodcrest Xeon CPUs (2 cores each) and 4G RAM. The NIC cards is
  Intel 82571 on PCI-e bus.

 are you using a 32 bit or a 64 bit OS?





-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006-11-06 Thread Eric Dumazet

On Monday 06 November 2006 10:46, Zhao Xiaoming wrote:
 2006/11/6, Eric Dumazet [EMAIL PROTECTED]:
  On Monday 06 November 2006 09:59, Zhao Xiaoming wrote:
   Thank you again for your help. To have more detailed statistic data, I
   did another round of test and gathered some data.  I give the overall
   description here and detailed /proc/net/sockstat, /proc/meminfo,
   /proc/slabinfo and /proc/buddyinfo follows.
   =
  slab mem costtcp mem pages  
   lowmem free with traffic: 254668KB 34693
 38772KB
   without traffic:   104080KB   1
  702652KB
   =
 
  Thank you for detailed infos.
 
  It appears you have an extensive use of threads (about 1), since :
   task_struct10095  10095   136031 : tunables   24   12
 8 : slabdata   3365   3365  0
 
  Each thread has a kernel stack, 8KB (ie 2 pages, order-1 allocation),
  plus a user vma
 
   vm_area_struct 21346  21504 92   421 : tunables  120   60
 8 : slabdata512512  0
 
  Most likely you dont need that much threads. A program with fewer threads
  will perform better and use less ram.

 Thanks for the comments. I known the threads may cost many memory.
 However, I already excluded them from the statistics. The 'after test'
 info was gotten while the 1 threads running but no traffics
 relayed. You may look at the meminfo of 'after test', there is still
 104080 kB slab memory which should already included the thread kernel
 memory cost (8K*1=80MB). I know 1 threads are not necessary
 and just use the simple logic to do some test.

In fact, your kernel has CONFIG_4KSTACKS, kernel thread stacks use 4K instead 
of 8K.

If you want to increase LOWMEM, (and keep 32bits kernel), you can chose a 
2G/2G user/kernel split, instead of the 3G/1G default split.
(see config : CONFIG_VMSPLIT_2G)

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pktgen patch available for perusal.

2006-11-06 Thread Robert Olsson


jamal writes:

  If you are listening then start with:
  
  1) Do a simple test with just udp traffic as above, doing simple
  accounting. This helps you to get a feel on how things work.
  2) modify the matching rules to match your magic cookie
  3) write a simple action invoked by your matching rules and use tc to
  add it to the policy.
  4) integrate in your app now that you know what you are doing.

 Yes. Sounds like simple and general solution. No call-backs, no #ifdef's
 no extra modules. Just a little recipe in pktgen.txt

 Cheers.
--ro
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pktgen patch available for perusal.

2006-11-06 Thread Robert Olsson


Ben Greear writes:
   Changes:
   * use a nano-second timer based on the scheduler timer (TSC) for relative 
   times, instead of get_time_of_day.

Seems I missed to set tsc as clocksource. It makes a difference. Performance is 
normal and I'm less confused.

e1000 82546GB @ 1.6 GHz Opteron. Kernel 2.6.19-rc1_Bifrost_-g18e199c6-dirty 

echo acpi_pm 
/sys/devices/system/clocksource/clocksource0/current_clocksource   

psize pps
-
60  556333
124  526942
252  452981
508  234996
1020  119748
1496  82248

echo tsc 
/sys/devices/system/clocksource/clocksource0/current_clocksource 

psize pps
-
60  819914
124  747286
252  452975
508  234993
1020  119749
1496  82247

Cheers. 
--ro
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/18] e1000: features, updates, documentation

2006-11-06 Thread Auke Kok


Jeff Garzik wrote:

pulled.

still waiting on those changes to better modularize the feature 
detection, etc.


that will start coming in early januari I think. We're currently validating all silicon 
that the code supports against the old and new code, and that is going to take quite 
some time to finish!


Cheers,

Auke
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC: 2.6 patch] bcm43xx_sprom_write(): add error checks

2006-11-06 Thread Adrian Bunk

The Coverity checker noted that these if (err)'s couldn't ever be 
true.

It seems the intention was to check the return values of the 
bcm43xx_pci_write_config32()'s?

Signed-off-by: Adrian Bunk [EMAIL PROTECTED]

---

 drivers/net/wireless/bcm43xx/bcm43xx_main.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c.old   2006-11-06 
14:45:47.0 +0100
+++ linux-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c   2006-11-06 
14:46:53.0 +0100
@@ -737,47 +737,47 @@ int bcm43xx_sprom_write(struct bcm43xx_p
crc = bcm43xx_sprom_crc(sprom);
expected_crc = (sprom[BCM43xx_SPROM_VERSION]  0xFF00)  8;
if (crc != expected_crc) {
printk(KERN_ERR PFX SPROM input data: Invalid CRC\n);
return -EINVAL;
}
 
printk(KERN_INFO PFX Writing SPROM. Do NOT turn off the power! Please 
stand by...\n);
err = bcm43xx_pci_read_config32(bcm, BCM43xx_PCICFG_SPROMCTL, 
spromctl);
if (err)
goto err_ctlreg;
spromctl |= 0x10; /* SPROM WRITE enable. */
-   bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl);
+   err = bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, 
spromctl);
if (err)
goto err_ctlreg;
/* We must burn lots of CPU cycles here, but that does not
 * really matter as one does not write the SPROM every other minute...
 */
printk(KERN_INFO PFX [ 0%%);
mdelay(500);
for (i = 0; i  BCM43xx_SPROM_SIZE; i++) {
if (i == 16)
printk(25%%);
else if (i == 32)
printk(50%%);
else if (i == 48)
printk(75%%);
else if (i % 2)
printk(.);
bcm43xx_write16(bcm, BCM43xx_SPROM_BASE + (i * 2), sprom[i]);
mmiowb();
mdelay(20);
}
spromctl = ~0x10; /* SPROM WRITE enable. */
-   bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl);
+   err = bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, 
spromctl);
if (err)
goto err_ctlreg;
mdelay(500);
printk(100%% ]\n);
printk(KERN_INFO PFX SPROM written.\n);
bcm43xx_controller_restart(bcm, SPROM update);
 
return 0;
 err_ctlreg:
printk(KERN_ERR PFX Could not access SPROM control register.\n);
return -ENODEV;
 }

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC: 2.6 patch] hostap_80211_rx(): fix a use-after-free

2006-11-06 Thread Adrian Bunk

This patch fixes a use-after-free for skb spotted by the Coverity 
checker.

Signed-off-by: Adrian Bunk [EMAIL PROTECTED]

--- linux-2.6/drivers/net/wireless/hostap/hostap_80211_rx.c.old 2006-11-06 
14:51:36.0 +0100
+++ linux-2.6/drivers/net/wireless/hostap/hostap_80211_rx.c 2006-11-06 
14:52:16.0 +0100
@@ -1004,10 +1004,10 @@ void hostap_80211_rx(struct net_device *
if (local-hostapd  local-apdev) {
/* Send IEEE 802.1X frames to the user
 * space daemon for processing */
-   prism2_rx_80211(local-apdev, skb, rx_stats,
-   PRISM2_RX_MGMT);
local-apdevstats.rx_packets++;
local-apdevstats.rx_bytes += skb-len;
+   prism2_rx_80211(local-apdev, skb, rx_stats,
+   PRISM2_RX_MGMT);
goto rx_exit;
}
} else if (!frame_authorized) {

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC: 2.6 patch] bcm43xx_sprom_write(): add error checks

2006-11-06 Thread Larry Finger


Adrian Bunk wrote:
The Coverity checker noted that these if (err)'s couldn't ever be 
true.


It seems the intention was to check the return values of the 
bcm43xx_pci_write_config32()'s?


Exactly. This patch sent to wireless-2.6.

Thanks,

Larry
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006-11-06 Thread Stephen Hemminger


Eric Dumazet wrote:

Zhao Xiaoming a écrit :

Dears,
   I'm running a linux box with kernel version 2.6.16. The hardware
has 2 Woodcrest Xeon CPUs (2 cores each) and 4G RAM. The NIC cards is
Intel 82571 on PCI-e bus.
   The box is acting as ethernet bridge between 2 Gigabit Ethernets.
By configuring ebtables and iptables, an application is running as TCP
proxy which will intercept all TCP connections requests from the
network and setup another TCP connection to the acture server.  The
TCP proxy then relays all traffics in both directions.
   The problem is the memory. Since the box must support thousands of
concurrent connections, I know the memory size of ZONE_NORMAL would be
a bottleneck as TCP packets would need many buffers. After setting
upper limit of net.ipv4.tcp_rmem and net.ipv4.tcp_wmem to 32K bytes,
our test began.
   My test scenario employs 2000 concurrent downloading connections
to a IIS server's port 80. The throughput is about 500~600 Mbps which
is limited by the capability of the client application. Because all
traffics are from server to client and the capability of client
machine is bottleneck, I believe the receiver side of the sockets
connected with server and the sender side of the sockets connected
with client should be filled with packets in correspondent windows.
Thus, roughly there should be about 32K * 2000+ 32K*2000 = 128M bytes
memory occupied by TCP/IP stack for packet buffering. Data from
slabtop confermed it. it's about 140M bytes memory cost after I start
the traffic. That reasonablly matched with my estimation. However,
/proc/meminfo had a different story. The 'LowFree' dropped from about
710M to 80M. In other words, there's addtional 500M memory in
ZONE_NORMAL allocated by someone other than the slab. Why?
The amount of memory per socket is controlled by the socket buffering. 
Your application
could be setting the value by calling setsockopt(). Otherwise, the tcp 
memory is limited

by the sysctl settings tcp_rmem (receiver) and tcp_wmem (sender).

For example on this server:
$ cat /proc/sys/net/ipv4/tcp_wmem
409616384   131072

Each sending socket would start with 16K of buffering, but could grow up 
to 128K based

on TCP send autotuning.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2.6.19-rc4-git10][PKT_SCHED] sch_htb: INIT_HLIST_NODE after hlist_del()

2006-11-06 Thread Stephen Hemminger

On Mon, 6 Nov 2006 12:33:53 +0100
Jarek Poplawski [EMAIL PROTECTED] wrote:

 After hlist_del() next and pprev pointers are not NULL
 so hlist_unhashed() doesn't work properly.
 
 
 Signed-off-by: Jarek Poplawski [EMAIL PROTECTED]
 ---
 
 
 diff -Nurp linux-2.6.19-rc4-git10-/net/sched/sch_htb.c 
 linux-2.6.19-rc4-git10/net/sched/sch_htb.c
 --- linux-2.6.19-rc4-git10-/net/sched/sch_htb.c   2006-11-06 
 11:42:41.0 +0100
 +++ linux-2.6.19-rc4-git10/net/sched/sch_htb.c2006-11-06 
 11:53:15.0 +0100
 @@ -1284,8 +1284,10 @@ static void htb_destroy_class(struct Qdi
 struct htb_class, sibling));
  
   /* note: this delete may happen twice (see htb_delete) */
 - if (!hlist_unhashed(cl-hlist))
 + if (!hlist_unhashed(cl-hlist)) {
   hlist_del(cl-hlist);
 + INIT_HLIST_NODE(cl-hlist);
 + }

why not use hlist_del_init?

   list_del(cl-sibling);
  
   if (cl-prio_activity)
 @@ -1333,8 +1335,10 @@ static int htb_delete(struct Qdisc *sch,
   sch_tree_lock(sch);
  
   /* delete from hash and active; remainder in destroy_class */
 - if (!hlist_unhashed(cl-hlist))
 + if (!hlist_unhashed(cl-hlist)) {
   hlist_del(cl-hlist);
 + INIT_HLIST_NODE(cl-hlist);
 + }
  
   if (cl-prio_activity)
   htb_deactivate(q, cl);


-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [sungem] proposal for a new locking strategy

2006-11-06 Thread Stephen Hemminger

On Sun, 5 Nov 2006 21:11:34 +0100
Eric Lemoine [EMAIL PROTECTED] wrote:

 On 11/5/06, Stephen Hemminger [EMAIL PROTECTED] wrote:
  On Sun, 5 Nov 2006 18:52:45 +0100
  Eric Lemoine [EMAIL PROTECTED] wrote:
 
   On 11/5/06, Stephen Hemminger [EMAIL PROTECTED] wrote:
On Sun, 5 Nov 2006 18:28:33 +0100
Eric Lemoine [EMAIL PROTECTED] wrote:
   
  You could also just use net_tx_lock() now.

 You mean netif_tx_lock()?

 Thanks for letting me know about that function. Yes, I may need it.
 tg3 and bnx2 use it to wake up the transmit queue:

  if (unlikely(netif_queue_stopped(tp-dev) 
   (tg3_tx_avail(tp)  TG3_TX_WAKEUP_THRESH))) {
  netif_tx_lock(tp-dev);
  if (netif_queue_stopped(tp-dev) 
  (tg3_tx_avail(tp)  TG3_TX_WAKEUP_THRESH))
  netif_wake_queue(tp-dev);
  netif_tx_unlock(tp-dev);
  }

 2.6.17 didn't use it. Was it a bug?

 Thanks,
   
No, it was introduced in 2.6.18. The functions are just a wrapper
around the network device transmit lock that is normally held.
   
If the device does not need to acquire the lock during IRQ, it
is a good alternative and avoids a second lock.
   
For transmit locking there are three common alternatives:
   
Method A: dev-queue_xmit_lock and per-device tx_lock
send: dev-xmit_lock held by caller
dev-hard_start_xmit acquires netdev_priv(dev)-tx_lock
   
irq:  netdev_priv(dev)-tx_lock acquired
   
Method B: dev-queue_xmit_lock only
send: dev-xmit_lock held by caller
irq:  schedules softirq (NAPI)
napi_poll: calls netif_tx_lock() which acquires dev-xmit_lock
   
Method C: LLTX
set dev-features LLTX
send: no locks held by caller
dev-hard_start_xmit acquires netdev_priv(dev)-tx_lock
irq: netdev_priv(dev)-tx_lock acquired
   
Method A is the only one that works with 2.4 and early (2.6.8?) kernels.
   
  
   Current sungem does Method C, and uses two locks: lock and tx_lock.
   What I was planning to do is Method B (which current tg3 uses). It
   seems to me that Method B is better than Method C. What do you think?
 
  B is better than C because the transmit logic doesn't have to
  spin in the case of lock contention, but it is not a big difference.
 
 Current sungem does C but uses try_lock() to acquire its private
 tx_lock. So it doesn't spin either in case of contention.


But the spin is still there, just more complex..
In qdisc_restart() processing of NETDEV_TX_LOCKED causes:
spin_lock(dev-xmit_lock)

q-requeue()
netif_schedule(dev);

SOFTIRQ:
net_tx_action()
qdisc_run() -- qdisc_restart()

So instead of spinning in tight loop, you end up with a longer code
path.


-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SKGE backport to 2.4 : success

2006-11-06 Thread Stephen Hemminger

On Sat, 4 Nov 2006 22:08:55 +0100
Willy Tarreau [EMAIL PROTECTED] wrote:

 Hi Stephen,
 
 I don't know if you received my mail since I got no reply.
 
 Thanks in advance for your comments,
 Willy
 
 On Sat, Oct 28, 2006 at 10:57:07PM +0200, Willy Tarreau wrote:
  Hi Stephen,
  
  In my own kernels, I've added your backport of SKGE to 2.4 that I found
  here :
  
 http://developer.osdl.org/shemminger/releases/skge-sky2-backport.tar.bz2
  
  It seems to work pretty well compared to the original syskonnect driver
  (up to and including 8.36). Several people around me have reported very
  slow NFS operations with the official driver, which I finally attributed
  to a strange effect of UDP packets not going out after a while until they
  get pushed by a TCP packet. I even noticed the problem at the company
  and we turned the NFS server to an unused 100 Mbps card to workaround the
  problem before being able to fully ananlyze the problem.
  
  It seems your driver is getting mature and its performance is very close to
  the official one, while its code is smaller and apparently more reliable. I
  was thinking about merging it in mainline 2.4 as a fix for people having
  trouble with the syskonnect driver. It might also be easier to backport 
  fixes
  from 2.6 to 2.4 when the driver is the same.
  
  I don't think we risk any regression because it won't replace an existing
  driver, but will provide one to people who are used to download new versions
  from an external tree.
  
  Also, I'm not yet sure whether I would also backport the sky2 driver, 
  because
  I know about a handful boxes running in production with the official one 
  with
  88E8053 chips at high packet rates with no trouble at all. Anyway, as long 
  as
  the backport does not prevent them from using the external driver, there
  should be no problem.
  
  I'd like to get your opinion on this matter, and of course, Jeff's and 
  Davem's.
  
  Thanks in advance,
  Willy
  


The backport needs to be updated. It is of older code.  I plan to do a new
backport this week. The backport version doesn't use NAPI, because of issues
with not wanting to change netdevice.h. For a good 2.4 version, I would
make a version that was closer to 2.6 code (using NAPI).

I did the backport because one of the equipment donors gave a VPN box whose
base OS is RHEL based on 2.4.


-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

SKB BUG: Invalid truesize, current git

2006-11-06 Thread Benjamin LaHaise

Hi all,

I managed to get a backtrace for the Invalid truesize bug.  The trigger is 
running LMbench2, but it's rater intermittent.  Traffic should be going 
over the loopback interface, but the main nic on the machine is e1000.  
Let me know if anyone has any ideas for things to try.

-ben

Linux version 2.6.19-rc4 ([EMAIL PROTECTED]) (gcc version 4.1.1 20060525 (Red 
Hat 4.1.1-1)) #73 SMP Mon Nov 6 13:13:44 EST 2006
Command line: ro root=LABEL=/1 console=ttyS0,38400
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009cc00 (usable)
 BIOS-e820: 0009cc00 - 000a (reserved)
 BIOS-e820: 000cc000 - 000d (reserved)
 BIOS-e820: 000e4000 - 0010 (reserved)
 BIOS-e820: 0010 - bff6 (usable)
 BIOS-e820: bff6 - bff69000 (ACPI data)
 BIOS-e820: bff69000 - bff8 (ACPI NVS)
 BIOS-e820: bff8 - c000 (reserved)
 BIOS-e820: e000 - f000 (reserved)
 BIOS-e820: fec0 - fec1 (reserved)
 BIOS-e820: fee0 - fee01000 (reserved)
 BIOS-e820: ff00 - 0001 (reserved)
 BIOS-e820: 0001 - 00014000 (usable)
Entering add_active_range(0, 0, 156) 0 entries of 256 used
Entering add_active_range(0, 256, 786272) 1 entries of 256 used
Entering add_active_range(0, 1048576, 1310720) 2 entries of 256 used
end_pfn_map = 1310720
DMI present.
ACPI: RSDP (v000 PTLTD ) @ 0x000f58d0
ACPI: RSDT (v001 PTLTDRSDT   0x0604  LTP 0x) @ 
0xbff636df
ACPI: FADT (v001 INTEL  TUMWATER 0x0604 PTL  0x0003) @ 
0xbff68e48
ACPI: MADT (v001 PTLTD   APIC   0x0604  LTP 0x) @ 
0xbff68ebc
ACPI: MCFG (v001 PTLTDMCFG   0x0604  LTP 0x) @ 
0xbff68f4c
ACPI: BOOT (v001 PTLTD  $SBFTBL$ 0x0604  LTP 0x0001) @ 
0xbff68f88
ACPI: SPCR (v001 PTLTD  $UCRTBL$ 0x0604 PTL  0x0001) @ 
0xbff68fb0
ACPI: SSDT (v001  PmRefCpuPm 0x3000 INTL 0x20050228) @ 
0xbff6371b
ACPI: DSDT (v001  Intel BLAKFORD 0x0604 MSFT 0x010e) @ 
0x
Entering add_active_range(0, 0, 156) 0 entries of 256 used
Entering add_active_range(0, 256, 786272) 1 entries of 256 used
Entering add_active_range(0, 1048576, 1310720) 2 entries of 256 used
Zone PFN ranges:
  DMA 0 - 4096
  DMA324096 -  1048576
  Normal1048576 -  1310720
early_node_map[3] active PFN ranges
0:0 -  156
0:  256 -   786272
0:  1048576 -  1310720
On node 0 totalpages: 1048316
  DMA zone: 56 pages used for memmap
  DMA zone: 1395 pages reserved
  DMA zone: 2545 pages, LIFO batch:0
  DMA32 zone: 14280 pages used for memmap
  DMA32 zone: 767896 pages, LIFO batch:31
  Normal zone: 3584 pages used for memmap
  Normal zone: 258560 pages, LIFO batch:31
ACPI: Local APIC address 0xfee0
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 (Bootup-CPU)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x06] enabled)
Processor #6
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
Processor #1
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x07] enabled)
Processor #7
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0])
IOAPIC[0]: apic_id 2, address 0xfec0, GSI 0-23
ACPI: IOAPIC (id[0x03] address[0xfec8] gsi_base[24])
IOAPIC[1]: apic_id 3, address 0xfec8, GSI 24-47
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Setting APIC routing to flat
Using ACPI (MADT) for SMP configuration information
Nosave address range: 0009c000 - 0009d000
Nosave address range: 0009d000 - 000a
Nosave address range: 000a - 000cc000
Nosave address range: 000cc000 - 000d
Nosave address range: 000d - 000e4000
Nosave address range: 000e4000 - 0010
Nosave address range: bff6 - bff69000
Nosave address range: bff69000 - bff8
Nosave address range: bff8 - c000
Nosave address range: c000 - e000
Nosave address range: e000 - f000
Nosave address range: f000 - fec0
Nosave address range: fec0 - fec1
Nosave address range: fec1 - fee0
Nosave address range: fee0 - fee01000
Nosave address range: fee01000 - ff00
Nosave address range: ff00 - 0001
Allocating

Re: SKB BUG: Invalid truesize, current git

2006-11-06 Thread Herbert Xu

On Mon, Nov 06, 2006 at 07:07:26PM +, Benjamin LaHaise wrote:
 
 I managed to get a backtrace for the Invalid truesize bug.  The trigger is 
 running LMbench2, but it's rater intermittent.  Traffic should be going 
 over the loopback interface, but the main nic on the machine is e1000.  
 Let me know if anyone has any ideas for things to try.

OK, this should cure it.  BTW, this indicates that your app is
retransmitting unnecessarily which might be a problem in itself.

This patch applies to all recent 2.6 kernels.

[NET]: Set truesize in pskb_copy

Since pskb_copy tacks on the non-linear bits from the original
skb, it needs to count them in the truesize field of the new skb.

Signed-off-by: Herbert Xu [EMAIL PROTECTED]

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f735455..b8b1063 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -639,6 +639,7 @@ struct sk_buff *pskb_copy(struct sk_buff
n-csum  = skb-csum;
n-ip_summed = skb-ip_summed;
 
+   n-truesize += skb-data_len;
n-data_len  = skb-data_len;
n-len   = skb-len;
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/11] convert d80211 to a proper protocol

2006-11-06 Thread Jiri Benc

On Sun, 05 Nov 2006 16:39:34 +0100, Johannes Berg wrote:
 003-d80211-cookie.patch
d80211: change the cookie to be opaque
 
This changes the 'cookie' that d80211 returns from alloc_hw
to be an opaque value to the driver. Turned out that it wasn't
such a great idea but since it was generally a clean up I kept
this patch to base my other patches on.

ACK.

 005-d80211-reduce-mdev-1.patch
 006-d80211-reduce-mdev-2.patch
d80211: reduce mdev usage
 
These two patches reduce mdev madness and change a lot of functions
to take a struct ieee80211_local * instead of the master netdev

ACK.

 007-d80211-cleanup-rxmgmt.patch
d80211: reduce mdev usage, fix ieee80211_rx_mgmt
 
Cleans up the ieee80211_rx_mgmt and related code

Looks good after a quick look. Need to review it more deeply.

 008-d80211-scan-sanity.patch
d80211: reduce master ieee80211_ptr deref in scan routines
 
Similar to the reduce mdev patches, just for the scan routines

ACK.

 009-d80211-convert-spaces.patch
d80211: convert leading spaces to tabs
 
I hated working on the code, so I did this. The next patch
breaks everything anyway.

NAK. There are too many patches pending. Let's do this just before
merging.

 010-d80211-proto.patch
d80211: convert to an 802.11 protocol
 
Converts d80211 to be a protocol together with tons of
cleanups and more. Hard to describe in two lines.

NAK.

This is too big patch for a review, it does too much things and I
fundamentally disagree with some parts of the patch. Split it into
individual patches.

Just some things which are broken with the patch (the list is probably
not complete):

 * The mdev no longer has a sub_if_data attached (why ever did it??)
   It's private area is for the driver since we don't create it but
   the driver does. I did keep the notation of mdev/master all through,
   but it's no longer the stacks device. Keep that in mind.

This definitely breaks AP mode. In the code, there is heavily (ab)used
the fact that the master device is in fact an AP device. I tried to fix
that but it was so difficult I gave up. It is needed to rewrite the
whole RX path (and even that is probably not enough). As this will
be fixed for free when we have native 802.11 devices, I don't think we
need to do anything about it now.

 * sysfs layout changed. There is no wiphy or an ieee80211 class any more,
   the attributes that used to be there are now in the net_device that
   the driver registered, and our attributes are below the devices we created.

You want an ieee80211 class. Once you get rid of a master interface you
need something with per-hardware information, statistics etc.

 * sysfs layout changed. There is no wiphy or an ieee80211 class any more,
   the attributes that used to be there are now in the net_device that
   the driver registered, and our attributes are below the devices we created.

Doesn't belong to this patch.

 And probably lots more.

???


What did happen with
d80211: add a function to get the wiphy index
d80211: add a perm_addr hardware property
d80211: add a struct device* hardware property
d80211: add a ethtool_ops hardware property
patches?

Thanks,

 Jiri

-- 
Jiri Benc
SUSE Labs
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] d80211: fix key access race

2006-11-06 Thread Jiri Benc

On Fri, 03 Nov 2006 11:48:22 +0800, Hong Liu wrote:
 It seems we don't have any protection when accessing the key.
 The RX/TX path may acquire a key which can be freed by the
 ioctl cmd.
 
 I put a key_lock spinlock to protect all the accesses to the key
 (whether the sta_info-key or ieee80211_sub_if_data-keys[]). 
 Don't find a good way to handle it :(

NAK, this is too expensive.

I'm aware of the problem and figured how to fix it correctly while
working on fixing of sta_list locking. Will send a patch later this
week, stay tuned :-)

Thanks,

 Jiri

-- 
Jiri Benc
SUSE Labs
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH] s2io ppc64 fix for readq/writeq

2006-11-06 Thread Ramkrishna Vepa

The 64 bit io operation on the IA64 platform is a 64 bit transaction on
the pci bus and is optimal to leave it as such. I prefer Jeff's
suggestion  - 

guaranteeing that a good enough for drivers readq() and writeq() exist
on all platforms even 32-bit platforms where the operation isn't
inherently atomic.

Ram

 -Original Message-
 From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]
 On Behalf Of Benjamin Herrenschmidt
 Sent: Monday, November 06, 2006 1:57 AM
 To: Jeff Garzik
 Cc: Linus Torvalds; netdev@vger.kernel.org
 Subject: Re: [PATCH] s2io ppc64 fix for readq/writeq
 
 On Mon, 2006-11-06 at 04:55 -0500, Jeff Garzik wrote:
  Benjamin Herrenschmidt wrote:
   On Mon, 2006-11-06 at 01:50 -0800, Linus Torvalds wrote:
   On Mon, 6 Nov 2006, Benjamin Herrenschmidt wrote:
   Anyway, what do you think of Jeff proposal to just implement
them as
 two
   32 bits operations ? My arch guy side screams at the idea, but
if,
   indeed, drivers generally cope fine with it, I suppose that's
ok.
   Last I saw, that's how normal PCI will split the IO anyway, so I
 guess it
   makes sense.
  
   Hrm.. true indeed. I'll implement them that way for ppc32 then.
 
  Bonus points if you want to find-and-kill where individual drivers
did
 
  #ifndef readq
  implement readq and writeq by hand...
  #endif
 
 Yes, well, we would have to make sure all archs have them defined
 first though, but I suppose I can have a look later this week, maybe
 tomorrow. Shouldn't be too hard :)
 
 Ben.
 
 
 
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] s2io ppc64 fix for readq/writeq

2006-11-06 Thread Christoph Hellwig

On Mon, Nov 06, 2006 at 03:33:19PM -0500, Ramkrishna Vepa wrote:
 The 64 bit io operation on the IA64 platform is a 64 bit transaction on
 the pci bus and is optimal to leave it as such. I prefer Jeff's
 suggestion  - 
 
 guaranteeing that a good enough for drivers readq() and writeq() exist
 on all platforms even 32-bit platforms where the operation isn't
 inherently atomic.

For consistencies sake we really want to have readq() and writeq() available
on all platforms.  I remember that some IB cards require it to actually
be a 64bit transactions, otherwise they have to do funny workarounds.
I think the best solution is to define ARCH_HAS_ATOMIC_READQ_WRITEQ
and let drivers do their workarounds based on that.

I've Cc'ed Roland because he should be able to explain the IB issue in
details.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [sungem] proposal for a new locking strategy

2006-11-06 Thread Eric Lemoine


On 11/6/06, Stephen Hemminger [EMAIL PROTECTED] wrote:

On Sun, 5 Nov 2006 21:11:34 +0100
Eric Lemoine [EMAIL PROTECTED] wrote:

 On 11/5/06, Stephen Hemminger [EMAIL PROTECTED] wrote:
  On Sun, 5 Nov 2006 18:52:45 +0100
  Eric Lemoine [EMAIL PROTECTED] wrote:
 
   On 11/5/06, Stephen Hemminger [EMAIL PROTECTED] wrote:
On Sun, 5 Nov 2006 18:28:33 +0100
Eric Lemoine [EMAIL PROTECTED] wrote:
   
  You could also just use net_tx_lock() now.

 You mean netif_tx_lock()?

 Thanks for letting me know about that function. Yes, I may need it.
 tg3 and bnx2 use it to wake up the transmit queue:

  if (unlikely(netif_queue_stopped(tp-dev) 
   (tg3_tx_avail(tp)  TG3_TX_WAKEUP_THRESH))) {
  netif_tx_lock(tp-dev);
  if (netif_queue_stopped(tp-dev) 
  (tg3_tx_avail(tp)  TG3_TX_WAKEUP_THRESH))
  netif_wake_queue(tp-dev);
  netif_tx_unlock(tp-dev);
  }

 2.6.17 didn't use it. Was it a bug?

 Thanks,
   
No, it was introduced in 2.6.18. The functions are just a wrapper
around the network device transmit lock that is normally held.
   
If the device does not need to acquire the lock during IRQ, it
is a good alternative and avoids a second lock.
   
For transmit locking there are three common alternatives:
   
Method A: dev-queue_xmit_lock and per-device tx_lock
send: dev-xmit_lock held by caller
dev-hard_start_xmit acquires netdev_priv(dev)-tx_lock
   
irq:  netdev_priv(dev)-tx_lock acquired
   
Method B: dev-queue_xmit_lock only
send: dev-xmit_lock held by caller
irq:  schedules softirq (NAPI)
napi_poll: calls netif_tx_lock() which acquires dev-xmit_lock
   
Method C: LLTX
set dev-features LLTX
send: no locks held by caller
dev-hard_start_xmit acquires netdev_priv(dev)-tx_lock
irq: netdev_priv(dev)-tx_lock acquired
   
Method A is the only one that works with 2.4 and early (2.6.8?) kernels.
   
  
   Current sungem does Method C, and uses two locks: lock and tx_lock.
   What I was planning to do is Method B (which current tg3 uses). It
   seems to me that Method B is better than Method C. What do you think?
 
  B is better than C because the transmit logic doesn't have to
  spin in the case of lock contention, but it is not a big difference.

 Current sungem does C but uses try_lock() to acquire its private
 tx_lock. So it doesn't spin either in case of contention.


But the spin is still there, just more complex..
In qdisc_restart() processing of NETDEV_TX_LOCKED causes:
spin_lock(dev-xmit_lock)

q-requeue()
netif_schedule(dev);

SOFTIRQ:
net_tx_action()
qdisc_run() -- qdisc_restart()

So instead of spinning in tight loop, you end up with a longer code
path.


Stephen, sorry for insisting a bit but I'm failing to see how B is
different from C in that respect. With method B, in qdisc_restart(),
if netif_tx_trylock() fails to acquire the lock then we also
requeue(), etc. Same long code path in case of contention.

--
Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] s2io ppc64 fix for readq/writeq

2006-11-06 Thread Roland Dreier

  For consistencies sake we really want to have readq() and writeq() available
  on all platforms.  I remember that some IB cards require it to actually
  be a 64bit transactions, otherwise they have to do funny workarounds.
  I think the best solution is to define ARCH_HAS_ATOMIC_READQ_WRITEQ
  and let drivers do their workarounds based on that.
  
  I've Cc'ed Roland because he should be able to explain the IB issue in
  details.

The issue I know about is drivers/infiniband/hw/mthca.  The card has
64-bit doorbell registers, and the restriction is that if you write
the doorbell write two 32-bit writes, you can't write anything else on
the same register page in between writing the two halves.  Since
different CPUs might be doing stuff on the same doorbell page at the
same time, there are two things we can do:
 - If writeq() exists then use that and assume it will generate only a
   single bus transaction that can't let anything sneak in the
   middle.  (That's a fairly safe assumption because the devices being
   driven are either 64-bit PCI-X or PCIe only)
 - If writeq() doesn't exist, use a spinlock to protect access to each
   doorbell page.

ARCH_HAS_ATOMIC_READQ_WRITEQ would be fine for that, but of course the
tricky thing is writing down the exact semantics that HAS_ATOMIC is
actually promising.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [sungem] proposal for a new locking strategy

2006-11-06 Thread Stephen Hemminger

On Mon, 6 Nov 2006 21:55:20 +0100
Eric Lemoine [EMAIL PROTECTED] wrote:

 On 11/6/06, Stephen Hemminger [EMAIL PROTECTED] wrote:
  On Sun, 5 Nov 2006 21:11:34 +0100
  Eric Lemoine [EMAIL PROTECTED] wrote:
 
   On 11/5/06, Stephen Hemminger [EMAIL PROTECTED] wrote:
On Sun, 5 Nov 2006 18:52:45 +0100
Eric Lemoine [EMAIL PROTECTED] wrote:
   
 On 11/5/06, Stephen Hemminger [EMAIL PROTECTED] wrote:
  On Sun, 5 Nov 2006 18:28:33 +0100
  Eric Lemoine [EMAIL PROTECTED] wrote:
 
You could also just use net_tx_lock() now.
  
   You mean netif_tx_lock()?
  
   Thanks for letting me know about that function. Yes, I may need 
   it.
   tg3 and bnx2 use it to wake up the transmit queue:
  
if (unlikely(netif_queue_stopped(tp-dev) 
 (tg3_tx_avail(tp)  TG3_TX_WAKEUP_THRESH))) 
   {
netif_tx_lock(tp-dev);
if (netif_queue_stopped(tp-dev) 
(tg3_tx_avail(tp)  TG3_TX_WAKEUP_THRESH))
netif_wake_queue(tp-dev);
netif_tx_unlock(tp-dev);
}
  
   2.6.17 didn't use it. Was it a bug?
  
   Thanks,
 
  No, it was introduced in 2.6.18. The functions are just a wrapper
  around the network device transmit lock that is normally held.
 
  If the device does not need to acquire the lock during IRQ, it
  is a good alternative and avoids a second lock.
 
  For transmit locking there are three common alternatives:
 
  Method A: dev-queue_xmit_lock and per-device tx_lock
  send: dev-xmit_lock held by caller
  dev-hard_start_xmit acquires netdev_priv(dev)-tx_lock
 
  irq:  netdev_priv(dev)-tx_lock acquired
 
  Method B: dev-queue_xmit_lock only
  send: dev-xmit_lock held by caller
  irq:  schedules softirq (NAPI)
  napi_poll: calls netif_tx_lock() which acquires 
  dev-xmit_lock
 
  Method C: LLTX
  set dev-features LLTX
  send: no locks held by caller
  dev-hard_start_xmit acquires 
  netdev_priv(dev)-tx_lock
  irq: netdev_priv(dev)-tx_lock acquired
 
  Method A is the only one that works with 2.4 and early (2.6.8?) 
  kernels.
 

 Current sungem does Method C, and uses two locks: lock and tx_lock.
 What I was planning to do is Method B (which current tg3 uses). It
 seems to me that Method B is better than Method C. What do you think?
   
B is better than C because the transmit logic doesn't have to
spin in the case of lock contention, but it is not a big difference.
  
   Current sungem does C but uses try_lock() to acquire its private
   tx_lock. So it doesn't spin either in case of contention.
 
 
  But the spin is still there, just more complex..
  In qdisc_restart() processing of NETDEV_TX_LOCKED causes:
  spin_lock(dev-xmit_lock)
 
  q-requeue()
  netif_schedule(dev);
 
  SOFTIRQ:
  net_tx_action()
  qdisc_run() -- qdisc_restart()
 
  So instead of spinning in tight loop, you end up with a longer code
  path.
 
 Stephen, sorry for insisting a bit but I'm failing to see how B is
 different from C in that respect. With method B, in qdisc_restart(),
 if netif_tx_trylock() fails to acquire the lock then we also
 requeue(), etc. Same long code path in case of contention.
 

Method C LLTX causes repeated softirq's which will be slower since the loop
requires more instructions than a simple spin loop (Method B).


-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC: 2.6 patch] hostap_80211_rx(): fix a use-after-free

2006-11-06 Thread Alexey Dobriyan

On Mon, Nov 06, 2006 at 03:21:48PM +0100, Adrian Bunk wrote:
 This patch fixes a use-after-free for skb spotted by the Coverity
 checker.

 --- linux-2.6/drivers/net/wireless/hostap/hostap_80211_rx.c.old
 +++ linux-2.6/drivers/net/wireless/hostap/hostap_80211_rx.c
 @@ -1004,10 +1004,10 @@ void hostap_80211_rx(struct net_device *
   if (local-hostapd  local-apdev) {
   /* Send IEEE 802.1X frames to the user
* space daemon for processing */
 - prism2_rx_80211(local-apdev, skb, rx_stats,
 - PRISM2_RX_MGMT);
   local-apdevstats.rx_packets++;
   local-apdevstats.rx_bytes += skb-len;
 + prism2_rx_80211(local-apdev, skb, rx_stats,
 + PRISM2_RX_MGMT);
   goto rx_exit;

Network drivers set rx_packets and rx_bytes after netif_rx. And last_rx,
too. The trick seems to be to use pkt_len variable.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: tg3_read_partno(): possible array overrun

2006-11-06 Thread Michael Chan

On Mon, 2006-11-06 at 10:45 +0100, Adrian Bunk wrote:
 The Coverity checker noted the following in drivers/net/tg3.c:
 
 --  snip  --
 
 The problem is that vpd_data[i + 2] could be vpd_data[255 + 2].

Thanks.  This should fix it:

[TG3]: Fix array overrun in tg3_read_partno().

Use proper upper limits for the loops and check for all error
conditions.

The problem was noticed by Adrian Bunk.

Signed-off-by: Michael Chan [EMAIL PROTECTED] 

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 8f059b7..06e4f77 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -10212,7 +10212,7 @@ skip_phy_reset:
 static void __devinit tg3_read_partno(struct tg3 *tp)
 {
unsigned char vpd_data[256];
-   int i;
+   unsigned int i;
u32 magic;
 
if (tg3_nvram_read_swab(tp, 0x0, magic))
@@ -10258,9 +10258,9 @@ static void __devinit tg3_read_partno(st
}
 
/* Now parse and find the part number. */
-   for (i = 0; i  256; ) {
+   for (i = 0; i  254; ) {
unsigned char val = vpd_data[i];
-   int block_end;
+   unsigned int block_end;
 
if (val == 0x82 || val == 0x91) {
i = (i + 3 +
@@ -10276,21 +10276,26 @@ static void __devinit tg3_read_partno(st
 (vpd_data[i + 1] +
  (vpd_data[i + 2]  8)));
i += 3;
-   while (i  block_end) {
+
+   if (block_end  256)
+   goto out_not_found;
+
+   while (i  (block_end - 2)) {
if (vpd_data[i + 0] == 'P' 
vpd_data[i + 1] == 'N') {
int partno_len = vpd_data[i + 2];
 
-   if (partno_len  24)
+   i += 3;
+   if (partno_len  24 || (partno_len + i)  256)
goto out_not_found;
 
memcpy(tp-board_part_number,
-  vpd_data[i + 3],
-  partno_len);
+  vpd_data[i], partno_len);
 
/* Success. */
return;
}
+   i += 3 + vpd_data[i + 2];
}
 
/* Part number not found. */



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] bcm43xx: Drain TX status before starting IRQs

2006-11-06 Thread Larry Finger

From: Michael Buesch [EMAIL PROTECTED]

Drain the Microcode TX-status-FIFO before we enable IRQs.
This is required, because the FIFO may still have entries left
from a previous run. Those would immediately fire after enabling
IRQs and would lead to an oops in the DMA TXstatus handling code.

Signed-off-by: Michael Buesch [EMAIL PROTECTED]
Signed-off-by: Larry Finger [EMAIL PROTECTED]
---

John,

Please apply this to wireless-2.6 and push it to 2.6.19. It has already
been sent to -stable for inclusion in 2.6.18.3. This patch replaces one
with the same name that was sent by Michael on October 19. It had a bug,
fixed in this version, that would lock up certain core revisions.

Larry

Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
===
--- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c
+++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
@@ -1467,6 +1467,23 @@ static void handle_irq_transmit_status(s
}
 }
 
+static void drain_txstatus_queue(struct bcm43xx_private *bcm)
+{
+   u32 dummy;
+
+   if (bcm-current_core-rev  5)
+   return;
+   /* Read all entries from the microcode TXstatus FIFO
+* and throw them away.
+*/
+   while (1) {
+   dummy = bcm43xx_read32(bcm, BCM43xx_MMIO_XMITSTAT_0);
+   if (!dummy)
+   break;
+   dummy = bcm43xx_read32(bcm, BCM43xx_MMIO_XMITSTAT_1);
+   }
+}
+
 static void bcm43xx_generate_noise_sample(struct bcm43xx_private *bcm)
 {
bcm43xx_shm_write16(bcm, BCM43xx_SHM_SHARED, 0x408, 0x7F7F);
@@ -3569,6 +3586,7 @@ int bcm43xx_select_wireless_core(struct 
bcm43xx_macfilter_clear(bcm, BCM43xx_MACFILTER_ASSOC);
bcm43xx_macfilter_set(bcm, BCM43xx_MACFILTER_SELF, (u8 
*)(bcm-net_dev-dev_addr));
bcm43xx_security_init(bcm);
+   drain_txstatus_queue(bcm);
ieee80211softmac_start(bcm-net_dev);
 
/* Let's go! Be careful after enabling the IRQs.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] bcm43xx: Add error checking in bcm43xx_sprom_write()

2006-11-06 Thread Larry Finger

From: Adrian Bunk [EMAIL PROTECTED]

The Coverity checker noted that these if (err)'s couldn't ever be 
true.

It seems the intention was to check the return values of the 
bcm43xx_pci_write_config32()'s?

Signed-off-by: Adrian Bunk [EMAIL PROTECTED]
Signed-off-by: Larry Finger [EMAIL PROTECTED]
Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
===
--- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c
+++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
@@ -750,7 +750,7 @@ int bcm43xx_sprom_write(struct bcm43xx_p
if (err)
goto err_ctlreg;
spromctl |= 0x10; /* SPROM WRITE enable. */
-   bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl);
+   err = bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, 
spromctl);
if (err)
goto err_ctlreg;
/* We must burn lots of CPU cycles here, but that does not
@@ -772,7 +772,7 @@ int bcm43xx_sprom_write(struct bcm43xx_p
mdelay(20);
}
spromctl = ~0x10; /* SPROM WRITE enable. */
-   bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl);
+   err = bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, 
spromctl);
if (err)
goto err_ctlreg;
mdelay(500);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] bcm43xx: Add error checking in bcm43xx_sprom_write()

2006-11-06 Thread Michael Buesch

On Monday 06 November 2006 16:48, Larry Finger wrote:
 From: Adrian Bunk [EMAIL PROTECTED]
 
 The Coverity checker noted that these if (err)'s couldn't ever be 
 true.
 
 It seems the intention was to check the return values of the 
 bcm43xx_pci_write_config32()'s?

Whoops, I thought I had fixed this bug long time ago.
The patch is correct.

 Signed-off-by: Adrian Bunk [EMAIL PROTECTED]
 Signed-off-by: Larry Finger [EMAIL PROTECTED]

Signed-off-by: Michael Buesch [EMAIL PROTECTED]

 Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
 ===
 --- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c
 +++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
 @@ -750,7 +750,7 @@ int bcm43xx_sprom_write(struct bcm43xx_p
   if (err)
   goto err_ctlreg;
   spromctl |= 0x10; /* SPROM WRITE enable. */
 - bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl);
 + err = bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, 
 spromctl);
   if (err)
   goto err_ctlreg;
   /* We must burn lots of CPU cycles here, but that does not
 @@ -772,7 +772,7 @@ int bcm43xx_sprom_write(struct bcm43xx_p
   mdelay(20);
   }
   spromctl = ~0x10; /* SPROM WRITE enable. */
 - bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl);
 + err = bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, 
 spromctl);
   if (err)
   goto err_ctlreg;
   mdelay(500);
 

-- 
Greetings Michael.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/11] convert d80211 to a proper protocol

2006-11-06 Thread Johannes Berg

[reordering a bit]

 This changes the 'cookie' that d80211 returns from alloc_hw
 to be an opaque value to the driver. Turned out that it wasn't
 such a great idea but since it was generally a clean up I kept
 this patch to base my other patches on.
 
 ACK.

 What did happen with
 d80211: add a function to get the wiphy index
 d80211: add a perm_addr hardware property
 d80211: add a struct device* hardware property
 d80211: add a ethtool_ops hardware property
 patches?

Well after some chat with a few people I decided that it was stupid and
not very maintainable to copy all the fields in net_device to a new
structure.

  009-d80211-convert-spaces.patch
 d80211: convert leading spaces to tabs
  
 I hated working on the code, so I did this. The next patch
 breaks everything anyway.
 
 NAK. There are too many patches pending. Let's do this just before
 merging.

Oh come off it! It's really stupid to have to check all the tabs/spaces
all the time. The patch changes 451 lines. And wiggle can handle that
just fine. Besides, if you do
s/^\+   /+\t/
s/^-   /-\t/
s//\t/
on your patches, they'll be fine too.

 This is too big patch for a review, 

Yeah. It's pretty bad actually, but I couldn't really find a good way to
split it into logical chunks.

  * The mdev no longer has a sub_if_data attached (why ever did it??)
It's private area is for the driver since we don't create it but
the driver does. I did keep the notation of mdev/master all through,
but it's no longer the stacks device. Keep that in mind.
 
 This definitely breaks AP mode. In the code, there is heavily (ab)used
 the fact that the master device is in fact an AP device. I tried to fix
 that but it was so difficult I gave up. It is needed to rewrite the
 whole RX path (and even that is probably not enough).

Bugger. I didn't notice that. I'll have a look. That is indeed a
showstopper.

 As this will
 be fixed for free when we have native 802.11 devices, I don't think we
 need to do anything about it now.

I don't think I understand this. I mean, my patch actually gives us
native 802.11 devices by making the drivers register those and then
handling them virtually similar to how 8021q handles ethernet devices. I
honestly thought that this was the plan for said native 802.11
devices.

  * sysfs layout changed. There is no wiphy or an ieee80211 class any more,
the attributes that used to be there are now in the net_device that
the driver registered, and our attributes are below the devices we 
  created.
 
 You want an ieee80211 class. Once you get rid of a master interface you
 need something with per-hardware information, statistics etc.

Yeah, I gave up trying to get rid of the master interface in favour of
having a native 802.11 device which is registered by the phy driver
instead.

  * sysfs layout changed. There is no wiphy or an ieee80211 class any more,
the attributes that used to be there are now in the net_device that
the driver registered, and our attributes are below the devices we 
  created.
 
 Doesn't belong to this patch.

Had to be here initially due to the way I did things, but ok, probably
changeable.

johannes


signature.asc
Description: This is a digitally signed message part

Re: [sungem] proposal for a new locking strategy

2006-11-06 Thread Eric Lemoine


On 11/6/06, Stephen Hemminger [EMAIL PROTECTED] wrote:

On Mon, 6 Nov 2006 21:55:20 +0100
Eric Lemoine [EMAIL PROTECTED] wrote:

 On 11/6/06, Stephen Hemminger [EMAIL PROTECTED] wrote:
  On Sun, 5 Nov 2006 21:11:34 +0100
  Eric Lemoine [EMAIL PROTECTED] wrote:
 
   On 11/5/06, Stephen Hemminger [EMAIL PROTECTED] wrote:
On Sun, 5 Nov 2006 18:52:45 +0100
Eric Lemoine [EMAIL PROTECTED] wrote:
   
 On 11/5/06, Stephen Hemminger [EMAIL PROTECTED] wrote:
  On Sun, 5 Nov 2006 18:28:33 +0100
  Eric Lemoine [EMAIL PROTECTED] wrote:
 
You could also just use net_tx_lock() now.
  
   You mean netif_tx_lock()?
  
   Thanks for letting me know about that function. Yes, I may need 
it.
   tg3 and bnx2 use it to wake up the transmit queue:
  
if (unlikely(netif_queue_stopped(tp-dev) 
 (tg3_tx_avail(tp)  TG3_TX_WAKEUP_THRESH))) 
{
netif_tx_lock(tp-dev);
if (netif_queue_stopped(tp-dev) 
(tg3_tx_avail(tp)  TG3_TX_WAKEUP_THRESH))
netif_wake_queue(tp-dev);
netif_tx_unlock(tp-dev);
}
  
   2.6.17 didn't use it. Was it a bug?
  
   Thanks,
 
  No, it was introduced in 2.6.18. The functions are just a wrapper
  around the network device transmit lock that is normally held.
 
  If the device does not need to acquire the lock during IRQ, it
  is a good alternative and avoids a second lock.
 
  For transmit locking there are three common alternatives:
 
  Method A: dev-queue_xmit_lock and per-device tx_lock
  send: dev-xmit_lock held by caller
  dev-hard_start_xmit acquires netdev_priv(dev)-tx_lock
 
  irq:  netdev_priv(dev)-tx_lock acquired
 
  Method B: dev-queue_xmit_lock only
  send: dev-xmit_lock held by caller
  irq:  schedules softirq (NAPI)
  napi_poll: calls netif_tx_lock() which acquires 
dev-xmit_lock
 
  Method C: LLTX
  set dev-features LLTX
  send: no locks held by caller
  dev-hard_start_xmit acquires 
netdev_priv(dev)-tx_lock
  irq: netdev_priv(dev)-tx_lock acquired
 
  Method A is the only one that works with 2.4 and early (2.6.8?) 
kernels.
 

 Current sungem does Method C, and uses two locks: lock and tx_lock.
 What I was planning to do is Method B (which current tg3 uses). It
 seems to me that Method B is better than Method C. What do you think?
   
B is better than C because the transmit logic doesn't have to
spin in the case of lock contention, but it is not a big difference.
  
   Current sungem does C but uses try_lock() to acquire its private
   tx_lock. So it doesn't spin either in case of contention.
 
 
  But the spin is still there, just more complex..
  In qdisc_restart() processing of NETDEV_TX_LOCKED causes:
  spin_lock(dev-xmit_lock)
 
  q-requeue()
  netif_schedule(dev);
 
  SOFTIRQ:
  net_tx_action()
  qdisc_run() -- qdisc_restart()
 
  So instead of spinning in tight loop, you end up with a longer code
  path.

 Stephen, sorry for insisting a bit but I'm failing to see how B is
 different from C in that respect. With method B, in qdisc_restart(),
 if netif_tx_trylock() fails to acquire the lock then we also
 requeue(), etc. Same long code path in case of contention.


Method C LLTX causes repeated softirq's which will be slower since the loop
requires more instructions than a simple spin loop (Method B).


What I'm saying above is that Method B also causes repeated tx
softirqs in case of contention on netif_tx_lock. The code path is :
netif_tx_trylock() fails - requeue() - netif_schedule() -
raise_softirq(NET_TX_SOFTIRQ). Am I missing anything?


--
Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SKGE backport to 2.4 : success

2006-11-06 Thread Willy Tarreau

On Mon, Nov 06, 2006 at 10:56:09AM -0800, Stephen Hemminger wrote:
 On Sat, 4 Nov 2006 22:08:55 +0100
 Willy Tarreau [EMAIL PROTECTED] wrote:
 
  Hi Stephen,
  
  I don't know if you received my mail since I got no reply.
  
  Thanks in advance for your comments,
  Willy
  
  On Sat, Oct 28, 2006 at 10:57:07PM +0200, Willy Tarreau wrote:
   Hi Stephen,
   
   In my own kernels, I've added your backport of SKGE to 2.4 that I found
   here :
   
  
   http://developer.osdl.org/shemminger/releases/skge-sky2-backport.tar.bz2
   
   It seems to work pretty well compared to the original syskonnect driver
   (up to and including 8.36). Several people around me have reported very
   slow NFS operations with the official driver, which I finally attributed
   to a strange effect of UDP packets not going out after a while until they
   get pushed by a TCP packet. I even noticed the problem at the company
   and we turned the NFS server to an unused 100 Mbps card to workaround the
   problem before being able to fully ananlyze the problem.
   
   It seems your driver is getting mature and its performance is very close 
   to
   the official one, while its code is smaller and apparently more reliable. 
   I
   was thinking about merging it in mainline 2.4 as a fix for people having
   trouble with the syskonnect driver. It might also be easier to backport 
   fixes
   from 2.6 to 2.4 when the driver is the same.
   
   I don't think we risk any regression because it won't replace an existing
   driver, but will provide one to people who are used to download new 
   versions
   from an external tree.
   
   Also, I'm not yet sure whether I would also backport the sky2 driver, 
   because
   I know about a handful boxes running in production with the official one 
   with
   88E8053 chips at high packet rates with no trouble at all. Anyway, as 
   long as
   the backport does not prevent them from using the external driver, there
   should be no problem.
   
   I'd like to get your opinion on this matter, and of course, Jeff's and 
   Davem's.
   
   Thanks in advance,
   Willy
   
 
 
 The backport needs to be updated. It is of older code.  I plan to do a new
 backport this week. The backport version doesn't use NAPI, because of issues
 with not wanting to change netdevice.h. For a good 2.4 version, I would
 make a version that was closer to 2.6 code (using NAPI).

That would be perfect, it would make backport of fixes even easier. I have
turned last version into a patch against 2.4.33 for in-tree inclusion, so
if you're interested in getting it for the Config.in, Makefiles and
Configure.help, do not hesitate.

 I did the backport because one of the equipment donors gave a VPN box whose
 base OS is RHEL based on 2.4.

It's amazing how having the hardware stimulates development, isn't it? :-)

Tbanks,
Willy

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH/RFC] add netpoll support for gianfar

2006-11-06 Thread Andy Fleming



On Nov 6, 2006, at 05:19, Vitaly Wool wrote:

The patch inlined below adds NET_POLL_CONTROLLER support for  
gianfar network driver.


 drivers/net/gianfar.c |   34 ++
 1 file changed, 34 insertions(+)

Signed-off-by: Vitaly Wool [EMAIL PROTECTED]

Index: powerpc/drivers/net/gianfar.c
===
--- powerpc.orig/drivers/net/gianfar.c
+++ powerpc/drivers/net/gianfar.c
@@ -133,6 +133,9 @@ static void gfar_set_hash_for_addr(struc
 #ifdef CONFIG_GFAR_NAPI
 static int gfar_poll(struct net_device *dev, int *budget);
 #endif
+#ifdef CONFIG_NET_POLL_CONTROLLER
+static void gfar_netpoll(struct net_device *dev);
+#endif
 int gfar_clean_rx_ring(struct net_device *dev, int rx_work_limit);
 static int gfar_process_frame(struct net_device *dev, struct  
sk_buff *skb, int length);

 static void gfar_vlan_rx_register(struct net_device *netdev,
@@ -260,6 +263,9 @@ static int gfar_probe(struct platform_de
dev-poll = gfar_poll;
dev-weight = GFAR_DEV_WEIGHT;
 #endif
+#ifdef CONFIG_NET_POLL_CONTROLLER
+   dev-poll_controller = gfar_netpoll;
+#endif
dev-stop = gfar_close;
dev-get_stats = gfar_get_stats;
dev-change_mtu = gfar_change_mtu;
@@ -1536,6 +1542,34 @@ static int gfar_poll(struct net_device *
 }
 #endif

+#ifdef CONFIG_NET_POLL_CONTROLLER
+/*
+ * Polling 'interrupt' - used by things like netconsole to send skbs
+ * without having to re-enable interrupts. It's not called while
+ * the interrupt routine is executing.
+ */
+static void gfar_netpoll(struct net_device *dev)
+{
+   struct gfar_private *priv = netdev_priv(dev);
+
+   /* If the device has multiple interrupts, run tx/rx */
+   if (priv-einfo-device_flags  FSL_GIANFAR_DEV_HAS_MULTI_INTR) {
+   disable_irq(priv-interruptTransmit);
+   disable_irq(priv-interruptReceive);
+   disable_irq(priv-interruptError);
+   gfar_transmit(priv-interruptTransmit, dev, NULL);
+   gfar_receive(priv-interruptReceive, dev, NULL);



You are passing extra arguments, here



+   enable_irq(priv-interruptError);
+   enable_irq(priv-interruptReceive);
+   enable_irq(priv-interruptTransmit);
+   } else {
+   disable_irq(priv-interruptTransmit);
+   gfar_interrupt(priv-interruptTransmit, dev, NULL);



and here (pt_regs got eliminated).

Also, a few more comments:

1) Do we need the disable/enable irq stuff?  It seems like we should  
be able to either just *mask* the interrupts at the controller, or  
rely on the locks to disable the interrupts.


2) If we are calling gfar_transmit and gfar_receive, shouldn't we  
call gfar_error?


3) I think it should be possible to just call gfar_interrupt() in  
every situation, but I'm not very familiar with net poll's  
requirements (You can add that into your evaluation of #1, too).


Andy


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zero checksum in netconsole/netdump packets

2006-11-06 Thread Chris Lalancette

Hello,
  I was reading some tcpdump's of netdump traffic today, and I realized 
that all of the packets that go from the crashing machine to the netdump server 
have a zero checksum.  Looking at the code, it looks like netconsole/netdump 
use the function netpoll_send_udp to send out the packets.  However, in 
netdump_send_udp, the checksum is set to 0, and never seems to be computed.  Is 
this intentional, or just an oversight?  I would think that we would always 
want to compute the UDP checksum, but there might be something I am 
overlooking.  Incidentally, it seems like the only user of netpoll_send_udp is 
netconsole (and netdump in RedHat kernels).
 Assuming that this is just an oversight, attached is a simple patch to 
compute the UDP checksum in netpoll_send_udp.

Signed-off-by: Chris Lalancette [EMAIL PROTECTED]
--- linux-2.6/net/core/netpoll.c.orig	2006-11-06 18:16:58.0 -0500
+++ linux-2.6/net/core/netpoll.c	2006-11-06 18:31:20.0 -0500
@@ -356,6 +356,10 @@ void netpoll_send_udp(struct netpoll *np
 	put_unaligned(htonl(np-remote_ip), (iph-daddr));
 	iph-check= ip_fast_csum((unsigned char *)iph, iph-ihl);
 
+	udph-check = csum_tcpudp_magic(iph-saddr, iph-daddr, udp_len,
+	IPPROTO_UDP,
+	csum_partial((unsigned char *)udph, udp_len, 0));
+
 	eth = (struct ethhdr *) skb_push(skb, ETH_HLEN);
 	skb-mac.raw = skb-data;
 	skb-protocol = eth-h_proto = htons(ETH_P_IP);

Re: [PATCH] s2io ppc64 fix for readq/writeq

2006-11-06 Thread Benjamin Herrenschmidt


 Generally the kernel code should write the two 32-bit chunks to the 
 memory-mapped region in order (low dword first), and let things take 
 care of themselves from there.
 
 That's pretty much the implementation that -every- driver copies, when 
 they need readq/writeq to work on a 32-bit platform.

What do you mean by low dword first ? For example, the implementation
in the s2io driver does:

static inline u64 readq(void __iomem *addr)
{
u64 ret = 0;
ret = readl(addr + 4);
ret = 32;
ret |= readl(addr);

return ret;
}

static inline void writeq(u64 val, void __iomem *addr)
{
writel((u32) (val), addr);
writel((u32) (val  32), (addr + 4));
}

As you can see, it reads the -second- dword first (high order dword in
little endian), but writes the first dword first (low order dword in
little endian).

If there is any logic here, it's card specific.

Or is this really what PCI does when doing 64 bits accesses on a 32 bits
PCI bus ? I would have expected the later (what write does) but this
driver does it reverse on reads.

I'm tempted to go to the simple

#define readq readq for now until we clear that up.
 
Ben.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] Net: kconfig, correct traffic shaper

2006-11-06 Thread David Miller

From: Jeff Garzik [EMAIL PROTECTED]
Date: Mon, 06 Nov 2006 02:52:02 -0500

 ACK from me, though I think that since it relates to traffic schedulers 
 I think this patch should be merged through DaveM...

I've merged it into my tree, thanks everyone.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] add dev_to_node()

2006-11-06 Thread Christoph Hellwig

On Sun, Nov 05, 2006 at 12:22:37AM -0800, David Miller wrote:
 Looks good to me.

So what's the right path to get this in?  There's one patch touching
MM code, one adding something to the driver core and then finally a
networking patch depending on the previous two.  Do you want to take
them all and send them in through the networking tree?  Or should
we put the burden on Andrew?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] don't use highmem in tcp hash size calculation

2006-11-06 Thread John Heffner

 This patch removes consideration of high memory when determining TCP hash
table sizes.  Taking into account high memory results in tcp_mem values that
are too large.

Signed-off-by: John Heffner [EMAIL PROTECTED]

---
commit ea55b7c31b47edf90132baea9a088da3bbe2bb5c
tree 82311e12d4e4e006fba1688cb537de06cf7a4e4b
parent 4f6f9ba021f8a2149238f7c081cd7cf55c70c775
author John Heffner [EMAIL PROTECTED] Mon, 06 Nov 2006 20:03:01 -0500
committer John Heffner [EMAIL PROTECTED] Mon, 06 Nov 2006 20:03:01 -0500

 net/ipv4/tcp.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 66e9a72..4322318 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2270,7 +2270,7 @@ void __init tcp_init(void)
thash_entries,
(num_physpages = 128 * 1024) ?
13 : 15,
-   HASH_HIGHMEM,
+   0,
tcp_hashinfo.ehash_size,
NULL,
0);
@@ -2286,7 +2286,7 @@ void __init tcp_init(void)
tcp_hashinfo.ehash_size,
(num_physpages = 128 * 1024) ?
13 : 15,
-   HASH_HIGHMEM,
+   0,
tcp_hashinfo.bhash_size,
NULL,
64 * 1024);

RE: [PATCH] s2io ppc64 fix for readq/writeq

2006-11-06 Thread Ramkrishna Vepa



 -Original Message-
 From: Roland Dreier [mailto:[EMAIL PROTECTED]
 Sent: Monday, November 06, 2006 12:55 PM
 To: Christoph Hellwig
 Cc: Ramkrishna Vepa; Benjamin Herrenschmidt; Jeff Garzik; Linus
Torvalds;
 netdev@vger.kernel.org; [EMAIL PROTECTED]
 Subject: Re: [PATCH] s2io ppc64 fix for readq/writeq
 
   For consistencies sake we really want to have readq() and writeq()
 available
   on all platforms.  I remember that some IB cards require it to
actually
   be a 64bit transactions, otherwise they have to do funny
workarounds.
   I think the best solution is to define ARCH_HAS_ATOMIC_READQ_WRITEQ
   and let drivers do their workarounds based on that.
  
   I've Cc'ed Roland because he should be able to explain the IB issue
in
   details.
 
 The issue I know about is drivers/infiniband/hw/mthca.  The card has
 64-bit doorbell registers, and the restriction is that if you write
 the doorbell write two 32-bit writes, you can't write anything else on
 the same register page in between writing the two halves.  Since
 different CPUs might be doing stuff on the same doorbell page at the
 same time, there are two things we can do:
  - If writeq() exists then use that and assume it will generate only a
single bus transaction that can't let anything sneak in the
middle.  (That's a fairly safe assumption because the devices being
driven are either 64-bit PCI-X or PCIe only)
  - If writeq() doesn't exist, use a spinlock to protect access to each
doorbell page.
 
 ARCH_HAS_ATOMIC_READQ_WRITEQ would be fine for that, but of course the
 tricky thing is writing down the exact semantics that HAS_ATOMIC is
 actually promising.
 
  - R.
[Ram] If the writes broken up into 32 bit writes they are posted to the
bridge and need to be flushed with a lock around the whole access. This
is in the domain of the driver and need not be part of the platform
specific code. 

Ram

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006-11-06 Thread Zhao Xiaoming


On 11/6/06, Eric Dumazet [EMAIL PROTECTED] wrote:

On Monday 06 November 2006 10:46, Zhao Xiaoming wrote:
 2006/11/6, Eric Dumazet [EMAIL PROTECTED]:
  On Monday 06 November 2006 09:59, Zhao Xiaoming wrote:
   Thank you again for your help. To have more detailed statistic data, I
   did another round of test and gathered some data.  I give the overall
   description here and detailed /proc/net/sockstat, /proc/meminfo,
   /proc/slabinfo and /proc/buddyinfo follows.
   =
  slab mem costtcp mem pages
   lowmem free with traffic: 254668KB 34693
 38772KB
   without traffic:   104080KB   1
  702652KB
   =
 
  Thank you for detailed infos.
 
  It appears you have an extensive use of threads (about 1), since :
   task_struct10095  10095   136031 : tunables   24   12
 8 : slabdata   3365   3365  0
 
  Each thread has a kernel stack, 8KB (ie 2 pages, order-1 allocation),
  plus a user vma
 
   vm_area_struct 21346  21504 92   421 : tunables  120   60
 8 : slabdata512512  0
 
  Most likely you dont need that much threads. A program with fewer threads
  will perform better and use less ram.

 Thanks for the comments. I known the threads may cost many memory.
 However, I already excluded them from the statistics. The 'after test'
 info was gotten while the 1 threads running but no traffics
 relayed. You may look at the meminfo of 'after test', there is still
 104080 kB slab memory which should already included the thread kernel
 memory cost (8K*1=80MB). I know 1 threads are not necessary
 and just use the simple logic to do some test.

In fact, your kernel has CONFIG_4KSTACKS, kernel thread stacks use 4K instead
of 8K.

If you want to increase LOWMEM, (and keep 32bits kernel), you can chose a
2G/2G user/kernel split, instead of the 3G/1G default split.
(see config : CONFIG_VMSPLIT_2G)

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Thank you for your advice. I know increase LOMEM could be help, but
now my concern is why I lose my 500M bytes memory after excluding all
known memory cost.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006-11-06 Thread Zhao Xiaoming


On 11/7/06, Stephen Hemminger [EMAIL PROTECTED] wrote:

Eric Dumazet wrote:
 Zhao Xiaoming a écrit :
 Dears,
I'm running a linux box with kernel version 2.6.16. The hardware
 has 2 Woodcrest Xeon CPUs (2 cores each) and 4G RAM. The NIC cards is
 Intel 82571 on PCI-e bus.
The box is acting as ethernet bridge between 2 Gigabit Ethernets.
 By configuring ebtables and iptables, an application is running as TCP
 proxy which will intercept all TCP connections requests from the
 network and setup another TCP connection to the acture server.  The
 TCP proxy then relays all traffics in both directions.
The problem is the memory. Since the box must support thousands of
 concurrent connections, I know the memory size of ZONE_NORMAL would be
 a bottleneck as TCP packets would need many buffers. After setting
 upper limit of net.ipv4.tcp_rmem and net.ipv4.tcp_wmem to 32K bytes,
 our test began.
My test scenario employs 2000 concurrent downloading connections
 to a IIS server's port 80. The throughput is about 500~600 Mbps which
 is limited by the capability of the client application. Because all
 traffics are from server to client and the capability of client
 machine is bottleneck, I believe the receiver side of the sockets
 connected with server and the sender side of the sockets connected
 with client should be filled with packets in correspondent windows.
 Thus, roughly there should be about 32K * 2000+ 32K*2000 = 128M bytes
 memory occupied by TCP/IP stack for packet buffering. Data from
 slabtop confermed it. it's about 140M bytes memory cost after I start
 the traffic. That reasonablly matched with my estimation. However,
 /proc/meminfo had a different story. The 'LowFree' dropped from about
 710M to 80M. In other words, there's addtional 500M memory in
 ZONE_NORMAL allocated by someone other than the slab. Why?
The amount of memory per socket is controlled by the socket buffering.
Your application
could be setting the value by calling setsockopt(). Otherwise, the tcp
memory is limited
by the sysctl settings tcp_rmem (receiver) and tcp_wmem (sender).

For example on this server:
$ cat /proc/sys/net/ipv4/tcp_wmem
409616384   131072

Each sending socket would start with 16K of buffering, but could grow up
to 128K based
on TCP send autotuning.




Of course I can change the TCP buffers and I already discribed I set
both uppper limit of tcp_rmem and tcp_wmem to 32K. And if you go
through my former posts, you should notic that TCP stack on my machine
only occupied 34K memory pages for buffering which is close to my
theoretical estimation: 128M. But at the same time, my free LOMEM size
decreased from over 700M to less than 100M. The question is where the
additional 500M bytes gone?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006-11-06 Thread Eric Dumazet


Zhao Xiaoming a écrit :

On 11/6/06, Eric Dumazet [EMAIL PROTECTED] wrote:
In fact, your kernel has CONFIG_4KSTACKS, kernel thread stacks use 4K 
instead

of 8K.

If you want to increase LOWMEM, (and keep 32bits kernel), you can chose a
2G/2G user/kernel split, instead of the 3G/1G default split.
(see config : CONFIG_VMSPLIT_2G)

Eric



Thank you for your advice. I know increase LOMEM could be help, but
now my concern is why I lose my 500M bytes memory after excluding all
known memory cost.


Unfortunatly you dont provide very much details.
AFAIK you didnt even gave whcih version of linux you run, which programs you 
run...

You keep answering where you 'lost' your mem, it's quite buging.
Maybe some Oracles on this list will see the light for you, before exchanging 
100 mails with you ?


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006-11-06 Thread Zhao Xiaoming


The latest update:
   It seems that Linux kernel memory management mechanisms including
buddy and slab algorisms are not very efficient under my test
conditions that tcp stack requires a lot of (hundreds of MB) packet
buffers and release them very frequently.
   Here is the proof. After change my kernel configuration to support
2/2 VM splition, LOMEM consumption reduced to 270M bytes compared with
640M bytes of the 1/3 kernel. All test conditions are the same and
memory pages allocated by TCP stack are also the same, 34K ~ 38K
pages. In other words, 'lost' memory changed from ~500M to ~130M.
Thus, I have nothing to do but guessing the much more free pages make
the slab/buddy algorisms more efficient and waste less memory.
   Finally I got what I want. Thank you all for your help and advices.

Xiaoming.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006-11-06 Thread Zhao Xiaoming


On 11/7/06, Eric Dumazet [EMAIL PROTECTED] wrote:

Zhao Xiaoming a écrit :
 On 11/6/06, Eric Dumazet [EMAIL PROTECTED] wrote:
 In fact, your kernel has CONFIG_4KSTACKS, kernel thread stacks use 4K
 instead
 of 8K.

 If you want to increase LOWMEM, (and keep 32bits kernel), you can chose a
 2G/2G user/kernel split, instead of the 3G/1G default split.
 (see config : CONFIG_VMSPLIT_2G)

 Eric

 Thank you for your advice. I know increase LOMEM could be help, but
 now my concern is why I lose my 500M bytes memory after excluding all
 known memory cost.

Unfortunatly you dont provide very much details.
AFAIK you didnt even gave whcih version of linux you run, which programs you
run...
You keep answering where you 'lost' your mem, it's quite buging.
Maybe some Oracles on this list will see the light for you, before exchanging
100 mails with you ?



I think I aready gave the kernel version and introduced my application
in the first post. What are the further details you want? The reason I
keep asking for the 'lost mem' is that I want to focus on the problem,
not the workarrounds that may lead to further problems if I keep
increasing the concurrent scale.
Anyway, since the problem is already solved (see my last post), I'd
like to thank you for the help.

Xiaoming.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] add dev_to_node()

2006-11-06 Thread Ravikiran G Thirumalai

On Sun, Nov 05, 2006 at 12:53:23AM +0100, Christoph Hellwig wrote:
 On Sat, Nov 04, 2006 at 06:06:48PM -0500, Dave Jones wrote:
  On Sat, Nov 04, 2006 at 11:56:29PM +0100, Christoph Hellwig wrote:
  
  This will break the compile for !NUMA if someone ends up doing a bisect
  and lands here as a bisect point.
  
  You introduce this nice wrapper..
 
 The dev_to_node wrapper is not enough as we can't assign to (-1) for
 the non-NUMA case.  So I added a second macro, set_dev_node for that.
 
 The patch below compiles and works on numa and non-NUMA platforms.
 
 

Hi Christoph,
dev_to_node does not work as expected on x86_64 (and i386).  This is because
node value returned by pcibus_to_node is initialized after a struct device
is created with current x86_64 code.

We need the node value initialized before the call to pci_scan_bus_parented,
as the generic devices are allocated and initialized
off pci_scan_child_bus, which gets called from pci_scan_bus_parented
The following patch does that using pci_sysdata introduced by the PCI
domain patches in -mm.

Signed-off-by: Alok N Kataria [EMAIL PROTECTED]
Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]
Signed-off-by: Shai Fultheim [EMAIL PROTECTED]

Index: linux-2.6.19-rc4mm2/arch/i386/pci/acpi.c
===
--- linux-2.6.19-rc4mm2.orig/arch/i386/pci/acpi.c   2006-11-06 
11:03:50.0 -0800
+++ linux-2.6.19-rc4mm2/arch/i386/pci/acpi.c2006-11-06 22:04:14.0 
-0800
@@ -9,6 +9,7 @@ struct pci_bus * __devinit pci_acpi_scan
 {
struct pci_bus *bus;
struct pci_sysdata *sd;
+   int pxm;
 
/* Allocate per-root-bus (not per bus) arch-specific data.
 * TODO: leak; this memory is never freed.
@@ -30,15 +31,21 @@ struct pci_bus * __devinit pci_acpi_scan
}
 #endif /* CONFIG_PCI_DOMAINS */
 
+   sd-node = -1;
+
+   pxm = acpi_get_pxm(device-handle);
+#ifdef CONFIG_ACPI_NUMA
+   if (pxm = 0)
+   sd-node = pxm_to_node(pxm);
+#endif
+
bus = pci_scan_bus_parented(NULL, busnum, pci_root_ops, sd);
if (!bus)
kfree(sd);
 
 #ifdef CONFIG_ACPI_NUMA
if (bus != NULL) {
-   int pxm = acpi_get_pxm(device-handle);
if (pxm = 0) {
-   sd-node = pxm_to_node(pxm);
printk(bus %d - pxm %d - node %d\n,
busnum, pxm, sd-node);
}
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

TCP stack sometimes loses ACKs ... or something

2006-11-06 Thread Neil Brown


I upgraded my notebook from 2.6.16 to 2.6.18 recently and noticed that
I couldn't talk to my VOIP device (which has a WEB interface).
Watching traffic I see the three-way-handshake working perfectly, and
then the first data packet is sent (a partial HTTP request: 
GET / HTTP/1.1 ) and an ACK comes back from the device.
Then the next data packet (remainder of the HTTP request) is sent, but
tcpdump never sees the ACK, nor does the TCP stack.  So the data gets
recent repeatedly.  No ack. Ever.

With 2.6.16, The ack comes back just fine and the connection proceeds
as you would expect.

As it was a very reproducible problem I decided to try git bisect
and found 

 bad: [7b4f4b5ebceab67ce440a61081a69f0265e17c2a] [TCP]: Set default max buffers 
from memory pool size

I double checked as this seemed a fairly unlikely patch to cause the
problem, but this definitely is it.
The net effect of this patch is to change the last of the three
numbers in 
cat /proc/sys/net/ipv4/tcp_[rw]mem 
from well below 2^20 to well above. 2^20 seems to be a significant
number. I set tcp_wmem to that and the ACK was lost.  I set it to
one less and the first ACK (at least) was accepted.
I ended up setting both r and w to 10 and everything is fine.

Exploring more deeply, and comparing:
  - a failing connection (to VIOP box, [rw]mem large)
  - a working connection to VOIP box ([rw]mem small)
  - a working connection to another machine ([rw]mem irrelevant).
I find:

  The VIOP returns MSS=1360 in the SYN/ACK packet.  Other machine
returns MSS=1460

  The ack that is getting lost contains data as well as the
  ACK. i.e. the same packet that ACKs at the TCP level includes the
  HTTP level reply.
  The matching ACK from the other machine (some Linux 2.6.8 I think)
   is a data-less ACK followed very quickly by the HTTP reply in
   a separate packet.

  The 'Timestamps' option coming back from the VOIP box is a little
  odd.  The Timestamp in the SYN/ACK is the same as the timestamp in
  the next ACK (the ack for the first partial HTTP request).
  The Timestamp in the next packet which is the one that gets lost has
  exactly the same TSval as previous packets, and TSecr is one more
  than in the previous packet.

I assume that one (or more) of these differences combined with the
large tcp_[rw]mem value cause the packet loss, but I have no idea
which.

Help?

I can make the tcp traces available if needed, but these are really
the only non-trivial differences.

I'm willing to test patches.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2.6.19-rc4-git10][PKT_SCHED] sch_htb: INIT_HLIST_NODE after hlist_del()

2006-11-06 Thread Jarek Poplawski

On Mon, Nov 06, 2006 at 09:44:49AM -0800, Stephen Hemminger wrote:
 On Mon, 6 Nov 2006 12:33:53 +0100
 Jarek Poplawski [EMAIL PROTECTED] wrote:
 
  After hlist_del() next and pprev pointers are not NULL
  so hlist_unhashed() doesn't work properly.
  
  
  Signed-off-by: Jarek Poplawski [EMAIL PROTECTED]
  ---
  
  
  diff -Nurp linux-2.6.19-rc4-git10-/net/sched/sch_htb.c 
  linux-2.6.19-rc4-git10/net/sched/sch_htb.c
  --- linux-2.6.19-rc4-git10-/net/sched/sch_htb.c 2006-11-06 
  11:42:41.0 +0100
  +++ linux-2.6.19-rc4-git10/net/sched/sch_htb.c  2006-11-06 
  11:53:15.0 +0100
  @@ -1284,8 +1284,10 @@ static void htb_destroy_class(struct Qdi
struct htb_class, sibling));
   
  /* note: this delete may happen twice (see htb_delete) */
  -   if (!hlist_unhashed(cl-hlist))
  +   if (!hlist_unhashed(cl-hlist)) {
  hlist_del(cl-hlist);
  +   INIT_HLIST_NODE(cl-hlist);
  +   }
 
 why not use hlist_del_init?

Yes, this is the question!

As a matter of fact I expected another question. Yesterday
I was short on time so I didn't describe the bug enough.
I'm not sure if you know the problem, so here are more
details (for me problem is 199% repeatable).

After something like this:

# tc qdisc add dev lo root handle 1: htb
# tc class add dev lo parent 1: classid 1:1 htb rate 200kbps
# tc class del dev lo classid 1:1

enter the BUG...

I've found the last command is the culprit and if you do:

# tc qdisc del dev lo root
there is no problem.

And probably it is enough to do the change only in htb_delete
- btw. is this hlist_del really needed there? and shouldn't
all deletions be done after zeroing the refcount? - but you
should know better. 

 
  list_del(cl-sibling);
   
  if (cl-prio_activity)
  @@ -1333,8 +1335,10 @@ static int htb_delete(struct Qdisc *sch,
  sch_tree_lock(sch);
   
  /* delete from hash and active; remainder in destroy_class */
  -   if (!hlist_unhashed(cl-hlist))
  +   if (!hlist_unhashed(cl-hlist)) {
  hlist_del(cl-hlist);
  +   INIT_HLIST_NODE(cl-hlist);
  +   }
   
  if (cl-prio_activity)
  htb_deactivate(q, cl);

Best regards,

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] don't use highmem in tcp hash size calculation

2006-11-06 Thread David Miller


Thanks very much for catching this John, patch applied.

Guess what?  Nobody uses HASH_HIGHMEM after this change, and
frankly I can't think of any valid use of it besides perhaps
something such as a page cache hash table but that's irrelevant
since we use a per-object tree data structure for that these
days.

We should probably kill off HASH_HIGHMEM.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Zero checksum in netconsole/netdump packets

2006-11-06 Thread David Miller

From: Chris Lalancette [EMAIL PROTECTED]
Date: Mon, 06 Nov 2006 18:40:59 -0500

  Assuming that this is just an oversight, attached is a simple
  patch to compute the UDP checksum in netpoll_send_udp.

If the resulting checksum is zero, you should set it to
all 1's, like the real UDP code does.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TCP stack sometimes loses ACKs ... or something

2006-11-06 Thread Stephen Hemminger


Neil Brown wrote:

I upgraded my notebook from 2.6.16 to 2.6.18 recently and noticed that
I couldn't talk to my VOIP device (which has a WEB interface).
Watching traffic I see the three-way-handshake working perfectly, and
then the first data packet is sent (a partial HTTP request: 
GET / HTTP/1.1 ) and an ACK comes back from the device.

Then the next data packet (remainder of the HTTP request) is sent, but
tcpdump never sees the ACK, nor does the TCP stack.  So the data gets
recent repeatedly.  No ack. Ever.

With 2.6.16, The ack comes back just fine and the connection proceeds
as you would expect.

As it was a very reproducible problem I decided to try git bisect
and found 


 bad: [7b4f4b5ebceab67ce440a61081a69f0265e17c2a] [TCP]: Set default max buffers 
from memory pool size

I double checked as this seemed a fairly unlikely patch to cause the
problem, but this definitely is it.
The net effect of this patch is to change the last of the three
numbers in 
cat /proc/sys/net/ipv4/tcp_[rw]mem 
from well below 2^20 to well above. 2^20 seems to be a significant

number. I set tcp_wmem to that and the ACK was lost.  I set it to
one less and the first ACK (at least) was accepted.
I ended up setting both r and w to 10 and everything is fine.

Exploring more deeply, and comparing:
  - a failing connection (to VIOP box, [rw]mem large)
  - a working connection to VOIP box ([rw]mem small)
  - a working connection to another machine ([rw]mem irrelevant).
I find:

  The VIOP returns MSS=1360 in the SYN/ACK packet.  Other machine
returns MSS=1460

  The ack that is getting lost contains data as well as the
  ACK. i.e. the same packet that ACKs at the TCP level includes the
  HTTP level reply.
  The matching ACK from the other machine (some Linux 2.6.8 I think)
   is a data-less ACK followed very quickly by the HTTP reply in
   a separate packet.

  The 'Timestamps' option coming back from the VOIP box is a little
  odd.  The Timestamp in the SYN/ACK is the same as the timestamp in
  the next ACK (the ack for the first partial HTTP request).
  The Timestamp in the next packet which is the one that gets lost has
  exactly the same TSval as previous packets, and TSecr is one more
  than in the previous packet.

I assume that one (or more) of these differences combined with the
large tcp_[rw]mem value cause the packet loss, but I have no idea
which.

Help?

I can make the tcp traces available if needed, but these are really
the only non-trivial differences.

I'm willing to test patches.

NeilBrown
  


You almost certainly have a windows scale corrupting firewall in your path.
See http://lwn.net/Articles/92727/

2.6.18 increased the maximum window size, so it aggravated a pre-existing
condition in your network. You can turn off window scaling globally 
(with sysctl)

or per route congestion window limit.

It could also be that VOIP application is getting aggravated by TCP ABC.
That can be turned off with sysctl (net.ipv4.tcp_abc=0)




-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TCP stack sometimes loses ACKs ... or something

2006-11-06 Thread David Miller


Window scaling... there is some intermediate device which is
trying to prevent out of window segments from passing through,
but it is not taking the negotiated window scale into account.
So it thinks that segments are outside of the window, when they
are not.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

83 matches

Mail list logo