date:20070219

Re: skge dysfunction on Amd X2 machine with 4GB memory

2007-02-19 Thread Chris Wedgwood

On Sun, Feb 11, 2007 at 04:57:55PM +0200, Matti Aarnio wrote:

> With the skge driver there seems to be some sort of problem to work
> in a system with memory above the 4 GB of PCI address space.

The chipset (apparently) doesn't deal with bus addresses over 4GB even
though the MAC does.

I guess the right way to fix this long term is to detect systems with
these chips and mask the dma_mask globally (or if you're clever per
bus)?

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 8042] New: Cisco VPN Client cannot connect using TCP with Intel 82573L NIC

2007-02-19 Thread Andrew Morton

On Mon, 19 Feb 2007 15:55:19 -0800 [EMAIL PROTECTED] wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=8042
> 
>Summary: Cisco VPN Client cannot connect using TCP with Intel
> 82573L NIC
> Kernel Version: 2.6.18.6
> Status: NEW
>   Severity: normal
>  Owner: [EMAIL PROTECTED]
>  Submitter: [EMAIL PROTECTED]
> 
> 
> Most recent kernel where this bug did *NOT* occur: -
> Distribution: Ubuntu, Debian
> Hardware Environment: Lenovo Thinkpad T60p
> Software Environment: -
> Problem Description:
> 
> I have an issue with the cisco vpn client
> (vpnclient-linux-x86_64-4.8.00.0490-k9.tar.gz) that appears to be related to
> packet fragmentation and the e1000 driver (hardware is 82573L, I don't believe
> that this issue affects earlier chips).
> 
> When I try to connect to a VPN using Cisco's TCP tunneling feature I 
> experience
> an issue where I am unable to connect to the vpn concentrator.
> 
> If I recompile the e1000 module, setting the option:
> 
> CONFIG_E1000_DISABLE_PACKET_SPLIT=y
> 
> then I am able to connect without issue.
> 
> I have experience this problem with the following kernels:
> 
> ubuntu edgy 2.6.16-11-generic
> debian sid  2.6.18-4-686 (Based on 2.6.18.6 w/hand picked later patches)
> kernel.org  2.6.18.6
> 
> There was a perhaps related bug resolved for udp recently, see this changelog 
> entry:
> 
> http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=753eab76a3337863a0d86ce045fa4eb6c3cbeef9
> 
> You can also see some discussion surrounding the issue (I had initially 
> believe
> it related to another issue with the 82573L), starting from this comment:
> 
> http://bugzilla.kernel.org/show_bug.cgi?id=6929#c9
> 
> Please let me know if there is anything else I can do to better explain the 
> problem.
> 
> Steps to reproduce:
> 
> It's not possible to reproduce this issue without:
> 
>  - A 82573L chip based network card
>  - A Cisco VPN Concentrator you can access using TCP tunneling
>  - The cisco vpn client ()
> 
> I have all of these, and would be more than pleased to reproduce the problem,
> provide packet captures, etc... If you want to reproduce the problem yourself,
> and have the above equipment, try to open a TCP encapsulated connection to the
> VPN Concentrator, you should not be able to unless you have compiled e1000 
> with
>  CONFIG_E1000_DISABLE_PACKET_SPLIT=y.
> 
> --- You are receiving this mail because: ---
> You are on the CC list for the bug, or are watching someone who is.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2][TCP] YeAH-TCP: limited slow start exported function

2007-02-19 Thread John Heffner


Angelo P. Castellani wrote:

John Heffner ha scritto:
Note the patch is compile-tested only!  I can do some real testing if 
you'd like to apply this Dave.
The date you read on the patch is due to the fact I've splitted this 
patchset into 2 diff files. This isn't compile-tested only, I've used 
this piece of code for about 3 months.


Sorry for the confusion.  The patch I attached to my message was 
compile-tested only.


Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH][IPSEC][2/3] IPv6 over IPv4 IPsec tunnel

2007-02-19 Thread Noriaki TAKAMIYA

Hi,

  More fix is needed for __xfrm6_bundle_create().

Signed-off-by: Noriaki TAKAMIYA <[EMAIL PROTECTED]>
Acked-by: Masahide NAKAMURA <[EMAIL PROTECTED]>
  
--
  fixed to set fl_tunnel.fl6_src correctly in xfrm6_bundle_create().

---
 net/ipv6/xfrm6_policy.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index b1133f2..d8a585b 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -189,7 +189,7 @@ __xfrm6_bundle_create(struct xfrm_policy
case AF_INET6:
ipv6_addr_copy(&fl_tunnel.fl6_dst, 
__xfrm6_bundle_addr_remote(xfrm[i], &fl->fl6_dst));
 
-   ipv6_addr_copy(&fl_tunnel.fl6_src, 
__xfrm6_bundle_addr_remote(xfrm[i], &fl->fl6_src));
+   ipv6_addr_copy(&fl_tunnel.fl6_src, 
__xfrm6_bundle_addr_local(xfrm[i], &fl->fl6_src));
break;
default:
BUG_ON(1);
-- 
Noriaki TAKAMIYA
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] remove irq_sem from ixgb

2007-02-19 Thread Chris Snook

From: Chris Snook <[EMAIL PROTECTED]>

Remove irq_sem from ixgb.  Currently untested, but similar to tested patches
on atl1 and e1000.

Signed-off-by: Chris Snook <[EMAIL PROTECTED]>
--
diff -urp linux-2.6.20-git14.orig/drivers/net/ixgb/ixgb.h
linux-2.6.20-git14/drivers/net/ixgb/ixgb.h
--- linux-2.6.20-git14.orig/drivers/net/ixgb/ixgb.h 2007-02-19
14:32:16.0 -0500
+++ linux-2.6.20-git14/drivers/net/ixgb/ixgb.h  2007-02-19 15:04:50.0
-0500
@@ -161,7 +161,6 @@ struct ixgb_adapter {
uint16_t link_speed;
uint16_t link_duplex;
spinlock_t tx_lock;
-   atomic_t irq_sem;
struct work_struct tx_timeout_task;

struct timer_list blink_timer;
diff -urp linux-2.6.20-git14.orig/drivers/net/ixgb/ixgb_main.c
linux-2.6.20-git14/drivers/net/ixgb/ixgb_main.c
--- linux-2.6.20-git14.orig/drivers/net/ixgb/ixgb_main.c2007-02-19
14:32:16.0 -0500
+++ linux-2.6.20-git14/drivers/net/ixgb/ixgb_main.c 2007-02-19
15:06:52.0 -0500
@@ -201,7 +201,6 @@ module_exit(ixgb_exit_module);
 static void
 ixgb_irq_disable(struct ixgb_adapter *adapter)
 {
-   atomic_inc(&adapter->irq_sem);
IXGB_WRITE_REG(&adapter->hw, IMC, ~0);
IXGB_WRITE_FLUSH(&adapter->hw);
synchronize_irq(adapter->pdev->irq);
@@ -215,12 +214,10 @@ ixgb_irq_disable(struct ixgb_adapter *ad
 static void
 ixgb_irq_enable(struct ixgb_adapter *adapter)
 {
-   if(atomic_dec_and_test(&adapter->irq_sem)) {
-   IXGB_WRITE_REG(&adapter->hw, IMS,
-  IXGB_INT_RXT0 | IXGB_INT_RXDMT0 | IXGB_INT_TXDW |
-  IXGB_INT_LSC);
-   IXGB_WRITE_FLUSH(&adapter->hw);
-   }
+   IXGB_WRITE_REG(&adapter->hw, IMS,
+  IXGB_INT_RXT0 | IXGB_INT_RXDMT0 | IXGB_INT_TXDW |
+  IXGB_INT_LSC);
+   IXGB_WRITE_FLUSH(&adapter->hw);
 }

 int
@@ -584,7 +581,6 @@ ixgb_sw_init(struct ixgb_adapter *adapte
/* enable flow control to be programmed */
hw->fc.send_xon = 1;

-   atomic_set(&adapter->irq_sem, 1);
spin_lock_init(&adapter->tx_lock);

return 0;
@@ -1755,7 +1751,6 @@ ixgb_intr(int irq, void *data)
  of the posted write is intentionally left out.
*/

-   atomic_inc(&adapter->irq_sem);
IXGB_WRITE_REG(&adapter->hw, IMC, ~0);
__netif_rx_schedule(netdev);
}

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] remove irq_sem from e1000

2007-02-19 Thread Chris Snook

From: Chris Snook <[EMAIL PROTECTED]>

Remove unnecessary irq_sem accounting from e1000.  Tested with no problems.

Signed-off-by: Chris Snook <[EMAIL PROTECTED]>
--
diff -urp linux-2.6.20-git14.orig/drivers/net/e1000/e1000.h
linux-2.6.20-git14/drivers/net/e1000/e1000.h
--- linux-2.6.20-git14.orig/drivers/net/e1000/e1000.h   2007-02-19
14:32:15.0 -0500
+++ linux-2.6.20-git14/drivers/net/e1000/e1000.h2007-02-19 
15:07:37.0
-0500
@@ -252,7 +252,6 @@ struct e1000_adapter {
 #ifdef CONFIG_E1000_NAPI
spinlock_t tx_queue_lock;
 #endif
-   atomic_t irq_sem;
unsigned int total_tx_bytes;
unsigned int total_tx_packets;
unsigned int total_rx_bytes;
diff -urp linux-2.6.20-git14.orig/drivers/net/e1000/e1000_main.c
linux-2.6.20-git14/drivers/net/e1000/e1000_main.c
--- linux-2.6.20-git14.orig/drivers/net/e1000/e1000_main.c  2007-02-19
14:32:15.0 -0500
+++ linux-2.6.20-git14/drivers/net/e1000/e1000_main.c   2007-02-19
15:09:28.0 -0500
@@ -349,7 +349,6 @@ static void e1000_free_irq(struct e1000_
 static void
 e1000_irq_disable(struct e1000_adapter *adapter)
 {
-   atomic_inc(&adapter->irq_sem);
E1000_WRITE_REG(&adapter->hw, IMC, ~0);
E1000_WRITE_FLUSH(&adapter->hw);
synchronize_irq(adapter->pdev->irq);
@@ -363,10 +362,8 @@ e1000_irq_disable(struct e1000_adapter *
 static void
 e1000_irq_enable(struct e1000_adapter *adapter)
 {
-   if (likely(atomic_dec_and_test(&adapter->irq_sem))) {
-   E1000_WRITE_REG(&adapter->hw, IMS, IMS_ENABLE_MASK);
-   E1000_WRITE_FLUSH(&adapter->hw);
-   }
+   E1000_WRITE_REG(&adapter->hw, IMS, IMS_ENABLE_MASK);
+   E1000_WRITE_FLUSH(&adapter->hw);
 }

 static void
@@ -1336,7 +1333,6 @@ e1000_sw_init(struct e1000_adapter *adap
spin_lock_init(&adapter->tx_queue_lock);
 #endif

-   atomic_set(&adapter->irq_sem, 1);
spin_lock_init(&adapter->stats_lock);

set_bit(__E1000_DOWN, &adapter->flags);
@@ -3758,11 +3754,6 @@ e1000_intr_msi(int irq, void *data)
 #endif
uint32_t icr = E1000_READ_REG(hw, ICR);

-#ifdef CONFIG_E1000_NAPI
-   /* read ICR disables interrupts using IAM, so keep up with our
-* enable/disable accounting */
-   atomic_inc(&adapter->irq_sem);
-#endif
if (icr & (E1000_ICR_RXSEQ | E1000_ICR_LSC)) {
hw->get_link_status = 1;
/* 80003ES2LAN workaround-- For packet buffer work-around on
@@ -3832,13 +3823,6 @@ e1000_intr(int irq, void *data)
if (unlikely(hw->mac_type >= e1000_82571 &&
 !(icr & E1000_ICR_INT_ASSERTED)))
return IRQ_NONE;
-
-   /* Interrupt Auto-Mask...upon reading ICR,
-* interrupts are masked.  No need for the
-* IMC write, but it does mean we should
-* account for it ASAP. */
-   if (likely(hw->mac_type >= e1000_82571))
-   atomic_inc(&adapter->irq_sem);
 #endif

if (unlikely(icr & (E1000_ICR_RXSEQ | E1000_ICR_LSC))) {
@@ -3862,7 +3846,6 @@ e1000_intr(int irq, void *data)
 #ifdef CONFIG_E1000_NAPI
if (unlikely(hw->mac_type < e1000_82571)) {
/* disable interrupts, without the synchronize_irq bit */
-   atomic_inc(&adapter->irq_sem);
E1000_WRITE_REG(hw, IMC, ~0);
E1000_WRITE_FLUSH(hw);
}
@@ -3888,7 +3871,6 @@ e1000_intr(int irq, void *data)
 * de-assertion state.
 */
if (hw->mac_type == e1000_82547 || hw->mac_type == e1000_82547_rev_2) {
-   atomic_inc(&adapter->irq_sem);
E1000_WRITE_REG(hw, IMC, ~0);
}

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] remove irq_sem cruft from e1000 and derivatives

2007-02-19 Thread Auke Kok


Chris Snook wrote:

Hey folks --

 While digging through the atl1 source, I was troubled by the code using
irq_sem.  I did some digging and found the same code in e1000 and ixgb.  I'm
not entirely sure what it was originally intended to do, but it doesn't seem
to be doing anything useful now, except possibly locking interrupts off if
NAPI is flipped on and off enough times to cause an integer overflow.

 The following patches completely remove irq_sem from each of the drivers.
 This has been tested successfully on atl1 and e1000.  If someone would like
to send me ixgb hardware I'd be glad to test that, otherwise you'll have to
test it yourself. :)

 -- Chris



I'm not yet seeing patches 1/3 appear, but I'll certainly take a look at them 
and have them tested in our labs once they appear for e1000 and ixgb.


Cheers,

Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] remove irq_sem from atl1

2007-02-19 Thread Chris Snook

From: Chris Snook <[EMAIL PROTECTED]>

Remove unnecessary irq_sem code from atl1 driver.  Tested with no problems.

Signed-off-by: Chris Snook <[EMAIL PROTECTED]>
Signed-off-by: Jay Cliburn <[EMAIL PROTECTED]>
--
diff -urp linux-2.6.20-git14.orig/drivers/net/atl1/atl1.h
linux-2.6.20-git14/drivers/net/atl1/atl1.h
--- linux-2.6.20-git14.orig/drivers/net/atl1/atl1.h 2007-02-19
14:32:15.0 -0500
+++ linux-2.6.20-git14/drivers/net/atl1/atl1.h  2007-02-19 15:10:07.0
-0500
@@ -236,7 +236,6 @@ struct atl1_adapter {
u16 link_speed;
u16 link_duplex;
spinlock_t lock;
-   atomic_t irq_sem;
struct work_struct tx_timeout_task;
struct work_struct link_chg_task;
struct work_struct pcie_dma_to_rst_task;
diff -urp linux-2.6.20-git14.orig/drivers/net/atl1/atl1_main.c
linux-2.6.20-git14/drivers/net/atl1/atl1_main.c
--- linux-2.6.20-git14.orig/drivers/net/atl1/atl1_main.c2007-02-19
14:32:15.0 -0500
+++ linux-2.6.20-git14/drivers/net/atl1/atl1_main.c 2007-02-19
15:10:44.0 -0500
@@ -163,7 +163,6 @@ static int __devinit atl1_sw_init(struct
hw->cmb_tx_timer = 1;   /* about 2us */
hw->smb_timer = 10; /* about 200ms */

-   atomic_set(&adapter->irq_sem, 0);
spin_lock_init(&adapter->lock);
spin_lock_init(&adapter->mb_lock);

@@ -272,8 +271,7 @@ err_nomem:
  */
 static void atl1_irq_enable(struct atl1_adapter *adapter)
 {
-   if (likely(!atomic_dec_and_test(&adapter->irq_sem)))
-   iowrite32(IMR_NORMAL_MASK, adapter->hw.hw_addr + REG_IMR);
+   iowrite32(IMR_NORMAL_MASK, adapter->hw.hw_addr + REG_IMR);
 }

 static void atl1_clear_phy_int(struct atl1_adapter *adapter)
@@ -1205,7 +1203,6 @@ static u32 atl1_configure(struct atl1_ad
  */
 static void atl1_irq_disable(struct atl1_adapter *adapter)
 {
-   atomic_inc(&adapter->irq_sem);
iowrite32(0, adapter->hw.hw_addr + REG_IMR);
ioread32(adapter->hw.hw_addr + REG_IMR);
synchronize_irq(adapter->pdev->irq);

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/3] remove irq_sem cruft from e1000 a nd derivatives

2007-02-19 Thread Chris Snook

Hey folks --

 While digging through the atl1 source, I was troubled by the code using
irq_sem.  I did some digging and found the same code in e1000 and ixgb.  I'm
not entirely sure what it was originally intended to do, but it doesn't seem
to be doing anything useful now, except possibly locking interrupts off if
NAPI is flipped on and off enough times to cause an integer overflow.

 The following patches completely remove irq_sem from each of the drivers.
 This has been tested successfully on atl1 and e1000.  If someone would like
to send me ixgb hardware I'd be glad to test that, otherwise you'll have to
test it yourself. :)

 -- Chris

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] forcedeth: fix checksum feature in mcp65

2007-02-19 Thread Ayaz Abdulla

This patch removes checksum offload feature in mcp65 chipsets as they 
are not supported in hw.


Signed-Off-By: Ayaz Abdulla <[EMAIL PROTECTED]>

--- orig/drivers/net/forcedeth.c2007-02-19 09:17:41.0 -0500
+++ new/drivers/net/forcedeth.c 2007-02-19 09:19:43.0 -0500
@@ -5374,19 +5374,19 @@
},
{   /* MCP65 Ethernet Controller */
PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 
PCI_DEVICE_ID_NVIDIA_NVENET_20),
-   .driver_data = 
DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_CHECKSUM|DEV_HAS_HIGH_DMA|DEV_HAS_POWER_CNTRL|DEV_HAS_MSI|DEV_HAS_PAUSEFRAME_TX|DEV_HAS_STATISTICS_V2|DEV_HAS_TEST_EXTENDED|DEV_HAS_MGMT_UNIT,
+   .driver_data = 
DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA|DEV_HAS_POWER_CNTRL|DEV_HAS_MSI|DEV_HAS_PAUSEFRAME_TX|DEV_HAS_STATISTICS_V2|DEV_HAS_TEST_EXTENDED|DEV_HAS_MGMT_UNIT,
},
{   /* MCP65 Ethernet Controller */
PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 
PCI_DEVICE_ID_NVIDIA_NVENET_21),
-   .driver_data = 
DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_CHECKSUM|DEV_HAS_HIGH_DMA|DEV_HAS_POWER_CNTRL|DEV_HAS_MSI|DEV_HAS_PAUSEFRAME_TX|DEV_HAS_STATISTICS_V2|DEV_HAS_TEST_EXTENDED|DEV_HAS_MGMT_UNIT,
+   .driver_data = 
DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA|DEV_HAS_POWER_CNTRL|DEV_HAS_MSI|DEV_HAS_PAUSEFRAME_TX|DEV_HAS_STATISTICS_V2|DEV_HAS_TEST_EXTENDED|DEV_HAS_MGMT_UNIT,
},
{   /* MCP65 Ethernet Controller */
PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 
PCI_DEVICE_ID_NVIDIA_NVENET_22),
-   .driver_data = 
DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_CHECKSUM|DEV_HAS_HIGH_DMA|DEV_HAS_POWER_CNTRL|DEV_HAS_MSI|DEV_HAS_PAUSEFRAME_TX|DEV_HAS_STATISTICS_V2|DEV_HAS_TEST_EXTENDED|DEV_HAS_MGMT_UNIT,
+   .driver_data = 
DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA|DEV_HAS_POWER_CNTRL|DEV_HAS_MSI|DEV_HAS_PAUSEFRAME_TX|DEV_HAS_STATISTICS_V2|DEV_HAS_TEST_EXTENDED|DEV_HAS_MGMT_UNIT,
},
{   /* MCP65 Ethernet Controller */
PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 
PCI_DEVICE_ID_NVIDIA_NVENET_23),
-   .driver_data = 
DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_CHECKSUM|DEV_HAS_HIGH_DMA|DEV_HAS_POWER_CNTRL|DEV_HAS_MSI|DEV_HAS_PAUSEFRAME_TX|DEV_HAS_STATISTICS_V2|DEV_HAS_TEST_EXTENDED|DEV_HAS_MGMT_UNIT,
+   .driver_data = 
DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA|DEV_HAS_POWER_CNTRL|DEV_HAS_MSI|DEV_HAS_PAUSEFRAME_TX|DEV_HAS_STATISTICS_V2|DEV_HAS_TEST_EXTENDED|DEV_HAS_MGMT_UNIT,
},
{   /* MCP67 Ethernet Controller */
PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 
PCI_DEVICE_ID_NVIDIA_NVENET_24),

Re: forcedeth problems on 2.6.20-rc6-mm3

2007-02-19 Thread Ayaz Abdulla




Robert Hancock wrote:

Ayaz Abdulla wrote:



For all those who are having issues, please try out the attached patch.

Ayaz


--- 

This email message is for the sole use of the intended recipient(s) 
and may contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact 
the sender by

reply email and destroy all copies of the original message.
--- 






--- orig/drivers/net/forcedeth.c2007-02-08 21:41:59.0 -0500
+++ new/drivers/net/forcedeth.c2007-02-08 21:44:53.0 -0500
@@ -3104,13 +3104,17 @@
 struct fe_priv *np = netdev_priv(dev);
 u8 __iomem *base = get_hwbase(dev);
 unsigned long flags;
+u32 retcode;
 
-if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2)

+if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) {
 pkts = nv_rx_process(dev, limit);
-else
+retcode = nv_alloc_rx(dev);
+} else {
 pkts = nv_rx_process_optimized(dev, limit);
+retcode = nv_alloc_rx_optimized(dev);
+}
 
-if (nv_alloc_rx(dev)) {

+if (retcode) {
 spin_lock_irqsave(&np->lock, flags);
 if (!np->in_shutdown)
 mod_timer(&np->oom_kick, jiffies + OOM_REFILL);



Did anyone push this patch into mainline? forcedeth on 2.6.20-git14 is 
still completely broken without this patch.




I have submitted the patch to netdev mailing list.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] forcedeth: disable msix

2007-02-19 Thread Ayaz Abdulla

There seems to be an issue when both MSI-X is enabled and NAPI is 
configured. This patch disables MSI-X until the issue is root caused.


Signed-Off-By: Ayaz Abdulla <[EMAIL PROTECTED]>

--- orig/drivers/net/forcedeth.c2007-02-19 09:17:02.0 -0500
+++ new/drivers/net/forcedeth.c 2007-02-19 09:17:07.0 -0500
@@ -839,7 +839,7 @@
NV_MSIX_INT_DISABLED,
NV_MSIX_INT_ENABLED
 };
-static int msix = NV_MSIX_INT_ENABLED;
+static int msix = NV_MSIX_INT_DISABLED;
 
 /*
  * DMA 64bit

[PATCH 1/3] forcedeth: fixed missing call in napi poll

2007-02-19 Thread Ayaz Abdulla

The napi poll routine was missing the call to the optimized rx process 
routine. This patch adds the missing call for the optimized path.


Signed-Off-By: Ayaz Abdulla <[EMAIL PROTECTED]>

--- orig/drivers/net/forcedeth.c2007-02-19 09:13:10.0 -0500
+++ new/drivers/net/forcedeth.c 2007-02-19 09:13:46.0 -0500
@@ -3104,13 +3104,17 @@
struct fe_priv *np = netdev_priv(dev);
u8 __iomem *base = get_hwbase(dev);
unsigned long flags;
+   u32 retcode;
 
-   if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2)
+   if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) {
pkts = nv_rx_process(dev, limit);
-   else
+   retcode = nv_alloc_rx(dev);
+   } else {
pkts = nv_rx_process_optimized(dev, limit);
+   retcode = nv_alloc_rx_optimized(dev);
+   }
 
-   if (nv_alloc_rx(dev)) {
+   if (retcode) {
spin_lock_irqsave(&np->lock, flags);
if (!np->in_shutdown)
mod_timer(&np->oom_kick, jiffies + OOM_REFILL);

Re: forcedeth problems on 2.6.20-rc6-mm3

2007-02-19 Thread Robert Hancock


Ayaz Abdulla wrote:


For all those who are having issues, please try out the attached patch.

Ayaz


--- 

This email message is for the sole use of the intended recipient(s) and 
may contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact 
the sender by

reply email and destroy all copies of the original message.
--- 






--- orig/drivers/net/forcedeth.c2007-02-08 21:41:59.0 -0500
+++ new/drivers/net/forcedeth.c 2007-02-08 21:44:53.0 -0500
@@ -3104,13 +3104,17 @@
struct fe_priv *np = netdev_priv(dev);
u8 __iomem *base = get_hwbase(dev);
unsigned long flags;
+   u32 retcode;
 
-	if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2)

+   if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) {
pkts = nv_rx_process(dev, limit);
-   else
+   retcode = nv_alloc_rx(dev);
+   } else {
pkts = nv_rx_process_optimized(dev, limit);
+   retcode = nv_alloc_rx_optimized(dev);
+   }
 
-	if (nv_alloc_rx(dev)) {

+   if (retcode) {
spin_lock_irqsave(&np->lock, flags);
if (!np->in_shutdown)
mod_timer(&np->oom_kick, jiffies + OOM_REFILL);


Did anyone push this patch into mainline? forcedeth on 2.6.20-git14 is 
still completely broken without this patch.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[2.6 patch] kill net/rxrpc/rxrpc_syms.c

2007-02-19 Thread Adrian Bunk

This patch moves the EXPORT_SYMBOL's from net/rxrpc/rxrpc_syms.c to the 
files with the actual functions.

Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>

---

This patch was already sent on:
- 26 Nov 2006

 net/rxrpc/Makefile |1 -
 net/rxrpc/call.c   |5 +
 net/rxrpc/connection.c |2 ++
 net/rxrpc/rxrpc_syms.c |   34 --
 net/rxrpc/transport.c  |4 
 5 files changed, 11 insertions(+), 35 deletions(-)

--- linux-2.6.19-rc6-mm1/net/rxrpc/Makefile.old 2006-11-26 04:49:25.0 
+0100
+++ linux-2.6.19-rc6-mm1/net/rxrpc/Makefile 2006-11-26 04:50:08.0 
+0100
@@ -12,7 +12,6 @@
krxtimod.o \
main.o \
peer.o \
-   rxrpc_syms.o \
transport.o
 
 ifeq ($(CONFIG_PROC_FS),y)
--- linux-2.6.19-rc6-mm1/net/rxrpc/call.c.old   2006-11-26 04:50:51.0 
+0100
+++ linux-2.6.19-rc6-mm1/net/rxrpc/call.c   2006-11-26 04:51:58.0 
+0100
@@ -314,6 +314,7 @@
_leave(" = %d", ret);
return ret;
 } /* end rxrpc_create_call() */
+EXPORT_SYMBOL(rxrpc_create_call);
 
 /*/
 /*
@@ -465,6 +466,7 @@
 
_leave(" [destroyed]");
 } /* end rxrpc_put_call() */
+EXPORT_SYMBOL(rxrpc_put_call);
 
 /*/
 /*
@@ -923,6 +925,7 @@
return __rxrpc_call_abort(call, error);
 
 } /* end rxrpc_call_abort() */
+EXPORT_SYMBOL(rxrpc_call_abort);
 
 /*/
 /*
@@ -1910,6 +1913,7 @@
}
 
 } /* end rxrpc_call_read_data() */
+EXPORT_SYMBOL(rxrpc_call_read_data);
 
 /*/
 /*
@@ -2076,6 +2080,7 @@
return ret;
 
 } /* end rxrpc_call_write_data() */
+EXPORT_SYMBOL(rxrpc_call_write_data);
 
 /*/
 /*
--- linux-2.6.19-rc6-mm1/net/rxrpc/connection.c.old 2006-11-26 
04:52:08.0 +0100
+++ linux-2.6.19-rc6-mm1/net/rxrpc/connection.c 2006-11-26 04:52:32.0 
+0100
@@ -207,6 +207,7 @@
spin_unlock(&peer->conn_gylock);
goto make_active;
 } /* end rxrpc_create_connection() */
+EXPORT_SYMBOL(rxrpc_create_connection);
 
 /*/
 /*
@@ -411,6 +412,7 @@
 
_leave(" [killed]");
 } /* end rxrpc_put_connection() */
+EXPORT_SYMBOL(rxrpc_put_connection);
 
 /*/
 /*
--- linux-2.6.19-rc6-mm1/net/rxrpc/transport.c.old  2006-11-26 
04:52:43.0 +0100
+++ linux-2.6.19-rc6-mm1/net/rxrpc/transport.c  2006-11-26 04:53:36.0 
+0100
@@ -146,6 +146,7 @@
_leave(" = %d", ret);
return ret;
 } /* end rxrpc_create_transport() */
+EXPORT_SYMBOL(rxrpc_create_transport);
 
 /*/
 /*
@@ -196,6 +197,7 @@
 
_leave("");
 } /* end rxrpc_put_transport() */
+EXPORT_SYMBOL(rxrpc_put_transport);
 
 /*/
 /*
@@ -231,6 +233,7 @@
_leave("= %d", ret);
return ret;
 } /* end rxrpc_add_service() */
+EXPORT_SYMBOL(rxrpc_add_service);
 
 /*/
 /*
@@ -248,6 +251,7 @@
 
_leave("");
 } /* end rxrpc_del_service() */
+EXPORT_SYMBOL(rxrpc_del_service);
 
 /*/
 /*
--- linux-2.6.19-rc6-mm1/net/rxrpc/rxrpc_syms.c 2006-09-20 05:42:06.0 
+0200
+++ /dev/null   2006-09-19 00:45:31.0 +0200
@@ -1,34 +0,0 @@
-/* rxrpc_syms.c: exported Rx RPC layer interface symbols
- *
- * Copyright (C) 2002 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells ([EMAIL PROTECTED])
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License
- * as published by the Free Software Foundation; either version
- * 2 of the License, or (at your option) any later version.
- */
-
-#include 
-
-#include 
-#include 
-#include 
-#include 
-
-/* call.c */
-EXPORT_SYMBOL(rxrpc_create_call);
-EXPORT_SYMBOL(rxrpc_put_call);
-EXPORT_SYMBOL(rxrpc_call_abort);
-EXPORT_SYMBOL(rxrpc_call_read_data);
-EXPORT_SYMBOL(rxrpc_call_write_data);
-
-/* connection.c */
-EXPORT_SYMBOL(rxrpc_create_connection);
-EXPORT_SYMBOL(rxrpc_put_connection);
-
-/* transport.c */
-EXPORT_SYMBOL(rxrpc_create_transport);
-EXPORT_SYMBOL(rxrpc_put_transport);
-EXPORT_SYMBOL(rxrpc_add_service);
-EXPORT_SYMBOL(rxrpc_del_service);


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-mm patch] drivers/net/vioc/: possible cleanups

2007-02-19 Thread Adrian Bunk

On Thu, Feb 15, 2007 at 05:14:08AM -0800, Andrew Morton wrote:
>...
> Changes since 2.6.20-rc6-mm3:
>...
> +Fabric7-VIOC-driver.patch
>...
>  netdev stuff
>...


This patch contains the following possible cleanups:
- remove dead #ifdef EXPORT_SYMTAB code
- no "inline" functions in C files - gcc knows best whether or not to
  inline static functions
- move the vioc_ethtool_ops prototype to a header file
- make needlessly global code static
- #if 0 unused code
- vioc_irq.c: remove the unused vioc_driver_lock

Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>

---

 drivers/net/vioc/f7/sppapi.h  |2 -
 drivers/net/vioc/khash.h  |   10 -
 drivers/net/vioc/spp.c|   60 +++---
 drivers/net/vioc/vioc_api.c   |5 ++
 drivers/net/vioc/vioc_api.h   |3 +
 drivers/net/vioc/vioc_driver.c|3 -
 drivers/net/vioc/vioc_ethtool.c   |2 -
 drivers/net/vioc/vioc_irq.c   |9 +---
 drivers/net/vioc/vioc_provision.c |   16 +---
 drivers/net/vioc/vioc_receive.c   |2 -
 drivers/net/vioc/vioc_spp.c   |6 +--
 drivers/net/vioc/vioc_transmit.c  |   40 +++-
 12 files changed, 85 insertions(+), 73 deletions(-)

--- linux-2.6.20-mm1/drivers/net/vioc/vioc_driver.c.old 2007-02-18 
01:14:31.0 +0100
+++ linux-2.6.20-mm1/drivers/net/vioc/vioc_driver.c 2007-02-18 
01:14:35.0 +0100
@@ -868,6 +868,3 @@
 module_init(vioc_module_init);
 module_exit(vioc_module_exit);
 
-#ifdef EXPORT_SYMTAB
-EXPORT_SYMBOL(vioc_viocdev);
-#endif /* EXPORT_SYMTAB */
--- linux-2.6.20-mm1/drivers/net/vioc/khash.h.old   2007-02-18 
01:16:28.0 +0100
+++ linux-2.6.20-mm1/drivers/net/vioc/khash.h   2007-02-18 01:25:29.0 
+0100
@@ -52,14 +52,4 @@
 };
 
 
-struct shash_t  *hashT_create(u32, size_t, size_t, u32(*)(unsigned char *, 
unsigned long), int(*)(void *, void *), unsigned int);
-int hashT_delete(struct shash_t * , void *);
-struct hash_elem_t *hashT_lookup(struct shash_t * , void *);
-struct hash_elem_t *hashT_add(struct shash_t *, void *);
-void hashT_destroy(struct shash_t *);
-/* Accesors */
-void **hashT_getkeys(struct shash_t *);
-size_t hashT_tablesize(struct shash_t *);
-size_t hashT_size(struct shash_t *);
-
 #endif
--- linux-2.6.20-mm1/drivers/net/vioc/f7/sppapi.h.old   2007-02-18 
01:26:44.0 +0100
+++ linux-2.6.20-mm1/drivers/net/vioc/f7/sppapi.h   2007-02-18 
01:26:57.0 +0100
@@ -234,7 +234,5 @@
 
 extern void spp_msg_unregister(u32 key_facility);
 
-extern int read_spp_regbank32(int vioc, int bank, char *buffer);
-
 #endif /* _SPPAPI_H_ */
 
--- linux-2.6.20-mm1/drivers/net/vioc/spp.c.old 2007-02-18 01:19:34.0 
+0100
+++ linux-2.6.20-mm1/drivers/net/vioc/spp.c 2007-02-18 01:27:13.0 
+0100
@@ -50,6 +50,15 @@
   c -= a; c -= b; c ^= (b >> 15); \
 }
 
+static struct hash_elem_t *hashT_add(struct shash_t *htable, void *key);
+static struct shash_t *hashT_create(u32 sizehint, size_t keybuf_size,
+size_t databuf_size,
+u32(*hfunc) (unsigned char *,
+unsigned long),
+int (*cfunc) (void *, void *),
+unsigned int flags);
+static void hashT_destroy(struct shash_t *htable);
+static struct hash_elem_t *hashT_lookup(struct shash_t *htable, void *key);
 
 struct shash_t {
/* Common fields for all hash tables types */
@@ -65,7 +74,9 @@
 };
 
 struct hash_ops {
+#if 0
int (*delete) (struct shash_t *, void *);
+#endif  /*  0  */
struct hash_elem_t *(*lookup) (struct shash_t *, void *);
void (*destroy) (struct shash_t *);
struct hash_elem_t *(*add) (struct shash_t *, void *);
@@ -143,6 +154,7 @@
return ((htable->hash_fn(key, len)) & (htable->tsize - 1));
 }
 
+#if 0
 /* Data associated to this key MUST be freed by the caller */
 static int ch_delete(struct shash_t *htable, void *key)
 {
@@ -181,6 +193,7 @@
 
return -1;
 }
+#endif  /*  0  */
 
 static void ch_destroy(struct shash_t *htable)
 {
@@ -232,16 +245,21 @@
 }
 
 /* Accesors **/
-inline size_t hashT_tablesize(struct shash_t * htable)
+
+#if 0
+
+size_t hashT_tablesize(struct shash_t * htable)
 {
return htable->tsize;
 }
 
-inline size_t hashT_size(struct shash_t * htable)
+size_t hashT_size(struct shash_t * htable)
 {
return htable->nelems;
 }
 
+#endif  /*  0  */
+
 static struct hash_elem_t *ch_lookup(struct shash_t *htable, void *key)
 {
u32 idx;
@@ -330,15 +348,17 @@
return 1;
 }
 
-struct hash_ops ch_ops = {
+static struct hash_ops ch_ops = {
+#if 0
.delete = ch_delete,
+#endif  /*  0  */
.lookup = ch_lookup,
.destroy = ch_destroy,
.getkeys = ch_getkeys,
.add = ch_add
 };
 
-struct facility fTable[FACILITY_CNT];

[2.6 patch] net/irda/: proper prototypes

2007-02-19 Thread Adrian Bunk

On Mon, Feb 05, 2007 at 06:01:42PM -0800, David Miller wrote:
> From: [EMAIL PROTECTED]
> Date: Mon, 05 Feb 2007 16:30:53 -0800
> 
> > From: Adrian Bunk <[EMAIL PROTECTED]>
> > 
> > Add proper prototypes for some functions in include/net/irda/irda.h
> > 
> > Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>
> > Acked-by: Samuel Ortiz <[EMAIL PROTECTED]>
> > Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
> 
> I NAK'd this so that Adrian would go add "extern" to the
> function declarations in the header file.
> 
> Please drop this, Adrian will resend once he fixes it up.

Sorry, I should have sent this earlier.
Updated patch below.

cu
Adrian


<--  snip  -->


This patch adds proper prototypes for some functions in
include/net/irda/irda.h

Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>

---

 include/net/irda/irda.h |   16 
 net/irda/irmod.c|   13 -
 2 files changed, 16 insertions(+), 13 deletions(-)

--- linux-2.6.20-rc1-mm1/include/net/irda/irda.h.old2006-12-18 
02:49:02.0 +0100
+++ linux-2.6.20-rc1-mm1/include/net/irda/irda.h2006-12-18 
02:58:02.0 +0100
@@ -113,4 +113,20 @@
 #define IAS_IRCOMM_ID 0x2343
 #define IAS_IRLPT_ID  0x9876
 
+struct net_device;
+struct packet_type;
+
+extern void irda_proc_register(void);
+extern void irda_proc_unregister(void);
+
+extern int irda_sysctl_register(void);
+extern void irda_sysctl_unregister(void);
+
+extern int irsock_init(void);
+extern void irsock_cleanup(void);
+
+extern int irlap_driver_rcv(struct sk_buff *skb, struct net_device *dev,
+   struct packet_type *ptype,
+   struct net_device *orig_dev);
+
 #endif /* NET_IRDA_H */
--- linux-2.6.20-rc1-mm1/net/irda/irmod.c.old   2006-12-18 02:52:18.0 
+0100
+++ linux-2.6.20-rc1-mm1/net/irda/irmod.c   2006-12-18 02:53:59.0 
+0100
@@ -42,19 +42,6 @@
 #include /* irttp_init */
 #include   /* irda_device_init */
 
-/* irproc.c */
-extern void irda_proc_register(void);
-extern void irda_proc_unregister(void);
-/* irsysctl.c */
-extern int  irda_sysctl_register(void);
-extern void irda_sysctl_unregister(void);
-/* af_irda.c */
-extern int  irsock_init(void);
-extern void irsock_cleanup(void);
-/* irlap_frame.c */
-extern int  irlap_driver_rcv(struct sk_buff *, struct net_device *,
-struct packet_type *, struct net_device *);
-
 /*
  * Module parameters
  */


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC: 2.6 patch] zd1211rw: possible cleanups

2007-02-19 Thread Adrian Bunk

This patch contains the following possible cleanups:
- make needlessly global functions static
- #if 0 the following unused global functions:
  - zd_chip.c: zd_ioread16()
  - zd_chip.c: zd_ioread32()
  - zd_chip.c: zd_iowrite16()
  - zd_chip.c: zd_ioread32v()
  - zd_chip.c: zd_read_mac_addr()
  - zd_chip.c: zd_set_beacon_interval()
  - zd_util.c: zd_hexdump()

Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>

---

 drivers/net/wireless/zd1211rw/zd_chip.c |   27 +++-
 drivers/net/wireless/zd1211rw/zd_chip.h |   26 +--
 drivers/net/wireless/zd1211rw/zd_mac.h  |6 -
 drivers/net/wireless/zd1211rw/zd_util.c |5 +---
 drivers/net/wireless/zd1211rw/zd_util.h |6 -
 5 files changed, 30 insertions(+), 40 deletions(-)

--- linux-2.6.19-rc6-mm1/drivers/net/wireless/zd1211rw/zd_chip.h.old
2006-11-26 00:18:00.0 +0100
+++ linux-2.6.19-rc6-mm1/drivers/net/wireless/zd1211rw/zd_chip.h
2006-11-26 00:26:41.0 +0100
@@ -709,15 +709,6 @@
return zd_usb_ioread16(&chip->usb, value, addr);
 }
 
-int zd_ioread32v_locked(struct zd_chip *chip, u32 *values,
-   const zd_addr_t *addresses, unsigned int count);
-
-static inline int zd_ioread32_locked(struct zd_chip *chip, u32 *value,
-const zd_addr_t addr)
-{
-   return zd_ioread32v_locked(chip, value, (const zd_addr_t *)&addr, 1);
-}
-
 static inline int zd_iowrite16_locked(struct zd_chip *chip, u16 value,
  zd_addr_t addr)
 {
@@ -747,9 +738,6 @@
return _zd_iowrite32v_locked(chip, &ioreq, 1);
 }
 
-int zd_iowrite32a_locked(struct zd_chip *chip,
-const struct zd_ioreq32 *ioreqs, unsigned int count);
-
 static inline int zd_rfwrite_locked(struct zd_chip *chip, u32 value, u8 bits)
 {
ZD_ASSERT(mutex_is_locked(&chip->mutex));
@@ -766,12 +754,7 @@
 /* Locking functions for reading and writing registers.
  * The different parameters are intentional.
  */
-int zd_ioread16(struct zd_chip *chip, zd_addr_t addr, u16 *value);
-int zd_iowrite16(struct zd_chip *chip, zd_addr_t addr, u16 value);
-int zd_ioread32(struct zd_chip *chip, zd_addr_t addr, u32 *value);
 int zd_iowrite32(struct zd_chip *chip, zd_addr_t addr, u32 value);
-int zd_ioread32v(struct zd_chip *chip, const zd_addr_t *addresses,
- u32 *values, unsigned int count);
 int zd_iowrite32a(struct zd_chip *chip, const struct zd_ioreq32 *ioreqs,
   unsigned int count);
 
@@ -783,7 +766,6 @@
 u8  zd_chip_get_channel(struct zd_chip *chip);
 int zd_read_regdomain(struct zd_chip *chip, u8 *regdomain);
 void zd_get_e2p_mac_addr(struct zd_chip *chip, u8 *mac_addr);
-int zd_read_mac_addr(struct zd_chip *chip, u8 *mac_addr);
 int zd_write_mac_addr(struct zd_chip *chip, const u8 *mac_addr);
 int zd_chip_switch_radio_on(struct zd_chip *chip);
 int zd_chip_switch_radio_off(struct zd_chip *chip);
@@ -794,20 +776,24 @@
 int zd_chip_enable_hwint(struct zd_chip *chip);
 int zd_chip_disable_hwint(struct zd_chip *chip);
 
+#if 0
 static inline int zd_get_encryption_type(struct zd_chip *chip, u32 *type)
 {
return zd_ioread32(chip, CR_ENCRYPTION_TYPE, type);
 }
+#endif  /*  0  */
 
 static inline int zd_set_encryption_type(struct zd_chip *chip, u32 type)
 {
return zd_iowrite32(chip, CR_ENCRYPTION_TYPE, type);
 }
 
+#if 0
 static inline int zd_chip_get_basic_rates(struct zd_chip *chip, u16 *cr_rates)
 {
return zd_ioread16(chip, CR_BASIC_RATE_TBL, cr_rates);
 }
+#endif  /*  0  */
 
 int zd_chip_set_basic_rates(struct zd_chip *chip, u16 cr_rates);
 
@@ -827,12 +813,12 @@
 
 int zd_chip_control_leds(struct zd_chip *chip, enum led_status status);
 
-int zd_set_beacon_interval(struct zd_chip *chip, u32 interval);
-
+#if 0
 static inline int zd_get_beacon_interval(struct zd_chip *chip, u32 *interval)
 {
return zd_ioread32(chip, CR_BCN_INTERVAL, interval);
 }
+#endif  /*  0  */
 
 struct rx_status;
 
--- linux-2.6.19-rc6-mm1/drivers/net/wireless/zd1211rw/zd_chip.c.old
2006-11-26 00:18:10.0 +0100
+++ linux-2.6.19-rc6-mm1/drivers/net/wireless/zd1211rw/zd_chip.c
2006-11-26 00:37:13.0 +0100
@@ -87,8 +87,8 @@
 /* Read a variable number of 32-bit values. Parameter count is not allowed to
  * exceed USB_MAX_IOREAD32_COUNT.
  */
-int zd_ioread32v_locked(struct zd_chip *chip, u32 *values, const zd_addr_t 
*addr,
-unsigned int count)
+static int zd_ioread32v_locked(struct zd_chip *chip, u32 *values,
+   const zd_addr_t *addr, unsigned int count)
 {
int r;
int i;
@@ -135,6 +135,12 @@
return r;
 }
 
+static int zd_ioread32_locked(struct zd_chip *chip, u32 *value,
+ const zd_addr_t addr)
+{
+   return zd_ioread32v_locked(chip, value, (const zd_addr_t *)&addr, 1);
+}
+
 int _zd_iowrite32v_locked(struct zd_chip *chip, const struct zd_ioreq32 
*ioreqs,
   u

Re: MediaGX/GeodeGX1 requires X86_OOSTORE.

2007-02-19 Thread Lennart Sorensen

On Tue, Feb 20, 2007 at 08:56:39AM +0900, takada wrote:
> /proc/cpuinfo with MediaGXm :
> 
> processor : 0
> vendor_id : CyrixInstead
> cpu family: 5
> model : 5
> model name: Cyrix MediaGXtm MMXtm Enhanced
> stepping  : 2
> cpu MHz   : 199.750
> cache size: 16 KB
> fdiv_bug  : no
> hlt_bug   : no
> f00f_bug  : no
> coma_bug  : no
> fpu   : yes
> fpu_exception : yes
> cpuid level   : 2
> wp: yes
> flags : fpu tsc msr cx8 cmov mmx cxmmx
> bogomips  : 401.00
> clflush size  : 32

Hmm with 2.6.18 I am seeing:

processor   : 0
vendor_id   : CyrixInstead
cpu family  : 5
model   : 9
model name  : Geode(TM) Integrated Processor by National Semi
stepping: 1
cpu MHz : 266.648
cache size  : 16 KB
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 2
wp  : yes
flags   : fpu tsc msr cx8 cmov mmx cxmmx
bogomips: 534.50

Similar, but the last line isn't there.  It looks like 2.6.18 doesn't
actually have code to print that information though.

--
Len Sorensen
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.19-rc6-mm1: drivers/net/chelsio/: unused code

2007-02-19 Thread Adrian Bunk

On Tue, Nov 28, 2006 at 11:47:19PM -0800, Andrew Morton wrote:
> On Wed, 29 Nov 2006 08:36:09 +0100
> Adrian Bunk <[EMAIL PROTECTED]> wrote:
> 
> > On Mon, Nov 27, 2006 at 10:24:55AM -0800, Stephen Hemminger wrote:
> > > On Fri, 24 Nov 2006 01:17:31 +0100
> > > Adrian Bunk <[EMAIL PROTECTED]> wrote:
> > > 
> > > > On Thu, Nov 23, 2006 at 02:17:03AM -0800, Andrew Morton wrote:
> > > > >...
> > > > > Changes since 2.6.19-rc5-mm2:
> > > > >...
> > > > > +chelsio-22-driver.patch
> > > > >...
> > > > >  netdev updates
> > > > 
> > > > It is suspicious that the following newly added code is completely 
> > > > unused:
> > > >   drivers/net/chelsio/ixf1010.o
> > > > t1_ixf1010_ops
> > > >   drivers/net/chelsio/mac.o
> > > > t1_chelsio_mac_ops
> > > >   drivers/net/chelsio/vsc8244.o
> > > > t1_vsc8244_ops
> > > > 
> > > > cu
> > > > Adrian
> > > > 
> > > 
> > > All that is gone in later version. I reposted new patches
> > > after -mm2 was done.
> > 
> > It seems these patches didn't make it into 2.6.19-rc6-mm2 ?
> 
> I dropped that patch and picked up Francois's tree instead.

These structs are still both present and unused as of 2.6.20-mm1.

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: Strange connection slowdown on pcnet32

2007-02-19 Thread Lennart Sorensen

On Mon, Feb 19, 2007 at 06:45:48PM -0500, Lennart Sorensen wrote:
> It seems the problem actually occours when the receive descriptor ring
> is full.  This seems to generate one (or sometimes more) descriptors in
> the ring which claim to be owned by the MAC, but at the head of the
> receive ring as far as the driver is concerned.  I see some note in the
> driver about an SP3G chipset sometimes causing this.  How would one
> identify this and clear such descriptors out of the way?  Getting stuck
> until the next time the MAC gets around to the descriptor and overwrites
> it is not good, since it causes delays, and out of order packets.

I am also noticing the receive error count going up, and the source is
this code:

if (status & 0x01)  /* Only count a general error at the */
   lp->stats.rx_errors++;  /* end of a packet. */

It appears this means I am receiving a frame marked with "End Of Packet"
but without "Start of Packet".  I have no idea how that happens, but it
shouldn't be able to make the driver and MAC stop processing the receive
ring.

--
Len Sorensen
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: MediaGX/GeodeGX1 requires X86_OOSTORE.

2007-02-19 Thread takada

From: Roland Dreier <[EMAIL PROTECTED]>
Subject: Re: MediaGX/GeodeGX1 requires X86_OOSTORE.
Date: Mon, 19 Feb 2007 11:48:27 -0800

>  > Does anyone know if there is any way to flush a cache line of the cpu to
>  > force rereading system memory for a given address or address range?
> 
> There is the "clflush" instruction, but not all x86 CPUs support it.
> You need to check the CPUID flag to know for sure (/proc/cpuinfo will
> show a "clflush" flag if it is supported).

/proc/cpuinfo with MediaGXm :

processor   : 0
vendor_id   : CyrixInstead
cpu family  : 5
model   : 5
model name  : Cyrix MediaGXtm MMXtm Enhanced
stepping: 2
cpu MHz : 199.750
cache size  : 16 KB
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 2
wp  : yes
flags   : fpu tsc msr cx8 cmov mmx cxmmx
bogomips: 401.00
clflush size: 32
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: nonblocking UDPv4 recvfrom() taking 4usec @ 3GHz?

2007-02-19 Thread Stephen Hemminger

On Tue, 20 Feb 2007 00:14:47 +0100
bert hubert <[EMAIL PROTECTED]> wrote:

> Hi people,
> 
> I'm trying to save people the cost of buying extra servers by making
> PowerDNS (GPL) ever faster, but I've hit a rather fundamental problem.
> 
> Linux 2.6.20-rc4 appears to take 4 microseconds on my P4 3GHz for a
> non-blocking UDPv4 recvfrom() call, both on loopback and ethernet.
> 
> Linux 2.6.18 on my 64 bit Athlon64 3200+ takes a similar amount of time.
> 
> This seems like rather a lot for a 50 byte datagram, but perhaps I'm
> overestimating your abilities :-)
> 
> The program is unthreaded, and I measure like this:
> 
> #define RDTSC(qp) \
> do { \
>   unsigned long lowPart, highPart;\
>   __asm__ __volatile__("rdtsc" : "=a" (lowPart), "=d" (highPart)); \
> qp = (((unsigned long long) highPart) << 32) | lowPart; \
> } while (0)
> 
> ...
> 
> uint64_t tsc1, tsc2;
> RDTSC(tsc1);  
> 
> if((len=recvfrom(fd, data, sizeof(data), 0, (sockaddr *)&fromaddr, &addrlen)) 
> >= 0) { 
> RDTSC(tsc2);  
> printf("%f\n", (tsc2-tsc1)/3000.0);  // 3GHz P4
> }
> 
> gdb generates the following dump from the actual program,
> x=_Z20handleNewUDPQuestioniRN5boost3anyE, I see nothing untoward happening
> between the two 'rdtsc' opcodes.
> 
> 0x08091de0 :  push   %ebp
> 0x08091de1 :  mov%esp,%ebp
> 0x08091de3 :  push   %edi
> 0x08091de4 :  push   %esi
> 0x08091de5 :  push   %ebx
> 0x08091de6 :  sub$0x78c,%esp
> 0x08091dec : mov%gs:0x14,%eax
> 0x08091df2 : mov%eax,0xffe4(%ebp)
> 0x08091df5 : xor%eax,%eax
> 0x08091df7 : movw   $0x2,0xffac(%ebp)
> 0x08091dfd : movl   $0x0,0xffb0(%ebp)
> 0x08091e04 : movw   $0x0,0xffae(%ebp)
> 0x08091e0a : movl   $0x1c,0xf8f4(%ebp)
> 0x08091e14 : rdtsc  
> 0x08091e16 : mov%edx,%ebx
> 0x08091e18 : mov0x8(%ebp),%edx
> 0x08091e1b : mov%eax,%esi
> 0x08091e1d : lea0xf8f4(%ebp),%eax
> 0x08091e23 : mov%eax,0x14(%esp)
> 0x08091e27 : lea0xffac(%ebp),%ecx
> 0x08091e2a : lea0xf950(%ebp),%eax
> 0x08091e30 : mov%ecx,0x10(%esp)
> 0x08091e34 : movl   $0x0,0xc(%esp)
> 0x08091e3c : movl   $0x5dc,0x8(%esp)
> 0x08091e44 :mov%eax,0x4(%esp)
> 0x08091e48 :mov%edx,(%esp)
> 0x08091e4b :call   0x8192110 
> 0x08091e50 :test   %eax,%eax
> 0x08091e52 :mov%eax,0xf8b0(%ebp)
> 0x08091e58 :js 0x8092168 
> 0x08091e5e :mov%ebx,%eax
> 0x08091e60 :xor%edx,%edx
> 0x08091e62 :mov%eax,%edx
> 0x08091e64 :mov$0x0,%eax
> 0x08091e69 :mov%esi,%ecx
> 0x08091e6b :mov%eax,%esi
> 0x08091e6d :or %ecx,%esi
> 0x08091e6f :mov%edx,%edi
> 0x08091e71 :rdtsc  
> 0x08091e73 :mov%eax,0xf8a0(%ebp)
> 0x08091e79 :mov0xf8a0(%ebp),%eax
> 0x08091e7f :mov%edx,%ecx
> 0x08091e81 :xor%ebx,%ebx
> 0x08091e83 :mov%ecx,%ebx
> 
> recvfrom itself is a tad worrisome, x=recvfrom. I didn't ask for the
> 'libc_enable_asynccancel' stuff. I'm trying to isolate the actual syscall
> but it is proving hard work for an assemnly newbie like me - socketcall
> doesn't make things easier.
> 
> 0xb7d62410 :cmpl   $0x0,%gs:0xc
> 0xb7d62418 :jne0xb7d62439 
> 0xb7d6241a :   mov%ebx,%edx
> 0xb7d6241c :   mov$0x66,%eax
> 0xb7d62421 :   mov$0xc,%ebx
> 0xb7d62426 :   lea0x4(%esp),%ecx
> 0xb7d6242a :   call   *%gs:0x10
> 0xb7d62431 :   mov%edx,%ebx
> 0xb7d62433 :   cmp$0xff83,%eax
> 0xb7d62436 :   jae0xb7d62469 
> 0xb7d62438 :   ret
> 0xb7d62439 :   push   %esi
> 0xb7d6243a :   call   0xb7d6ddd0 <__libc_enable_asynccancel>
> 0xb7d6243f :   mov%eax,%esi
> 0xb7d62441 :   mov%ebx,%edx
> 0xb7d62443 :   mov$0x66,%eax
> 0xb7d62448 :   mov$0xc,%ebx
> 0xb7d6244d :   lea0x8(%esp),%ecx
> 0xb7d62451 :   call   *%gs:0x10
> 0xb7d62458 :   mov%edx,%ebx
> 0xb7d6245a :   xchg   %eax,%esi
> 0xb7d6245b :   call   0xb7d6dd90 <__libc_disable_asynccancel>
> 0xb7d62460 :   mov%esi,%eax
> 0xb7d62462 :   pop%esi
> 0xb7d62463 :   cmp$0xff83,%eax
> 0xb7d62466 :   jae0xb7d62469 
> 0xb7d62468 :   ret
> 0xb7d62469 :   call   0xb7d998f8 <__i686.get_pc_thunk.cx>
> 0xb7d6246e :   add$0x61b86,%ecx
> 0xb7d62474 :  mov0xff2c(%ecx),%ecx
> 0xb7d6247a :  xor%edx,%edx
> 0xb7d6247c :  sub%eax,%edx
> 0xb7d6247e :  mov%edx,%gs:(%ecx)
> 0xb7d62481 :  or $0x,%eax
> 0xb7d62484 :  jmp0xb7d62438 
> 
> Any clues?
> 

Use oprofile to find the hotspot.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2][TCP] YeAH-TCP: limited slow start exported function

2007-02-19 Thread Angelo P. Castellani


John Heffner ha scritto:
Note the patch is compile-tested only!  I can do some real testing if 
you'd like to apply this Dave.
The date you read on the patch is due to the fact I've splitted this 
patchset into 2 diff files. This isn't compile-tested only, I've used 
this piece of code for about 3 months.


However more testing is good and welcome.

Regards,
Angelo P. Castellani
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: Strange connection slowdown on pcnet32

2007-02-19 Thread Lennart Sorensen

On Mon, Feb 19, 2007 at 05:29:20PM -0500, Lennart Sorensen wrote:
> I just noticed, it seems almost all these problems occour right at the
> start of transfers when the tcp window size is still being worked out
> for the connection speed, and I am seeing the error count go up in
> ifconfig for the port when it happens too.  Is it possible for an error
> to get flagged in a receive descriptor without the owner bit being
> updated?

It seems the problem actually occours when the receive descriptor ring
is full.  This seems to generate one (or sometimes more) descriptors in
the ring which claim to be owned by the MAC, but at the head of the
receive ring as far as the driver is concerned.  I see some note in the
driver about an SP3G chipset sometimes causing this.  How would one
identify this and clear such descriptors out of the way?  Getting stuck
until the next time the MAC gets around to the descriptor and overwrites
it is not good, since it causes delays, and out of order packets.

--
Len Sorensen
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

nonblocking UDPv4 recvfrom() taking 4usec @ 3GHz?

2007-02-19 Thread bert hubert

Hi people,

I'm trying to save people the cost of buying extra servers by making
PowerDNS (GPL) ever faster, but I've hit a rather fundamental problem.

Linux 2.6.20-rc4 appears to take 4 microseconds on my P4 3GHz for a
non-blocking UDPv4 recvfrom() call, both on loopback and ethernet.

Linux 2.6.18 on my 64 bit Athlon64 3200+ takes a similar amount of time.

This seems like rather a lot for a 50 byte datagram, but perhaps I'm
overestimating your abilities :-)

The program is unthreaded, and I measure like this:

#define RDTSC(qp) \
do { \
  unsigned long lowPart, highPart;  \
  __asm__ __volatile__("rdtsc" : "=a" (lowPart), "=d" (highPart)); \
qp = (((unsigned long long) highPart) << 32) | lowPart; \
} while (0)

...

uint64_t tsc1, tsc2;
RDTSC(tsc1);  

if((len=recvfrom(fd, data, sizeof(data), 0, (sockaddr *)&fromaddr, &addrlen)) 
>= 0) { 
RDTSC(tsc2);  
printf("%f\n", (tsc2-tsc1)/3000.0);  // 3GHz P4
}

gdb generates the following dump from the actual program,
x=_Z20handleNewUDPQuestioniRN5boost3anyE, I see nothing untoward happening
between the two 'rdtsc' opcodes.

0x08091de0 :  push   %ebp
0x08091de1 :  mov%esp,%ebp
0x08091de3 :  push   %edi
0x08091de4 :  push   %esi
0x08091de5 :  push   %ebx
0x08091de6 :  sub$0x78c,%esp
0x08091dec : mov%gs:0x14,%eax
0x08091df2 : mov%eax,0xffe4(%ebp)
0x08091df5 : xor%eax,%eax
0x08091df7 : movw   $0x2,0xffac(%ebp)
0x08091dfd : movl   $0x0,0xffb0(%ebp)
0x08091e04 : movw   $0x0,0xffae(%ebp)
0x08091e0a : movl   $0x1c,0xf8f4(%ebp)
0x08091e14 : rdtsc  
0x08091e16 : mov%edx,%ebx
0x08091e18 : mov0x8(%ebp),%edx
0x08091e1b : mov%eax,%esi
0x08091e1d : lea0xf8f4(%ebp),%eax
0x08091e23 : mov%eax,0x14(%esp)
0x08091e27 : lea0xffac(%ebp),%ecx
0x08091e2a : lea0xf950(%ebp),%eax
0x08091e30 : mov%ecx,0x10(%esp)
0x08091e34 : movl   $0x0,0xc(%esp)
0x08091e3c : movl   $0x5dc,0x8(%esp)
0x08091e44 :mov%eax,0x4(%esp)
0x08091e48 :mov%edx,(%esp)
0x08091e4b :call   0x8192110 
0x08091e50 :test   %eax,%eax
0x08091e52 :mov%eax,0xf8b0(%ebp)
0x08091e58 :js 0x8092168 
0x08091e5e :mov%ebx,%eax
0x08091e60 :xor%edx,%edx
0x08091e62 :mov%eax,%edx
0x08091e64 :mov$0x0,%eax
0x08091e69 :mov%esi,%ecx
0x08091e6b :mov%eax,%esi
0x08091e6d :or %ecx,%esi
0x08091e6f :mov%edx,%edi
0x08091e71 :rdtsc  
0x08091e73 :mov%eax,0xf8a0(%ebp)
0x08091e79 :mov0xf8a0(%ebp),%eax
0x08091e7f :mov%edx,%ecx
0x08091e81 :xor%ebx,%ebx
0x08091e83 :mov%ecx,%ebx

recvfrom itself is a tad worrisome, x=recvfrom. I didn't ask for the
'libc_enable_asynccancel' stuff. I'm trying to isolate the actual syscall
but it is proving hard work for an assemnly newbie like me - socketcall
doesn't make things easier.

0xb7d62410 :cmpl   $0x0,%gs:0xc
0xb7d62418 :jne0xb7d62439 
0xb7d6241a :   mov%ebx,%edx
0xb7d6241c :   mov$0x66,%eax
0xb7d62421 :   mov$0xc,%ebx
0xb7d62426 :   lea0x4(%esp),%ecx
0xb7d6242a :   call   *%gs:0x10
0xb7d62431 :   mov%edx,%ebx
0xb7d62433 :   cmp$0xff83,%eax
0xb7d62436 :   jae0xb7d62469 
0xb7d62438 :   ret
0xb7d62439 :   push   %esi
0xb7d6243a :   call   0xb7d6ddd0 <__libc_enable_asynccancel>
0xb7d6243f :   mov%eax,%esi
0xb7d62441 :   mov%ebx,%edx
0xb7d62443 :   mov$0x66,%eax
0xb7d62448 :   mov$0xc,%ebx
0xb7d6244d :   lea0x8(%esp),%ecx
0xb7d62451 :   call   *%gs:0x10
0xb7d62458 :   mov%edx,%ebx
0xb7d6245a :   xchg   %eax,%esi
0xb7d6245b :   call   0xb7d6dd90 <__libc_disable_asynccancel>
0xb7d62460 :   mov%esi,%eax
0xb7d62462 :   pop%esi
0xb7d62463 :   cmp$0xff83,%eax
0xb7d62466 :   jae0xb7d62469 
0xb7d62468 :   ret
0xb7d62469 :   call   0xb7d998f8 <__i686.get_pc_thunk.cx>
0xb7d6246e :   add$0x61b86,%ecx
0xb7d62474 :  mov0xff2c(%ecx),%ecx
0xb7d6247a :  xor%edx,%edx
0xb7d6247c :  sub%eax,%edx
0xb7d6247e :  mov%edx,%gs:(%ecx)
0xb7d62481 :  or $0x,%eax
0xb7d62484 :  jmp0xb7d62438 

Any clues?

-- 
http://www.PowerDNS.com  Open source, database driven DNS Software 
http://netherlabs.nl  Open and Closed source services
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel bug in bcm43xx-d80211

2007-02-19 Thread Johannes Berg

On Mon, 2007-02-19 at 17:30 -0500, Pavel Roskin wrote:

> Johannes, would it be possible to commit patches faster, please?  Now
> that I told Michael about git-update-server-info, his changes are
> downloadable as soon as he makes a commit.  wireless-dev.git, on the
> other hand, is a mess and has been for some time (since Friday, I
> believe).

I don't commit to wireless-dev, John does. I'd love if the patches were
in already ;) And I think he even said he had committed them but they
didn't show up so something must have gone wrong (forgot to push out to
kernel.org maybe)

johannes

signature.asc
Description: This is a digitally signed message part

Re: Kernel bug in bcm43xx-d80211

2007-02-19 Thread Pavel Roskin

On Mon, 2007-02-19 at 23:12 +0100, Johannes Berg wrote:
> On Mon, 2007-02-19 at 13:48 -0800, Alex Davis wrote:
> > I go the following Oops with the latest wireless-dev git when starting 
> > wpa_supplicant:
> > 
> > Feb 19 16:17:42 boss kernel: [  377.359573] BUG: unable to handle kernel 
> > NULL pointer dereference
> > at virtual address 0002
> 
> Probably caused by my recent changes that accidentally broke d80211
> pretty much completely. Patches are on the linux-wireless mailing list.

Johannes, would it be possible to commit patches faster, please?  Now
that I told Michael about git-update-server-info, his changes are
downloadable as soon as he makes a commit.  wireless-dev.git, on the
other hand, is a mess and has been for some time (since Friday, I
believe).

It is a problem for projects like DadWifi that recommend to use the top
of wireless-dev.git.  Yes, I know, breakage is unavoidable to a certain
degree, but it shouldn't come to the situation when the patches are
known, nobody objects, yet the repository stays broken and all newcomers
have to be told about the problem.

That's not to offend you or anyone.  It's just something that would help
a lot.

-- 
Regards,
Pavel Roskin

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: Strange connection slowdown on pcnet32

2007-02-19 Thread Lennart Sorensen

On Mon, Feb 19, 2007 at 05:18:45PM -0500, Lennart Sorensen wrote:
> On Mon, Feb 19, 2007 at 03:11:36PM -0500, Lennart Sorensen wrote:
> > I have been poking at things with firescope to see if the MAC is
> > actually writing to system memory or not.
> > 
> > The entry that it gets stuch on is _always_ entry 0 in the rx_ring.
> > There does not appear to be any exceptions to this.  
> > 
> > Here is my firescope (slightly modified for this purpose) dump of the
> > rx_ring of eth1:
> > 
> > Descriptor:Address: /--base---\ /buf\ /sta\ /-message-\ /reserved-\
> >   :   : | | |len| |tus| | length  | | |
> > RXdesc[00]:6694000: 12 18 5f 05 fa f9 00 80 40 00 00 00 00 00 00 00
> > RXdesc[01]:6694010: 12 78 15 05 fa f9 40 03 ee 05 00 00 00 00 00 00
> > RXdesc[02]:6694020: 12 a0 52 06 fa f9 40 03 ee 05 00 00 00 00 00 00
> > RXdesc[03]:6694030: 12 f8 c2 04 fa f9 40 03 ee 05 00 00 00 00 00 00
> > RXdesc[04]:6694040: 12 70 15 05 fa f9 40 03 ee 05 00 00 00 00 00 00
> > RXdesc[05]:6694050: 12 e8 37 05 fa f9 40 03 ee 05 00 00 00 00 00 00
> > RXdesc[06]:6694060: 12 e0 37 05 fa f9 40 03 ee 05 00 00 00 00 00 00
> > RXdesc[07]:6694070: 12 e8 d5 04 fa f9 40 03 ee 05 00 00 00 00 00 00
> > RXdesc[08]:6694080: 12 e0 d5 04 fa f9 40 03 ee 05 00 00 00 00 00 00
> > RXdesc[09]:6694090: 12 d8 d1 05 fa f9 40 03 46 00 00 00 00 00 00 00
> > RXdesc[10]:66940a0: 12 d0 d1 05 fa f9 40 03 4e 00 00 00 00 00 00 00
> > RXdesc[11]:66940b0: 12 d8 02 05 fa f9 10 03 40 00 00 00 00 00 00 00
> > RXdesc[12]:66940c0: 12 d0 02 05 fa f9 40 03 46 00 00 00 00 00 00 00
> > RXdesc[13]:66940d0: 12 38 58 05 fa f9 00 80 ee 05 00 00 00 00 00 00
> > RXdesc[14]:66940e0: 12 30 58 05 fa f9 00 80 ee 05 00 00 00 00 00 00
> > RXdesc[15]:66940f0: 12 78 2c 05 fa f9 00 80 ee 05 00 00 00 00 00 00
> > RXdesc[16]:6694100: 12 a0 58 05 fa f9 00 80 ee 05 00 00 00 00 00 00
> > RXdesc[17]:6694110: 12 b0 04 05 fa f9 00 80 ee 05 00 00 00 00 00 00
> > RXdesc[18]:6694120: 12 b8 04 05 fa f9 00 80 ee 05 00 00 00 00 00 00
> > RXdesc[19]:6694130: 12 70 2c 05 fa f9 00 80 ee 05 00 00 00 00 00 00
> > RXdesc[20]:6694140: 12 f8 56 05 fa f9 00 80 ee 05 00 00 00 00 00 00
> > RXdesc[21]:6694150: 12 c8 29 05 fa f9 00 80 ee 05 00 00 00 00 00 00
> > RXdesc[22]:6694160: 12 20 03 05 fa f9 00 80 ee 05 00 00 00 00 00 00
> > RXdesc[23]:6694170: 12 60 4c 05 fa f9 00 80 87 05 00 00 00 00 00 00
> > RXdesc[24]:6694180: 12 98 53 05 fa f9 00 80 40 00 00 00 00 00 00 00
> > RXdesc[25]:6694190: 12 b0 cc 04 fa f9 00 80 40 00 00 00 00 00 00 00
> > RXdesc[26]:66941a0: 12 a8 3f 05 fa f9 00 80 40 00 00 00 00 00 00 00
> > RXdesc[27]:66941b0: 12 58 e8 04 fa f9 00 80 40 00 00 00 00 00 00 00
> > RXdesc[28]:66941c0: 12 b0 4d 06 fa f9 00 80 40 00 00 00 00 00 00 00
> > RXdesc[29]:66941d0: 12 38 ef 04 fa f9 00 80 40 00 00 00 00 00 00 00
> > RXdesc[30]:66941e0: 12 98 1f 05 fa f9 00 80 40 00 00 00 00 00 00 00
> > RXdesc[31]:66941f0: 12 28 f1 04 fa f9 00 80 40 00 00 00 00 00 00 00
> > 
> > I only ever see entry 0 as status 0080 (0x8000 which is owned by mac),
> > and this is while the driver is checking entry 0 every time it tries to
> > check for any waiting packets.
> > 
> > Running tcpdump while pinging gives the interesting result that some
> > packets are ariving out of order making it seem like the driver is
> > processing the packets out of order.  Perhaps the driver is wrong to be
> > looking at entry 0, and should be looking at entry 1 and is hence stuck
> > until the whole receive ring has been filled again?
> > 
> > 15:06:04.112812 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 1
> > 15:06:05.119799 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 2
> > 15:06:05.120159 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 2
> > 15:06:05.127045 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 1
> > 15:06:06.119862 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 3
> > 15:06:07.119921 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 4
> > 15:06:08.119994 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 5
> > 15:06:08.426400 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 3
> > 15:06:08.427915 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 4
> > 15:06:08.429033 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 5
> > 15:06:09.120053 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 6
> > 15:06:10.120109 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 7
> > 15:06:10.705332 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 6
> > 15:06:10.707258 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 7
> > 15:06:11.120175 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 8
> > 15:06:12.120233 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 9
> > 15:06:13.120297 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 10
> > 15:06:14.120359 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 11
> > 15:06:14.120737 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 11
>

Re: Kernel bug in bcm43xx-d80211

2007-02-19 Thread Pavel Roskin

On Mon, 2007-02-19 at 13:48 -0800, Alex Davis wrote:
> I go the following Oops with the latest wireless-dev git when starting 
> wpa_supplicant:

Wireless topics moved from this list to [EMAIL PROTECTED]
Broadcom drivers are discussed in [EMAIL PROTECTED]
wireless-dev is horribly broken, and the fixes haven't been merged yet.

The current Broadcom driver can be loaded from
http://bu3sch.de/git/wireless-dev.git (please load it on top of
wireless-dev.git to save bandwidth)

It doesn't include the latest breakage from wireless-dev, but it does
include some important fixes.

Although I haven't seen a problem like yours, I strongly suggest that
you try the above repository and post your results to the bcm43xx-dev
list.  Even if the results are more positive :)

-- 
Regards,
Pavel Roskin

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: Strange connection slowdown on pcnet32

2007-02-19 Thread Lennart Sorensen

On Mon, Feb 19, 2007 at 03:11:36PM -0500, Lennart Sorensen wrote:
> I have been poking at things with firescope to see if the MAC is
> actually writing to system memory or not.
> 
> The entry that it gets stuch on is _always_ entry 0 in the rx_ring.
> There does not appear to be any exceptions to this.  
> 
> Here is my firescope (slightly modified for this purpose) dump of the
> rx_ring of eth1:
> 
> Descriptor:Address: /--base---\ /buf\ /sta\ /-message-\ /reserved-\
>   :   : | | |len| |tus| | length  | | |
> RXdesc[00]:6694000: 12 18 5f 05 fa f9 00 80 40 00 00 00 00 00 00 00
> RXdesc[01]:6694010: 12 78 15 05 fa f9 40 03 ee 05 00 00 00 00 00 00
> RXdesc[02]:6694020: 12 a0 52 06 fa f9 40 03 ee 05 00 00 00 00 00 00
> RXdesc[03]:6694030: 12 f8 c2 04 fa f9 40 03 ee 05 00 00 00 00 00 00
> RXdesc[04]:6694040: 12 70 15 05 fa f9 40 03 ee 05 00 00 00 00 00 00
> RXdesc[05]:6694050: 12 e8 37 05 fa f9 40 03 ee 05 00 00 00 00 00 00
> RXdesc[06]:6694060: 12 e0 37 05 fa f9 40 03 ee 05 00 00 00 00 00 00
> RXdesc[07]:6694070: 12 e8 d5 04 fa f9 40 03 ee 05 00 00 00 00 00 00
> RXdesc[08]:6694080: 12 e0 d5 04 fa f9 40 03 ee 05 00 00 00 00 00 00
> RXdesc[09]:6694090: 12 d8 d1 05 fa f9 40 03 46 00 00 00 00 00 00 00
> RXdesc[10]:66940a0: 12 d0 d1 05 fa f9 40 03 4e 00 00 00 00 00 00 00
> RXdesc[11]:66940b0: 12 d8 02 05 fa f9 10 03 40 00 00 00 00 00 00 00
> RXdesc[12]:66940c0: 12 d0 02 05 fa f9 40 03 46 00 00 00 00 00 00 00
> RXdesc[13]:66940d0: 12 38 58 05 fa f9 00 80 ee 05 00 00 00 00 00 00
> RXdesc[14]:66940e0: 12 30 58 05 fa f9 00 80 ee 05 00 00 00 00 00 00
> RXdesc[15]:66940f0: 12 78 2c 05 fa f9 00 80 ee 05 00 00 00 00 00 00
> RXdesc[16]:6694100: 12 a0 58 05 fa f9 00 80 ee 05 00 00 00 00 00 00
> RXdesc[17]:6694110: 12 b0 04 05 fa f9 00 80 ee 05 00 00 00 00 00 00
> RXdesc[18]:6694120: 12 b8 04 05 fa f9 00 80 ee 05 00 00 00 00 00 00
> RXdesc[19]:6694130: 12 70 2c 05 fa f9 00 80 ee 05 00 00 00 00 00 00
> RXdesc[20]:6694140: 12 f8 56 05 fa f9 00 80 ee 05 00 00 00 00 00 00
> RXdesc[21]:6694150: 12 c8 29 05 fa f9 00 80 ee 05 00 00 00 00 00 00
> RXdesc[22]:6694160: 12 20 03 05 fa f9 00 80 ee 05 00 00 00 00 00 00
> RXdesc[23]:6694170: 12 60 4c 05 fa f9 00 80 87 05 00 00 00 00 00 00
> RXdesc[24]:6694180: 12 98 53 05 fa f9 00 80 40 00 00 00 00 00 00 00
> RXdesc[25]:6694190: 12 b0 cc 04 fa f9 00 80 40 00 00 00 00 00 00 00
> RXdesc[26]:66941a0: 12 a8 3f 05 fa f9 00 80 40 00 00 00 00 00 00 00
> RXdesc[27]:66941b0: 12 58 e8 04 fa f9 00 80 40 00 00 00 00 00 00 00
> RXdesc[28]:66941c0: 12 b0 4d 06 fa f9 00 80 40 00 00 00 00 00 00 00
> RXdesc[29]:66941d0: 12 38 ef 04 fa f9 00 80 40 00 00 00 00 00 00 00
> RXdesc[30]:66941e0: 12 98 1f 05 fa f9 00 80 40 00 00 00 00 00 00 00
> RXdesc[31]:66941f0: 12 28 f1 04 fa f9 00 80 40 00 00 00 00 00 00 00
> 
> I only ever see entry 0 as status 0080 (0x8000 which is owned by mac),
> and this is while the driver is checking entry 0 every time it tries to
> check for any waiting packets.
> 
> Running tcpdump while pinging gives the interesting result that some
> packets are ariving out of order making it seem like the driver is
> processing the packets out of order.  Perhaps the driver is wrong to be
> looking at entry 0, and should be looking at entry 1 and is hence stuck
> until the whole receive ring has been filled again?
> 
> 15:06:04.112812 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 1
> 15:06:05.119799 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 2
> 15:06:05.120159 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 2
> 15:06:05.127045 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 1
> 15:06:06.119862 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 3
> 15:06:07.119921 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 4
> 15:06:08.119994 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 5
> 15:06:08.426400 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 3
> 15:06:08.427915 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 4
> 15:06:08.429033 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 5
> 15:06:09.120053 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 6
> 15:06:10.120109 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 7
> 15:06:10.705332 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 6
> 15:06:10.707258 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 7
> 15:06:11.120175 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 8
> 15:06:12.120233 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 9
> 15:06:13.120297 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 10
> 15:06:14.120359 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 11
> 15:06:14.120737 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 11
> 15:06:14.127064 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 8
> 15:06:14.127700 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 9
> 15:06:14.128268 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo

Re: Kernel bug in bcm43xx-d80211

2007-02-19 Thread Johannes Berg

On Mon, 2007-02-19 at 13:48 -0800, Alex Davis wrote:
> I go the following Oops with the latest wireless-dev git when starting 
> wpa_supplicant:
> 
> Feb 19 16:17:42 boss kernel: [  377.359573] BUG: unable to handle kernel NULL 
> pointer dereference
> at virtual address 0002

Probably caused by my recent changes that accidentally broke d80211
pretty much completely. Patches are on the linux-wireless mailing list.

johannes


signature.asc
Description: This is a digitally signed message part

Kernel bug in bcm43xx-d80211

2007-02-19 Thread Alex Davis

I go the following Oops with the latest wireless-dev git when starting 
wpa_supplicant:

Feb 19 16:17:42 boss kernel: [  377.359573] BUG: unable to handle kernel NULL 
pointer dereference
at virtual address 0002
Feb 19 16:17:42 boss kernel: [  377.359641]  printing eip:
Feb 19 16:17:42 boss kernel: [  377.359670] f8b2a3c3
Feb 19 16:17:42 boss kernel: [  377.359672] *pde = 
Feb 19 16:17:42 boss kernel: [  377.359702] Oops: 0002 [#1]
Feb 19 16:17:42 boss kernel: [  377.359730] SMP
Feb 19 16:17:42 boss kernel: [  377.359799] Modules linked in: af_packet arc4 
ecb blkcipher
rc80211_simple bcm43xx_d80211 80211 cfg80211 snd_seq_dummy snd_seq_oss 
snd_seq_midi_event snd_seq
snd_seq_device snd
_pcm_oss snd_mixer_oss ipv6 usbhid hid usbmouse snd_intel8x0 snd_ac97_codec b44 
ssb ehci_hcd
uhci_hcd intel_agp yenta_socket pcmcia ac97_bus serio_raw usbcore agpgart 
rsrc_nonstatic ohci1394
snd_pcm ide_cd pc
mcia_core 8250_pci evdev firmware_class 8250 ieee1394 serial_core snd_timer 
cdrom snd crc32
soundcore snd_page_alloc unix
Feb 19 16:17:42 boss kernel: [  377.360945] CPU:0
Feb 19 16:17:42 boss kernel: [  377.360946] EIP:0060:[]Not 
tainted VLI
Feb 19 16:17:42 boss kernel: [  377.360947] EFLAGS: 00010246   (2.6.20 #1)
Feb 19 16:17:42 boss kernel: [  377.361048] EIP is at do_mark_unused+0x0/0x7 
[bcm43xx_d80211]
Feb 19 16:17:42 boss kernel: [  377.361080] eax: f71d7000   ebx:    
ecx:    edx:

Feb 19 16:17:42 boss kernel: [  377.361113] esi:    edi: f71d7000   
ebp: f8b2a3c3   esp:
c192dee0
Feb 19 16:17:42 boss kernel: [  377.361146] ds: 007b   es: 007b   ss: 0068
Feb 19 16:17:42 boss kernel: [  377.361176] Process events/0 (pid: 6, 
ti=c192c000 task=c191ca70
task.ti=c192c000)
Feb 19 16:17:42 boss kernel: [  377.361210] Stack: f8b28629  c0103587 
 
  
Feb 19 16:17:42 boss kernel: [  377.361433]0282 f8b2a3d8 f71d7000 
f8b19db7 f71d7000
f8b19f50 f89b64a0 38058a67
Feb 19 16:17:42 boss kernel: [  377.361655]f71d7000 f8b1a0dd 0011 
f71d7274 f71d7270
c18fd2c0 0246 c012a392
Feb 19 16:17:42 boss kernel: [  377.361878] Call Trace:
Feb 19 16:17:42 boss kernel: [  377.361932]  [] 
bcm43xx_call_for_each_loctl+0x30/0x9b
[bcm43xx_d80211]
Feb 19 16:17:42 boss kernel: [  377.362003]  [] 
common_interrupt+0x23/0x28
Feb 19 16:17:42 boss kernel: [  377.362060]  [] 
bcm43xx_loctl_mark_all_unused+0xe/0x17
[bcm43xx_d80211]
Feb 19 16:17:42 boss kernel: [  377.362129]  [] 
bcm43xx_periodic_every60sec+0x8/0x2e
[bcm43xx_d80211]
Feb 19 16:17:42 boss kernel: [  377.362197]  [] 
do_periodic_work+0xb4/0xe9
[bcm43xx_d80211]
Feb 19 16:17:42 boss kernel: [  377.362258]  [] 
bcm43xx_periodic_work_handler+0xb5/0x16f
[bcm43xx_d80211]
Feb 19 16:17:42 boss kernel: [  377.362327]  [] 
run_workqueue+0x7e/0x14e
Feb 19 16:17:42 boss kernel: [  377.362381]  [] 
bcm43xx_periodic_work_handler+0x0/0x16f
[bcm43xx_d80211]
Feb 19 16:17:42 boss kernel: [  377.362449]  [] 
worker_thread+0x14e/0x16d
Feb 19 16:17:42 boss kernel: [  377.362503]  [] 
default_wake_function+0x0/0xc
Feb 19 16:17:42 boss kernel: [  377.362558]  [] 
default_wake_function+0x0/0xc
Feb 19 16:17:42 boss kernel: [  377.362613]  [] 
worker_thread+0x0/0x16d
Feb 19 16:17:42 boss kernel: [  377.362665]  [] kthread+0xa0/0xd1
Feb 19 16:17:42 boss kernel: [  377.362717]  [] kthread+0x0/0xd1
Feb 19 16:17:42 boss kernel: [  377.362769]  [] 
kernel_thread_helper+0x7/0x10
Feb 19 16:17:42 boss kernel: [  377.362823]  ===
Feb 19 16:17:42 boss kernel: [  377.362852] Code: 04 00 00 00 80 e3 01 0f 44 c1 
88 44 24 02 8d 8a
72 03 00 00 89 f0 8d 54 24 02 e8 9f e1 ff ff 8b 5c 24 04 8b 74 24 08 83 c4 0c 
c3 <80> 62 02 fe 31
c0 c3 53 ba
c3 a3 b2 f8 8b 58 6c e8 21 e2 ff ff
Feb 19 16:17:42 boss kernel: [  377.364227] EIP: [] 
do_mark_unused+0x0/0x7
[bcm43xx_d80211] SS:ESP 0068:c192dee0

lspci -v 
02:03.0 Network controller: Broadcom Corporation BCM4309 802.11a/b/g (rev 03)
Subsystem: Dell Truemobile 1450 MiniPCI
Flags: bus master, fast devsel, latency 32, IRQ 18
Memory at faff6000 (32-bit, non-prefetchable) [size=8K]

wpa_supplicant is version 0.4.9: I was trying to connect to a Linksys WRT54G 
using WEP encryption.

Relevant part of .config
CONFIG_BCM43XX=m
CONFIG_BCM43XX_DEBUG=y
CONFIG_BCM43XX_DMA=y
CONFIG_BCM43XX_PIO=y
CONFIG_BCM43XX_DMA_AND_PIO_MODE=y
# CONFIG_BCM43XX_DMA_MODE is not set
# CONFIG_BCM43XX_PIO_MODE is not set
# CONFIG_ZD1211RW is not set
CONFIG_BCM43XX_D80211=m
CONFIG_BCM43XX_D80211_PCI=y
CONFIG_BCM43XX_D80211_PCMCIA=y
CONFIG_BCM43XX_D80211_DEBUG=y
CONFIG_BCM43XX_D80211_DMA=y
CONFIG_BCM43XX_D80211_PIO=y
CONFIG_BCM43XX_D80211_DMA_AND_PIO_MODE=y
# CONFIG_BCM43XX_D80211_DMA_MODE is not set
# CONFIG_BCM43XX_D80211_PIO_MODE is not set
# CONFIG_RT2X00 is not set
# CONFIG_ADM8211 is not set
# CONFIG_P54_COMMON is not set
# CONFIG_ZD1211RW_D80211 is not set
CONFIG_NET_WIRELESS=y

Machine is a Dell Inspiron 9100 laptop with an HT-enabled Pe

Re: [Bugme-new] [Bug 7974] New: BUG: scheduling while atomic: swapper/0x10000100/0

2007-02-19 Thread Andy Gospodarek

On Thu, Feb 15, 2007 at 03:45:23PM -0800, Jay Vosburgh wrote:
> 
>   For the short term, yes, I don't have any disagreement with
> switching the timer based stuff over to workqueues.  Basically a one for
> one replacement to get the functions in a process context and tweak the
> locking.

I did some testing of mine last week and my patch definitely has some
issues.  I'm running into a problem that is similar to the thread
started last week titled "BUG] RTNL and flush_scheduled_work
deadlocks" but I think I can patch around that if needed.

> 
>   I do think we're having a little confusion over details of
> terminology; if I'm not mistaken, you're thinking that workqueue means
> single threaded: even though each individual "monitor thingie" is a
> separate piece of work, they still can't collide.
> 
>   That's true, but (unless I've missed a call somewhere) there
> isn't a "wq_pause_for_a_bit" type of call (that, e.g., waits for
> anything running to stop, then doesn't run any further work until we
> later tell it to), so suspending all of the periodic things running for
> the bond is more hassle than if there's just one schedulable work thing,
> which internally calls the right functions to do the various things.
> This is also single threaded, but easier to stop and start.  It seems to
> be simpler to have multiple link monitors running in such a system as
> well (without having them thrashing the link state as would happen now).
> 

I see by looking at your patch that you keep a list of timers and only
schedule work for the event that will happen next.  I've seen timer
implementations like this before and feel its reasonable.  It would be
good to account for skew, but other than that it seems like a reasonable
solution -- though it is too bad that workqueues and their behavior seem
like somewhat of a mystery to most and cause people to code around them
(I don't blame you one bit).

I also plan to start testing your patch later this week and will let you
know what I find.

-andy

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Extensible hashing and RCU

2007-02-19 Thread Andi Kleen

Evgeniy Polyakov <[EMAIL PROTECTED]> writes:
> 
> My experiment shows almost 400 nsecs without _any_ locks - they are
> removed completely - it is pure hash selection/list traverse time.

Are you sure you're not measuring TLB misses too? In user space
you likely use 4K pages. The kernel would use 2MB pages.
I would suggest putting the tables into hugetlbfs allocated memory
in your test program.

-Andei
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] 8139too: RTNL and flush_scheduled_work deadlock

2007-02-19 Thread Francois Romieu

Cc: list trimmed.

Jarek Poplawski <[EMAIL PROTECTED]> :
> On Fri, Feb 16, 2007 at 09:20:34PM +0100, Francois Romieu wrote:
[...]
> > Btw, the thread runs every 3*HZ at most.
> 
> You are right (mostly)! But I think rtnl_lock is special
> and should be spared (even this 3*HZ) and here it's used
> for some mainly internal purpose (close synchronization).
> And it looks like mainly for this internal reason holding
> of rtnl_lock is increased. And because rtnl_lock is quite
> popular you have to take into consideration that after
> this 3*HZ it could spend some time waiting for the lock.
> So, maybe it would be nicer to check this netif_running
> twice (after rtnl_lock where needed), but maybe it's a
> mater of taste only, and yours is better, as well.

The region protected by RTNL has been widened to include a tx_timeout
handler. It is supposed to handle an occasional error, something that
should not even happen at 3*HZ. Optimizing it is useless, especially
on an high-end performer like the 8139.

> (Btw. I didn't verify this, but I hope you checked that
> places not under rtnl_lock before the patch are safe from
> some locking problems now.)

I did. It is not a reason to trust the patch though.

-- 
Ueimor
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [linux-pm] [Ipw2100-devel] [RFC] Runtime power management on ipw2100

2007-02-19 Thread David Brownell

On Thursday 08 February 2007 1:01 am, Zhu Yi wrote:

> A generic requirement for dynamic power management is the hardware
> resource should not be touched when you put it in a low power state.

That is in no way a "generic" requirement.  It might apply specifically
to one ipw2100 low power state ... but "in general" devices may support
more than one low power state, with different levels of functionality.
Not all of those levels necessarily disallow touching the hardware.

>   But I think
> freeing the irq handler before suspend should be the right way to go.

Some folk like that model a lot for shared IRQs.  It shouldn't
matter for non-sharable ones.

- Dave
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2][TCP] YeAH-TCP: limited slow start exported function

2007-02-19 Thread John Heffner

I'd prefer to make it apply automatically across all congestion controls 
that do slow-start, and also make the max_ssthresh parameter 
controllable via sysctl.  This patch (attached) should implement this. 
Note the default value for sysctl_tcp_max_ssthresh = 0, which disables 
limited slow-start.  This should make ABC apply during LSS as well.


Note the patch is compile-tested only!  I can do some real testing if 
you'd like to apply this Dave.


Thanks,
  -John


Angelo P. Castellani wrote:

Forgot the patch..

Angelo P. Castellani ha scritto:

From: Angelo P. Castellani <[EMAIL PROTECTED]>

RFC3742: limited slow start

See http://www.ietf.org/rfc/rfc3742.txt

Signed-off-by: Angelo P. Castellani <[EMAIL PROTECTED]>
---

To allow code reutilization I've added the limited slow start 
procedure as an exported symbol of linux tcp congestion control.


On large BDP networks canonical slow start should be avoided because 
it requires large packet losses to converge, whereas at lower BDPs 
slow start and limited slow start are identical. Large BDP is defined 
through the max_ssthresh variable.


I think limited slow start could safely replace the canonical slow 
start procedure in Linux.


Regards,
Angelo P. Castellani

p.s.: in the attached patch is added an exported function currently 
used only by YeAH TCP


include/net/tcp.h   |1 +
net/ipv4/tcp_cong.c |   23 +++
2 files changed, 24 insertions(+)







diff -uprN linux-2.6.20-a/include/net/tcp.h linux-2.6.20-c/include/net/tcp.h
--- linux-2.6.20-a/include/net/tcp.h2007-02-04 19:44:54.0 +0100
+++ linux-2.6.20-c/include/net/tcp.h2007-02-19 10:54:10.0 +0100
@@ -669,6 +669,7 @@ extern void tcp_get_allowed_congestion_c
 extern int tcp_set_allowed_congestion_control(char *allowed);
 extern int tcp_set_congestion_control(struct sock *sk, const char *name);
 extern void tcp_slow_start(struct tcp_sock *tp);
+extern void tcp_limited_slow_start(struct tcp_sock *tp);
 
 extern struct tcp_congestion_ops tcp_init_congestion_ops;

 extern u32 tcp_reno_ssthresh(struct sock *sk);
diff -uprN linux-2.6.20-a/net/ipv4/tcp_cong.c linux-2.6.20-c/net/ipv4/tcp_cong.c
--- linux-2.6.20-a/net/ipv4/tcp_cong.c  2007-02-04 19:44:54.0 +0100
+++ linux-2.6.20-c/net/ipv4/tcp_cong.c  2007-02-19 10:54:10.0 +0100
@@ -297,6 +297,29 @@ void tcp_slow_start(struct tcp_sock *tp)
 }
 EXPORT_SYMBOL_GPL(tcp_slow_start);
 
+void tcp_limited_slow_start(struct tcp_sock *tp)

+{
+   /* RFC3742: limited slow start
+* the window is increased by 1/K MSS for each arriving ACK,
+* for K = int(cwnd/(0.5 max_ssthresh))
+*/
+
+   const int max_ssthresh = 100;
+
+   if (max_ssthresh > 0 && tp->snd_cwnd > max_ssthresh) {
+   u32 k = max(tp->snd_cwnd / (max_ssthresh >> 1), 1U);
+   if (++tp->snd_cwnd_cnt >= k) {
+   if (tp->snd_cwnd < tp->snd_cwnd_clamp)
+   tp->snd_cwnd++;
+   tp->snd_cwnd_cnt = 0;
+   }
+   } else {
+   if (tp->snd_cwnd < tp->snd_cwnd_clamp)
+   tp->snd_cwnd++;
+   }
+}
+EXPORT_SYMBOL_GPL(tcp_limited_slow_start);
+
 /*
  * TCP Reno congestion control
  * This is special case used for fallback as well.


Add RFC3742 Limited Slow-Start, controlled by variable sysctl_tcp_max_ssthresh.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>

---
commit 97033fa201705e6cfc68ce66f34ede3277c3d645
tree 5df4607728abce93aa05b31015a90f2ce369abff
parent 8a03d9a498eaf02c8a118752050a5154852c13bf
author John Heffner <[EMAIL PROTECTED]> Mon, 19 Feb 2007 15:52:16 -0500
committer John Heffner <[EMAIL PROTECTED]> Mon, 19 Feb 2007 15:52:16 -0500

 include/linux/sysctl.h |1 +
 include/net/tcp.h  |1 +
 net/ipv4/sysctl_net_ipv4.c |8 
 net/ipv4/tcp_cong.c|   33 +++--
 4 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 2c5fb38..a2dce72 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -438,6 +438,7 @@ enum
NET_CIPSOV4_RBM_STRICTVALID=121,
NET_TCP_AVAIL_CONG_CONTROL=122,
NET_TCP_ALLOWED_CONG_CONTROL=123,
+   NET_TCP_MAX_SSTHRESH=124,
 };
 
 enum {
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5c472f2..521da28 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -230,6 +230,7 @@ extern int sysctl_tcp_mtu_probing;
 extern int sysctl_tcp_base_mss;
 extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
+extern int sysctl_tcp_max_ssthresh;
 
 extern atomic_t tcp_memory_allocated;
 extern atomic_t tcp_sockets_allocated;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 0aa3047..d68effe 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -803,6

[patch 1/2] natsemi: Add support for using MII port with no PHY

2007-02-19 Thread Mark Brown

This patch provides code paths which allow the natsemi driver to use the
external MII port on the chip but ignore any PHYs that may be attached to it.
The link state will be left as it was when the driver started and can be
configured via ethtool.  Any PHYs that are present can be accessed via the MII
ioctl()s.

This is useful for systems where the device is connected without a PHY
or where either information or actions outside the scope of the driver
are required in order to use the PHYs.

Signed-Off-By: Mark Brown <[EMAIL PROTECTED]>

---

This revision of the patch fixes some issues brought up during review.

Previous versions of this patch exposed the new functionality as a module
option.  This has been removed.  Any hardware that needs this should be
identifiable by a quirk since it unlikely to behave correctly with an
unmodified driver.

Index: linux/drivers/net/natsemi.c
===
--- linux.orig/drivers/net/natsemi.c2007-02-19 10:10:40.0 +
+++ linux/drivers/net/natsemi.c 2007-02-19 10:20:45.0 +
@@ -568,6 +568,8 @@
u32 intr_status;
/* Do not touch the nic registers */
int hands_off;
+   /* Don't pay attention to the reported link state. */
+   int ignore_phy;
/* external phy that is used: only valid if dev->if_port != PORT_TP */
int mii;
int phy_addr_external;
@@ -696,7 +698,10 @@
struct netdev_private *np = netdev_priv(dev);
u32 tmp;
 
-   netif_carrier_off(dev);
+   if (np->ignore_phy)
+   netif_carrier_on(dev);
+   else
+   netif_carrier_off(dev);
 
/* get the initial settings from hardware */
tmp= mdio_read(dev, MII_BMCR);
@@ -806,8 +811,10 @@
np->hands_off = 0;
np->intr_status = 0;
np->eeprom_size = natsemi_pci_info[chip_idx].eeprom_size;
+   np->ignore_phy = 0;
 
/* Initial port:
+* - If configured to ignore the PHY set up for external.
 * - If the nic was configured to use an external phy and if find_mii
 *   finds a phy: use external port, first phy that replies.
 * - Otherwise: internal port.
@@ -815,7 +822,7 @@
 * The address would be used to access a phy over the mii bus, but
 * the internal phy is accessed through mapped registers.
 */
-   if (readl(ioaddr + ChipConfig) & CfgExtPhy)
+   if (np->ignore_phy || readl(ioaddr + ChipConfig) & CfgExtPhy)
dev->if_port = PORT_MII;
else
dev->if_port = PORT_TP;
@@ -825,7 +832,9 @@
 
if (dev->if_port != PORT_TP) {
np->phy_addr_external = find_mii(dev);
-   if (np->phy_addr_external == PHY_ADDR_NONE) {
+   /* If we're ignoring the PHY it doesn't matter if we can't
+* find one. */
+   if (!np->ignore_phy && np->phy_addr_external == PHY_ADDR_NONE) {
dev->if_port = PORT_TP;
np->phy_addr_external = PHY_ADDR_INTERNAL;
}
@@ -891,6 +900,8 @@
printk("%02x, IRQ %d", dev->dev_addr[i], irq);
if (dev->if_port == PORT_TP)
printk(", port TP.\n");
+   else if (np->ignore_phy)
+   printk(", port MII, ignoring PHY\n");
else
printk(", port MII, phy ad %d.\n", 
np->phy_addr_external);
}
@@ -1571,9 +1582,13 @@
 {
struct netdev_private *np = netdev_priv(dev);
void __iomem * ioaddr = ns_ioaddr(dev);
-   int duplex;
+   int duplex = np->duplex;
u16 bmsr;
 
+   /* If we are ignoring the PHY then don't try reading it. */
+   if (np->ignore_phy)
+   goto propagate_state;
+
/* The link status field is latched: it remains low after a temporary
 * link failure until it's read. We need the current link status,
 * thus read twice.
@@ -1585,7 +1600,7 @@
if (netif_carrier_ok(dev)) {
if (netif_msg_link(np))
printk(KERN_NOTICE "%s: link down.\n",
-   dev->name);
+  dev->name);
netif_carrier_off(dev);
undo_cable_magic(dev);
}
@@ -1609,6 +1624,7 @@
duplex = 1;
}
 
+propagate_state:
/* if duplex is set then bit 28 must be set, too */
if (duplex ^ !!(np->rx_config & RxAcceptTx)) {
if (netif_msg_link(np))
@@ -2819,6 +2835,15 @@
}
 
/*
+* If we're ignoring the PHY then autoneg and the internal
+* transciever are really not going to work so don't let the
+* user select them.
+*/
+   if (np->ignore_phy && (ecmd->autoneg == AUTONEG_ENABLE ||
+  ecmd

[patch 2/2] natsemi: Support Aculab E1/T1 PMXc cPCI carrier cards

2007-02-19 Thread Mark Brown

Aculab E1/T1 PMXc cPCI carrier card cards present a natsemi on the cPCI
bus with an oversized EEPROM using a direct MII<->MII connection with no
PHY.  This patch adds a new device table entry supporting these cards.

Signed-Off-By: Mark Brown <[EMAIL PROTECTED]>

---

This revision removes extra braces from the previous version.

Index: linux/drivers/net/natsemi.c
===
--- linux.orig/drivers/net/natsemi.c2007-02-19 10:16:50.0 +
+++ linux/drivers/net/natsemi.c 2007-02-19 10:18:25.0 +
@@ -244,6 +244,9 @@
MII_EN_SCRM = 0x0004,   /* enable scrambler (tp) */
 };
 
+enum {
+   NATSEMI_FLAG_IGNORE_PHY = 0x1,
+};
 
 /* array of board data directly indexed by pci_tbl[x].driver_data */
 static const struct {
@@ -251,10 +254,12 @@
unsigned long flags;
unsigned int eeprom_size;
 } natsemi_pci_info[] __devinitdata = {
+   { "Aculab E1/T1 PMXc cPCI carrier card", NATSEMI_FLAG_IGNORE_PHY, 128 },
{ "NatSemi DP8381[56]", 0, 24 },
 };
 
 static const struct pci_device_id natsemi_pci_tbl[] __devinitdata = {
+   { PCI_VENDOR_ID_NS, 0x0020, 0x12d9, 0x000c, 0, 0, 0 },
{ PCI_VENDOR_ID_NS, 0x0020, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0 },
{ } /* terminate list */
 };
@@ -811,7 +816,10 @@
np->hands_off = 0;
np->intr_status = 0;
np->eeprom_size = natsemi_pci_info[chip_idx].eeprom_size;
-   np->ignore_phy = 0;
+   if (natsemi_pci_info[chip_idx].flags & NATSEMI_FLAG_IGNORE_PHY)
+   np->ignore_phy = 1;
+   else
+   np->ignore_phy = 0;
 
/* Initial port:
 * - If configured to ignore the PHY set up for external.

--
"You grabbed my hand and we fell into it, like a daydream - or a fever."
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 0/2] natsemi: Support Aculab E1/T1 cPCI carrier cards

2007-02-19 Thread Mark Brown

These patches add support for the Aculab E1/T1 cPCI carrier card to the
natsemi driver.  The first patch provides support for using the MII port
with no PHY and the second adds the quirks required to detect and
configure the card.

This revision should address the issues raised by Jeff over the weekend.
Apologies if I've missed anything.

--
"You grabbed my hand and we fell into it, like a daydream - or a fever."
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: Strange connection slowdown on pcnet32

2007-02-19 Thread Lennart Sorensen

On Fri, Feb 16, 2007 at 04:01:57PM -0500, Lennart Sorensen wrote:
> eth1: netif_receive_skb(skb)
> eth1: netif_receive_skb(skb)
> eth1: pcnet32_poll: pcnet32_rx() got 16 packets
> eth1: base: 0x05215812 status: 0310 next->status: 0310
> eth1: netif_receive_skb(skb)
> eth1: netif_receive_skb(skb)
> eth1: netif_receive_skb(skb)
> eth1: netif_receive_skb(skb)
> eth1: netif_receive_skb(skb)
> eth1: netif_receive_skb(skb)
> eth1: netif_receive_skb(skb)
> eth1: netif_receive_skb(skb)
> eth1: netif_receive_skb(skb)
> eth1: netif_receive_skb(skb)
> eth1: netif_receive_skb(skb)
> eth1: netif_receive_skb(skb)
> eth1: netif_receive_skb(skb)
> eth1: netif_receive_skb(skb)
> eth1: netif_receive_skb(skb)
> eth1: netif_receive_skb(skb)
> eth1: pcnet32_poll: pcnet32_rx() got 16 packets
> eth1: base: 0x04c51812 status: 8000 next->status: 0310
> eth1: pcnet32_poll: pcnet32_rx() got 0 packets
> eth1: interrupt  csr0=0x6f3 new csr=0x33, csr3=0x.
> eth1: exiting interrupt, csr0=0x0033, csr3=0x5f00.
> eth1: base: 0x04c51812 status: 8000 next->status: 0310
> eth1: pcnet32_poll: pcnet32_rx() got 0 packets
> eth1: interrupt  csr0=0x4f3 new csr=0x33, csr3=0x.
> eth1: exiting interrupt, csr0=0x0033, csr3=0x5f00.
> eth1: base: 0x04c51812 status: 8000 next->status: 0310
> eth1: pcnet32_poll: pcnet32_rx() got 0 packets
> eth1: interrupt  csr0=0x4f3 new csr=0x33, csr3=0x.
> eth1: exiting interrupt, csr0=0x0433, csr3=0x5f00.
> eth1: base: 0x04c51812 status: 8000 next->status: 0310
> eth1: pcnet32_poll: pcnet32_rx() got 0 packets
> 
> So somehow it ends up that when it reads the status of the descriptor at
> address 0x04c51812, it sees the status as 0x8000 (which means owned by
> the MAC I believe), even though the next descriptor in the ring has a
> sensible status, indicating that the descriptor is ready to be handled
> by the driver.  Since the descriptor isn't ready, we exit without
> handling anything and NAPI reschedules is the next time we get an
> interrupt, and after some random number of tries, we finally see the
> right status and handle the packet, along with a bunch of other packets
> waiting in the descriptor ring.  Then we seem to hit the exact same
> descriptor address again, with the same problem in the status we read,
> and again we are stuck for a while, until finally we see the right
> status, and another pile of packets get handled, and we again hit the
> same descriptor address and get stuck.

I have been poking at things with firescope to see if the MAC is
actually writing to system memory or not.

The entry that it gets stuch on is _always_ entry 0 in the rx_ring.
There does not appear to be any exceptions to this.  

Here is my firescope (slightly modified for this purpose) dump of the
rx_ring of eth1:

Descriptor:Address: /--base---\ /buf\ /sta\ /-message-\ /reserved-\
  :   : | | |len| |tus| | length  | | |
RXdesc[00]:6694000: 12 18 5f 05 fa f9 00 80 40 00 00 00 00 00 00 00
RXdesc[01]:6694010: 12 78 15 05 fa f9 40 03 ee 05 00 00 00 00 00 00
RXdesc[02]:6694020: 12 a0 52 06 fa f9 40 03 ee 05 00 00 00 00 00 00
RXdesc[03]:6694030: 12 f8 c2 04 fa f9 40 03 ee 05 00 00 00 00 00 00
RXdesc[04]:6694040: 12 70 15 05 fa f9 40 03 ee 05 00 00 00 00 00 00
RXdesc[05]:6694050: 12 e8 37 05 fa f9 40 03 ee 05 00 00 00 00 00 00
RXdesc[06]:6694060: 12 e0 37 05 fa f9 40 03 ee 05 00 00 00 00 00 00
RXdesc[07]:6694070: 12 e8 d5 04 fa f9 40 03 ee 05 00 00 00 00 00 00
RXdesc[08]:6694080: 12 e0 d5 04 fa f9 40 03 ee 05 00 00 00 00 00 00
RXdesc[09]:6694090: 12 d8 d1 05 fa f9 40 03 46 00 00 00 00 00 00 00
RXdesc[10]:66940a0: 12 d0 d1 05 fa f9 40 03 4e 00 00 00 00 00 00 00
RXdesc[11]:66940b0: 12 d8 02 05 fa f9 10 03 40 00 00 00 00 00 00 00
RXdesc[12]:66940c0: 12 d0 02 05 fa f9 40 03 46 00 00 00 00 00 00 00
RXdesc[13]:66940d0: 12 38 58 05 fa f9 00 80 ee 05 00 00 00 00 00 00
RXdesc[14]:66940e0: 12 30 58 05 fa f9 00 80 ee 05 00 00 00 00 00 00
RXdesc[15]:66940f0: 12 78 2c 05 fa f9 00 80 ee 05 00 00 00 00 00 00
RXdesc[16]:6694100: 12 a0 58 05 fa f9 00 80 ee 05 00 00 00 00 00 00
RXdesc[17]:6694110: 12 b0 04 05 fa f9 00 80 ee 05 00 00 00 00 00 00
RXdesc[18]:6694120: 12 b8 04 05 fa f9 00 80 ee 05 00 00 00 00 00 00
RXdesc[19]:6694130: 12 70 2c 05 fa f9 00 80 ee 05 00 00 00 00 00 00
RXdesc[20]:6694140: 12 f8 56 05 fa f9 00 80 ee 05 00 00 00 00 00 00
RXdesc[21]:6694150: 12 c8 29 05 fa f9 00 80 ee 05 00 00 00 00 00 00
RXdesc[22]:6694160: 12 20 03 05 fa f9 00 80 ee 05 00 00 00 00 00 00
RXdesc[23]:6694170: 12 60 4c 05 fa f9 00 80 87 05 00 00 00 00 00 00
RXdesc[24]:6694180: 12 98 53 05 fa f9 00 80 40 00 00 00 00 00 00 00
RXdesc[25]:6694190: 12 b0 cc 04 fa f9 00 80 40 00 00 00 00 00 00 00
RXdesc[26]:66941a0: 12 a8 3f 05 fa f9 00 80 40 00 00 00 00 00 00 00
RXdesc[27]:66941b0: 12 58 e8 04 fa f9 00 80 40 00 00 00 00 00 00 00
RXdesc[28]:66941c0: 12 b0 4d 06 fa f9 00 80 40 00 00 00 00 00 00 00
RXdesc[29]:66941d0: 12 38 ef 04 fa f9 00 80 40 00 00 00 00 00 00 00
RXdesc[30]:66941e0: 12 98 1f 05 fa f9 00 80 40 00

Re: MediaGX/GeodeGX1 requires X86_OOSTORE.

2007-02-19 Thread Lennart Sorensen

On Mon, Feb 19, 2007 at 11:48:27AM -0800, Roland Dreier wrote:
>  > Does anyone know if there is any way to flush a cache line of the cpu to
>  > force rereading system memory for a given address or address range?
> 
> There is the "clflush" instruction, but not all x86 CPUs support it.
> You need to check the CPUID flag to know for sure (/proc/cpuinfo will
> show a "clflush" flag if it is supported).

Well I will check for that.  Of course it is still possible that is it
actually the network chip screwing up somehow.

--
Len Sorensen
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: MediaGX/GeodeGX1 requires X86_OOSTORE.

2007-02-19 Thread Roland Dreier

 > Does anyone know if there is any way to flush a cache line of the cpu to
 > force rereading system memory for a given address or address range?

There is the "clflush" instruction, but not all x86 CPUs support it.
You need to check the CPUID flag to know for sure (/proc/cpuinfo will
show a "clflush" flag if it is supported).
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Extensible hashing and RCU

2007-02-19 Thread Michael K. Edwards


On 19 Feb 2007 13:04:12 +0100, Andi Kleen <[EMAIL PROTECTED]> wrote:

LRU tends to be hell for caches in MP systems, because it writes to
the cache lines too and makes them exclusive and more expensive.


That's why you let the hardware worry about LRU.  You don't write to
the upper layers of the splay tree when you don't have to.  It's the
mere traversal of the upper layers that keeps them in cache, causing
the cache hierarchy to mimic the data structure hierarchy.

RCU changes the whole game, of course, because you don't write to the
old copy at all; you have to clone the altered node and all its
ancestors and swap out the root node itself under a spinlock.  Except
you don't use a spinlock; you have a ring buffer of root nodes and
atomically increment the writer index.  That atomically incremented
index is the only thing on which there's any write contention.
(Obviously you need a completion flag on the new root node for the
next writer to poll on, so the sequence is atomic-increment ... copy
and alter from leaf to root ... wmb() ... mark new root complete.)

When you share TCP sessions among CPUs, and packets associated with
the same session may hit softirq in any CPU, you are going to eat a
lot of interconnect bandwidth keeping the sessions coherent.  (The
only way out of this is to partition the tuple space by CPU at the NIC
layer with separate per-core, or perhaps per-cache, receive queues; at
which point the NIC is so smart that you might as well put the DDoS
handling there.)  But at least it's cache coherency protocol bandwidth
and not bandwidth to and from DRAM, which has much nastier latencies.
The only reason the data structure matters _at_all_ is that DDoS
attacks threaten to evict the working set of real sessions out of
cache.  That's why you add new sessions at the leaves and don't rotate
them up until they're hit a second time.

Of course the leaf layer can't be RCU, but it doesn't have to be; it's
just a bucket of tuples.  You need an auxiliary structure to hold the
session handshake trackers for the leaf layer, but you assume that
you're always hitting cold cache when diving into this structure and
ration accesses accordingly.  Maybe you even explicitly evict entries
from cache after sending the SYNACK, so they don't crowd other stuff
out; they go to DRAM and get pulled into the new CPU (and rotated up)
if and when the next packet in the session arrives.  (I'm assuming
T/TCP here, so you can't skimp much on session tracker size during the
handshake.)

Every software firewall I've seen yet falls over under DDoS.  If you
want to change that, you're going to need more than the
back-of-the-napkin calculations that show that session lookup
bandwidth exceeds frame throughput for min-size packets.  You're going
to need to strategize around exploiting the cache hierarchy already
present in your commodity processor to implicitly partition real
traffic from the DDoS storm.  It's not a trivial problem, even in the
mathematician's sense (in which all problems are either trivial or
unsolved).

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Extensible hashing and RCU

2007-02-19 Thread Benjamin LaHaise

On Mon, Feb 19, 2007 at 01:26:42PM -0500, Benjamin LaHaise wrote:
> On Mon, Feb 19, 2007 at 07:13:07PM +0100, Eric Dumazet wrote:
> > So even with a lazy hash function, 89 % of lookups are satisfied with less 
> > than 6 compares.
> 
> Which sucks, as those are typically going to be cache misses (costing many 
> hundreds of cpu cycles).  Hash chains fair very poorly under DoS conditions, 
> and must be removed under a heavy load.  Worst case handling is very 
> important next to common case.

I should clarify.  Back of the napkin calculations show that there is only 
157 cycles on a 3GHz processor in which to decide what happens to a packet, 
which means 1 cache miss is more than enough.  In theory we can get pretty 
close to line rate with quad core processors, but it definately needs some 
of the features that newer chipsets have for stuffing packets directly into 
the cache.  I would venture a guess that we also need to intelligently 
partition packets so that we make the most use of available cache resources.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Extensible hashing and RCU

2007-02-19 Thread Benjamin LaHaise

On Mon, Feb 19, 2007 at 07:13:07PM +0100, Eric Dumazet wrote:
> So even with a lazy hash function, 89 % of lookups are satisfied with less 
> than 6 compares.

Which sucks, as those are typically going to be cache misses (costing many 
hundreds of cpu cycles).  Hash chains fair very poorly under DoS conditions, 
and must be removed under a heavy load.  Worst case handling is very 
important next to common case.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Extensible hashing and RCU

2007-02-19 Thread Eric Dumazet

On Monday 19 February 2007 16:14, Eric Dumazet wrote:
>
> Because O(1) is different of O(log(N)) ?
> if N = 2^20, it certainly makes a difference.
> Yes, 1% of chains might have a length > 10, but yet 99% of the lookups are
> touching less than 4 cache lines.
> With a binary tree, log(2^20) is 20. or maybe not ? If you tell me it's 4,
> I will be very pleased.
>

Here is the tcp ehash chain length distribution on a real server :
ehash_addr=0x81047600
ehash_size=1048576
333835 used chains, 3365 used twchains
Distribution of sockets/chain length
[chain length]:number of sockets
[1]:221019 37.4645%
[2]:56590 56.6495%
[3]:21250 67.4556%
[4]:12534 75.9541%
[5]:8677 83.3082%
[6]:5862 89.2701%
[7]:3640 93.5892%
[8]:2219 96.5983%
[9]:1083 98.2505%
[10]:539 99.1642%
[11]:244 99.6191%
[12]:112 99.8469%
[13]:39 99.9329%
[14]:16 99.9708%
[15]:6 99.9861%
[16]:3 99.9942%
[17]:2 100%
total : 589942 sockets

So even with a lazy hash function, 89 % of lookups are satisfied with less 
than 6 compares.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Extensible hashing and RCU

2007-02-19 Thread Eric Dumazet

On Monday 19 February 2007 15:25, Evgeniy Polyakov wrote:
> On Mon, Feb 19, 2007 at 03:14:02PM +0100, Eric Dumazet ([EMAIL PROTECTED]) 
wrote:
> > > Forget about cache misses and cache lines - we have a hash table, only
> > > part of which is used (part for time-wait sockets, part for established
> > > ones).
> >
> > No you didnt not read my mail. Current ehash is not as decribed by you.
>
> I did.
> And I also said that my tests do not have timewait sockets at all - I
> removed sk_for_each and so on, which should effectively increase lookup
> time twice on busy system with lots of created/removed sockets per
> timeframe (that is theory from my side already).
> Anyway, I ran the same test with increased table too.
>
> > > Anyway, even with 2^20 (i.e. when the whole table is only used for
> > > established sockets) search time is about 360-370 nsec on 3.7 GHz Core
> > > Duo (only one CPU is used) with 2 GB of ram.
> >
> > Your tests are user land, so unfortunatly are biased...
> >
> > (Unless you use hugetlb data ?)
>
> No I do not. But the same can be applied to trie test - it is also
> performed in userspace and thus suffers from possible swapping/cache
> flushing and so on.
>
> And I doubt moving test into kernel will suddenly end up with 10 times
> increased rates.

At least some architectures pay a high price using vmalloc() instead of 
kmalloc(), and TLB misses means something for them. Not everybody has the 
latest Intel cpu. Normally, ehash table is using huge pages.

>
> Anyway, trie test (broken implementation) is two times slower than hash
> table (resized already), and it does not include locking isses of the
> hash table access and further scalability issues.
>

You mix apples and oranges. We already know locking has nothing to do with 
hashing or trie-ing. We *can* put RCU on top of the existing ehash. We also 
can add hash resizing if we really care.

> I think I need to fix my trie implementation to fully show its
> potential, but original question was why tree/trie based implementation
> is not considered at all although it promises better performance and
> scalability.

Because you mix performance and scalability. Thats not exactly the same.
Sometime, high performance means *suboptimal* scalability.

Because O(1) is different of O(log(N)) ?
if N = 2^20, it certainly makes a difference.
Yes, 1% of chains might have a length > 10, but yet 99% of the lookups are 
touching less than 4 cache lines.
With a binary tree, log(2^20) is 20. or maybe not ? If you tell me it's 4, I 
will be very pleased.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] - drivers/net/hamradio remove local random function, use random32()

2007-02-19 Thread Thomas Sailer

On Fri, 2007-02-16 at 09:42 -0800, Joe Perches wrote:

> Signed-off-by: Joe Perches <[EMAIL PROTECTED]>
Acked-By: Thomas Sailer <[EMAIL PROTECTED]>

Thanks a lot!

Tom


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: MediaGX/GeodeGX1 requires X86_OOSTORE.

2007-02-19 Thread Lennart Sorensen

On Sat, Feb 17, 2007 at 11:11:13PM +0900, takada wrote:
> is it mean what doesn't help with doesn't call set_cx86_reoder()?
> this function disable to reorder at 0x4000: to 0x:.
> does pcnet32 access at out of above range?
> 
> --- arch/i386/Kconfig.cpu~2007-02-05 03:44:54.0 +0900
> +++ arch/i386/Kconfig.cpu 2007-02-17 21:25:52.0 +0900
> @@ -322,7 +322,7 @@ config X86_USE_3DNOW
>  
>  config X86_OOSTORE
>   bool
> - depends on (MWINCHIP3D || MWINCHIP2 || MWINCHIPC6) && MTRR
> + depends on (MWINCHIP3D || MWINCHIP2 || MWINCHIPC6) && MTRR || MGEODEGX1
>   default y
>  
>  config X86_TSC

Well it turns out that enabling OOSTORE doesn't elliminate the problem,
but it does make it go from occouring within seconds to occouring within
many hours.  I am off to investigate some more.

Does anyone know if there is any way to flush a cache line of the cpu to
force rereading system memory for a given address or address range?

--
Len Sorensen
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Extensible hashing and RCU

2007-02-19 Thread Evgeniy Polyakov

On Mon, Feb 19, 2007 at 03:14:02PM +0100, Eric Dumazet ([EMAIL PROTECTED]) 
wrote:
> > Forget about cache misses and cache lines - we have a hash table, only
> > part of which is used (part for time-wait sockets, part for established
> > ones).
> 
> No you didnt not read my mail. Current ehash is not as decribed by you.

I did.
And I also said that my tests do not have timewait sockets at all - I
removed sk_for_each and so on, which should effectively increase lookup 
time twice on busy system with lots of created/removed sockets per
timeframe (that is theory from my side already).
Anyway, I ran the same test with increased table too.

> > Anyway, even with 2^20 (i.e. when the whole table is only used for
> > established sockets) search time is about 360-370 nsec on 3.7 GHz Core
> > Duo (only one CPU is used) with 2 GB of ram.
> 
> Your tests are user land, so unfortunatly are biased...
> 
> (Unless you use hugetlb data ?)

No I do not. But the same can be applied to trie test - it is also
performed in userspace and thus suffers from possible swapping/cache
flushing and so on.

And I doubt moving test into kernel will suddenly end up with 10 times 
increased rates.

Anyway, trie test (broken implementation) is two times slower than hash
table (resized already), and it does not include locking isses of the
hash table access and further scalability issues.

I think I need to fix my trie implementation to fully show its
potential, but original question was why tree/trie based implementation
is not considered at all although it promises better performance and
scalability.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bug 8013] New: select for write hangs on a socket after write returned ECONNRESET

2007-02-19 Thread Jarek Poplawski

On 17-02-2007 17:25, Evgeniy Polyakov wrote:
> On Fri, Feb 16, 2007 at 09:34:27PM +0300, Evgeniy Polyakov ([EMAIL 
> PROTECTED]) wrote:
>> Otherwise we can extend select output mask to include hungup too
>> (getting into account that hungup is actually output event).
> 
> This is another possible way to fix select after write after connection
> reset.

I hope you know what you are doing and that this will
change functionality for some users.

In my opinion it looks like a problem with interpretation
and not a bug. From tcp.c:

"
   * Some poll() documentation says that POLLHUP is incompatible
   * with the POLLOUT/POLLWR flags, so somebody should check this
   * all. But careful, it tends to be safer to return too many
   * bits than too few, and you can easily break real applications
   * if you don't tell them that something has hung up!
...
   * Actually, it is interesting to look how Solaris and DUX
   * solve this dilemma. I would prefer, if PULLHUP were maskable,
   * then we could set it on SND_SHUTDOWN. BTW examples given
   * in Stevens' books assume exactly this behaviour, it explains
   * why PULLHUP is incompatible with POLLOUT.--ANK
   *
   * NOTE. Check for TCP_CLOSE is added. The goal is to prevent
   * blocking on fresh not-connected or disconnected socket. --ANK
   */"

So it seems ANK hesitated and somebody choose not to do
this - maybe for some reason... 

Regards,
Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Extensible hashing and RCU

2007-02-19 Thread Eric Dumazet

On Monday 19 February 2007 14:56, Evgeniy Polyakov wrote:
> On Mon, Feb 19, 2007 at 02:38:13PM +0100, Eric Dumazet ([EMAIL PROTECTED]) 
wrote:
> > On Monday 19 February 2007 12:41, Evgeniy Polyakov wrote:
> > > > 1 microsecond ? Are you kidding ? We want no more than 50 ns.
> > >
> > > Theory again.
> >
> > Theory is nice, but I personally prefer oprofile :)
> > I base my comments on real facts.
> > We *want* 50 ns tcp lookups (2 cache line misses, one with reader intent,
> > one for exclusive access intent)
>
> I said that your words are theory in previous mails :)
>
> Current code works 10 times worse than you expect.
>
> > > Existing table does not scale that good - I created (1<<20)/2 (to cover
> > > only established part) entries table and filled it with 1 million of
> > > random entries -search time is about half of microsecod.
> >
> > I use exactly 1^20 slots, not 1^19 (see commit
> > dbca9b2750e3b1ee6f56a616160ccfc12e8b161f , where I changed layout of
> > ehash table so that two chains (established/timewait) are on the same
> > cache line. every cache miss *counts*)
>
> Forget about cache misses and cache lines - we have a hash table, only
> part of which is used (part for time-wait sockets, part for established
> ones).

No you didnt not read my mail. Current ehash is not as decribed by you.

>
> Anyway, even with 2^20 (i.e. when the whole table is only used for
> established sockets) search time is about 360-370 nsec on 3.7 GHz Core
> Duo (only one CPU is used) with 2 GB of ram.

Your tests are user land, so unfortunatly are biased...

(Unless you use hugetlb data ?)

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Extensible hashing and RCU

2007-02-19 Thread Evgeniy Polyakov

Actually for socket code any other binary tree will work perfectly ok -
socket code does not have wildcards (except listening sockets), so it is
possible to combine all values into one search key used in flat
one-dimensional tree - it scales as hell and allows still very high
lookup time.

As of cache usage - such trees can be combined with different protocols
to increase cache locality.

The only reason I implemented trie is that netchannels support
wildcards, that is how netfilter is implemented on top of them.

Tree with lazy deletion (i.e. without deletion at all) can be moved to 
RCU very easily.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Extensible hashing and RCU

2007-02-19 Thread Evgeniy Polyakov

On Mon, Feb 19, 2007 at 02:38:13PM +0100, Eric Dumazet ([EMAIL PROTECTED]) 
wrote:
> On Monday 19 February 2007 12:41, Evgeniy Polyakov wrote:
> 
> > > 1 microsecond ? Are you kidding ? We want no more than 50 ns.
> >
> > Theory again.
> 
> 
> Theory is nice, but I personally prefer oprofile :)
> I base my comments on real facts.
> We *want* 50 ns tcp lookups (2 cache line misses, one with reader intent, one 
> for exclusive access intent)

I said that your words are theory in previous mails :)

Current code works 10 times worse than you expect.

> > Existing table does not scale that good - I created (1<<20)/2 (to cover
> > only established part) entries table and filled it with 1 million of random
> > entries -search time is about half of microsecod.
> 
> I use exactly 1^20 slots, not 1^19 (see commit  
> dbca9b2750e3b1ee6f56a616160ccfc12e8b161f , where I changed layout of ehash 
> table so that two chains (established/timewait) are on the same cache line. 
> every cache miss *counts*) 

Forget about cache misses and cache lines - we have a hash table, only
part of which is used (part for time-wait sockets, part for established
ones).

Anyway, even with 2^20 (i.e. when the whole table is only used for
established sockets) search time is about 360-370 nsec on 3.7 GHz Core
Duo (only one CPU is used) with 2 GB of ram.

> http://www.mail-archive.com/netdev@vger.kernel.org/msg31096.html
> 
> (Of course, you may have to change MAX_ORDER to 14 or else the hash table 
> hits 
> the MAX_ORDER limit)
> 
> Search time under 100 ns, for real trafic (kind of random... but not quite)
> Most of this time is taken by the rwlock, so expect 50 ns once RCU is finally 
> in...

My experiment shows almost 400 nsecs without _any_ locks - they are
removed completely - it is pure hash selection/list traverse time.

> In your tests, please make sure a User process is actually doing real work on 
> each CPU, ie evicting cpu caches every ms...
> 
> The rule is : On a normal machine, cpu caches contain UserMode data, not 
> kernel data. (as a typical machine spends 15% of its cpu time in kernel land, 
> and 85% in User land). You can assume kernel text is in cache, but even this 
> assumption may be wrong.

In my tests _only_ hash tables are in memory (well with some bits of
other stuff) - I use exactly the same approach for both trie and hash
table tests - table/trie is allocated, filled and lookup of random
values is performed in a loop. It is done in userspace - I just moved
list.h inet_hashtable.h and other needed files into separate project and
compiled them (with removed locks, atomic operations and other pure
kernel stuff). So actual time even more for hash table - at least it
requires locks while trie implementation works with RCU.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Extensible hashing and RCU

2007-02-19 Thread Eric Dumazet

On Monday 19 February 2007 12:41, Evgeniy Polyakov wrote:

> > 1 microsecond ? Are you kidding ? We want no more than 50 ns.
>
> Theory again.

Theory is nice, but I personally prefer oprofile :)
I base my comments on real facts.
We *want* 50 ns tcp lookups (2 cache line misses, one with reader intent, one 
for exclusive access intent)

>
> Existing table does not scale that good - I created (1<<20)/2 (to cover
> only established part) entries table and filled it with 1 million of random
> entries -search time is about half of microsecod.

I use exactly 1^20 slots, not 1^19 (see commit  
dbca9b2750e3b1ee6f56a616160ccfc12e8b161f , where I changed layout of ehash 
table so that two chains (established/timewait) are on the same cache line. 
every cache miss *counts*) 

http://www.mail-archive.com/netdev@vger.kernel.org/msg31096.html

(Of course, you may have to change MAX_ORDER to 14 or else the hash table hits 
the MAX_ORDER limit)

Search time under 100 ns, for real trafic (kind of random... but not quite)
Most of this time is taken by the rwlock, so expect 50 ns once RCU is finally 
in...

In your tests, please make sure a User process is actually doing real work on 
each CPU, ie evicting cpu caches every ms...

The rule is : On a normal machine, cpu caches contain UserMode data, not 
kernel data. (as a typical machine spends 15% of its cpu time in kernel land, 
and 85% in User land). You can assume kernel text is in cache, but even this 
assumption may be wrong.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Extensible hashing and RCU

2007-02-19 Thread Robert Olsson


Andi Kleen writes:

 > > If not, you loose.
 > 
 > It all depends on if the higher levels on the trie are small
 > enough to be kept in cache. Even with two cache misses it might
 > still break even, but have better scalability.

 Yes the trick to keep root large to allow a very flat tree and few 
 cache misses. Stefan Nilsson (author of LC-trie) and were able to 
 improve the the LC-trie quite a bit we called this trie+hash ->trash

 The paper discusses trie/hash... (you've seen it)

 http://www.nada.kth.se/~snilsson/public/papers/trash/

 > Another advantage would be to eliminate the need for large memory
 > blocks, which cause problems too e.g. on NUMA. It certainly would
 > save quite some memory if the tree levels are allocated on demand
 > only. However breaking it up might also cost more TLB misses, 
 > but those could be eliminated by preallocating the tree in
 > the same way as the hash today. Don't know if it's needed or not.
 > 
 > I guess someone needs to code it up and try it.
 
 I've implemented trie/trash as replacement for the dst hash to full
 key lookup for ipv4 (unicache) to start with. And is still is focusing 
 on the nasty parts, packet forwarding, as we don't want break this 
 So the benfits of full flow lookup is not accounted. I.e the full flow
 lookup could give socket at no cost and do some conntrack support
 like Evgeniy did in netchannels pathes.

 Below, some recent comparisions and profiles for the packet forwardning
 
Input 2 * 65k concurrent flows eth0->eth1, eth2->eth3 in forwarding
On separate CPU Opteron 2218 (2.6 GHZ)  net-2.6.21 git. Numbers are very 
approximative but should still be representative. Profiles are 
collected. 

Performance comparison
--
Table below holds: dst-entries in use, lookup hits, slow path and total pps

Flowlen 40

250k 1020 + 21 = 1041 pps  Vanilla rt_hash=32k
1M950 + 29 =  979 pps  Vanilla rt_hash=131k
260k  980 + 24 = 1004 pps  Unicache

Flowlen 4 (rdos)

290k 560 + 162 = 722 kpps  Vanilla  rt_hash=32k
1M   400 + 165 = 565 kpps  Vanilla  rt_hash=131k
230k 570 + 170 = 740 kpps  Unicache 


unicache flen=4 pkts

c02df84f 5257 7.72078 tkey_extract_bits
c023151a 5230 7.68112 e1000_clean_rx_irq
c02df908 3306 4.85541 tkey_equals
c014cf31 3296 4.84072 kfree
c02f8c3b 3067 4.5044  ip_route_input
c02fbdf0 2948 4.32963 ip_forward
c023024e 2809 4.12548 e1000_xmit_frame
c02e06f1 2792 4.10052 trie_lookup
c02fd764 2159 3.17085 ip_output
c032591c 1965 2.88593 fn_trie_lookup
c014cd82 1456 2.13838 kmem_cache_alloc
c02fa941 1337 1.96361 ip_rcv
c014ced0 1334 1.9592  kmem_cache_free
c02e1538 1289 1.89311 unicache_tcp_establish
c02e2d70 1218 1.78884 dev_queue_xmit
c02e31af 1074 1.57735 netif_receive_skb
c02f8484 1053 1.54651 ip_route_input_slow
c02db552 987  1.44957 __alloc_skb
c02e626f 913  1.34089 dst_alloc
c02edaad 828  1.21606 __qdisc_run
c0321ccf 810  1.18962 fib_get_table
c02e14c1 782  1.1485  match_pktgen
c02e6375 766  1.125   dst_destroy
c02e10e8 728  1.06919 unicache_hash_code
c0231242 647  0.950227e1000_clean_tx_irq
c02f7d23 625  0.917916ipv4_dst_destroy

unicache flen=40 pkts
-
c023151a 6742 10.3704 e1000_clean_rx_irq
c02df908 4553 7.00332 tkey_equals
c02fbdf0 4455 6.85258 ip_forward
c02e06f1 4067 6.25577 trie_lookup
c02f8c3b 3951 6.07734 ip_route_input
c02df84f 3929 6.0435  tkey_extract_bits
c023024e 3538 5.44207 e1000_xmit_frame
c014cf31 3152 4.84834 kfree
c02fd764 2711 4.17ip_output
c02e1538 1930 2.96868 unicache_tcp_establish
c02fa941 1696 2.60875 ip_rcv
c02e31af 1466 2.25497 netif_receive_skb
c02e2d70 1412 2.17191 dev_queue_xmit
c014cd82 1397 2.14883 kmem_cache_alloc
c02db552 1394 2.14422 __alloc_skb
c02edaad 1032 1.5874  __qdisc_run
c02ed6b8 957  1.47204 eth_header
c02e15dd 904  1.39051 unicache_garbage_collect_active
c02db94e 861  1.32437 kfree_skb
c0231242 794  1.22131 e1000_clean_tx_irq
c022fd58 778  1.1967  e1000_tx_map
c014ce73 756  1.16286 __kmalloc
c014ced0 740  1.13825 kmem_cache_free
c02e14c1 701  1.07826 match_pktgen
c023002c 621  0.955208e1000_tx_queue
c02e78fa 519  0.798314neigh_resolve_output

Vanilla w. flen=4 pkts rt_hash=32k
--
c02f6852 1570422.9102 ip_route_input
c023151a 5324 7.76705 e1000_clean_rx_irq
c02f84a1 4457 6.5022  ip_rcv
c02f9950 3065 4.47145 ip_forward
c023024e 2630 3.83684 e1000_xmit_frame
c0323380 2343 3.41814 fn_trie_lookup
c02fb2c4 2181 3.1818  ip_output
c02f4a3b 1839 2.68287 rt_intern_hash
c02f4480 1762 2.57054 rt_may_expire
c02f60

Re: [PATCH 3/4] 8139too: RTNL and flush_scheduled_work deadlock

2007-02-19 Thread Jarek Poplawski

On Fri, Feb 16, 2007 at 09:20:34PM +0100, Francois Romieu wrote:
> Jarek Poplawski <[EMAIL PROTECTED]> :
...
> > > @@ -1603,18 +1605,21 @@ static void rtl8139_thread (struct work_struct 
> > > *work)
> > >   struct net_device *dev = tp->mii.dev;
> > >   unsigned long thr_delay = next_tick;
> > >  
> > > + rtnl_lock();
> > > +
> > > + if (!netif_running(dev))
> > > + goto out_unlock;
> > 
> > I wonder, why you don't do netif_running before
> > rtnl_lock ? It's an atomic operation. And I'm not sure if increasing
> > rtnl_lock range is really needed here.
> 
> threadA: netif_running()
> user task B: rtnl_lock()
> user task B: dev->close()
> user task B: rtnl_unlock()
> threadA: rtnl_lock()
> threadA: mess with closed device
> 
> Btw, the thread runs every 3*HZ at most.

You are right (mostly)! But I think rtnl_lock is special
and should be spared (even this 3*HZ) and here it's used
for some mainly internal purpose (close synchronization).
And it looks like mainly for this internal reason holding
of rtnl_lock is increased. And because rtnl_lock is quite
popular you have to take into consideration that after
this 3*HZ it could spend some time waiting for the lock.
So, maybe it would be nicer to check this netif_running
twice (after rtnl_lock where needed), but maybe it's a
mater of taste only, and yours is better, as well.
(Btw. I didn't verify this, but I hope you checked that
places not under rtnl_lock before the patch are safe from
some locking problems now.)

Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Extensible hashing and RCU

2007-02-19 Thread Evgeniy Polyakov

On Sun, Feb 18, 2007 at 09:21:30PM +0100, Eric Dumazet ([EMAIL PROTECTED]) 
wrote:
> Evgeniy Polyakov a e'crit :
> >On Sun, Feb 18, 2007 at 07:46:22PM +0100, Eric Dumazet 
> >([EMAIL PROTECTED]) wrote:
> >>>Why anyone do not want to use trie - for socket-like loads it has
> >>>exactly constant search/insert/delete time and scales as hell.
> >>>
> >>Because we want to be *very* fast. You cannot beat hash table.
> >>
> >>Say you have 1.000.000 tcp connections, with 50.000 incoming packets per 
> >>second to *random* streams...
> >
> >What is really good in trie, that you may have upto 2^32 connections
> >without _any_ difference in lookup performance of random streams.
> 
> So are you speaking of one memory cache miss per lookup ?
> If not, you loose.

With trie big part of it _does_ live in cache compared to hash table
where similar addresses ends up in a completely different hash entries.

> >>With a 2^20 hashtable, a lookup uses one cache line (the hash head 
> >>pointer) plus one cache line to get the socket (you need it to access its 
> >>refcounter)
> >>
> >>Several attempts were done in the past to add RCU to ehash table (last 
> >>done by Benjamin LaHaise last March). I believe this was delayed a bit, 
> >>because David would like to be able to resize the hash table...
> >
> >This is a theory.
> 
> Not theory, but actual practice, on a real machine.
> 
> # cat /proc/net/sockstat
> sockets: used 918944
> TCP: inuse 925413 orphan 7401 tw 4906 alloc 926292 mem 304759
> UDP: inuse 9
> RAW: inuse 0
> FRAG: inuse 9 memory 18360

Theory is a speculation about performance.

Highly cache usage optimized bubble sorting still much worse than 
cache usage non-optimized binary tree.
 
> >Practice includes cost for hashing, locking, and list traversal
> >(each pointer is in own cache line btw, which must be fetched) and plus
> >the same for time wait sockets (if we are unlucky).
> >
> >No need to talk about price of cache miss when there might be more
> >serious problems - for example length of the linked list to traverse each 
> >time new packet is received.
> >
> >For example lookup time in trie with 1.6 millions random 3-dimensional
> >32bit (saddr/daddr/ports) entries is about 1 microsecond on amd athlon64 
> >3500 cpu (test was ran in userspace emulator though).
> 
> 1 microsecond ? Are you kidding ? We want no more than 50 ns.

Theory again.

Existing table does not scale that good - I created (1<<20)/2 (to cover only
established part) entries table and filled it with 1 million of random entries 
-search time is about half of microsecod.

Wanna see a code? I copied Linux hash table magic into userspace and run
the same inet_hash() and inet_lookup() in a loop. Result above.

Trie is still 2 times worse, but I've just found a bug in my implementation.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 9/18] [TCP] FRTO: Response should reset also snd_cwnd_cnt

2007-02-19 Thread Ilpo Järvinen

Since purpose is to reduce CWND, we prevent immediate growth. This
is not a major issue nor is "the correct way" specified anywhere.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_input.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2679279..9637abd 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2490,6 +2490,7 @@ static int tcp_ack_update_window(struct 
 static void tcp_conservative_spur_to_response(struct tcp_sock *tp)
 {
tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh);
+   tp->snd_cwnd_cnt = 0;
tcp_moderate_cwnd(tp);
 }
 
-- 
1.4.2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/18] [TCP]: Don't enter to fast recovery while using FRTO

2007-02-19 Thread Ilpo Järvinen

Because TCP is not in Loss state during FRTO recovery, fast
recovery could be triggered by accident. Non-SACK FRTO is more
robust than not yet included SACK-enhanced version (that can
receiver high number of duplicate ACKs with SACK blocks during
FRTO), at least with unidirectional transfers, but under
extraordinary patterns fast recovery can be incorrectly
triggered, e.g., Data loss+ACK losses => cumulative ACK with
enough SACK blocks to meet sacked_out >= dupthresh condition).

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_input.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9637abd..309da3e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1547,6 +1547,10 @@ static int tcp_time_to_recover(struct so
 {
__u32 packets_out;
 
+   /* Do not perform any recovery during FRTO algorithm */
+   if (tp->frto_counter)
+   return 0;
+
/* Trick#1: The loss is proven. */
if (tp->lost_out)
return 1;
-- 
1.4.2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 15/18] [TCP] FRTO: Fake cwnd for ssthresh callback

2007-02-19 Thread Ilpo Järvinen

TCP without FRTO would be in Loss state with small cwnd. FRTO,
however, leaves cwnd (typically) to a larger value which causes
ssthresh to become too large in case RTO is triggered again
compared to what conventional recovery would do. Because
consecutive RTOs result in only a single ssthresh reduction,
RTO+cumulative ACK+RTO pattern is required to trigger this
event.

A large comment is included for congestion control module writers
trying to figure out what CA_EVENT_FRTO handler should do because
there exists a remote possibility of incompatibility between
FRTO and module defined ssthresh functions.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_input.c |   26 +-
 1 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2c0b387..5d935b1 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1289,7 +1289,31 @@ void tcp_enter_frto(struct sock *sk)
((icsk->icsk_ca_state == TCP_CA_Loss || tp->frto_counter) &&
 !icsk->icsk_retransmits)) {
tp->prior_ssthresh = tcp_current_ssthresh(sk);
-   tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
+   /* Our state is too optimistic in ssthresh() call because cwnd
+* is not reduced until tcp_enter_frto_loss() when previous FRTO
+* recovery has not yet completed. Pattern would be this: RTO,
+* Cumulative ACK, RTO (2xRTO for the same segment does not end
+* up here twice).
+* RFC4138 should be more specific on what to do, even though
+* RTO is quite unlikely to occur after the first Cumulative ACK
+* due to back-off and complexity of triggering events ...
+*/
+   if (tp->frto_counter) {
+   u32 stored_cwnd;
+   stored_cwnd = tp->snd_cwnd;
+   tp->snd_cwnd = 2;
+   tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
+   tp->snd_cwnd = stored_cwnd;
+   } else {
+   tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
+   }
+   /* ... in theory, cong.control module could do "any tricks" in
+* ssthresh(), which means that ca_state, lost bits and lost_out
+* counter would have to be faked before the call occurs. We
+* consider that too expensive, unlikely and hacky, so modules
+* using these in ssthresh() must deal these incompatibility
+* issues if they receives CA_EVENT_FRTO and frto_counter != 0
+*/
tcp_ca_event(sk, CA_EVENT_FRTO);
}
 
-- 
1.4.2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 18/18] [TCP] FRTO: Sysctl documentation for SACK enhanced version

2007-02-19 Thread Ilpo Järvinen

The description is overly verbose to avoid ambiguity between
"SACK enabled" and "SACK enhanced FRTO"

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 Documentation/networking/ip-sysctl.txt |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index a0f6842..d66777b 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -178,7 +178,10 @@ tcp_frto - BOOLEAN
Enables F-RTO, an enhanced recovery algorithm for TCP retransmission
timeouts.  It is particularly beneficial in wireless environments
where packet loss is typically due to random radio interference
-   rather than intermediate router congestion.
+   rather than intermediate router congestion. If set to 1, basic
+   version is enabled. 2 enables SACK enhanced FRTO, which is
+   EXPERIMENTAL. The basic version can be used also when SACK is
+   enabled for a flow through tcp_sack sysctl.
 
 tcp_keepalive_time - INTEGER
How often TCP sends out keepalive messages when keepalive is enabled.
-- 
1.4.2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 14/18] [TCP] FRTO: Reverse RETRANS bit clearing logic

2007-02-19 Thread Ilpo Järvinen

Previously RETRANS bits were cleared on the entry to FRTO. We
postpone that into tcp_enter_frto_loss, which is really the
place were the clearing should be done anyway. This allows
simplification of the logic from a clearing loop to the head skb
clearing only.

Besides, the other changes made in the previous patches to
tcp_use_frto made it impossible for the non-SACKed FRTO to be
entered if other than the head has been rexmitted.

With SACK-enhanced FRTO (and Appendix B), however, there can be
a number retransmissions in flight when RTO expires (same thing
could happen before this patchset also with non-SACK FRTO). To
not introduce any jumpiness into the packet counting during FRTO,
instead of clearing RETRANS bits from skbs during entry, do it
later on.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_input.c |   35 +++
 1 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 8f0aa9d..2c0b387 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1268,7 +1268,11 @@ int tcp_use_frto(struct sock *sk)
 
 /* RTO occurred, but do not yet enter Loss state. Instead, defer RTO
  * recovery a bit and use heuristics in tcp_process_frto() to detect if
- * the RTO was spurious.
+ * the RTO was spurious. Only clear SACKED_RETRANS of the head here to
+ * keep retrans_out counting accurate (with SACK F-RTO, other than head
+ * may still have that bit set); TCPCB_LOST and remaining SACKED_RETRANS
+ * bits are handled if the Loss state is really to be entered (in
+ * tcp_enter_frto_loss).
  *
  * Do like tcp_enter_loss() would; when RTO expires the second time it
  * does:
@@ -1289,17 +1293,13 @@ void tcp_enter_frto(struct sock *sk)
tcp_ca_event(sk, CA_EVENT_FRTO);
}
 
-   /* Have to clear retransmission markers here to keep the bookkeeping
-* in shape, even though we are not yet in Loss state.
-* If something was really lost, it is eventually caught up
-* in tcp_enter_frto_loss.
-*/
-   tp->retrans_out = 0;
tp->undo_marker = tp->snd_una;
tp->undo_retrans = 0;
 
-   sk_stream_for_retrans_queue(skb, sk) {
+   skb = skb_peek(&sk->sk_write_queue);
+   if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS) {
TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
+   tp->retrans_out -= tcp_skb_pcount(skb);
}
tcp_sync_left_out(tp);
 
@@ -1313,7 +1313,7 @@ void tcp_enter_frto(struct sock *sk)
  * which indicates that we should follow the traditional RTO recovery,
  * i.e. mark everything lost and do go-back-N retransmission.
  */
-static void tcp_enter_frto_loss(struct sock *sk, int allowed_segments)
+static void tcp_enter_frto_loss(struct sock *sk, int allowed_segments, int 
flag)
 {
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
@@ -1322,10 +1322,21 @@ static void tcp_enter_frto_loss(struct s
tp->sacked_out = 0;
tp->lost_out = 0;
tp->fackets_out = 0;
+   tp->retrans_out = 0;
 
sk_stream_for_retrans_queue(skb, sk) {
cnt += tcp_skb_pcount(skb);
-   TCP_SKB_CB(skb)->sacked &= ~TCPCB_LOST;
+   /*
+* Count the retransmission made on RTO correctly (only when
+* waiting for the first ACK and did not get it)...
+*/
+   if ((tp->frto_counter == 1) && !(flag&FLAG_DATA_ACKED)) {
+   tp->retrans_out += tcp_skb_pcount(skb);
+   /* ...enter this if branch just for the first segment */
+   flag |= FLAG_DATA_ACKED;
+   } else {
+   TCP_SKB_CB(skb)->sacked &= 
~(TCPCB_LOST|TCPCB_SACKED_RETRANS);
+   }
if (!(TCP_SKB_CB(skb)->sacked&TCPCB_SACKED_ACKED)) {
 
/* Do not mark those segments lost that were
@@ -2550,7 +2561,7 @@ static int tcp_process_frto(struct sock 
inet_csk(sk)->icsk_retransmits = 0;
 
if (!before(tp->snd_una, tp->frto_highmark)) {
-   tcp_enter_frto_loss(sk, tp->frto_counter + 1);
+   tcp_enter_frto_loss(sk, tp->frto_counter + 1, flag);
return 1;
}
 
@@ -2562,7 +2573,7 @@ static int tcp_process_frto(struct sock 
return 1;
 
if (!(flag&FLAG_DATA_ACKED)) {
-   tcp_enter_frto_loss(sk, (tp->frto_counter == 1 ? 0 : 3));
+   tcp_enter_frto_loss(sk, (tp->frto_counter == 1 ? 0 : 3), flag);
return 1;
}
 
-- 
1.4.2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 17/18] [TCP]: SACK enhanced FRTO

2007-02-19 Thread Ilpo Järvinen

Implements the SACK-enhanced FRTO given in RFC4138 using the
variant given in Appendix B.

RFC4138, Appendix B:
  "This means that in order to declare timeout spurious, the TCP
   sender must receive an acknowledgment for non-retransmitted
   segment between SND.UNA and RecoveryPoint in algorithm step 3.
   RecoveryPoint is defined in conservative SACK-recovery
   algorithm [RFC3517]"

The basic version of the FRTO algorithm can still be used also
when SACK is enabled. To enabled SACK-enhanced version, tcp_frto
sysctl is set to 2.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_input.c |   76 +++---
 1 files changed, 65 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 356de02..3ce4019 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -100,6 +100,7 @@ #define FLAG_DATA_SACKED0x20 /* New SAC
 #define FLAG_ECE   0x40 /* ECE in this ACK 
*/
 #define FLAG_DATA_LOST 0x80 /* SACK detected data lossage. 
*/
 #define FLAG_SLOWPATH  0x100 /* Do not skip RFC checks for window 
update.*/
+#define FLAG_ONLY_ORIG_SACKED  0x200 /* SACKs only non-rexmit sent before RTO 
*/
 
 #define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
 #define FLAG_NOT_DUP   (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
@@ -110,6 +111,8 @@ #define IsReno(tp) ((tp)->rx_opt.sack_ok
 #define IsFack(tp) ((tp)->rx_opt.sack_ok & 2)
 #define IsDSack(tp) ((tp)->rx_opt.sack_ok & 4)
 
+#define IsSackFrto() (sysctl_tcp_frto == 0x2)
+
 #define TCP_REMNANT (TCP_FLAG_FIN|TCP_FLAG_URG|TCP_FLAG_SYN|TCP_FLAG_PSH)
 
 /* Adapt the MSS value used to make delayed ack decision to the
@@ -1159,6 +1162,18 @@ tcp_sacktag_write_queue(struct sock *sk,
/* clear lost hint */
tp->retransmit_skb_hint = NULL;
}
+   /* SACK enhanced F-RTO detection.
+* Set flag if and only if non-rexmitted
+* segments below frto_highmark are
+* SACKed (RFC4138; Appendix B).
+* Clearing correct due to in-order walk
+*/
+   if (after(end_seq, tp->frto_highmark)) {
+   flag &= ~FLAG_ONLY_ORIG_SACKED;
+   } else {
+   if (!(sacked & TCPCB_RETRANS))
+   flag |= 
FLAG_ONLY_ORIG_SACKED;
+   }
}
 
TCP_SKB_CB(skb)->sacked |= TCPCB_SACKED_ACKED;
@@ -1240,7 +1255,8 @@ #endif
 /* F-RTO can only be used if these conditions are satisfied:
  *  - there must be some unsent new data
  *  - the advertised window should allow sending it
- *  - TCP has never retransmitted anything other than head
+ *  - TCP has never retransmitted anything other than head (SACK enhanced
+ *variant from Appendix B of RFC4138 is more robust here)
  */
 int tcp_use_frto(struct sock *sk)
 {
@@ -1252,6 +1268,9 @@ int tcp_use_frto(struct sock *sk)
  tp->snd_una + tp->snd_wnd))
return 0;
 
+   if (IsSackFrto())
+   return 1;
+
/* Avoid expensive walking of rexmit queue if possible */
if (tp->retrans_out > 1)
return 0;
@@ -1328,9 +1347,18 @@ void tcp_enter_frto(struct sock *sk)
}
tcp_sync_left_out(tp);
 
+   /* Earlier loss recovery underway (see RFC4138; Appendix B).
+* The last condition is necessary at least in tp->frto_counter case.
+*/
+   if (IsSackFrto() && (tp->frto_counter ||
+   ((1 << icsk->icsk_ca_state) & (TCPF_CA_Recovery|TCPF_CA_Loss))) &&
+   after(tp->high_seq, tp->snd_una)) {
+   tp->frto_highmark = tp->high_seq;
+   } else {
+   tp->frto_highmark = tp->snd_nxt;
+   }
tcp_set_ca_state(sk, TCP_CA_Disorder);
tp->high_seq = tp->snd_nxt;
-   tp->frto_highmark = tp->snd_nxt;
tp->frto_counter = 1;
 }
 
@@ -2566,6 +2594,10 @@ static void tcp_conservative_spur_to_res
  * Rationale: if the RTO was spurious, new ACKs should arrive from the
  * original window even after we transmit two new data segments.
  *
+ * SACK version:
+ *   on first step, wait until first cumulative ACK arrives, then move to
+ *   the second step. In second step, the next ACK decides.
+ *
  * F-RTO is implemented (mainly) in four functions:
  *   - tcp_use_frto() is used to determine if TCP is can use F-RTO
  *   - tcp_enter_frto() prepares TCP state on RTO if F-RTO is used, it is
@@ -

[PATCH 16/18] [TCP]: Prevent reordering adjustments during FRTO

2007-02-19 Thread Ilpo Järvinen

To be honest, I'm not too sure how the reord stuff works in the
first place but this seems necessary.

When FRTO has been active, the one and only retransmission could
be unnecessary but the state and sending order might not be what
the sacktag code expects it to be (to work correctly).

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_input.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5d935b1..356de02 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1224,7 +1224,8 @@ tcp_sacktag_write_queue(struct sock *sk,
 
tp->left_out = tp->sacked_out + tp->lost_out;
 
-   if ((reord < tp->fackets_out) && icsk->icsk_ca_state != TCP_CA_Loss)
+   if ((reord < tp->fackets_out) && icsk->icsk_ca_state != TCP_CA_Loss &&
+   (tp->frto_highmark && after(tp->snd_una, tp->frto_highmark)))
tcp_update_reordering(sk, ((tp->fackets_out + 1) - reord), 0);
 
 #if FASTRETRANS_DEBUG > 0
-- 
1.4.2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 8/18] [TCP] FRTO: fixes fallback to conventional recovery

2007-02-19 Thread Ilpo Järvinen

The FRTO detection did not care how ACK pattern affects to cwnd
calculation of the conventional recovery. This caused incorrect
setting of cwnd when the fallback becames necessary. The
knowledge tcp_process_frto() has about the incoming ACK is now
passed on to tcp_enter_frto_loss() in allowed_segments parameter
that gives the number of segments that must be added to
packets-in-flight while calculating the new cwnd.

Instead of snd_una we use FLAG_DATA_ACKED in duplicate ACK
detection because RFC4138 states (in Section 2.2):
  If the first acknowledgment after the RTO retransmission
  does not acknowledge all of the data that was retransmitted
  in step 1, the TCP sender reverts to the conventional RTO
  recovery.  Otherwise, a malicious receiver acknowledging
  partial segments could cause the sender to declare the
  timeout spurious in a case where data was lost.

If the next ACK after RTO is duplicate, we do not retransmit
anything, which is equal to what conservative conventional
recovery does in such case.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_input.c |   14 +-
 1 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5831daa..2679279 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1296,7 +1296,7 @@ void tcp_enter_frto(struct sock *sk)
  * which indicates that we should follow the traditional RTO recovery,
  * i.e. mark everything lost and do go-back-N retransmission.
  */
-static void tcp_enter_frto_loss(struct sock *sk)
+static void tcp_enter_frto_loss(struct sock *sk, int allowed_segments)
 {
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
@@ -1326,7 +1326,7 @@ static void tcp_enter_frto_loss(struct s
}
tcp_sync_left_out(tp);
 
-   tp->snd_cwnd = tp->frto_counter + tcp_packets_in_flight(tp)+1;
+   tp->snd_cwnd = tcp_packets_in_flight(tp) + allowed_segments;
tp->snd_cwnd_cnt = 0;
tp->snd_cwnd_stamp = tcp_time_stamp;
tp->undo_marker = 0;
@@ -2527,6 +2527,11 @@ static void tcp_process_frto(struct sock
if (flag&FLAG_DATA_ACKED)
inet_csk(sk)->icsk_retransmits = 0;
 
+   if (!before(tp->snd_una, tp->frto_highmark)) {
+   tcp_enter_frto_loss(sk, tp->frto_counter + 1);
+   return;
+   }
+
/* RFC4138 shortcoming in step 2; should also have case c): ACK isn't
 * duplicate nor advances window, e.g., opposite dir data, winupdate
 */
@@ -2534,9 +2539,8 @@ static void tcp_process_frto(struct sock
!(flag&FLAG_FORWARD_PROGRESS))
return;
 
-   if (tp->snd_una == prior_snd_una ||
-   !before(tp->snd_una, tp->frto_highmark)) {
-   tcp_enter_frto_loss(sk);
+   if (!(flag&FLAG_DATA_ACKED)) {
+   tcp_enter_frto_loss(sk, (tp->frto_counter == 1 ? 0 : 3));
return;
}
 
-- 
1.4.2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/18] [TCP] FRTO: Moved tcp_use_frto from tcp.h to tcp_input.c

2007-02-19 Thread Ilpo Järvinen

In addition, removed inline.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 include/net/tcp.h|   14 +-
 net/ipv4/tcp_input.c |   13 +
 2 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5c472f2..572a77b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -341,6 +341,7 @@ extern struct sock *tcp_check_req(stru
 extern int tcp_child_process(struct sock *parent,
  struct sock *child,
  struct sk_buff *skb);
+extern int tcp_use_frto(const struct sock *sk);
 extern voidtcp_enter_frto(struct sock *sk);
 extern voidtcp_enter_loss(struct sock *sk, int how);
 extern voidtcp_clear_retrans(struct tcp_sock *tp);
@@ -1033,19 +1034,6 @@ static inline int tcp_paws_check(const s
 
 #define TCP_CHECK_TIMER(sk) do { } while (0)
 
-static inline int tcp_use_frto(const struct sock *sk)
-{
-   const struct tcp_sock *tp = tcp_sk(sk);
-   
-   /* F-RTO must be activated in sysctl and there must be some
-* unsent new data, and the advertised window should allow
-* sending it.
-*/
-   return (sysctl_tcp_frto && sk->sk_send_head &&
-   !after(TCP_SKB_CB(sk->sk_send_head)->end_seq,
-  tp->snd_una + tp->snd_wnd));
-}
-
 static inline void tcp_mib_init(void)
 {
/* See RFC 2012 */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c5be3d0..294cb44 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1236,6 +1236,19 @@ #endif
return flag;
 }
 
+int tcp_use_frto(const struct sock *sk)
+{
+   const struct tcp_sock *tp = tcp_sk(sk);
+   
+   /* F-RTO must be activated in sysctl and there must be some
+* unsent new data, and the advertised window should allow
+* sending it.
+*/
+   return (sysctl_tcp_frto && sk->sk_send_head &&
+   !after(TCP_SKB_CB(sk->sk_send_head)->end_seq,
+  tp->snd_una + tp->snd_wnd));
+}
+
 /* RTO occurred, but do not yet enter loss state. Instead, transmit two new
  * segments to see from the next ACKs whether any data was really missing.
  * If the RTO was spurious, new ACKs should arrive.
-- 
1.4.2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 13/18] [TCP] FRTO: Entry is allowed only during (New)Reno like recovery

2007-02-19 Thread Ilpo Järvinen

This interpretation comes from RFC4138:
"If the sender implements some loss recovery algorithm other
 than Reno or NewReno [FHG04], the F-RTO algorithm SHOULD
 NOT be entered when earlier fast recovery is underway."

I think the RFC means to say (especially in the light of
Appendix B) that ...recovery is underway (not just fast recovery)
or was underway when it was interrupted by an earlier (F-)RTO
that hasn't yet been resolved (snd_una has not advanced enough).
Thus, my interpretation is that whenever TCP has ever
retransmitted other than head, basic version cannot be used
because then the order assumptions which are used as FRTO basis
do not hold.

NewReno has only the head segment retransmitted at a time.
Therefore, walk up to the segment that has not been SACKed, if
that segment is not retransmitted nor anything before it, we know
for sure, that nothing after the non-SACKed segment should be
either. This assumption is valid because TCPCB_EVER_RETRANS does
not leave holes but each non-SACKed segment is rexmitted
in-order.

Check for retrans_out > 1 avoids more expensive walk through the
skb list, as we can know the result beforehand: F-RTO will not be
allowed.

SACKed skb can turn into non-SACked only in the extremely rare
case of SACK reneging, in this case we might fail to detect
retransmissions if there were them for any other than head. To
get rid of that feature, whole rexmit queue would have to be
walked (always) or FRTO should be prevented when SACK reneging
happens. Of course RTO should still trigger after reneging which
makes this issue even less likely to show up. And as long as the
response is as conservative as it's now, nothing bad happens even
then.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 include/net/tcp.h|2 +-
 net/ipv4/tcp_input.c |   25 +
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 572a77b..7fd6b77 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -341,7 +341,7 @@ extern struct sock *tcp_check_req(stru
 extern int tcp_child_process(struct sock *parent,
  struct sock *child,
  struct sk_buff *skb);
-extern int tcp_use_frto(const struct sock *sk);
+extern int tcp_use_frto(struct sock *sk);
 extern voidtcp_enter_frto(struct sock *sk);
 extern voidtcp_enter_loss(struct sock *sk, int how);
 extern voidtcp_clear_retrans(struct tcp_sock *tp);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5e952f0..8f0aa9d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1239,14 +1239,31 @@ #endif
 /* F-RTO can only be used if these conditions are satisfied:
  *  - there must be some unsent new data
  *  - the advertised window should allow sending it
+ *  - TCP has never retransmitted anything other than head
  */
-int tcp_use_frto(const struct sock *sk)
+int tcp_use_frto(struct sock *sk)
 {
const struct tcp_sock *tp = tcp_sk(sk);
+   struct sk_buff *skb;

-   return (sysctl_tcp_frto && sk->sk_send_head &&
-   !after(TCP_SKB_CB(sk->sk_send_head)->end_seq,
-  tp->snd_una + tp->snd_wnd));
+   if (!sysctl_tcp_frto || !sk->sk_send_head ||
+   after(TCP_SKB_CB(sk->sk_send_head)->end_seq,
+ tp->snd_una + tp->snd_wnd))
+   return 0;
+
+   /* Avoid expensive walking of rexmit queue if possible */
+   if (tp->retrans_out > 1)
+   return 0;
+
+   skb = skb_peek(&sk->sk_write_queue)->next;  /* Skips head */
+   sk_stream_for_retrans_queue_from(skb, sk) {
+   if (TCP_SKB_CB(skb)->sacked&TCPCB_RETRANS)
+   return 0;
+   /* Short-circuit when first non-SACKed skb has been checked */
+   if (!(TCP_SKB_CB(skb)->sacked&TCPCB_SACKED_ACKED))
+   break;
+   }
+   return 1;
 }
 
 /* RTO occurred, but do not yet enter Loss state. Instead, defer RTO
-- 
1.4.2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 11/18] [TCP] FRTO: frto_counter modulo-op converted to two assignments

2007-02-19 Thread Ilpo Järvinen

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_input.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 309da3e..9fc7f66 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2551,11 +2551,11 @@ static void tcp_process_frto(struct sock
 
if (tp->frto_counter == 1) {
tp->snd_cwnd = tcp_packets_in_flight(tp) + 2;
+   tp->frto_counter = 2;
} else /* frto_counter == 2 */ {
tcp_conservative_spur_to_response(tp);
+   tp->frto_counter = 0;
}
-
-   tp->frto_counter = (tp->frto_counter + 1) % 3;
 }
 
 /* This routine deals with incoming acks, but not outgoing ones. */
-- 
1.4.2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 12/18] [TCP]: Prevent unrelated cwnd adjustment while using FRTO

2007-02-19 Thread Ilpo Järvinen

FRTO controls cwnd when it still processes the ACK input or it
has just reverted back to conventional RTO recovery; the normal
rules apply when FRTO has reverted to standard congestion
control.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_input.c |   18 +++---
 1 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9fc7f66..5e952f0 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2522,7 +2522,7 @@ static void tcp_conservative_spur_to_res
  * to prove that the RTO is indeed spurious. It transfers the control
  * from F-RTO to the conventional RTO recovery
  */
-static void tcp_process_frto(struct sock *sk, u32 prior_snd_una, int flag)
+static int tcp_process_frto(struct sock *sk, u32 prior_snd_una, int flag)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
@@ -2534,7 +2534,7 @@ static void tcp_process_frto(struct sock
 
if (!before(tp->snd_una, tp->frto_highmark)) {
tcp_enter_frto_loss(sk, tp->frto_counter + 1);
-   return;
+   return 1;
}
 
/* RFC4138 shortcoming in step 2; should also have case c): ACK isn't
@@ -2542,20 +2542,22 @@ static void tcp_process_frto(struct sock
 */
if ((tp->snd_una == prior_snd_una) && (flag&FLAG_NOT_DUP) &&
!(flag&FLAG_FORWARD_PROGRESS))
-   return;
+   return 1;
 
if (!(flag&FLAG_DATA_ACKED)) {
tcp_enter_frto_loss(sk, (tp->frto_counter == 1 ? 0 : 3));
-   return;
+   return 1;
}
 
if (tp->frto_counter == 1) {
tp->snd_cwnd = tcp_packets_in_flight(tp) + 2;
tp->frto_counter = 2;
+   return 1;
} else /* frto_counter == 2 */ {
tcp_conservative_spur_to_response(tp);
tp->frto_counter = 0;
}
+   return 0;
 }
 
 /* This routine deals with incoming acks, but not outgoing ones. */
@@ -2569,6 +2571,7 @@ static int tcp_ack(struct sock *sk, stru
u32 prior_in_flight;
s32 seq_rtt;
int prior_packets;
+   int frto_cwnd = 0;
 
/* If the ack is newer than sent or older than previous acks
 * then we can probably ignore it.
@@ -2631,15 +2634,16 @@ static int tcp_ack(struct sock *sk, stru
flag |= tcp_clean_rtx_queue(sk, &seq_rtt);
 
if (tp->frto_counter)
-   tcp_process_frto(sk, prior_snd_una, flag);
+   frto_cwnd = tcp_process_frto(sk, prior_snd_una, flag);
 
if (tcp_ack_is_dubious(sk, flag)) {
/* Advance CWND, if state allows this. */
-   if ((flag & FLAG_DATA_ACKED) && tcp_may_raise_cwnd(sk, flag))
+   if ((flag & FLAG_DATA_ACKED) && !frto_cwnd &&
+   tcp_may_raise_cwnd(sk, flag))
tcp_cong_avoid(sk, ack,  seq_rtt, prior_in_flight, 0);
tcp_fastretrans_alert(sk, prior_snd_una, prior_packets, flag);
} else {
-   if ((flag & FLAG_DATA_ACKED))
+   if ((flag & FLAG_DATA_ACKED) && !frto_cwnd)
tcp_cong_avoid(sk, ack, seq_rtt, prior_in_flight, 1);
}
 
-- 
1.4.2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 7/18] [TCP] FRTO: Ignore some uninteresting ACKs

2007-02-19 Thread Ilpo Järvinen

Handles RFC4138 shortcoming (in step 2); it should also have case
c) which ignores ACKs that are not duplicates nor advance window
(opposite dir data, winupdate).

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_input.c |   13 ++---
 1 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d1e731f..5831daa 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2495,9 +2495,9 @@ static void tcp_conservative_spur_to_res
 
 /* F-RTO spurious RTO detection algorithm (RFC4138)
  *
- * F-RTO affects during two new ACKs following RTO. State (ACK number) is kept
- * in frto_counter. When ACK advances window (but not to or beyond highest
- * sequence sent before RTO):
+ * F-RTO affects during two new ACKs following RTO (well, almost, see inline
+ * comments). State (ACK number) is kept in frto_counter. When ACK advances
+ * window (but not to or beyond highest sequence sent before RTO):
  *   On First ACK,  send two new segments out.
  *   On Second ACK, RTO was likely spurious. Do spurious response (response
  *  algorithm is not part of the F-RTO detection algorithm
@@ -2527,6 +2527,13 @@ static void tcp_process_frto(struct sock
if (flag&FLAG_DATA_ACKED)
inet_csk(sk)->icsk_retransmits = 0;
 
+   /* RFC4138 shortcoming in step 2; should also have case c): ACK isn't
+* duplicate nor advances window, e.g., opposite dir data, winupdate
+*/
+   if ((tp->snd_una == prior_snd_una) && (flag&FLAG_NOT_DUP) &&
+   !(flag&FLAG_FORWARD_PROGRESS))
+   return;
+
if (tp->snd_una == prior_snd_una ||
!before(tp->snd_una, tp->frto_highmark)) {
tcp_enter_frto_loss(sk);
-- 
1.4.2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/18] [TCP] FRTO: Consecutive RTOs keep prior_ssthresh and ssthresh

2007-02-19 Thread Ilpo Järvinen

In case a latency spike causes more than one RTO, the later should not
cause the already reduced ssthresh to propagate into the prior_ssthresh
since FRTO declares all such RTOs spurious at once or none of them. In
treating of ssthresh, we mimic what tcp_enter_loss() does.

The previous state (in frto_counter) must be available until we have
checked it in tcp_enter_frto(), and also ACK information flag in
process_frto().

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_input.c |   20 ++--
 1 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f645c3e..c846beb 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1252,6 +1252,10 @@ int tcp_use_frto(const struct sock *sk)
 /* RTO occurred, but do not yet enter Loss state. Instead, defer RTO
  * recovery a bit and use heuristics in tcp_process_frto() to detect if
  * the RTO was spurious.
+ *
+ * Do like tcp_enter_loss() would; when RTO expires the second time it
+ * does:
+ *  "Reduce ssthresh if it has not yet been made inside this window."
  */
 void tcp_enter_frto(struct sock *sk)
 {
@@ -1259,11 +1263,10 @@ void tcp_enter_frto(struct sock *sk)
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
 
-   tp->frto_counter = 1;
-
-   if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
+   if ((!tp->frto_counter && icsk->icsk_ca_state <= TCP_CA_Disorder) ||
tp->snd_una == tp->high_seq ||
-   (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
+   ((icsk->icsk_ca_state == TCP_CA_Loss || tp->frto_counter) &&
+!icsk->icsk_retransmits)) {
tp->prior_ssthresh = tcp_current_ssthresh(sk);
tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
tcp_ca_event(sk, CA_EVENT_FRTO);
@@ -1285,6 +1288,7 @@ void tcp_enter_frto(struct sock *sk)
 
tcp_set_ca_state(sk, TCP_CA_Open);
tp->frto_highmark = tp->snd_nxt;
+   tp->frto_counter = 1;
 }
 
 /* Enter Loss state after F-RTO was applied. Dupack arrived after RTO,
@@ -2513,12 +2517,16 @@ static void tcp_conservative_spur_to_res
  * to prove that the RTO is indeed spurious. It transfers the control
  * from F-RTO to the conventional RTO recovery
  */
-static void tcp_process_frto(struct sock *sk, u32 prior_snd_una)
+static void tcp_process_frto(struct sock *sk, u32 prior_snd_una, int flag)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
tcp_sync_left_out(tp);
 
+   /* Duplicate the behavior from Loss state (fastretrans_alert) */
+   if (flag&FLAG_DATA_ACKED)
+   inet_csk(sk)->icsk_retransmits = 0;
+
if (tp->snd_una == prior_snd_una ||
!before(tp->snd_una, tp->frto_highmark)) {
tcp_enter_frto_loss(sk);
@@ -2607,7 +2615,7 @@ static int tcp_ack(struct sock *sk, stru
flag |= tcp_clean_rtx_queue(sk, &seq_rtt);
 
if (tp->frto_counter)
-   tcp_process_frto(sk, prior_snd_una);
+   tcp_process_frto(sk, prior_snd_una, flag);
 
if (tcp_ack_is_dubious(sk, flag)) {
/* Advance CWND, if state allows this. */
-- 
1.4.2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 6/18] [TCP] FRTO: Use Disorder state during operation instead of Open

2007-02-19 Thread Ilpo Järvinen

Retransmission counter assumptions are to be changed. Forcing
reason to do this exist: Using sysctl in check would be racy
as soon as FRTO starts to ignore some ACKs (doing that in the
following patches). Userspace may disable it at any moment
giving nice oops if timing is right. frto_counter would be
inaccessible from userspace, but with SACK enhanced FRTO
retrans_out can include other than head, and possibly leaving
it non-zero after spurious RTO, boom again.

Luckily, solution seems rather simple: never go directly to Open
state but use Disorder instead. This does not really change much,
since TCP could anyway change its state to Disorder during FRTO
using path tcp_fastretrans_alert -> tcp_try_to_open (e.g., when
a SACK block makes ACK dubious). Besides, Disorder seems to be
the state where TCP should be if not recovering (in Recovery or
Loss state) while having some retransmissions in-flight (see
tcp_try_to_open), which is exactly what happens with FRTO.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_input.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c846beb..d1e731f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1286,7 +1286,8 @@ void tcp_enter_frto(struct sock *sk)
}
tcp_sync_left_out(tp);
 
-   tcp_set_ca_state(sk, TCP_CA_Open);
+   tcp_set_ca_state(sk, TCP_CA_Disorder);
+   tp->high_seq = tp->snd_nxt;
tp->frto_highmark = tp->snd_nxt;
tp->frto_counter = 1;
 }
@@ -2014,8 +2015,7 @@ tcp_fastretrans_alert(struct sock *sk, u
/* E. Check state exit conditions. State can be terminated
 *when high_seq is ACKed. */
if (icsk->icsk_ca_state == TCP_CA_Open) {
-   if (!sysctl_tcp_frto)
-   BUG_TRAP(tp->retrans_out == 0);
+   BUG_TRAP(tp->retrans_out == 0);
tp->retrans_stamp = 0;
} else if (!before(tp->snd_una, tp->high_seq)) {
switch (icsk->icsk_ca_state) {
-- 
1.4.2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/18] [TCP] FRTO: Comment cleanup & improvement

2007-02-19 Thread Ilpo Järvinen

Moved comments out from the body of process_frto() to the head
(preferred way; see Documentation/CodingStyle). Bonus: it's much
easier to read in this compacted form.

FRTO algorithm and implementation is described in greater detail.
For interested reader, more information is available in RFC4138.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_input.c |   49 -
 1 files changed, 32 insertions(+), 17 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 294cb44..f645c3e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1236,22 +1236,22 @@ #endif
return flag;
 }
 
+/* F-RTO can only be used if these conditions are satisfied:
+ *  - there must be some unsent new data
+ *  - the advertised window should allow sending it
+ */
 int tcp_use_frto(const struct sock *sk)
 {
const struct tcp_sock *tp = tcp_sk(sk);

-   /* F-RTO must be activated in sysctl and there must be some
-* unsent new data, and the advertised window should allow
-* sending it.
-*/
return (sysctl_tcp_frto && sk->sk_send_head &&
!after(TCP_SKB_CB(sk->sk_send_head)->end_seq,
   tp->snd_una + tp->snd_wnd));
 }
 
-/* RTO occurred, but do not yet enter loss state. Instead, transmit two new
- * segments to see from the next ACKs whether any data was really missing.
- * If the RTO was spurious, new ACKs should arrive.
+/* RTO occurred, but do not yet enter Loss state. Instead, defer RTO
+ * recovery a bit and use heuristics in tcp_process_frto() to detect if
+ * the RTO was spurious.
  */
 void tcp_enter_frto(struct sock *sk)
 {
@@ -2489,6 +2489,30 @@ static void tcp_conservative_spur_to_res
tcp_moderate_cwnd(tp);
 }
 
+/* F-RTO spurious RTO detection algorithm (RFC4138)
+ *
+ * F-RTO affects during two new ACKs following RTO. State (ACK number) is kept
+ * in frto_counter. When ACK advances window (but not to or beyond highest
+ * sequence sent before RTO):
+ *   On First ACK,  send two new segments out.
+ *   On Second ACK, RTO was likely spurious. Do spurious response (response
+ *  algorithm is not part of the F-RTO detection algorithm
+ *  given in RFC4138 but can be selected separately).
+ * Otherwise (basically on duplicate ACK), RTO was (likely) caused by a loss
+ * and TCP falls back to conventional RTO recovery.
+ *
+ * Rationale: if the RTO was spurious, new ACKs should arrive from the
+ * original window even after we transmit two new data segments.
+ *
+ * F-RTO is implemented (mainly) in four functions:
+ *   - tcp_use_frto() is used to determine if TCP is can use F-RTO
+ *   - tcp_enter_frto() prepares TCP state on RTO if F-RTO is used, it is
+ * called when tcp_use_frto() showed green light
+ *   - tcp_process_frto() handles incoming ACKs during F-RTO algorithm
+ *   - tcp_enter_frto_loss() is called if there is not enough evidence
+ * to prove that the RTO is indeed spurious. It transfers the control
+ * from F-RTO to the conventional RTO recovery
+ */
 static void tcp_process_frto(struct sock *sk, u32 prior_snd_una)
 {
struct tcp_sock *tp = tcp_sk(sk);
@@ -2497,25 +2521,16 @@ static void tcp_process_frto(struct sock
 
if (tp->snd_una == prior_snd_una ||
!before(tp->snd_una, tp->frto_highmark)) {
-   /* RTO was caused by loss, start retransmitting in
-* go-back-N slow start
-*/
tcp_enter_frto_loss(sk);
return;
}
 
if (tp->frto_counter == 1) {
-   /* First ACK after RTO advances the window: allow two new
-* segments out.
-*/
tp->snd_cwnd = tcp_packets_in_flight(tp) + 2;
-   } else {
+   } else /* frto_counter == 2 */ {
tcp_conservative_spur_to_response(tp);
}
 
-   /* F-RTO affects on two new ACKs following RTO.
-* At latest on third ACK the TCP behavior is back to normal.
-*/
tp->frto_counter = (tp->frto_counter + 1) % 3;
 }
 
-- 
1.4.2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHSET 0/18] FRTO: fixes and small changes + SACK enhanced version

2007-02-19 Thread Ilpo Järvinen

Hi,

Here is a set of patches that fix most of the flaws the current FRTO
implementation (specified in RFC4138) has, besides that, the last two
patches add SACK-enhanced FRTO (not enabled unless frto sysctl is set
to 2, which allows using of the basic version also with SACK). There
are some depencies to the earlier patches in the set (hard to list
all thoughts I've had, but not all combinations are not good ones
even if they apply cleanly).

 Documentation/networking/ip-sysctl.txt |5 -
 include/net/tcp.h  |   14 --
 net/ipv4/tcp_input.c   |  265 ++--
 3 files changed, 221 insertions(+), 63 deletions(-)

(At least) one interpretation issue exists, see patch "FRTO: Entry is
allowed only during (New)Reno like recovery".

Besides that, these things should/could be solved (later on):
- Setting undo_marker when RTO is not spurious (FRTO has been
  clearing it, which disabled DSACK undos for conventional
  recovery).
- Interaction with Eifel
- Different response (new sysctl to select them?)
- When cumulative ACK arrives to the frto_highseq during FRTO,
  it could be useful to go directly to CA_Open because then
  duplicate ACKs for that segment could then be used initiate
  recovery if it was lost. Most of the time, the duplicate ACKs
  won't be false ones (we might have made too many unnecessary
  retransmission but that's less likely with FRTO and it could
  be consider while making state decision).
- Maybe the frto_highmark should be reset somewhere during a
  connection due to wrapping of seqnos (reord adjustment relies
  on it having a valid after relation...)?
- tcp_use_frto and tcp_enter_loss now both scan skb list from
  the beginning, it might be possible to take advantage of this
  either by combining them or by passing skb from use_frto
  iteration to tcp_enter_loss.

I did some tests with FACK + SACK FRTO, results seemed to be correct but
the conservative response had really poor performance. I'm more familiar
with more aggressive response time-seq graphs and I was really wondering
does this thing really work at all (in couple of cases), but yes, I found
after tracing that it worked although the results was not a very good
looking one due to interaction with rate halving, maybe a "rate-halving
aware" response could do much better (or alternatively one that does more
aggressive undo).

# Test 1: normal TCP
# Test 2: spurious RTO
# Test 3: drop the segment
# Test 4: drop a delayed segment
# Test 5: drop the next segment
# Test 6: drop in window segment
# Test 7: drop the segment and the next segment
# Test 8: drop the segment and in window segment
# Test 9: delay the first and next (spurious RTOs, for different segments)
# Test 10: delay the first excessively (two spurious RTOs)
# Test n+1: drop rexmission
# Test n+2: delay rexmission (spurious RTO also after frto_highmark)
# Test n+3: delay rexmission (spurious RTO also after highmark), drop RTO seg
# Test n+4: drop the segment and rexmit
# Test n+5: drop the segment and first new data
# Test n+6: drop the segment and second new data

The tests were run in 2.6.18, I have quite a lot of own modifications
included in but they were disable using sysctls except for a change in
mark_head_lost: if condition from !TAGBITS -> !(TAGBITS & ~SACKED_RETRANS)
but afaict, it shouldn't affect, and if it does, it should be included
(if you received this mail from previous send attempt, I claimed by a
mistakenly that SACKED_ACKED was the bit that was excluded and had
incorrect parenthesized it here). I couldn't come up with a scenario in
mainline only code where SACKED_RETRANS would be set for a skb when LOST
has not been set, except for the head by FRTO itself which will not be a
problem. I have checked that the FRTO parts used in tests were identical
to the result of this patchset. Compile tested againts the net-2.6 (also
intermediate steps).


-- 
 i.

ps. I'm sorry if you receive these twice, the previous attempted had some
charset problems and was rejected at least by netdev.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/18] [TCP] FRTO: Incorrectly clears TCPCB_EVER_RETRANS bit

2007-02-19 Thread Ilpo Järvinen

FRTO was slightly too brave... Should only clear
TCPCB_SACKED_RETRANS bit.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_input.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 1a14191..b21e232 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1266,7 +1266,7 @@ void tcp_enter_frto(struct sock *sk)
tp->undo_retrans = 0;
 
sk_stream_for_retrans_queue(skb, sk) {
-   TCP_SKB_CB(skb)->sacked &= ~TCPCB_RETRANS;
+   TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
}
tcp_sync_left_out(tp);
 
-- 
1.4.2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/18] [TCP] FRTO: Separated response from FRTO detection algorithm

2007-02-19 Thread Ilpo Järvinen

FRTO spurious RTO detection algorithm (RFC4138) does not include response
to a detected spurious RTO but can use different response algorithms.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_input.c |   16 ++--
 1 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b21e232..c5be3d0 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2467,6 +2467,15 @@ static int tcp_ack_update_window(struct 
return flag;
 }
 
+/* A very conservative spurious RTO response algorithm: reduce cwnd and
+ * continue in congestion avoidance.
+ */
+static void tcp_conservative_spur_to_response(struct tcp_sock *tp)
+{
+   tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh);
+   tcp_moderate_cwnd(tp);
+}
+
 static void tcp_process_frto(struct sock *sk, u32 prior_snd_una)
 {
struct tcp_sock *tp = tcp_sk(sk);
@@ -2488,12 +2497,7 @@ static void tcp_process_frto(struct sock
 */
tp->snd_cwnd = tcp_packets_in_flight(tp) + 2;
} else {
-   /* Also the second ACK after RTO advances the window.
-* The RTO was likely spurious. Reduce cwnd and continue
-* in congestion avoidance
-*/
-   tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh);
-   tcp_moderate_cwnd(tp);
+   tcp_conservative_spur_to_response(tp);
}
 
/* F-RTO affects on two new ACKs following RTO.
-- 
1.4.2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Extensible hashing and RCU

2007-02-19 Thread Andi Kleen

"Michael K. Edwards" <[EMAIL PROTECTED]> writes:

> A better data structure for RCU, even with a fixed key space, is
> probably a splay tree.  Much less vulnerable to cache eviction DDoS
> than a hash, because the hot connections get rotated up into non-leaf
> layers and get traversed enough to keep them in the LRU set.

LRU tends to be hell for caches in MP systems, because it writes to
the cache lines too and makes them exclusive and more expensive.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Extensible hashing and RCU

2007-02-19 Thread Andi Kleen

Eric Dumazet <[EMAIL PROTECTED]> writes:
> 
> So are you speaking of one memory cache miss per lookup ?

Actually two: if the trie'ing allows RCUing you would save
the spinlock cache line too. This would increase the 
break-even budget for the trie.

> If not, you loose.

It all depends on if the higher levels on the trie are small
enough to be kept in cache. Even with two cache misses it might
still break even, but have better scalability.

Another advantage would be to eliminate the need for large memory
blocks, which cause problems too e.g. on NUMA. It certainly would
save quite some memory if the tree levels are allocated on demand
only. However breaking it up might also cost more TLB misses, 
but those could be eliminated by preallocating the tree in
the same way as the hash today. Don't know if it's needed or not.

I guess someone needs to code it up and try it.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2][TCP] YeAH-TCP: algorithm implementation

2007-02-19 Thread Angelo P. Castellani


The patch.

Angelo P. Castellani ha scritto:

From: Angelo P. Castellani <[EMAIL PROTECTED]>

YeAH-TCP is a sender-side high-speed enabled TCP congestion control
algorithm, which uses a mixed loss/delay approach to compute the
congestion window. It's design goals target high efficiency, internal,
RTT and Reno fairness, resilience to link loss while keeping network
elements load as low as possible.

For further details look here:
   http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf

Signed-off-by: Angelo P. Castellani <[EMAIL PROTECTED]>

---

This is the YeAH-TCP implementation of the algorithm presented to
PFLDnet2007 (http://wil.cs.caltech.edu/pfldnet2007/).

Regards,
Angelo P. Castellani

Kconfig|   14 ++
Makefile   |1
tcp_yeah.c |  288
+
tcp_yeah.h |  134 
4 files changed, 437 insertions(+)




diff -uprN linux-2.6.20-a/net/ipv4/Kconfig linux-2.6.20-b/net/ipv4/Kconfig
--- linux-2.6.20-a/net/ipv4/Kconfig	2007-02-04 19:44:54.0 +0100
+++ linux-2.6.20-b/net/ipv4/Kconfig	2007-02-19 10:52:46.0 +0100
@@ -574,6 +574,20 @@ config TCP_CONG_VENO
 	loss packets.
 	See http://www.ntu.edu.sg/home5/ZHOU0022/papers/CPFu03a.pdf
 
+config TCP_CONG_YEAH
+	tristate "YeAH TCP"
+	depends on EXPERIMENTAL
+	default n
+	---help---
+	YeAH-TCP is a sender-side high-speed enabled TCP congestion control
+	algorithm, which uses a mixed loss/delay approach to compute the
+	congestion window. It's design goals target high efficiency,
+	internal, RTT and Reno fairness, resilience to link loss while
+	keeping network elements load as low as possible.
+	
+	For further details look here:
+	  http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf
+
 choice
 	prompt "Default TCP congestion control"
 	default DEFAULT_CUBIC
diff -uprN linux-2.6.20-a/net/ipv4/Makefile linux-2.6.20-b/net/ipv4/Makefile
--- linux-2.6.20-a/net/ipv4/Makefile	2007-02-04 19:44:54.0 +0100
+++ linux-2.6.20-b/net/ipv4/Makefile	2007-02-19 10:52:46.0 +0100
@@ -49,6 +49,7 @@ obj-$(CONFIG_TCP_CONG_VEGAS) += tcp_vega
 obj-$(CONFIG_TCP_CONG_VENO) += tcp_veno.o
 obj-$(CONFIG_TCP_CONG_SCALABLE) += tcp_scalable.o
 obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o
+obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o
 obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
 
 obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
diff -uprN linux-2.6.20-a/net/ipv4/tcp_yeah.c linux-2.6.20-b/net/ipv4/tcp_yeah.c
--- linux-2.6.20-a/net/ipv4/tcp_yeah.c	1970-01-01 01:00:00.0 +0100
+++ linux-2.6.20-b/net/ipv4/tcp_yeah.c	2007-02-19 10:52:46.0 +0100
@@ -0,0 +1,288 @@
+/*
+ *
+ *   YeAH TCP
+ *
+ * For further details look at:
+ *http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf
+ *
+ */
+
+#include "tcp_yeah.h"
+
+/* Default values of the Vegas variables, in fixed-point representation
+ * with V_PARAM_SHIFT bits to the right of the binary point.
+ */
+#define V_PARAM_SHIFT 1
+
+#define TCP_YEAH_ALPHA   80 //lin number of packets queued at the bottleneck
+#define TCP_YEAH_GAMMA1 //lin fraction of queue to be removed per rtt
+#define TCP_YEAH_DELTA3 //log minimum fraction of cwnd to be removed on loss
+#define TCP_YEAH_EPSILON  1 //log maximum fraction to be removed on early decongestion
+#define TCP_YEAH_PHY  8 //lin maximum delta from base
+#define TCP_YEAH_RHO 16 //lin minumum number of consecutive rtt to consider competition on loss
+#define TCP_YEAH_ZETA50 //lin minimum number of state switchs to reset reno_count
+
+#define TCP_SCALABLE_AI_CNT	 100U
+
+/* YeAH variables */
+struct yeah {
+	/* Vegas */
+	u32	beg_snd_nxt;	/* right edge during last RTT */
+	u32	beg_snd_una;	/* left edge  during last RTT */
+	u32	beg_snd_cwnd;	/* saves the size of the cwnd */
+	u8	doing_vegas_now;/* if true, do vegas for this RTT */
+	u16	cntRTT;		/* # of RTTs measured within last RTT */
+	u32	minRTT;		/* min of RTTs measured within last RTT (in usec) */
+	u32	baseRTT;	/* the min of all Vegas RTT measurements seen (in usec) */
+	
+	/* YeAH */
+	u32 lastQ;
+	u32 doing_reno_now;
+
+	u32 reno_count;
+	u32 fast_count;
+
+	u32 pkts_acked;
+};
+
+static void tcp_yeah_init(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct yeah *yeah = inet_csk_ca(sk);
+
+	tcp_vegas_init(sk);
+
+	yeah->doing_reno_now = 0;
+	yeah->lastQ = 0;
+
+	yeah->reno_count = 2;
+
+	/* Ensure the MD arithmetic works.  This is somewhat pedantic,
+	 * since I don't think we will see a cwnd this large. :) */
+	tp->snd_cwnd_clamp = min_t(u32, tp->snd_cwnd_clamp, 0x/128);
+
+}
+
+
+static void tcp_yeah_pkts_acked(struct sock *sk, u32 pkts_acked)
+{
+	const struct inet_connection_sock *icsk = inet_csk(sk);
+	struct yeah *yeah = inet_csk_ca(sk);
+
+	if (icsk->icsk_ca_state == TCP_CA_Open)
+		yeah->pkts_acked = pkts_acked;	
+}
+
+/* 64bit divisor, dividend and result. dynamic precision */
+static inline u64 div64_64(u64 dividend, u64 divisor)
+{

[PATCH 1/2][TCP] YeAH-TCP: algorithm implementation

2007-02-19 Thread Angelo P. Castellani


From: Angelo P. Castellani <[EMAIL PROTECTED]>

YeAH-TCP is a sender-side high-speed enabled TCP congestion control
algorithm, which uses a mixed loss/delay approach to compute the
congestion window. It's design goals target high efficiency, internal,
RTT and Reno fairness, resilience to link loss while keeping network
elements load as low as possible.

For further details look here:
   http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf

Signed-off-by: Angelo P. Castellani <[EMAIL PROTECTED]>

---

This is the YeAH-TCP implementation of the algorithm presented to
PFLDnet2007 (http://wil.cs.caltech.edu/pfldnet2007/).

Regards,
Angelo P. Castellani

Kconfig|   14 ++
Makefile   |1
tcp_yeah.c |  288
+
tcp_yeah.h |  134 
4 files changed, 437 insertions(+)

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2][TCP] YeAH-TCP: algorithm implementation

2007-02-19 Thread Angelo P. Castellani


From: Angelo P. Castellani <[EMAIL PROTECTED]>

YeAH-TCP is a sender-side high-speed enabled TCP congestion control 
algorithm, which uses a mixed loss/delay approach to compute the 
congestion window. It's design goals target high efficiency, internal, 
RTT and Reno fairness, resilience to link loss while keeping network 
elements load as low as possible.


For further details look here:
   http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf

Signed-off-by: Angelo P. Castellani <[EMAIL PROTECTED]>

---

This is the YeAH-TCP implementation of the algorithm presented to 
PFLDnet2007 (http://wil.cs.caltech.edu/pfldnet2007/).


Regards,
Angelo P. Castellani

Kconfig|   14 ++
Makefile   |1
tcp_yeah.c |  288 
+

tcp_yeah.h |  134 
4 files changed, 437 insertions(+)

diff -uprN linux-2.6.20-a/net/ipv4/Kconfig linux-2.6.20-b/net/ipv4/Kconfig
--- linux-2.6.20-a/net/ipv4/Kconfig	2007-02-04 19:44:54.0 +0100
+++ linux-2.6.20-b/net/ipv4/Kconfig	2007-02-19 10:52:46.0 +0100
@@ -574,6 +574,20 @@ config TCP_CONG_VENO
 	loss packets.
 	See http://www.ntu.edu.sg/home5/ZHOU0022/papers/CPFu03a.pdf
 
+config TCP_CONG_YEAH
+	tristate "YeAH TCP"
+	depends on EXPERIMENTAL
+	default n
+	---help---
+	YeAH-TCP is a sender-side high-speed enabled TCP congestion control
+	algorithm, which uses a mixed loss/delay approach to compute the
+	congestion window. It's design goals target high efficiency,
+	internal, RTT and Reno fairness, resilience to link loss while
+	keeping network elements load as low as possible.
+	
+	For further details look here:
+	  http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf
+
 choice
 	prompt "Default TCP congestion control"
 	default DEFAULT_CUBIC
diff -uprN linux-2.6.20-a/net/ipv4/Makefile linux-2.6.20-b/net/ipv4/Makefile
--- linux-2.6.20-a/net/ipv4/Makefile	2007-02-04 19:44:54.0 +0100
+++ linux-2.6.20-b/net/ipv4/Makefile	2007-02-19 10:52:46.0 +0100
@@ -49,6 +49,7 @@ obj-$(CONFIG_TCP_CONG_VEGAS) += tcp_vega
 obj-$(CONFIG_TCP_CONG_VENO) += tcp_veno.o
 obj-$(CONFIG_TCP_CONG_SCALABLE) += tcp_scalable.o
 obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o
+obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o
 obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
 
 obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
diff -uprN linux-2.6.20-a/net/ipv4/tcp_yeah.c linux-2.6.20-b/net/ipv4/tcp_yeah.c
--- linux-2.6.20-a/net/ipv4/tcp_yeah.c	1970-01-01 01:00:00.0 +0100
+++ linux-2.6.20-b/net/ipv4/tcp_yeah.c	2007-02-19 10:52:46.0 +0100
@@ -0,0 +1,288 @@
+/*
+ *
+ *   YeAH TCP
+ *
+ * For further details look at:
+ *http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf
+ *
+ */
+
+#include "tcp_yeah.h"
+
+/* Default values of the Vegas variables, in fixed-point representation
+ * with V_PARAM_SHIFT bits to the right of the binary point.
+ */
+#define V_PARAM_SHIFT 1
+
+#define TCP_YEAH_ALPHA   80 //lin number of packets queued at the bottleneck
+#define TCP_YEAH_GAMMA1 //lin fraction of queue to be removed per rtt
+#define TCP_YEAH_DELTA3 //log minimum fraction of cwnd to be removed on loss
+#define TCP_YEAH_EPSILON  1 //log maximum fraction to be removed on early decongestion
+#define TCP_YEAH_PHY  8 //lin maximum delta from base
+#define TCP_YEAH_RHO 16 //lin minumum number of consecutive rtt to consider competition on loss
+#define TCP_YEAH_ZETA50 //lin minimum number of state switchs to reset reno_count
+
+#define TCP_SCALABLE_AI_CNT	 100U
+
+/* YeAH variables */
+struct yeah {
+	/* Vegas */
+	u32	beg_snd_nxt;	/* right edge during last RTT */
+	u32	beg_snd_una;	/* left edge  during last RTT */
+	u32	beg_snd_cwnd;	/* saves the size of the cwnd */
+	u8	doing_vegas_now;/* if true, do vegas for this RTT */
+	u16	cntRTT;		/* # of RTTs measured within last RTT */
+	u32	minRTT;		/* min of RTTs measured within last RTT (in usec) */
+	u32	baseRTT;	/* the min of all Vegas RTT measurements seen (in usec) */
+	
+	/* YeAH */
+	u32 lastQ;
+	u32 doing_reno_now;
+
+	u32 reno_count;
+	u32 fast_count;
+
+	u32 pkts_acked;
+};
+
+static void tcp_yeah_init(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct yeah *yeah = inet_csk_ca(sk);
+
+	tcp_vegas_init(sk);
+
+	yeah->doing_reno_now = 0;
+	yeah->lastQ = 0;
+
+	yeah->reno_count = 2;
+
+	/* Ensure the MD arithmetic works.  This is somewhat pedantic,
+	 * since I don't think we will see a cwnd this large. :) */
+	tp->snd_cwnd_clamp = min_t(u32, tp->snd_cwnd_clamp, 0x/128);
+
+}
+
+
+static void tcp_yeah_pkts_acked(struct sock *sk, u32 pkts_acked)
+{
+	const struct inet_connection_sock *icsk = inet_csk(sk);
+	struct yeah *yeah = inet_csk_ca(sk);
+
+	if (icsk->icsk_ca_state == TCP_CA_Open)
+		yeah->pkts_acked = pkts_acked;	
+}
+
+/* 64bit divisor, dividend and result. dynamic precision */
+static inline u64 div64_64(u64 dividend, u64 divisor)
+{
+	u32 d = divisor;
+
+	if (divisor > 0xf

Re: [PATCH 2/2][TCP] YeAH-TCP: limited slow start exported function

2007-02-19 Thread Angelo P. Castellani


Forgot the patch..

Angelo P. Castellani ha scritto:

From: Angelo P. Castellani <[EMAIL PROTECTED]>

RFC3742: limited slow start

See http://www.ietf.org/rfc/rfc3742.txt

Signed-off-by: Angelo P. Castellani <[EMAIL PROTECTED]>
---

To allow code reutilization I've added the limited slow start 
procedure as an exported symbol of linux tcp congestion control.


On large BDP networks canonical slow start should be avoided because 
it requires large packet losses to converge, whereas at lower BDPs 
slow start and limited slow start are identical. Large BDP is defined 
through the max_ssthresh variable.


I think limited slow start could safely replace the canonical slow 
start procedure in Linux.


Regards,
Angelo P. Castellani

p.s.: in the attached patch is added an exported function currently 
used only by YeAH TCP


include/net/tcp.h   |1 +
net/ipv4/tcp_cong.c |   23 +++
2 files changed, 24 insertions(+)




diff -uprN linux-2.6.20-a/include/net/tcp.h linux-2.6.20-c/include/net/tcp.h
--- linux-2.6.20-a/include/net/tcp.h	2007-02-04 19:44:54.0 +0100
+++ linux-2.6.20-c/include/net/tcp.h	2007-02-19 10:54:10.0 +0100
@@ -669,6 +669,7 @@ extern void tcp_get_allowed_congestion_c
 extern int tcp_set_allowed_congestion_control(char *allowed);
 extern int tcp_set_congestion_control(struct sock *sk, const char *name);
 extern void tcp_slow_start(struct tcp_sock *tp);
+extern void tcp_limited_slow_start(struct tcp_sock *tp);
 
 extern struct tcp_congestion_ops tcp_init_congestion_ops;
 extern u32 tcp_reno_ssthresh(struct sock *sk);
diff -uprN linux-2.6.20-a/net/ipv4/tcp_cong.c linux-2.6.20-c/net/ipv4/tcp_cong.c
--- linux-2.6.20-a/net/ipv4/tcp_cong.c	2007-02-04 19:44:54.0 +0100
+++ linux-2.6.20-c/net/ipv4/tcp_cong.c	2007-02-19 10:54:10.0 +0100
@@ -297,6 +297,29 @@ void tcp_slow_start(struct tcp_sock *tp)
 }
 EXPORT_SYMBOL_GPL(tcp_slow_start);
 
+void tcp_limited_slow_start(struct tcp_sock *tp)
+{
+	/* RFC3742: limited slow start
+	 * the window is increased by 1/K MSS for each arriving ACK,
+	 * for K = int(cwnd/(0.5 max_ssthresh))
+	 */
+
+	const int max_ssthresh = 100;
+
+	if (max_ssthresh > 0 && tp->snd_cwnd > max_ssthresh) {
+		u32 k = max(tp->snd_cwnd / (max_ssthresh >> 1), 1U);
+		if (++tp->snd_cwnd_cnt >= k) {
+			if (tp->snd_cwnd < tp->snd_cwnd_clamp)
+tp->snd_cwnd++;
+			tp->snd_cwnd_cnt = 0;
+		}
+	} else {
+		if (tp->snd_cwnd < tp->snd_cwnd_clamp)
+			tp->snd_cwnd++;
+	}
+}
+EXPORT_SYMBOL_GPL(tcp_limited_slow_start);
+
 /*
  * TCP Reno congestion control
  * This is special case used for fallback as well.

[PATCH 2/2][TCP] YeAH-TCP: limited slow start exported function

2007-02-19 Thread Angelo P. Castellani


From: Angelo P. Castellani <[EMAIL PROTECTED]>

RFC3742: limited slow start

See http://www.ietf.org/rfc/rfc3742.txt

Signed-off-by: Angelo P. Castellani <[EMAIL PROTECTED]>
---

To allow code reutilization I've added the limited slow start procedure 
as an exported symbol of linux tcp congestion control.


On large BDP networks canonical slow start should be avoided because it 
requires large packet losses to converge, whereas at lower BDPs slow 
start and limited slow start are identical. Large BDP is defined through 
the max_ssthresh variable.


I think limited slow start could safely replace the canonical slow start 
procedure in Linux.


Regards,
Angelo P. Castellani

p.s.: in the attached patch is added an exported function currently used 
only by YeAH TCP


include/net/tcp.h   |1 +
net/ipv4/tcp_cong.c |   23 +++
2 files changed, 24 insertions(+)

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Converting network devices from class devices causes namespace pollution

2007-02-19 Thread Eric W. Biederman

Greg KH <[EMAIL PROTECTED]> writes:

> We need our own namespace for these devices, and we have it today
> already.  Look if you enable CONFIG_SYSFS_DEPRECATED, or on a pre-2.6.19
> machine at what shows up in the pci device directories:
> -r--r--r--  1 root root  4096 2007-02-18 13:06 vendor

Interesting.  I hadn't noticed that before.

> So, all we need to do is rename these devices back to the "net:eth0"
> name, and everything will be fine.  I'll work on fixing that tomorrow as
> it will take a bit of hacking on the kobject symlink function and the
> driver core code (but it gets us rid of a symlink in "compatiblity
> mode", which is always a nice win...)

Ok.  I'm groaning a little bit at what a nuisance this is going to be
to get support for multiple network namespaces in there after your fix
goes in, directories can be easier to deal with.  But once you figure this
part out I will figure something out.

For me the nasty case is 1 pci device that has multiple ethernet devices
coming from it (I think IB devices have this property today), each
showing up in a different network namespace, so they might all have
the same name. Ugh.

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

87 matches

Mail list logo