Re: skge dysfunction on Amd X2 machine with 4GB memory
On Sun, Feb 11, 2007 at 04:57:55PM +0200, Matti Aarnio wrote: > With the skge driver there seems to be some sort of problem to work > in a system with memory above the 4 GB of PCI address space. The chipset (apparently) doesn't deal with bus addresses over 4GB even though the MAC does. I guess the right way to fix this long term is to detect systems with these chips and mask the dma_mask globally (or if you're clever per bus)? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 8042] New: Cisco VPN Client cannot connect using TCP with Intel 82573L NIC
On Mon, 19 Feb 2007 15:55:19 -0800 [EMAIL PROTECTED] wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=8042 > >Summary: Cisco VPN Client cannot connect using TCP with Intel > 82573L NIC > Kernel Version: 2.6.18.6 > Status: NEW > Severity: normal > Owner: [EMAIL PROTECTED] > Submitter: [EMAIL PROTECTED] > > > Most recent kernel where this bug did *NOT* occur: - > Distribution: Ubuntu, Debian > Hardware Environment: Lenovo Thinkpad T60p > Software Environment: - > Problem Description: > > I have an issue with the cisco vpn client > (vpnclient-linux-x86_64-4.8.00.0490-k9.tar.gz) that appears to be related to > packet fragmentation and the e1000 driver (hardware is 82573L, I don't believe > that this issue affects earlier chips). > > When I try to connect to a VPN using Cisco's TCP tunneling feature I > experience > an issue where I am unable to connect to the vpn concentrator. > > If I recompile the e1000 module, setting the option: > > CONFIG_E1000_DISABLE_PACKET_SPLIT=y > > then I am able to connect without issue. > > I have experience this problem with the following kernels: > > ubuntu edgy 2.6.16-11-generic > debian sid 2.6.18-4-686 (Based on 2.6.18.6 w/hand picked later patches) > kernel.org 2.6.18.6 > > There was a perhaps related bug resolved for udp recently, see this changelog > entry: > > http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=753eab76a3337863a0d86ce045fa4eb6c3cbeef9 > > You can also see some discussion surrounding the issue (I had initially > believe > it related to another issue with the 82573L), starting from this comment: > > http://bugzilla.kernel.org/show_bug.cgi?id=6929#c9 > > Please let me know if there is anything else I can do to better explain the > problem. > > Steps to reproduce: > > It's not possible to reproduce this issue without: > > - A 82573L chip based network card > - A Cisco VPN Concentrator you can access using TCP tunneling > - The cisco vpn client () > > I have all of these, and would be more than pleased to reproduce the problem, > provide packet captures, etc... If you want to reproduce the problem yourself, > and have the above equipment, try to open a TCP encapsulated connection to the > VPN Concentrator, you should not be able to unless you have compiled e1000 > with > CONFIG_E1000_DISABLE_PACKET_SPLIT=y. > > --- You are receiving this mail because: --- > You are on the CC list for the bug, or are watching someone who is. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2][TCP] YeAH-TCP: limited slow start exported function
Angelo P. Castellani wrote: John Heffner ha scritto: Note the patch is compile-tested only! I can do some real testing if you'd like to apply this Dave. The date you read on the patch is due to the fact I've splitted this patchset into 2 diff files. This isn't compile-tested only, I've used this piece of code for about 3 months. Sorry for the confusion. The patch I attached to my message was compile-tested only. Thanks, -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH][IPSEC][2/3] IPv6 over IPv4 IPsec tunnel
Hi, More fix is needed for __xfrm6_bundle_create(). Signed-off-by: Noriaki TAKAMIYA <[EMAIL PROTECTED]> Acked-by: Masahide NAKAMURA <[EMAIL PROTECTED]> -- fixed to set fl_tunnel.fl6_src correctly in xfrm6_bundle_create(). --- net/ipv6/xfrm6_policy.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c index b1133f2..d8a585b 100644 --- a/net/ipv6/xfrm6_policy.c +++ b/net/ipv6/xfrm6_policy.c @@ -189,7 +189,7 @@ __xfrm6_bundle_create(struct xfrm_policy case AF_INET6: ipv6_addr_copy(&fl_tunnel.fl6_dst, __xfrm6_bundle_addr_remote(xfrm[i], &fl->fl6_dst)); - ipv6_addr_copy(&fl_tunnel.fl6_src, __xfrm6_bundle_addr_remote(xfrm[i], &fl->fl6_src)); + ipv6_addr_copy(&fl_tunnel.fl6_src, __xfrm6_bundle_addr_local(xfrm[i], &fl->fl6_src)); break; default: BUG_ON(1); -- Noriaki TAKAMIYA - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] remove irq_sem from ixgb
From: Chris Snook <[EMAIL PROTECTED]> Remove irq_sem from ixgb. Currently untested, but similar to tested patches on atl1 and e1000. Signed-off-by: Chris Snook <[EMAIL PROTECTED]> -- diff -urp linux-2.6.20-git14.orig/drivers/net/ixgb/ixgb.h linux-2.6.20-git14/drivers/net/ixgb/ixgb.h --- linux-2.6.20-git14.orig/drivers/net/ixgb/ixgb.h 2007-02-19 14:32:16.0 -0500 +++ linux-2.6.20-git14/drivers/net/ixgb/ixgb.h 2007-02-19 15:04:50.0 -0500 @@ -161,7 +161,6 @@ struct ixgb_adapter { uint16_t link_speed; uint16_t link_duplex; spinlock_t tx_lock; - atomic_t irq_sem; struct work_struct tx_timeout_task; struct timer_list blink_timer; diff -urp linux-2.6.20-git14.orig/drivers/net/ixgb/ixgb_main.c linux-2.6.20-git14/drivers/net/ixgb/ixgb_main.c --- linux-2.6.20-git14.orig/drivers/net/ixgb/ixgb_main.c2007-02-19 14:32:16.0 -0500 +++ linux-2.6.20-git14/drivers/net/ixgb/ixgb_main.c 2007-02-19 15:06:52.0 -0500 @@ -201,7 +201,6 @@ module_exit(ixgb_exit_module); static void ixgb_irq_disable(struct ixgb_adapter *adapter) { - atomic_inc(&adapter->irq_sem); IXGB_WRITE_REG(&adapter->hw, IMC, ~0); IXGB_WRITE_FLUSH(&adapter->hw); synchronize_irq(adapter->pdev->irq); @@ -215,12 +214,10 @@ ixgb_irq_disable(struct ixgb_adapter *ad static void ixgb_irq_enable(struct ixgb_adapter *adapter) { - if(atomic_dec_and_test(&adapter->irq_sem)) { - IXGB_WRITE_REG(&adapter->hw, IMS, - IXGB_INT_RXT0 | IXGB_INT_RXDMT0 | IXGB_INT_TXDW | - IXGB_INT_LSC); - IXGB_WRITE_FLUSH(&adapter->hw); - } + IXGB_WRITE_REG(&adapter->hw, IMS, + IXGB_INT_RXT0 | IXGB_INT_RXDMT0 | IXGB_INT_TXDW | + IXGB_INT_LSC); + IXGB_WRITE_FLUSH(&adapter->hw); } int @@ -584,7 +581,6 @@ ixgb_sw_init(struct ixgb_adapter *adapte /* enable flow control to be programmed */ hw->fc.send_xon = 1; - atomic_set(&adapter->irq_sem, 1); spin_lock_init(&adapter->tx_lock); return 0; @@ -1755,7 +1751,6 @@ ixgb_intr(int irq, void *data) of the posted write is intentionally left out. */ - atomic_inc(&adapter->irq_sem); IXGB_WRITE_REG(&adapter->hw, IMC, ~0); __netif_rx_schedule(netdev); } - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] remove irq_sem from e1000
From: Chris Snook <[EMAIL PROTECTED]> Remove unnecessary irq_sem accounting from e1000. Tested with no problems. Signed-off-by: Chris Snook <[EMAIL PROTECTED]> -- diff -urp linux-2.6.20-git14.orig/drivers/net/e1000/e1000.h linux-2.6.20-git14/drivers/net/e1000/e1000.h --- linux-2.6.20-git14.orig/drivers/net/e1000/e1000.h 2007-02-19 14:32:15.0 -0500 +++ linux-2.6.20-git14/drivers/net/e1000/e1000.h2007-02-19 15:07:37.0 -0500 @@ -252,7 +252,6 @@ struct e1000_adapter { #ifdef CONFIG_E1000_NAPI spinlock_t tx_queue_lock; #endif - atomic_t irq_sem; unsigned int total_tx_bytes; unsigned int total_tx_packets; unsigned int total_rx_bytes; diff -urp linux-2.6.20-git14.orig/drivers/net/e1000/e1000_main.c linux-2.6.20-git14/drivers/net/e1000/e1000_main.c --- linux-2.6.20-git14.orig/drivers/net/e1000/e1000_main.c 2007-02-19 14:32:15.0 -0500 +++ linux-2.6.20-git14/drivers/net/e1000/e1000_main.c 2007-02-19 15:09:28.0 -0500 @@ -349,7 +349,6 @@ static void e1000_free_irq(struct e1000_ static void e1000_irq_disable(struct e1000_adapter *adapter) { - atomic_inc(&adapter->irq_sem); E1000_WRITE_REG(&adapter->hw, IMC, ~0); E1000_WRITE_FLUSH(&adapter->hw); synchronize_irq(adapter->pdev->irq); @@ -363,10 +362,8 @@ e1000_irq_disable(struct e1000_adapter * static void e1000_irq_enable(struct e1000_adapter *adapter) { - if (likely(atomic_dec_and_test(&adapter->irq_sem))) { - E1000_WRITE_REG(&adapter->hw, IMS, IMS_ENABLE_MASK); - E1000_WRITE_FLUSH(&adapter->hw); - } + E1000_WRITE_REG(&adapter->hw, IMS, IMS_ENABLE_MASK); + E1000_WRITE_FLUSH(&adapter->hw); } static void @@ -1336,7 +1333,6 @@ e1000_sw_init(struct e1000_adapter *adap spin_lock_init(&adapter->tx_queue_lock); #endif - atomic_set(&adapter->irq_sem, 1); spin_lock_init(&adapter->stats_lock); set_bit(__E1000_DOWN, &adapter->flags); @@ -3758,11 +3754,6 @@ e1000_intr_msi(int irq, void *data) #endif uint32_t icr = E1000_READ_REG(hw, ICR); -#ifdef CONFIG_E1000_NAPI - /* read ICR disables interrupts using IAM, so keep up with our -* enable/disable accounting */ - atomic_inc(&adapter->irq_sem); -#endif if (icr & (E1000_ICR_RXSEQ | E1000_ICR_LSC)) { hw->get_link_status = 1; /* 80003ES2LAN workaround-- For packet buffer work-around on @@ -3832,13 +3823,6 @@ e1000_intr(int irq, void *data) if (unlikely(hw->mac_type >= e1000_82571 && !(icr & E1000_ICR_INT_ASSERTED))) return IRQ_NONE; - - /* Interrupt Auto-Mask...upon reading ICR, -* interrupts are masked. No need for the -* IMC write, but it does mean we should -* account for it ASAP. */ - if (likely(hw->mac_type >= e1000_82571)) - atomic_inc(&adapter->irq_sem); #endif if (unlikely(icr & (E1000_ICR_RXSEQ | E1000_ICR_LSC))) { @@ -3862,7 +3846,6 @@ e1000_intr(int irq, void *data) #ifdef CONFIG_E1000_NAPI if (unlikely(hw->mac_type < e1000_82571)) { /* disable interrupts, without the synchronize_irq bit */ - atomic_inc(&adapter->irq_sem); E1000_WRITE_REG(hw, IMC, ~0); E1000_WRITE_FLUSH(hw); } @@ -3888,7 +3871,6 @@ e1000_intr(int irq, void *data) * de-assertion state. */ if (hw->mac_type == e1000_82547 || hw->mac_type == e1000_82547_rev_2) { - atomic_inc(&adapter->irq_sem); E1000_WRITE_REG(hw, IMC, ~0); } - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] remove irq_sem cruft from e1000 and derivatives
Chris Snook wrote: Hey folks -- While digging through the atl1 source, I was troubled by the code using irq_sem. I did some digging and found the same code in e1000 and ixgb. I'm not entirely sure what it was originally intended to do, but it doesn't seem to be doing anything useful now, except possibly locking interrupts off if NAPI is flipped on and off enough times to cause an integer overflow. The following patches completely remove irq_sem from each of the drivers. This has been tested successfully on atl1 and e1000. If someone would like to send me ixgb hardware I'd be glad to test that, otherwise you'll have to test it yourself. :) -- Chris I'm not yet seeing patches 1/3 appear, but I'll certainly take a look at them and have them tested in our labs once they appear for e1000 and ixgb. Cheers, Auke - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] remove irq_sem from atl1
From: Chris Snook <[EMAIL PROTECTED]> Remove unnecessary irq_sem code from atl1 driver. Tested with no problems. Signed-off-by: Chris Snook <[EMAIL PROTECTED]> Signed-off-by: Jay Cliburn <[EMAIL PROTECTED]> -- diff -urp linux-2.6.20-git14.orig/drivers/net/atl1/atl1.h linux-2.6.20-git14/drivers/net/atl1/atl1.h --- linux-2.6.20-git14.orig/drivers/net/atl1/atl1.h 2007-02-19 14:32:15.0 -0500 +++ linux-2.6.20-git14/drivers/net/atl1/atl1.h 2007-02-19 15:10:07.0 -0500 @@ -236,7 +236,6 @@ struct atl1_adapter { u16 link_speed; u16 link_duplex; spinlock_t lock; - atomic_t irq_sem; struct work_struct tx_timeout_task; struct work_struct link_chg_task; struct work_struct pcie_dma_to_rst_task; diff -urp linux-2.6.20-git14.orig/drivers/net/atl1/atl1_main.c linux-2.6.20-git14/drivers/net/atl1/atl1_main.c --- linux-2.6.20-git14.orig/drivers/net/atl1/atl1_main.c2007-02-19 14:32:15.0 -0500 +++ linux-2.6.20-git14/drivers/net/atl1/atl1_main.c 2007-02-19 15:10:44.0 -0500 @@ -163,7 +163,6 @@ static int __devinit atl1_sw_init(struct hw->cmb_tx_timer = 1; /* about 2us */ hw->smb_timer = 10; /* about 200ms */ - atomic_set(&adapter->irq_sem, 0); spin_lock_init(&adapter->lock); spin_lock_init(&adapter->mb_lock); @@ -272,8 +271,7 @@ err_nomem: */ static void atl1_irq_enable(struct atl1_adapter *adapter) { - if (likely(!atomic_dec_and_test(&adapter->irq_sem))) - iowrite32(IMR_NORMAL_MASK, adapter->hw.hw_addr + REG_IMR); + iowrite32(IMR_NORMAL_MASK, adapter->hw.hw_addr + REG_IMR); } static void atl1_clear_phy_int(struct atl1_adapter *adapter) @@ -1205,7 +1203,6 @@ static u32 atl1_configure(struct atl1_ad */ static void atl1_irq_disable(struct atl1_adapter *adapter) { - atomic_inc(&adapter->irq_sem); iowrite32(0, adapter->hw.hw_addr + REG_IMR); ioread32(adapter->hw.hw_addr + REG_IMR); synchronize_irq(adapter->pdev->irq); - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] remove irq_sem cruft from e1000 a nd derivatives
Hey folks -- While digging through the atl1 source, I was troubled by the code using irq_sem. I did some digging and found the same code in e1000 and ixgb. I'm not entirely sure what it was originally intended to do, but it doesn't seem to be doing anything useful now, except possibly locking interrupts off if NAPI is flipped on and off enough times to cause an integer overflow. The following patches completely remove irq_sem from each of the drivers. This has been tested successfully on atl1 and e1000. If someone would like to send me ixgb hardware I'd be glad to test that, otherwise you'll have to test it yourself. :) -- Chris - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] forcedeth: fix checksum feature in mcp65
This patch removes checksum offload feature in mcp65 chipsets as they are not supported in hw. Signed-Off-By: Ayaz Abdulla <[EMAIL PROTECTED]> --- orig/drivers/net/forcedeth.c2007-02-19 09:17:41.0 -0500 +++ new/drivers/net/forcedeth.c 2007-02-19 09:19:43.0 -0500 @@ -5374,19 +5374,19 @@ }, { /* MCP65 Ethernet Controller */ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_20), - .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_CHECKSUM|DEV_HAS_HIGH_DMA|DEV_HAS_POWER_CNTRL|DEV_HAS_MSI|DEV_HAS_PAUSEFRAME_TX|DEV_HAS_STATISTICS_V2|DEV_HAS_TEST_EXTENDED|DEV_HAS_MGMT_UNIT, + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA|DEV_HAS_POWER_CNTRL|DEV_HAS_MSI|DEV_HAS_PAUSEFRAME_TX|DEV_HAS_STATISTICS_V2|DEV_HAS_TEST_EXTENDED|DEV_HAS_MGMT_UNIT, }, { /* MCP65 Ethernet Controller */ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_21), - .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_CHECKSUM|DEV_HAS_HIGH_DMA|DEV_HAS_POWER_CNTRL|DEV_HAS_MSI|DEV_HAS_PAUSEFRAME_TX|DEV_HAS_STATISTICS_V2|DEV_HAS_TEST_EXTENDED|DEV_HAS_MGMT_UNIT, + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA|DEV_HAS_POWER_CNTRL|DEV_HAS_MSI|DEV_HAS_PAUSEFRAME_TX|DEV_HAS_STATISTICS_V2|DEV_HAS_TEST_EXTENDED|DEV_HAS_MGMT_UNIT, }, { /* MCP65 Ethernet Controller */ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_22), - .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_CHECKSUM|DEV_HAS_HIGH_DMA|DEV_HAS_POWER_CNTRL|DEV_HAS_MSI|DEV_HAS_PAUSEFRAME_TX|DEV_HAS_STATISTICS_V2|DEV_HAS_TEST_EXTENDED|DEV_HAS_MGMT_UNIT, + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA|DEV_HAS_POWER_CNTRL|DEV_HAS_MSI|DEV_HAS_PAUSEFRAME_TX|DEV_HAS_STATISTICS_V2|DEV_HAS_TEST_EXTENDED|DEV_HAS_MGMT_UNIT, }, { /* MCP65 Ethernet Controller */ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_23), - .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_CHECKSUM|DEV_HAS_HIGH_DMA|DEV_HAS_POWER_CNTRL|DEV_HAS_MSI|DEV_HAS_PAUSEFRAME_TX|DEV_HAS_STATISTICS_V2|DEV_HAS_TEST_EXTENDED|DEV_HAS_MGMT_UNIT, + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA|DEV_HAS_POWER_CNTRL|DEV_HAS_MSI|DEV_HAS_PAUSEFRAME_TX|DEV_HAS_STATISTICS_V2|DEV_HAS_TEST_EXTENDED|DEV_HAS_MGMT_UNIT, }, { /* MCP67 Ethernet Controller */ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_24),
Re: forcedeth problems on 2.6.20-rc6-mm3
Robert Hancock wrote: Ayaz Abdulla wrote: For all those who are having issues, please try out the attached patch. Ayaz --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. --- --- orig/drivers/net/forcedeth.c2007-02-08 21:41:59.0 -0500 +++ new/drivers/net/forcedeth.c2007-02-08 21:44:53.0 -0500 @@ -3104,13 +3104,17 @@ struct fe_priv *np = netdev_priv(dev); u8 __iomem *base = get_hwbase(dev); unsigned long flags; +u32 retcode; -if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) +if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) { pkts = nv_rx_process(dev, limit); -else +retcode = nv_alloc_rx(dev); +} else { pkts = nv_rx_process_optimized(dev, limit); +retcode = nv_alloc_rx_optimized(dev); +} -if (nv_alloc_rx(dev)) { +if (retcode) { spin_lock_irqsave(&np->lock, flags); if (!np->in_shutdown) mod_timer(&np->oom_kick, jiffies + OOM_REFILL); Did anyone push this patch into mainline? forcedeth on 2.6.20-git14 is still completely broken without this patch. I have submitted the patch to netdev mailing list. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] forcedeth: disable msix
There seems to be an issue when both MSI-X is enabled and NAPI is configured. This patch disables MSI-X until the issue is root caused. Signed-Off-By: Ayaz Abdulla <[EMAIL PROTECTED]> --- orig/drivers/net/forcedeth.c2007-02-19 09:17:02.0 -0500 +++ new/drivers/net/forcedeth.c 2007-02-19 09:17:07.0 -0500 @@ -839,7 +839,7 @@ NV_MSIX_INT_DISABLED, NV_MSIX_INT_ENABLED }; -static int msix = NV_MSIX_INT_ENABLED; +static int msix = NV_MSIX_INT_DISABLED; /* * DMA 64bit
[PATCH 1/3] forcedeth: fixed missing call in napi poll
The napi poll routine was missing the call to the optimized rx process routine. This patch adds the missing call for the optimized path. Signed-Off-By: Ayaz Abdulla <[EMAIL PROTECTED]> --- orig/drivers/net/forcedeth.c2007-02-19 09:13:10.0 -0500 +++ new/drivers/net/forcedeth.c 2007-02-19 09:13:46.0 -0500 @@ -3104,13 +3104,17 @@ struct fe_priv *np = netdev_priv(dev); u8 __iomem *base = get_hwbase(dev); unsigned long flags; + u32 retcode; - if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) { pkts = nv_rx_process(dev, limit); - else + retcode = nv_alloc_rx(dev); + } else { pkts = nv_rx_process_optimized(dev, limit); + retcode = nv_alloc_rx_optimized(dev); + } - if (nv_alloc_rx(dev)) { + if (retcode) { spin_lock_irqsave(&np->lock, flags); if (!np->in_shutdown) mod_timer(&np->oom_kick, jiffies + OOM_REFILL);
Re: forcedeth problems on 2.6.20-rc6-mm3
Ayaz Abdulla wrote: For all those who are having issues, please try out the attached patch. Ayaz --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. --- --- orig/drivers/net/forcedeth.c2007-02-08 21:41:59.0 -0500 +++ new/drivers/net/forcedeth.c 2007-02-08 21:44:53.0 -0500 @@ -3104,13 +3104,17 @@ struct fe_priv *np = netdev_priv(dev); u8 __iomem *base = get_hwbase(dev); unsigned long flags; + u32 retcode; - if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) { pkts = nv_rx_process(dev, limit); - else + retcode = nv_alloc_rx(dev); + } else { pkts = nv_rx_process_optimized(dev, limit); + retcode = nv_alloc_rx_optimized(dev); + } - if (nv_alloc_rx(dev)) { + if (retcode) { spin_lock_irqsave(&np->lock, flags); if (!np->in_shutdown) mod_timer(&np->oom_kick, jiffies + OOM_REFILL); Did anyone push this patch into mainline? forcedeth on 2.6.20-git14 is still completely broken without this patch. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[2.6 patch] kill net/rxrpc/rxrpc_syms.c
This patch moves the EXPORT_SYMBOL's from net/rxrpc/rxrpc_syms.c to the files with the actual functions. Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]> --- This patch was already sent on: - 26 Nov 2006 net/rxrpc/Makefile |1 - net/rxrpc/call.c |5 + net/rxrpc/connection.c |2 ++ net/rxrpc/rxrpc_syms.c | 34 -- net/rxrpc/transport.c |4 5 files changed, 11 insertions(+), 35 deletions(-) --- linux-2.6.19-rc6-mm1/net/rxrpc/Makefile.old 2006-11-26 04:49:25.0 +0100 +++ linux-2.6.19-rc6-mm1/net/rxrpc/Makefile 2006-11-26 04:50:08.0 +0100 @@ -12,7 +12,6 @@ krxtimod.o \ main.o \ peer.o \ - rxrpc_syms.o \ transport.o ifeq ($(CONFIG_PROC_FS),y) --- linux-2.6.19-rc6-mm1/net/rxrpc/call.c.old 2006-11-26 04:50:51.0 +0100 +++ linux-2.6.19-rc6-mm1/net/rxrpc/call.c 2006-11-26 04:51:58.0 +0100 @@ -314,6 +314,7 @@ _leave(" = %d", ret); return ret; } /* end rxrpc_create_call() */ +EXPORT_SYMBOL(rxrpc_create_call); /*/ /* @@ -465,6 +466,7 @@ _leave(" [destroyed]"); } /* end rxrpc_put_call() */ +EXPORT_SYMBOL(rxrpc_put_call); /*/ /* @@ -923,6 +925,7 @@ return __rxrpc_call_abort(call, error); } /* end rxrpc_call_abort() */ +EXPORT_SYMBOL(rxrpc_call_abort); /*/ /* @@ -1910,6 +1913,7 @@ } } /* end rxrpc_call_read_data() */ +EXPORT_SYMBOL(rxrpc_call_read_data); /*/ /* @@ -2076,6 +2080,7 @@ return ret; } /* end rxrpc_call_write_data() */ +EXPORT_SYMBOL(rxrpc_call_write_data); /*/ /* --- linux-2.6.19-rc6-mm1/net/rxrpc/connection.c.old 2006-11-26 04:52:08.0 +0100 +++ linux-2.6.19-rc6-mm1/net/rxrpc/connection.c 2006-11-26 04:52:32.0 +0100 @@ -207,6 +207,7 @@ spin_unlock(&peer->conn_gylock); goto make_active; } /* end rxrpc_create_connection() */ +EXPORT_SYMBOL(rxrpc_create_connection); /*/ /* @@ -411,6 +412,7 @@ _leave(" [killed]"); } /* end rxrpc_put_connection() */ +EXPORT_SYMBOL(rxrpc_put_connection); /*/ /* --- linux-2.6.19-rc6-mm1/net/rxrpc/transport.c.old 2006-11-26 04:52:43.0 +0100 +++ linux-2.6.19-rc6-mm1/net/rxrpc/transport.c 2006-11-26 04:53:36.0 +0100 @@ -146,6 +146,7 @@ _leave(" = %d", ret); return ret; } /* end rxrpc_create_transport() */ +EXPORT_SYMBOL(rxrpc_create_transport); /*/ /* @@ -196,6 +197,7 @@ _leave(""); } /* end rxrpc_put_transport() */ +EXPORT_SYMBOL(rxrpc_put_transport); /*/ /* @@ -231,6 +233,7 @@ _leave("= %d", ret); return ret; } /* end rxrpc_add_service() */ +EXPORT_SYMBOL(rxrpc_add_service); /*/ /* @@ -248,6 +251,7 @@ _leave(""); } /* end rxrpc_del_service() */ +EXPORT_SYMBOL(rxrpc_del_service); /*/ /* --- linux-2.6.19-rc6-mm1/net/rxrpc/rxrpc_syms.c 2006-09-20 05:42:06.0 +0200 +++ /dev/null 2006-09-19 00:45:31.0 +0200 @@ -1,34 +0,0 @@ -/* rxrpc_syms.c: exported Rx RPC layer interface symbols - * - * Copyright (C) 2002 Red Hat, Inc. All Rights Reserved. - * Written by David Howells ([EMAIL PROTECTED]) - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - */ - -#include - -#include -#include -#include -#include - -/* call.c */ -EXPORT_SYMBOL(rxrpc_create_call); -EXPORT_SYMBOL(rxrpc_put_call); -EXPORT_SYMBOL(rxrpc_call_abort); -EXPORT_SYMBOL(rxrpc_call_read_data); -EXPORT_SYMBOL(rxrpc_call_write_data); - -/* connection.c */ -EXPORT_SYMBOL(rxrpc_create_connection); -EXPORT_SYMBOL(rxrpc_put_connection); - -/* transport.c */ -EXPORT_SYMBOL(rxrpc_create_transport); -EXPORT_SYMBOL(rxrpc_put_transport); -EXPORT_SYMBOL(rxrpc_add_service); -EXPORT_SYMBOL(rxrpc_del_service); - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[-mm patch] drivers/net/vioc/: possible cleanups
On Thu, Feb 15, 2007 at 05:14:08AM -0800, Andrew Morton wrote: >... > Changes since 2.6.20-rc6-mm3: >... > +Fabric7-VIOC-driver.patch >... > netdev stuff >... This patch contains the following possible cleanups: - remove dead #ifdef EXPORT_SYMTAB code - no "inline" functions in C files - gcc knows best whether or not to inline static functions - move the vioc_ethtool_ops prototype to a header file - make needlessly global code static - #if 0 unused code - vioc_irq.c: remove the unused vioc_driver_lock Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]> --- drivers/net/vioc/f7/sppapi.h |2 - drivers/net/vioc/khash.h | 10 - drivers/net/vioc/spp.c| 60 +++--- drivers/net/vioc/vioc_api.c |5 ++ drivers/net/vioc/vioc_api.h |3 + drivers/net/vioc/vioc_driver.c|3 - drivers/net/vioc/vioc_ethtool.c |2 - drivers/net/vioc/vioc_irq.c |9 +--- drivers/net/vioc/vioc_provision.c | 16 +--- drivers/net/vioc/vioc_receive.c |2 - drivers/net/vioc/vioc_spp.c |6 +-- drivers/net/vioc/vioc_transmit.c | 40 +++- 12 files changed, 85 insertions(+), 73 deletions(-) --- linux-2.6.20-mm1/drivers/net/vioc/vioc_driver.c.old 2007-02-18 01:14:31.0 +0100 +++ linux-2.6.20-mm1/drivers/net/vioc/vioc_driver.c 2007-02-18 01:14:35.0 +0100 @@ -868,6 +868,3 @@ module_init(vioc_module_init); module_exit(vioc_module_exit); -#ifdef EXPORT_SYMTAB -EXPORT_SYMBOL(vioc_viocdev); -#endif /* EXPORT_SYMTAB */ --- linux-2.6.20-mm1/drivers/net/vioc/khash.h.old 2007-02-18 01:16:28.0 +0100 +++ linux-2.6.20-mm1/drivers/net/vioc/khash.h 2007-02-18 01:25:29.0 +0100 @@ -52,14 +52,4 @@ }; -struct shash_t *hashT_create(u32, size_t, size_t, u32(*)(unsigned char *, unsigned long), int(*)(void *, void *), unsigned int); -int hashT_delete(struct shash_t * , void *); -struct hash_elem_t *hashT_lookup(struct shash_t * , void *); -struct hash_elem_t *hashT_add(struct shash_t *, void *); -void hashT_destroy(struct shash_t *); -/* Accesors */ -void **hashT_getkeys(struct shash_t *); -size_t hashT_tablesize(struct shash_t *); -size_t hashT_size(struct shash_t *); - #endif --- linux-2.6.20-mm1/drivers/net/vioc/f7/sppapi.h.old 2007-02-18 01:26:44.0 +0100 +++ linux-2.6.20-mm1/drivers/net/vioc/f7/sppapi.h 2007-02-18 01:26:57.0 +0100 @@ -234,7 +234,5 @@ extern void spp_msg_unregister(u32 key_facility); -extern int read_spp_regbank32(int vioc, int bank, char *buffer); - #endif /* _SPPAPI_H_ */ --- linux-2.6.20-mm1/drivers/net/vioc/spp.c.old 2007-02-18 01:19:34.0 +0100 +++ linux-2.6.20-mm1/drivers/net/vioc/spp.c 2007-02-18 01:27:13.0 +0100 @@ -50,6 +50,15 @@ c -= a; c -= b; c ^= (b >> 15); \ } +static struct hash_elem_t *hashT_add(struct shash_t *htable, void *key); +static struct shash_t *hashT_create(u32 sizehint, size_t keybuf_size, +size_t databuf_size, +u32(*hfunc) (unsigned char *, +unsigned long), +int (*cfunc) (void *, void *), +unsigned int flags); +static void hashT_destroy(struct shash_t *htable); +static struct hash_elem_t *hashT_lookup(struct shash_t *htable, void *key); struct shash_t { /* Common fields for all hash tables types */ @@ -65,7 +74,9 @@ }; struct hash_ops { +#if 0 int (*delete) (struct shash_t *, void *); +#endif /* 0 */ struct hash_elem_t *(*lookup) (struct shash_t *, void *); void (*destroy) (struct shash_t *); struct hash_elem_t *(*add) (struct shash_t *, void *); @@ -143,6 +154,7 @@ return ((htable->hash_fn(key, len)) & (htable->tsize - 1)); } +#if 0 /* Data associated to this key MUST be freed by the caller */ static int ch_delete(struct shash_t *htable, void *key) { @@ -181,6 +193,7 @@ return -1; } +#endif /* 0 */ static void ch_destroy(struct shash_t *htable) { @@ -232,16 +245,21 @@ } /* Accesors **/ -inline size_t hashT_tablesize(struct shash_t * htable) + +#if 0 + +size_t hashT_tablesize(struct shash_t * htable) { return htable->tsize; } -inline size_t hashT_size(struct shash_t * htable) +size_t hashT_size(struct shash_t * htable) { return htable->nelems; } +#endif /* 0 */ + static struct hash_elem_t *ch_lookup(struct shash_t *htable, void *key) { u32 idx; @@ -330,15 +348,17 @@ return 1; } -struct hash_ops ch_ops = { +static struct hash_ops ch_ops = { +#if 0 .delete = ch_delete, +#endif /* 0 */ .lookup = ch_lookup, .destroy = ch_destroy, .getkeys = ch_getkeys, .add = ch_add }; -struct facility fTable[FACILITY_CNT];
[2.6 patch] net/irda/: proper prototypes
On Mon, Feb 05, 2007 at 06:01:42PM -0800, David Miller wrote: > From: [EMAIL PROTECTED] > Date: Mon, 05 Feb 2007 16:30:53 -0800 > > > From: Adrian Bunk <[EMAIL PROTECTED]> > > > > Add proper prototypes for some functions in include/net/irda/irda.h > > > > Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]> > > Acked-by: Samuel Ortiz <[EMAIL PROTECTED]> > > Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> > > I NAK'd this so that Adrian would go add "extern" to the > function declarations in the header file. > > Please drop this, Adrian will resend once he fixes it up. Sorry, I should have sent this earlier. Updated patch below. cu Adrian <-- snip --> This patch adds proper prototypes for some functions in include/net/irda/irda.h Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]> --- include/net/irda/irda.h | 16 net/irda/irmod.c| 13 - 2 files changed, 16 insertions(+), 13 deletions(-) --- linux-2.6.20-rc1-mm1/include/net/irda/irda.h.old2006-12-18 02:49:02.0 +0100 +++ linux-2.6.20-rc1-mm1/include/net/irda/irda.h2006-12-18 02:58:02.0 +0100 @@ -113,4 +113,20 @@ #define IAS_IRCOMM_ID 0x2343 #define IAS_IRLPT_ID 0x9876 +struct net_device; +struct packet_type; + +extern void irda_proc_register(void); +extern void irda_proc_unregister(void); + +extern int irda_sysctl_register(void); +extern void irda_sysctl_unregister(void); + +extern int irsock_init(void); +extern void irsock_cleanup(void); + +extern int irlap_driver_rcv(struct sk_buff *skb, struct net_device *dev, + struct packet_type *ptype, + struct net_device *orig_dev); + #endif /* NET_IRDA_H */ --- linux-2.6.20-rc1-mm1/net/irda/irmod.c.old 2006-12-18 02:52:18.0 +0100 +++ linux-2.6.20-rc1-mm1/net/irda/irmod.c 2006-12-18 02:53:59.0 +0100 @@ -42,19 +42,6 @@ #include /* irttp_init */ #include /* irda_device_init */ -/* irproc.c */ -extern void irda_proc_register(void); -extern void irda_proc_unregister(void); -/* irsysctl.c */ -extern int irda_sysctl_register(void); -extern void irda_sysctl_unregister(void); -/* af_irda.c */ -extern int irsock_init(void); -extern void irsock_cleanup(void); -/* irlap_frame.c */ -extern int irlap_driver_rcv(struct sk_buff *, struct net_device *, -struct packet_type *, struct net_device *); - /* * Module parameters */ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC: 2.6 patch] zd1211rw: possible cleanups
This patch contains the following possible cleanups: - make needlessly global functions static - #if 0 the following unused global functions: - zd_chip.c: zd_ioread16() - zd_chip.c: zd_ioread32() - zd_chip.c: zd_iowrite16() - zd_chip.c: zd_ioread32v() - zd_chip.c: zd_read_mac_addr() - zd_chip.c: zd_set_beacon_interval() - zd_util.c: zd_hexdump() Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]> --- drivers/net/wireless/zd1211rw/zd_chip.c | 27 +++- drivers/net/wireless/zd1211rw/zd_chip.h | 26 +-- drivers/net/wireless/zd1211rw/zd_mac.h |6 - drivers/net/wireless/zd1211rw/zd_util.c |5 +--- drivers/net/wireless/zd1211rw/zd_util.h |6 - 5 files changed, 30 insertions(+), 40 deletions(-) --- linux-2.6.19-rc6-mm1/drivers/net/wireless/zd1211rw/zd_chip.h.old 2006-11-26 00:18:00.0 +0100 +++ linux-2.6.19-rc6-mm1/drivers/net/wireless/zd1211rw/zd_chip.h 2006-11-26 00:26:41.0 +0100 @@ -709,15 +709,6 @@ return zd_usb_ioread16(&chip->usb, value, addr); } -int zd_ioread32v_locked(struct zd_chip *chip, u32 *values, - const zd_addr_t *addresses, unsigned int count); - -static inline int zd_ioread32_locked(struct zd_chip *chip, u32 *value, -const zd_addr_t addr) -{ - return zd_ioread32v_locked(chip, value, (const zd_addr_t *)&addr, 1); -} - static inline int zd_iowrite16_locked(struct zd_chip *chip, u16 value, zd_addr_t addr) { @@ -747,9 +738,6 @@ return _zd_iowrite32v_locked(chip, &ioreq, 1); } -int zd_iowrite32a_locked(struct zd_chip *chip, -const struct zd_ioreq32 *ioreqs, unsigned int count); - static inline int zd_rfwrite_locked(struct zd_chip *chip, u32 value, u8 bits) { ZD_ASSERT(mutex_is_locked(&chip->mutex)); @@ -766,12 +754,7 @@ /* Locking functions for reading and writing registers. * The different parameters are intentional. */ -int zd_ioread16(struct zd_chip *chip, zd_addr_t addr, u16 *value); -int zd_iowrite16(struct zd_chip *chip, zd_addr_t addr, u16 value); -int zd_ioread32(struct zd_chip *chip, zd_addr_t addr, u32 *value); int zd_iowrite32(struct zd_chip *chip, zd_addr_t addr, u32 value); -int zd_ioread32v(struct zd_chip *chip, const zd_addr_t *addresses, - u32 *values, unsigned int count); int zd_iowrite32a(struct zd_chip *chip, const struct zd_ioreq32 *ioreqs, unsigned int count); @@ -783,7 +766,6 @@ u8 zd_chip_get_channel(struct zd_chip *chip); int zd_read_regdomain(struct zd_chip *chip, u8 *regdomain); void zd_get_e2p_mac_addr(struct zd_chip *chip, u8 *mac_addr); -int zd_read_mac_addr(struct zd_chip *chip, u8 *mac_addr); int zd_write_mac_addr(struct zd_chip *chip, const u8 *mac_addr); int zd_chip_switch_radio_on(struct zd_chip *chip); int zd_chip_switch_radio_off(struct zd_chip *chip); @@ -794,20 +776,24 @@ int zd_chip_enable_hwint(struct zd_chip *chip); int zd_chip_disable_hwint(struct zd_chip *chip); +#if 0 static inline int zd_get_encryption_type(struct zd_chip *chip, u32 *type) { return zd_ioread32(chip, CR_ENCRYPTION_TYPE, type); } +#endif /* 0 */ static inline int zd_set_encryption_type(struct zd_chip *chip, u32 type) { return zd_iowrite32(chip, CR_ENCRYPTION_TYPE, type); } +#if 0 static inline int zd_chip_get_basic_rates(struct zd_chip *chip, u16 *cr_rates) { return zd_ioread16(chip, CR_BASIC_RATE_TBL, cr_rates); } +#endif /* 0 */ int zd_chip_set_basic_rates(struct zd_chip *chip, u16 cr_rates); @@ -827,12 +813,12 @@ int zd_chip_control_leds(struct zd_chip *chip, enum led_status status); -int zd_set_beacon_interval(struct zd_chip *chip, u32 interval); - +#if 0 static inline int zd_get_beacon_interval(struct zd_chip *chip, u32 *interval) { return zd_ioread32(chip, CR_BCN_INTERVAL, interval); } +#endif /* 0 */ struct rx_status; --- linux-2.6.19-rc6-mm1/drivers/net/wireless/zd1211rw/zd_chip.c.old 2006-11-26 00:18:10.0 +0100 +++ linux-2.6.19-rc6-mm1/drivers/net/wireless/zd1211rw/zd_chip.c 2006-11-26 00:37:13.0 +0100 @@ -87,8 +87,8 @@ /* Read a variable number of 32-bit values. Parameter count is not allowed to * exceed USB_MAX_IOREAD32_COUNT. */ -int zd_ioread32v_locked(struct zd_chip *chip, u32 *values, const zd_addr_t *addr, -unsigned int count) +static int zd_ioread32v_locked(struct zd_chip *chip, u32 *values, + const zd_addr_t *addr, unsigned int count) { int r; int i; @@ -135,6 +135,12 @@ return r; } +static int zd_ioread32_locked(struct zd_chip *chip, u32 *value, + const zd_addr_t addr) +{ + return zd_ioread32v_locked(chip, value, (const zd_addr_t *)&addr, 1); +} + int _zd_iowrite32v_locked(struct zd_chip *chip, const struct zd_ioreq32 *ioreqs, u
Re: MediaGX/GeodeGX1 requires X86_OOSTORE.
On Tue, Feb 20, 2007 at 08:56:39AM +0900, takada wrote: > /proc/cpuinfo with MediaGXm : > > processor : 0 > vendor_id : CyrixInstead > cpu family: 5 > model : 5 > model name: Cyrix MediaGXtm MMXtm Enhanced > stepping : 2 > cpu MHz : 199.750 > cache size: 16 KB > fdiv_bug : no > hlt_bug : no > f00f_bug : no > coma_bug : no > fpu : yes > fpu_exception : yes > cpuid level : 2 > wp: yes > flags : fpu tsc msr cx8 cmov mmx cxmmx > bogomips : 401.00 > clflush size : 32 Hmm with 2.6.18 I am seeing: processor : 0 vendor_id : CyrixInstead cpu family : 5 model : 9 model name : Geode(TM) Integrated Processor by National Semi stepping: 1 cpu MHz : 266.648 cache size : 16 KB fdiv_bug: no hlt_bug : no f00f_bug: no coma_bug: no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu tsc msr cx8 cmov mmx cxmmx bogomips: 534.50 Similar, but the last line isn't there. It looks like 2.6.18 doesn't actually have code to print that information though. -- Len Sorensen - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.19-rc6-mm1: drivers/net/chelsio/: unused code
On Tue, Nov 28, 2006 at 11:47:19PM -0800, Andrew Morton wrote: > On Wed, 29 Nov 2006 08:36:09 +0100 > Adrian Bunk <[EMAIL PROTECTED]> wrote: > > > On Mon, Nov 27, 2006 at 10:24:55AM -0800, Stephen Hemminger wrote: > > > On Fri, 24 Nov 2006 01:17:31 +0100 > > > Adrian Bunk <[EMAIL PROTECTED]> wrote: > > > > > > > On Thu, Nov 23, 2006 at 02:17:03AM -0800, Andrew Morton wrote: > > > > >... > > > > > Changes since 2.6.19-rc5-mm2: > > > > >... > > > > > +chelsio-22-driver.patch > > > > >... > > > > > netdev updates > > > > > > > > It is suspicious that the following newly added code is completely > > > > unused: > > > > drivers/net/chelsio/ixf1010.o > > > > t1_ixf1010_ops > > > > drivers/net/chelsio/mac.o > > > > t1_chelsio_mac_ops > > > > drivers/net/chelsio/vsc8244.o > > > > t1_vsc8244_ops > > > > > > > > cu > > > > Adrian > > > > > > > > > > All that is gone in later version. I reposted new patches > > > after -mm2 was done. > > > > It seems these patches didn't make it into 2.6.19-rc6-mm2 ? > > I dropped that patch and picked up Francois's tree instead. These structs are still both present and unused as of 2.6.20-mm1. cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: Strange connection slowdown on pcnet32
On Mon, Feb 19, 2007 at 06:45:48PM -0500, Lennart Sorensen wrote: > It seems the problem actually occours when the receive descriptor ring > is full. This seems to generate one (or sometimes more) descriptors in > the ring which claim to be owned by the MAC, but at the head of the > receive ring as far as the driver is concerned. I see some note in the > driver about an SP3G chipset sometimes causing this. How would one > identify this and clear such descriptors out of the way? Getting stuck > until the next time the MAC gets around to the descriptor and overwrites > it is not good, since it causes delays, and out of order packets. I am also noticing the receive error count going up, and the source is this code: if (status & 0x01) /* Only count a general error at the */ lp->stats.rx_errors++; /* end of a packet. */ It appears this means I am receiving a frame marked with "End Of Packet" but without "Start of Packet". I have no idea how that happens, but it shouldn't be able to make the driver and MAC stop processing the receive ring. -- Len Sorensen - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MediaGX/GeodeGX1 requires X86_OOSTORE.
From: Roland Dreier <[EMAIL PROTECTED]> Subject: Re: MediaGX/GeodeGX1 requires X86_OOSTORE. Date: Mon, 19 Feb 2007 11:48:27 -0800 > > Does anyone know if there is any way to flush a cache line of the cpu to > > force rereading system memory for a given address or address range? > > There is the "clflush" instruction, but not all x86 CPUs support it. > You need to check the CPUID flag to know for sure (/proc/cpuinfo will > show a "clflush" flag if it is supported). /proc/cpuinfo with MediaGXm : processor : 0 vendor_id : CyrixInstead cpu family : 5 model : 5 model name : Cyrix MediaGXtm MMXtm Enhanced stepping: 2 cpu MHz : 199.750 cache size : 16 KB fdiv_bug: no hlt_bug : no f00f_bug: no coma_bug: no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu tsc msr cx8 cmov mmx cxmmx bogomips: 401.00 clflush size: 32 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonblocking UDPv4 recvfrom() taking 4usec @ 3GHz?
On Tue, 20 Feb 2007 00:14:47 +0100 bert hubert <[EMAIL PROTECTED]> wrote: > Hi people, > > I'm trying to save people the cost of buying extra servers by making > PowerDNS (GPL) ever faster, but I've hit a rather fundamental problem. > > Linux 2.6.20-rc4 appears to take 4 microseconds on my P4 3GHz for a > non-blocking UDPv4 recvfrom() call, both on loopback and ethernet. > > Linux 2.6.18 on my 64 bit Athlon64 3200+ takes a similar amount of time. > > This seems like rather a lot for a 50 byte datagram, but perhaps I'm > overestimating your abilities :-) > > The program is unthreaded, and I measure like this: > > #define RDTSC(qp) \ > do { \ > unsigned long lowPart, highPart;\ > __asm__ __volatile__("rdtsc" : "=a" (lowPart), "=d" (highPart)); \ > qp = (((unsigned long long) highPart) << 32) | lowPart; \ > } while (0) > > ... > > uint64_t tsc1, tsc2; > RDTSC(tsc1); > > if((len=recvfrom(fd, data, sizeof(data), 0, (sockaddr *)&fromaddr, &addrlen)) > >= 0) { > RDTSC(tsc2); > printf("%f\n", (tsc2-tsc1)/3000.0); // 3GHz P4 > } > > gdb generates the following dump from the actual program, > x=_Z20handleNewUDPQuestioniRN5boost3anyE, I see nothing untoward happening > between the two 'rdtsc' opcodes. > > 0x08091de0 : push %ebp > 0x08091de1 : mov%esp,%ebp > 0x08091de3 : push %edi > 0x08091de4 : push %esi > 0x08091de5 : push %ebx > 0x08091de6 : sub$0x78c,%esp > 0x08091dec : mov%gs:0x14,%eax > 0x08091df2 : mov%eax,0xffe4(%ebp) > 0x08091df5 : xor%eax,%eax > 0x08091df7 : movw $0x2,0xffac(%ebp) > 0x08091dfd : movl $0x0,0xffb0(%ebp) > 0x08091e04 : movw $0x0,0xffae(%ebp) > 0x08091e0a : movl $0x1c,0xf8f4(%ebp) > 0x08091e14 : rdtsc > 0x08091e16 : mov%edx,%ebx > 0x08091e18 : mov0x8(%ebp),%edx > 0x08091e1b : mov%eax,%esi > 0x08091e1d : lea0xf8f4(%ebp),%eax > 0x08091e23 : mov%eax,0x14(%esp) > 0x08091e27 : lea0xffac(%ebp),%ecx > 0x08091e2a : lea0xf950(%ebp),%eax > 0x08091e30 : mov%ecx,0x10(%esp) > 0x08091e34 : movl $0x0,0xc(%esp) > 0x08091e3c : movl $0x5dc,0x8(%esp) > 0x08091e44 :mov%eax,0x4(%esp) > 0x08091e48 :mov%edx,(%esp) > 0x08091e4b :call 0x8192110 > 0x08091e50 :test %eax,%eax > 0x08091e52 :mov%eax,0xf8b0(%ebp) > 0x08091e58 :js 0x8092168 > 0x08091e5e :mov%ebx,%eax > 0x08091e60 :xor%edx,%edx > 0x08091e62 :mov%eax,%edx > 0x08091e64 :mov$0x0,%eax > 0x08091e69 :mov%esi,%ecx > 0x08091e6b :mov%eax,%esi > 0x08091e6d :or %ecx,%esi > 0x08091e6f :mov%edx,%edi > 0x08091e71 :rdtsc > 0x08091e73 :mov%eax,0xf8a0(%ebp) > 0x08091e79 :mov0xf8a0(%ebp),%eax > 0x08091e7f :mov%edx,%ecx > 0x08091e81 :xor%ebx,%ebx > 0x08091e83 :mov%ecx,%ebx > > recvfrom itself is a tad worrisome, x=recvfrom. I didn't ask for the > 'libc_enable_asynccancel' stuff. I'm trying to isolate the actual syscall > but it is proving hard work for an assemnly newbie like me - socketcall > doesn't make things easier. > > 0xb7d62410 :cmpl $0x0,%gs:0xc > 0xb7d62418 :jne0xb7d62439 > 0xb7d6241a : mov%ebx,%edx > 0xb7d6241c : mov$0x66,%eax > 0xb7d62421 : mov$0xc,%ebx > 0xb7d62426 : lea0x4(%esp),%ecx > 0xb7d6242a : call *%gs:0x10 > 0xb7d62431 : mov%edx,%ebx > 0xb7d62433 : cmp$0xff83,%eax > 0xb7d62436 : jae0xb7d62469 > 0xb7d62438 : ret > 0xb7d62439 : push %esi > 0xb7d6243a : call 0xb7d6ddd0 <__libc_enable_asynccancel> > 0xb7d6243f : mov%eax,%esi > 0xb7d62441 : mov%ebx,%edx > 0xb7d62443 : mov$0x66,%eax > 0xb7d62448 : mov$0xc,%ebx > 0xb7d6244d : lea0x8(%esp),%ecx > 0xb7d62451 : call *%gs:0x10 > 0xb7d62458 : mov%edx,%ebx > 0xb7d6245a : xchg %eax,%esi > 0xb7d6245b : call 0xb7d6dd90 <__libc_disable_asynccancel> > 0xb7d62460 : mov%esi,%eax > 0xb7d62462 : pop%esi > 0xb7d62463 : cmp$0xff83,%eax > 0xb7d62466 : jae0xb7d62469 > 0xb7d62468 : ret > 0xb7d62469 : call 0xb7d998f8 <__i686.get_pc_thunk.cx> > 0xb7d6246e : add$0x61b86,%ecx > 0xb7d62474 : mov0xff2c(%ecx),%ecx > 0xb7d6247a : xor%edx,%edx > 0xb7d6247c : sub%eax,%edx > 0xb7d6247e : mov%edx,%gs:(%ecx) > 0xb7d62481 : or $0x,%eax > 0xb7d62484 : jmp0xb7d62438 > > Any clues? > Use oprofile to find the hotspot. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2][TCP] YeAH-TCP: limited slow start exported function
John Heffner ha scritto: Note the patch is compile-tested only! I can do some real testing if you'd like to apply this Dave. The date you read on the patch is due to the fact I've splitted this patchset into 2 diff files. This isn't compile-tested only, I've used this piece of code for about 3 months. However more testing is good and welcome. Regards, Angelo P. Castellani - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: Strange connection slowdown on pcnet32
On Mon, Feb 19, 2007 at 05:29:20PM -0500, Lennart Sorensen wrote: > I just noticed, it seems almost all these problems occour right at the > start of transfers when the tcp window size is still being worked out > for the connection speed, and I am seeing the error count go up in > ifconfig for the port when it happens too. Is it possible for an error > to get flagged in a receive descriptor without the owner bit being > updated? It seems the problem actually occours when the receive descriptor ring is full. This seems to generate one (or sometimes more) descriptors in the ring which claim to be owned by the MAC, but at the head of the receive ring as far as the driver is concerned. I see some note in the driver about an SP3G chipset sometimes causing this. How would one identify this and clear such descriptors out of the way? Getting stuck until the next time the MAC gets around to the descriptor and overwrites it is not good, since it causes delays, and out of order packets. -- Len Sorensen - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
nonblocking UDPv4 recvfrom() taking 4usec @ 3GHz?
Hi people, I'm trying to save people the cost of buying extra servers by making PowerDNS (GPL) ever faster, but I've hit a rather fundamental problem. Linux 2.6.20-rc4 appears to take 4 microseconds on my P4 3GHz for a non-blocking UDPv4 recvfrom() call, both on loopback and ethernet. Linux 2.6.18 on my 64 bit Athlon64 3200+ takes a similar amount of time. This seems like rather a lot for a 50 byte datagram, but perhaps I'm overestimating your abilities :-) The program is unthreaded, and I measure like this: #define RDTSC(qp) \ do { \ unsigned long lowPart, highPart; \ __asm__ __volatile__("rdtsc" : "=a" (lowPart), "=d" (highPart)); \ qp = (((unsigned long long) highPart) << 32) | lowPart; \ } while (0) ... uint64_t tsc1, tsc2; RDTSC(tsc1); if((len=recvfrom(fd, data, sizeof(data), 0, (sockaddr *)&fromaddr, &addrlen)) >= 0) { RDTSC(tsc2); printf("%f\n", (tsc2-tsc1)/3000.0); // 3GHz P4 } gdb generates the following dump from the actual program, x=_Z20handleNewUDPQuestioniRN5boost3anyE, I see nothing untoward happening between the two 'rdtsc' opcodes. 0x08091de0 : push %ebp 0x08091de1 : mov%esp,%ebp 0x08091de3 : push %edi 0x08091de4 : push %esi 0x08091de5 : push %ebx 0x08091de6 : sub$0x78c,%esp 0x08091dec : mov%gs:0x14,%eax 0x08091df2 : mov%eax,0xffe4(%ebp) 0x08091df5 : xor%eax,%eax 0x08091df7 : movw $0x2,0xffac(%ebp) 0x08091dfd : movl $0x0,0xffb0(%ebp) 0x08091e04 : movw $0x0,0xffae(%ebp) 0x08091e0a : movl $0x1c,0xf8f4(%ebp) 0x08091e14 : rdtsc 0x08091e16 : mov%edx,%ebx 0x08091e18 : mov0x8(%ebp),%edx 0x08091e1b : mov%eax,%esi 0x08091e1d : lea0xf8f4(%ebp),%eax 0x08091e23 : mov%eax,0x14(%esp) 0x08091e27 : lea0xffac(%ebp),%ecx 0x08091e2a : lea0xf950(%ebp),%eax 0x08091e30 : mov%ecx,0x10(%esp) 0x08091e34 : movl $0x0,0xc(%esp) 0x08091e3c : movl $0x5dc,0x8(%esp) 0x08091e44 :mov%eax,0x4(%esp) 0x08091e48 :mov%edx,(%esp) 0x08091e4b :call 0x8192110 0x08091e50 :test %eax,%eax 0x08091e52 :mov%eax,0xf8b0(%ebp) 0x08091e58 :js 0x8092168 0x08091e5e :mov%ebx,%eax 0x08091e60 :xor%edx,%edx 0x08091e62 :mov%eax,%edx 0x08091e64 :mov$0x0,%eax 0x08091e69 :mov%esi,%ecx 0x08091e6b :mov%eax,%esi 0x08091e6d :or %ecx,%esi 0x08091e6f :mov%edx,%edi 0x08091e71 :rdtsc 0x08091e73 :mov%eax,0xf8a0(%ebp) 0x08091e79 :mov0xf8a0(%ebp),%eax 0x08091e7f :mov%edx,%ecx 0x08091e81 :xor%ebx,%ebx 0x08091e83 :mov%ecx,%ebx recvfrom itself is a tad worrisome, x=recvfrom. I didn't ask for the 'libc_enable_asynccancel' stuff. I'm trying to isolate the actual syscall but it is proving hard work for an assemnly newbie like me - socketcall doesn't make things easier. 0xb7d62410 :cmpl $0x0,%gs:0xc 0xb7d62418 :jne0xb7d62439 0xb7d6241a : mov%ebx,%edx 0xb7d6241c : mov$0x66,%eax 0xb7d62421 : mov$0xc,%ebx 0xb7d62426 : lea0x4(%esp),%ecx 0xb7d6242a : call *%gs:0x10 0xb7d62431 : mov%edx,%ebx 0xb7d62433 : cmp$0xff83,%eax 0xb7d62436 : jae0xb7d62469 0xb7d62438 : ret 0xb7d62439 : push %esi 0xb7d6243a : call 0xb7d6ddd0 <__libc_enable_asynccancel> 0xb7d6243f : mov%eax,%esi 0xb7d62441 : mov%ebx,%edx 0xb7d62443 : mov$0x66,%eax 0xb7d62448 : mov$0xc,%ebx 0xb7d6244d : lea0x8(%esp),%ecx 0xb7d62451 : call *%gs:0x10 0xb7d62458 : mov%edx,%ebx 0xb7d6245a : xchg %eax,%esi 0xb7d6245b : call 0xb7d6dd90 <__libc_disable_asynccancel> 0xb7d62460 : mov%esi,%eax 0xb7d62462 : pop%esi 0xb7d62463 : cmp$0xff83,%eax 0xb7d62466 : jae0xb7d62469 0xb7d62468 : ret 0xb7d62469 : call 0xb7d998f8 <__i686.get_pc_thunk.cx> 0xb7d6246e : add$0x61b86,%ecx 0xb7d62474 : mov0xff2c(%ecx),%ecx 0xb7d6247a : xor%edx,%edx 0xb7d6247c : sub%eax,%edx 0xb7d6247e : mov%edx,%gs:(%ecx) 0xb7d62481 : or $0x,%eax 0xb7d62484 : jmp0xb7d62438 Any clues? -- http://www.PowerDNS.com Open source, database driven DNS Software http://netherlabs.nl Open and Closed source services - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug in bcm43xx-d80211
On Mon, 2007-02-19 at 17:30 -0500, Pavel Roskin wrote: > Johannes, would it be possible to commit patches faster, please? Now > that I told Michael about git-update-server-info, his changes are > downloadable as soon as he makes a commit. wireless-dev.git, on the > other hand, is a mess and has been for some time (since Friday, I > believe). I don't commit to wireless-dev, John does. I'd love if the patches were in already ;) And I think he even said he had committed them but they didn't show up so something must have gone wrong (forgot to push out to kernel.org maybe) johannes signature.asc Description: This is a digitally signed message part
Re: Kernel bug in bcm43xx-d80211
On Mon, 2007-02-19 at 23:12 +0100, Johannes Berg wrote: > On Mon, 2007-02-19 at 13:48 -0800, Alex Davis wrote: > > I go the following Oops with the latest wireless-dev git when starting > > wpa_supplicant: > > > > Feb 19 16:17:42 boss kernel: [ 377.359573] BUG: unable to handle kernel > > NULL pointer dereference > > at virtual address 0002 > > Probably caused by my recent changes that accidentally broke d80211 > pretty much completely. Patches are on the linux-wireless mailing list. Johannes, would it be possible to commit patches faster, please? Now that I told Michael about git-update-server-info, his changes are downloadable as soon as he makes a commit. wireless-dev.git, on the other hand, is a mess and has been for some time (since Friday, I believe). It is a problem for projects like DadWifi that recommend to use the top of wireless-dev.git. Yes, I know, breakage is unavoidable to a certain degree, but it shouldn't come to the situation when the patches are known, nobody objects, yet the repository stays broken and all newcomers have to be told about the problem. That's not to offend you or anyone. It's just something that would help a lot. -- Regards, Pavel Roskin - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: Strange connection slowdown on pcnet32
On Mon, Feb 19, 2007 at 05:18:45PM -0500, Lennart Sorensen wrote: > On Mon, Feb 19, 2007 at 03:11:36PM -0500, Lennart Sorensen wrote: > > I have been poking at things with firescope to see if the MAC is > > actually writing to system memory or not. > > > > The entry that it gets stuch on is _always_ entry 0 in the rx_ring. > > There does not appear to be any exceptions to this. > > > > Here is my firescope (slightly modified for this purpose) dump of the > > rx_ring of eth1: > > > > Descriptor:Address: /--base---\ /buf\ /sta\ /-message-\ /reserved-\ > > : : | | |len| |tus| | length | | | > > RXdesc[00]:6694000: 12 18 5f 05 fa f9 00 80 40 00 00 00 00 00 00 00 > > RXdesc[01]:6694010: 12 78 15 05 fa f9 40 03 ee 05 00 00 00 00 00 00 > > RXdesc[02]:6694020: 12 a0 52 06 fa f9 40 03 ee 05 00 00 00 00 00 00 > > RXdesc[03]:6694030: 12 f8 c2 04 fa f9 40 03 ee 05 00 00 00 00 00 00 > > RXdesc[04]:6694040: 12 70 15 05 fa f9 40 03 ee 05 00 00 00 00 00 00 > > RXdesc[05]:6694050: 12 e8 37 05 fa f9 40 03 ee 05 00 00 00 00 00 00 > > RXdesc[06]:6694060: 12 e0 37 05 fa f9 40 03 ee 05 00 00 00 00 00 00 > > RXdesc[07]:6694070: 12 e8 d5 04 fa f9 40 03 ee 05 00 00 00 00 00 00 > > RXdesc[08]:6694080: 12 e0 d5 04 fa f9 40 03 ee 05 00 00 00 00 00 00 > > RXdesc[09]:6694090: 12 d8 d1 05 fa f9 40 03 46 00 00 00 00 00 00 00 > > RXdesc[10]:66940a0: 12 d0 d1 05 fa f9 40 03 4e 00 00 00 00 00 00 00 > > RXdesc[11]:66940b0: 12 d8 02 05 fa f9 10 03 40 00 00 00 00 00 00 00 > > RXdesc[12]:66940c0: 12 d0 02 05 fa f9 40 03 46 00 00 00 00 00 00 00 > > RXdesc[13]:66940d0: 12 38 58 05 fa f9 00 80 ee 05 00 00 00 00 00 00 > > RXdesc[14]:66940e0: 12 30 58 05 fa f9 00 80 ee 05 00 00 00 00 00 00 > > RXdesc[15]:66940f0: 12 78 2c 05 fa f9 00 80 ee 05 00 00 00 00 00 00 > > RXdesc[16]:6694100: 12 a0 58 05 fa f9 00 80 ee 05 00 00 00 00 00 00 > > RXdesc[17]:6694110: 12 b0 04 05 fa f9 00 80 ee 05 00 00 00 00 00 00 > > RXdesc[18]:6694120: 12 b8 04 05 fa f9 00 80 ee 05 00 00 00 00 00 00 > > RXdesc[19]:6694130: 12 70 2c 05 fa f9 00 80 ee 05 00 00 00 00 00 00 > > RXdesc[20]:6694140: 12 f8 56 05 fa f9 00 80 ee 05 00 00 00 00 00 00 > > RXdesc[21]:6694150: 12 c8 29 05 fa f9 00 80 ee 05 00 00 00 00 00 00 > > RXdesc[22]:6694160: 12 20 03 05 fa f9 00 80 ee 05 00 00 00 00 00 00 > > RXdesc[23]:6694170: 12 60 4c 05 fa f9 00 80 87 05 00 00 00 00 00 00 > > RXdesc[24]:6694180: 12 98 53 05 fa f9 00 80 40 00 00 00 00 00 00 00 > > RXdesc[25]:6694190: 12 b0 cc 04 fa f9 00 80 40 00 00 00 00 00 00 00 > > RXdesc[26]:66941a0: 12 a8 3f 05 fa f9 00 80 40 00 00 00 00 00 00 00 > > RXdesc[27]:66941b0: 12 58 e8 04 fa f9 00 80 40 00 00 00 00 00 00 00 > > RXdesc[28]:66941c0: 12 b0 4d 06 fa f9 00 80 40 00 00 00 00 00 00 00 > > RXdesc[29]:66941d0: 12 38 ef 04 fa f9 00 80 40 00 00 00 00 00 00 00 > > RXdesc[30]:66941e0: 12 98 1f 05 fa f9 00 80 40 00 00 00 00 00 00 00 > > RXdesc[31]:66941f0: 12 28 f1 04 fa f9 00 80 40 00 00 00 00 00 00 00 > > > > I only ever see entry 0 as status 0080 (0x8000 which is owned by mac), > > and this is while the driver is checking entry 0 every time it tries to > > check for any waiting packets. > > > > Running tcpdump while pinging gives the interesting result that some > > packets are ariving out of order making it seem like the driver is > > processing the packets out of order. Perhaps the driver is wrong to be > > looking at entry 0, and should be looking at entry 1 and is hence stuck > > until the whole receive ring has been filled again? > > > > 15:06:04.112812 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 1 > > 15:06:05.119799 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 2 > > 15:06:05.120159 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 2 > > 15:06:05.127045 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 1 > > 15:06:06.119862 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 3 > > 15:06:07.119921 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 4 > > 15:06:08.119994 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 5 > > 15:06:08.426400 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 3 > > 15:06:08.427915 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 4 > > 15:06:08.429033 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 5 > > 15:06:09.120053 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 6 > > 15:06:10.120109 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 7 > > 15:06:10.705332 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 6 > > 15:06:10.707258 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 7 > > 15:06:11.120175 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 8 > > 15:06:12.120233 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 9 > > 15:06:13.120297 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 10 > > 15:06:14.120359 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 11 > > 15:06:14.120737 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 11 >
Re: Kernel bug in bcm43xx-d80211
On Mon, 2007-02-19 at 13:48 -0800, Alex Davis wrote: > I go the following Oops with the latest wireless-dev git when starting > wpa_supplicant: Wireless topics moved from this list to [EMAIL PROTECTED] Broadcom drivers are discussed in [EMAIL PROTECTED] wireless-dev is horribly broken, and the fixes haven't been merged yet. The current Broadcom driver can be loaded from http://bu3sch.de/git/wireless-dev.git (please load it on top of wireless-dev.git to save bandwidth) It doesn't include the latest breakage from wireless-dev, but it does include some important fixes. Although I haven't seen a problem like yours, I strongly suggest that you try the above repository and post your results to the bcm43xx-dev list. Even if the results are more positive :) -- Regards, Pavel Roskin - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: Strange connection slowdown on pcnet32
On Mon, Feb 19, 2007 at 03:11:36PM -0500, Lennart Sorensen wrote: > I have been poking at things with firescope to see if the MAC is > actually writing to system memory or not. > > The entry that it gets stuch on is _always_ entry 0 in the rx_ring. > There does not appear to be any exceptions to this. > > Here is my firescope (slightly modified for this purpose) dump of the > rx_ring of eth1: > > Descriptor:Address: /--base---\ /buf\ /sta\ /-message-\ /reserved-\ > : : | | |len| |tus| | length | | | > RXdesc[00]:6694000: 12 18 5f 05 fa f9 00 80 40 00 00 00 00 00 00 00 > RXdesc[01]:6694010: 12 78 15 05 fa f9 40 03 ee 05 00 00 00 00 00 00 > RXdesc[02]:6694020: 12 a0 52 06 fa f9 40 03 ee 05 00 00 00 00 00 00 > RXdesc[03]:6694030: 12 f8 c2 04 fa f9 40 03 ee 05 00 00 00 00 00 00 > RXdesc[04]:6694040: 12 70 15 05 fa f9 40 03 ee 05 00 00 00 00 00 00 > RXdesc[05]:6694050: 12 e8 37 05 fa f9 40 03 ee 05 00 00 00 00 00 00 > RXdesc[06]:6694060: 12 e0 37 05 fa f9 40 03 ee 05 00 00 00 00 00 00 > RXdesc[07]:6694070: 12 e8 d5 04 fa f9 40 03 ee 05 00 00 00 00 00 00 > RXdesc[08]:6694080: 12 e0 d5 04 fa f9 40 03 ee 05 00 00 00 00 00 00 > RXdesc[09]:6694090: 12 d8 d1 05 fa f9 40 03 46 00 00 00 00 00 00 00 > RXdesc[10]:66940a0: 12 d0 d1 05 fa f9 40 03 4e 00 00 00 00 00 00 00 > RXdesc[11]:66940b0: 12 d8 02 05 fa f9 10 03 40 00 00 00 00 00 00 00 > RXdesc[12]:66940c0: 12 d0 02 05 fa f9 40 03 46 00 00 00 00 00 00 00 > RXdesc[13]:66940d0: 12 38 58 05 fa f9 00 80 ee 05 00 00 00 00 00 00 > RXdesc[14]:66940e0: 12 30 58 05 fa f9 00 80 ee 05 00 00 00 00 00 00 > RXdesc[15]:66940f0: 12 78 2c 05 fa f9 00 80 ee 05 00 00 00 00 00 00 > RXdesc[16]:6694100: 12 a0 58 05 fa f9 00 80 ee 05 00 00 00 00 00 00 > RXdesc[17]:6694110: 12 b0 04 05 fa f9 00 80 ee 05 00 00 00 00 00 00 > RXdesc[18]:6694120: 12 b8 04 05 fa f9 00 80 ee 05 00 00 00 00 00 00 > RXdesc[19]:6694130: 12 70 2c 05 fa f9 00 80 ee 05 00 00 00 00 00 00 > RXdesc[20]:6694140: 12 f8 56 05 fa f9 00 80 ee 05 00 00 00 00 00 00 > RXdesc[21]:6694150: 12 c8 29 05 fa f9 00 80 ee 05 00 00 00 00 00 00 > RXdesc[22]:6694160: 12 20 03 05 fa f9 00 80 ee 05 00 00 00 00 00 00 > RXdesc[23]:6694170: 12 60 4c 05 fa f9 00 80 87 05 00 00 00 00 00 00 > RXdesc[24]:6694180: 12 98 53 05 fa f9 00 80 40 00 00 00 00 00 00 00 > RXdesc[25]:6694190: 12 b0 cc 04 fa f9 00 80 40 00 00 00 00 00 00 00 > RXdesc[26]:66941a0: 12 a8 3f 05 fa f9 00 80 40 00 00 00 00 00 00 00 > RXdesc[27]:66941b0: 12 58 e8 04 fa f9 00 80 40 00 00 00 00 00 00 00 > RXdesc[28]:66941c0: 12 b0 4d 06 fa f9 00 80 40 00 00 00 00 00 00 00 > RXdesc[29]:66941d0: 12 38 ef 04 fa f9 00 80 40 00 00 00 00 00 00 00 > RXdesc[30]:66941e0: 12 98 1f 05 fa f9 00 80 40 00 00 00 00 00 00 00 > RXdesc[31]:66941f0: 12 28 f1 04 fa f9 00 80 40 00 00 00 00 00 00 00 > > I only ever see entry 0 as status 0080 (0x8000 which is owned by mac), > and this is while the driver is checking entry 0 every time it tries to > check for any waiting packets. > > Running tcpdump while pinging gives the interesting result that some > packets are ariving out of order making it seem like the driver is > processing the packets out of order. Perhaps the driver is wrong to be > looking at entry 0, and should be looking at entry 1 and is hence stuck > until the whole receive ring has been filled again? > > 15:06:04.112812 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 1 > 15:06:05.119799 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 2 > 15:06:05.120159 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 2 > 15:06:05.127045 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 1 > 15:06:06.119862 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 3 > 15:06:07.119921 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 4 > 15:06:08.119994 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 5 > 15:06:08.426400 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 3 > 15:06:08.427915 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 4 > 15:06:08.429033 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 5 > 15:06:09.120053 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 6 > 15:06:10.120109 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 7 > 15:06:10.705332 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 6 > 15:06:10.707258 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 7 > 15:06:11.120175 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 8 > 15:06:12.120233 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 9 > 15:06:13.120297 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 10 > 15:06:14.120359 IP 10.128.10.254 > 10.128.10.1: icmp 64: echo request seq 11 > 15:06:14.120737 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 11 > 15:06:14.127064 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 8 > 15:06:14.127700 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo reply seq 9 > 15:06:14.128268 IP 10.128.10.1 > 10.128.10.254: icmp 64: echo
Re: Kernel bug in bcm43xx-d80211
On Mon, 2007-02-19 at 13:48 -0800, Alex Davis wrote: > I go the following Oops with the latest wireless-dev git when starting > wpa_supplicant: > > Feb 19 16:17:42 boss kernel: [ 377.359573] BUG: unable to handle kernel NULL > pointer dereference > at virtual address 0002 Probably caused by my recent changes that accidentally broke d80211 pretty much completely. Patches are on the linux-wireless mailing list. johannes signature.asc Description: This is a digitally signed message part
Kernel bug in bcm43xx-d80211
I go the following Oops with the latest wireless-dev git when starting wpa_supplicant: Feb 19 16:17:42 boss kernel: [ 377.359573] BUG: unable to handle kernel NULL pointer dereference at virtual address 0002 Feb 19 16:17:42 boss kernel: [ 377.359641] printing eip: Feb 19 16:17:42 boss kernel: [ 377.359670] f8b2a3c3 Feb 19 16:17:42 boss kernel: [ 377.359672] *pde = Feb 19 16:17:42 boss kernel: [ 377.359702] Oops: 0002 [#1] Feb 19 16:17:42 boss kernel: [ 377.359730] SMP Feb 19 16:17:42 boss kernel: [ 377.359799] Modules linked in: af_packet arc4 ecb blkcipher rc80211_simple bcm43xx_d80211 80211 cfg80211 snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd _pcm_oss snd_mixer_oss ipv6 usbhid hid usbmouse snd_intel8x0 snd_ac97_codec b44 ssb ehci_hcd uhci_hcd intel_agp yenta_socket pcmcia ac97_bus serio_raw usbcore agpgart rsrc_nonstatic ohci1394 snd_pcm ide_cd pc mcia_core 8250_pci evdev firmware_class 8250 ieee1394 serial_core snd_timer cdrom snd crc32 soundcore snd_page_alloc unix Feb 19 16:17:42 boss kernel: [ 377.360945] CPU:0 Feb 19 16:17:42 boss kernel: [ 377.360946] EIP:0060:[]Not tainted VLI Feb 19 16:17:42 boss kernel: [ 377.360947] EFLAGS: 00010246 (2.6.20 #1) Feb 19 16:17:42 boss kernel: [ 377.361048] EIP is at do_mark_unused+0x0/0x7 [bcm43xx_d80211] Feb 19 16:17:42 boss kernel: [ 377.361080] eax: f71d7000 ebx: ecx: edx: Feb 19 16:17:42 boss kernel: [ 377.361113] esi: edi: f71d7000 ebp: f8b2a3c3 esp: c192dee0 Feb 19 16:17:42 boss kernel: [ 377.361146] ds: 007b es: 007b ss: 0068 Feb 19 16:17:42 boss kernel: [ 377.361176] Process events/0 (pid: 6, ti=c192c000 task=c191ca70 task.ti=c192c000) Feb 19 16:17:42 boss kernel: [ 377.361210] Stack: f8b28629 c0103587 Feb 19 16:17:42 boss kernel: [ 377.361433]0282 f8b2a3d8 f71d7000 f8b19db7 f71d7000 f8b19f50 f89b64a0 38058a67 Feb 19 16:17:42 boss kernel: [ 377.361655]f71d7000 f8b1a0dd 0011 f71d7274 f71d7270 c18fd2c0 0246 c012a392 Feb 19 16:17:42 boss kernel: [ 377.361878] Call Trace: Feb 19 16:17:42 boss kernel: [ 377.361932] [] bcm43xx_call_for_each_loctl+0x30/0x9b [bcm43xx_d80211] Feb 19 16:17:42 boss kernel: [ 377.362003] [] common_interrupt+0x23/0x28 Feb 19 16:17:42 boss kernel: [ 377.362060] [] bcm43xx_loctl_mark_all_unused+0xe/0x17 [bcm43xx_d80211] Feb 19 16:17:42 boss kernel: [ 377.362129] [] bcm43xx_periodic_every60sec+0x8/0x2e [bcm43xx_d80211] Feb 19 16:17:42 boss kernel: [ 377.362197] [] do_periodic_work+0xb4/0xe9 [bcm43xx_d80211] Feb 19 16:17:42 boss kernel: [ 377.362258] [] bcm43xx_periodic_work_handler+0xb5/0x16f [bcm43xx_d80211] Feb 19 16:17:42 boss kernel: [ 377.362327] [] run_workqueue+0x7e/0x14e Feb 19 16:17:42 boss kernel: [ 377.362381] [] bcm43xx_periodic_work_handler+0x0/0x16f [bcm43xx_d80211] Feb 19 16:17:42 boss kernel: [ 377.362449] [] worker_thread+0x14e/0x16d Feb 19 16:17:42 boss kernel: [ 377.362503] [] default_wake_function+0x0/0xc Feb 19 16:17:42 boss kernel: [ 377.362558] [] default_wake_function+0x0/0xc Feb 19 16:17:42 boss kernel: [ 377.362613] [] worker_thread+0x0/0x16d Feb 19 16:17:42 boss kernel: [ 377.362665] [] kthread+0xa0/0xd1 Feb 19 16:17:42 boss kernel: [ 377.362717] [] kthread+0x0/0xd1 Feb 19 16:17:42 boss kernel: [ 377.362769] [] kernel_thread_helper+0x7/0x10 Feb 19 16:17:42 boss kernel: [ 377.362823] === Feb 19 16:17:42 boss kernel: [ 377.362852] Code: 04 00 00 00 80 e3 01 0f 44 c1 88 44 24 02 8d 8a 72 03 00 00 89 f0 8d 54 24 02 e8 9f e1 ff ff 8b 5c 24 04 8b 74 24 08 83 c4 0c c3 <80> 62 02 fe 31 c0 c3 53 ba c3 a3 b2 f8 8b 58 6c e8 21 e2 ff ff Feb 19 16:17:42 boss kernel: [ 377.364227] EIP: [] do_mark_unused+0x0/0x7 [bcm43xx_d80211] SS:ESP 0068:c192dee0 lspci -v 02:03.0 Network controller: Broadcom Corporation BCM4309 802.11a/b/g (rev 03) Subsystem: Dell Truemobile 1450 MiniPCI Flags: bus master, fast devsel, latency 32, IRQ 18 Memory at faff6000 (32-bit, non-prefetchable) [size=8K] wpa_supplicant is version 0.4.9: I was trying to connect to a Linksys WRT54G using WEP encryption. Relevant part of .config CONFIG_BCM43XX=m CONFIG_BCM43XX_DEBUG=y CONFIG_BCM43XX_DMA=y CONFIG_BCM43XX_PIO=y CONFIG_BCM43XX_DMA_AND_PIO_MODE=y # CONFIG_BCM43XX_DMA_MODE is not set # CONFIG_BCM43XX_PIO_MODE is not set # CONFIG_ZD1211RW is not set CONFIG_BCM43XX_D80211=m CONFIG_BCM43XX_D80211_PCI=y CONFIG_BCM43XX_D80211_PCMCIA=y CONFIG_BCM43XX_D80211_DEBUG=y CONFIG_BCM43XX_D80211_DMA=y CONFIG_BCM43XX_D80211_PIO=y CONFIG_BCM43XX_D80211_DMA_AND_PIO_MODE=y # CONFIG_BCM43XX_D80211_DMA_MODE is not set # CONFIG_BCM43XX_D80211_PIO_MODE is not set # CONFIG_RT2X00 is not set # CONFIG_ADM8211 is not set # CONFIG_P54_COMMON is not set # CONFIG_ZD1211RW_D80211 is not set CONFIG_NET_WIRELESS=y Machine is a Dell Inspiron 9100 laptop with an HT-enabled Pe
Re: [Bugme-new] [Bug 7974] New: BUG: scheduling while atomic: swapper/0x10000100/0
On Thu, Feb 15, 2007 at 03:45:23PM -0800, Jay Vosburgh wrote: > > For the short term, yes, I don't have any disagreement with > switching the timer based stuff over to workqueues. Basically a one for > one replacement to get the functions in a process context and tweak the > locking. I did some testing of mine last week and my patch definitely has some issues. I'm running into a problem that is similar to the thread started last week titled "BUG] RTNL and flush_scheduled_work deadlocks" but I think I can patch around that if needed. > > I do think we're having a little confusion over details of > terminology; if I'm not mistaken, you're thinking that workqueue means > single threaded: even though each individual "monitor thingie" is a > separate piece of work, they still can't collide. > > That's true, but (unless I've missed a call somewhere) there > isn't a "wq_pause_for_a_bit" type of call (that, e.g., waits for > anything running to stop, then doesn't run any further work until we > later tell it to), so suspending all of the periodic things running for > the bond is more hassle than if there's just one schedulable work thing, > which internally calls the right functions to do the various things. > This is also single threaded, but easier to stop and start. It seems to > be simpler to have multiple link monitors running in such a system as > well (without having them thrashing the link state as would happen now). > I see by looking at your patch that you keep a list of timers and only schedule work for the event that will happen next. I've seen timer implementations like this before and feel its reasonable. It would be good to account for skew, but other than that it seems like a reasonable solution -- though it is too bad that workqueues and their behavior seem like somewhat of a mystery to most and cause people to code around them (I don't blame you one bit). I also plan to start testing your patch later this week and will let you know what I find. -andy - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
Evgeniy Polyakov <[EMAIL PROTECTED]> writes: > > My experiment shows almost 400 nsecs without _any_ locks - they are > removed completely - it is pure hash selection/list traverse time. Are you sure you're not measuring TLB misses too? In user space you likely use 4K pages. The kernel would use 2MB pages. I would suggest putting the tables into hugetlbfs allocated memory in your test program. -Andei - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] 8139too: RTNL and flush_scheduled_work deadlock
Cc: list trimmed. Jarek Poplawski <[EMAIL PROTECTED]> : > On Fri, Feb 16, 2007 at 09:20:34PM +0100, Francois Romieu wrote: [...] > > Btw, the thread runs every 3*HZ at most. > > You are right (mostly)! But I think rtnl_lock is special > and should be spared (even this 3*HZ) and here it's used > for some mainly internal purpose (close synchronization). > And it looks like mainly for this internal reason holding > of rtnl_lock is increased. And because rtnl_lock is quite > popular you have to take into consideration that after > this 3*HZ it could spend some time waiting for the lock. > So, maybe it would be nicer to check this netif_running > twice (after rtnl_lock where needed), but maybe it's a > mater of taste only, and yours is better, as well. The region protected by RTNL has been widened to include a tx_timeout handler. It is supposed to handle an occasional error, something that should not even happen at 3*HZ. Optimizing it is useless, especially on an high-end performer like the 8139. > (Btw. I didn't verify this, but I hope you checked that > places not under rtnl_lock before the patch are safe from > some locking problems now.) I did. It is not a reason to trust the patch though. -- Ueimor - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-pm] [Ipw2100-devel] [RFC] Runtime power management on ipw2100
On Thursday 08 February 2007 1:01 am, Zhu Yi wrote: > A generic requirement for dynamic power management is the hardware > resource should not be touched when you put it in a low power state. That is in no way a "generic" requirement. It might apply specifically to one ipw2100 low power state ... but "in general" devices may support more than one low power state, with different levels of functionality. Not all of those levels necessarily disallow touching the hardware. > But I think > freeing the irq handler before suspend should be the right way to go. Some folk like that model a lot for shared IRQs. It shouldn't matter for non-sharable ones. - Dave - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2][TCP] YeAH-TCP: limited slow start exported function
I'd prefer to make it apply automatically across all congestion controls that do slow-start, and also make the max_ssthresh parameter controllable via sysctl. This patch (attached) should implement this. Note the default value for sysctl_tcp_max_ssthresh = 0, which disables limited slow-start. This should make ABC apply during LSS as well. Note the patch is compile-tested only! I can do some real testing if you'd like to apply this Dave. Thanks, -John Angelo P. Castellani wrote: Forgot the patch.. Angelo P. Castellani ha scritto: From: Angelo P. Castellani <[EMAIL PROTECTED]> RFC3742: limited slow start See http://www.ietf.org/rfc/rfc3742.txt Signed-off-by: Angelo P. Castellani <[EMAIL PROTECTED]> --- To allow code reutilization I've added the limited slow start procedure as an exported symbol of linux tcp congestion control. On large BDP networks canonical slow start should be avoided because it requires large packet losses to converge, whereas at lower BDPs slow start and limited slow start are identical. Large BDP is defined through the max_ssthresh variable. I think limited slow start could safely replace the canonical slow start procedure in Linux. Regards, Angelo P. Castellani p.s.: in the attached patch is added an exported function currently used only by YeAH TCP include/net/tcp.h |1 + net/ipv4/tcp_cong.c | 23 +++ 2 files changed, 24 insertions(+) diff -uprN linux-2.6.20-a/include/net/tcp.h linux-2.6.20-c/include/net/tcp.h --- linux-2.6.20-a/include/net/tcp.h2007-02-04 19:44:54.0 +0100 +++ linux-2.6.20-c/include/net/tcp.h2007-02-19 10:54:10.0 +0100 @@ -669,6 +669,7 @@ extern void tcp_get_allowed_congestion_c extern int tcp_set_allowed_congestion_control(char *allowed); extern int tcp_set_congestion_control(struct sock *sk, const char *name); extern void tcp_slow_start(struct tcp_sock *tp); +extern void tcp_limited_slow_start(struct tcp_sock *tp); extern struct tcp_congestion_ops tcp_init_congestion_ops; extern u32 tcp_reno_ssthresh(struct sock *sk); diff -uprN linux-2.6.20-a/net/ipv4/tcp_cong.c linux-2.6.20-c/net/ipv4/tcp_cong.c --- linux-2.6.20-a/net/ipv4/tcp_cong.c 2007-02-04 19:44:54.0 +0100 +++ linux-2.6.20-c/net/ipv4/tcp_cong.c 2007-02-19 10:54:10.0 +0100 @@ -297,6 +297,29 @@ void tcp_slow_start(struct tcp_sock *tp) } EXPORT_SYMBOL_GPL(tcp_slow_start); +void tcp_limited_slow_start(struct tcp_sock *tp) +{ + /* RFC3742: limited slow start +* the window is increased by 1/K MSS for each arriving ACK, +* for K = int(cwnd/(0.5 max_ssthresh)) +*/ + + const int max_ssthresh = 100; + + if (max_ssthresh > 0 && tp->snd_cwnd > max_ssthresh) { + u32 k = max(tp->snd_cwnd / (max_ssthresh >> 1), 1U); + if (++tp->snd_cwnd_cnt >= k) { + if (tp->snd_cwnd < tp->snd_cwnd_clamp) + tp->snd_cwnd++; + tp->snd_cwnd_cnt = 0; + } + } else { + if (tp->snd_cwnd < tp->snd_cwnd_clamp) + tp->snd_cwnd++; + } +} +EXPORT_SYMBOL_GPL(tcp_limited_slow_start); + /* * TCP Reno congestion control * This is special case used for fallback as well. Add RFC3742 Limited Slow-Start, controlled by variable sysctl_tcp_max_ssthresh. Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- commit 97033fa201705e6cfc68ce66f34ede3277c3d645 tree 5df4607728abce93aa05b31015a90f2ce369abff parent 8a03d9a498eaf02c8a118752050a5154852c13bf author John Heffner <[EMAIL PROTECTED]> Mon, 19 Feb 2007 15:52:16 -0500 committer John Heffner <[EMAIL PROTECTED]> Mon, 19 Feb 2007 15:52:16 -0500 include/linux/sysctl.h |1 + include/net/tcp.h |1 + net/ipv4/sysctl_net_ipv4.c |8 net/ipv4/tcp_cong.c| 33 +++-- 4 files changed, 33 insertions(+), 10 deletions(-) diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index 2c5fb38..a2dce72 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -438,6 +438,7 @@ enum NET_CIPSOV4_RBM_STRICTVALID=121, NET_TCP_AVAIL_CONG_CONTROL=122, NET_TCP_ALLOWED_CONG_CONTROL=123, + NET_TCP_MAX_SSTHRESH=124, }; enum { diff --git a/include/net/tcp.h b/include/net/tcp.h index 5c472f2..521da28 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -230,6 +230,7 @@ extern int sysctl_tcp_mtu_probing; extern int sysctl_tcp_base_mss; extern int sysctl_tcp_workaround_signed_windows; extern int sysctl_tcp_slow_start_after_idle; +extern int sysctl_tcp_max_ssthresh; extern atomic_t tcp_memory_allocated; extern atomic_t tcp_sockets_allocated; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 0aa3047..d68effe 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -803,6
[patch 1/2] natsemi: Add support for using MII port with no PHY
This patch provides code paths which allow the natsemi driver to use the external MII port on the chip but ignore any PHYs that may be attached to it. The link state will be left as it was when the driver started and can be configured via ethtool. Any PHYs that are present can be accessed via the MII ioctl()s. This is useful for systems where the device is connected without a PHY or where either information or actions outside the scope of the driver are required in order to use the PHYs. Signed-Off-By: Mark Brown <[EMAIL PROTECTED]> --- This revision of the patch fixes some issues brought up during review. Previous versions of this patch exposed the new functionality as a module option. This has been removed. Any hardware that needs this should be identifiable by a quirk since it unlikely to behave correctly with an unmodified driver. Index: linux/drivers/net/natsemi.c === --- linux.orig/drivers/net/natsemi.c2007-02-19 10:10:40.0 + +++ linux/drivers/net/natsemi.c 2007-02-19 10:20:45.0 + @@ -568,6 +568,8 @@ u32 intr_status; /* Do not touch the nic registers */ int hands_off; + /* Don't pay attention to the reported link state. */ + int ignore_phy; /* external phy that is used: only valid if dev->if_port != PORT_TP */ int mii; int phy_addr_external; @@ -696,7 +698,10 @@ struct netdev_private *np = netdev_priv(dev); u32 tmp; - netif_carrier_off(dev); + if (np->ignore_phy) + netif_carrier_on(dev); + else + netif_carrier_off(dev); /* get the initial settings from hardware */ tmp= mdio_read(dev, MII_BMCR); @@ -806,8 +811,10 @@ np->hands_off = 0; np->intr_status = 0; np->eeprom_size = natsemi_pci_info[chip_idx].eeprom_size; + np->ignore_phy = 0; /* Initial port: +* - If configured to ignore the PHY set up for external. * - If the nic was configured to use an external phy and if find_mii * finds a phy: use external port, first phy that replies. * - Otherwise: internal port. @@ -815,7 +822,7 @@ * The address would be used to access a phy over the mii bus, but * the internal phy is accessed through mapped registers. */ - if (readl(ioaddr + ChipConfig) & CfgExtPhy) + if (np->ignore_phy || readl(ioaddr + ChipConfig) & CfgExtPhy) dev->if_port = PORT_MII; else dev->if_port = PORT_TP; @@ -825,7 +832,9 @@ if (dev->if_port != PORT_TP) { np->phy_addr_external = find_mii(dev); - if (np->phy_addr_external == PHY_ADDR_NONE) { + /* If we're ignoring the PHY it doesn't matter if we can't +* find one. */ + if (!np->ignore_phy && np->phy_addr_external == PHY_ADDR_NONE) { dev->if_port = PORT_TP; np->phy_addr_external = PHY_ADDR_INTERNAL; } @@ -891,6 +900,8 @@ printk("%02x, IRQ %d", dev->dev_addr[i], irq); if (dev->if_port == PORT_TP) printk(", port TP.\n"); + else if (np->ignore_phy) + printk(", port MII, ignoring PHY\n"); else printk(", port MII, phy ad %d.\n", np->phy_addr_external); } @@ -1571,9 +1582,13 @@ { struct netdev_private *np = netdev_priv(dev); void __iomem * ioaddr = ns_ioaddr(dev); - int duplex; + int duplex = np->duplex; u16 bmsr; + /* If we are ignoring the PHY then don't try reading it. */ + if (np->ignore_phy) + goto propagate_state; + /* The link status field is latched: it remains low after a temporary * link failure until it's read. We need the current link status, * thus read twice. @@ -1585,7 +1600,7 @@ if (netif_carrier_ok(dev)) { if (netif_msg_link(np)) printk(KERN_NOTICE "%s: link down.\n", - dev->name); + dev->name); netif_carrier_off(dev); undo_cable_magic(dev); } @@ -1609,6 +1624,7 @@ duplex = 1; } +propagate_state: /* if duplex is set then bit 28 must be set, too */ if (duplex ^ !!(np->rx_config & RxAcceptTx)) { if (netif_msg_link(np)) @@ -2819,6 +2835,15 @@ } /* +* If we're ignoring the PHY then autoneg and the internal +* transciever are really not going to work so don't let the +* user select them. +*/ + if (np->ignore_phy && (ecmd->autoneg == AUTONEG_ENABLE || + ecmd
[patch 2/2] natsemi: Support Aculab E1/T1 PMXc cPCI carrier cards
Aculab E1/T1 PMXc cPCI carrier card cards present a natsemi on the cPCI bus with an oversized EEPROM using a direct MII<->MII connection with no PHY. This patch adds a new device table entry supporting these cards. Signed-Off-By: Mark Brown <[EMAIL PROTECTED]> --- This revision removes extra braces from the previous version. Index: linux/drivers/net/natsemi.c === --- linux.orig/drivers/net/natsemi.c2007-02-19 10:16:50.0 + +++ linux/drivers/net/natsemi.c 2007-02-19 10:18:25.0 + @@ -244,6 +244,9 @@ MII_EN_SCRM = 0x0004, /* enable scrambler (tp) */ }; +enum { + NATSEMI_FLAG_IGNORE_PHY = 0x1, +}; /* array of board data directly indexed by pci_tbl[x].driver_data */ static const struct { @@ -251,10 +254,12 @@ unsigned long flags; unsigned int eeprom_size; } natsemi_pci_info[] __devinitdata = { + { "Aculab E1/T1 PMXc cPCI carrier card", NATSEMI_FLAG_IGNORE_PHY, 128 }, { "NatSemi DP8381[56]", 0, 24 }, }; static const struct pci_device_id natsemi_pci_tbl[] __devinitdata = { + { PCI_VENDOR_ID_NS, 0x0020, 0x12d9, 0x000c, 0, 0, 0 }, { PCI_VENDOR_ID_NS, 0x0020, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0 }, { } /* terminate list */ }; @@ -811,7 +816,10 @@ np->hands_off = 0; np->intr_status = 0; np->eeprom_size = natsemi_pci_info[chip_idx].eeprom_size; - np->ignore_phy = 0; + if (natsemi_pci_info[chip_idx].flags & NATSEMI_FLAG_IGNORE_PHY) + np->ignore_phy = 1; + else + np->ignore_phy = 0; /* Initial port: * - If configured to ignore the PHY set up for external. -- "You grabbed my hand and we fell into it, like a daydream - or a fever." - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 0/2] natsemi: Support Aculab E1/T1 cPCI carrier cards
These patches add support for the Aculab E1/T1 cPCI carrier card to the natsemi driver. The first patch provides support for using the MII port with no PHY and the second adds the quirks required to detect and configure the card. This revision should address the issues raised by Jeff over the weekend. Apologies if I've missed anything. -- "You grabbed my hand and we fell into it, like a daydream - or a fever." - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: Strange connection slowdown on pcnet32
On Fri, Feb 16, 2007 at 04:01:57PM -0500, Lennart Sorensen wrote: > eth1: netif_receive_skb(skb) > eth1: netif_receive_skb(skb) > eth1: pcnet32_poll: pcnet32_rx() got 16 packets > eth1: base: 0x05215812 status: 0310 next->status: 0310 > eth1: netif_receive_skb(skb) > eth1: netif_receive_skb(skb) > eth1: netif_receive_skb(skb) > eth1: netif_receive_skb(skb) > eth1: netif_receive_skb(skb) > eth1: netif_receive_skb(skb) > eth1: netif_receive_skb(skb) > eth1: netif_receive_skb(skb) > eth1: netif_receive_skb(skb) > eth1: netif_receive_skb(skb) > eth1: netif_receive_skb(skb) > eth1: netif_receive_skb(skb) > eth1: netif_receive_skb(skb) > eth1: netif_receive_skb(skb) > eth1: netif_receive_skb(skb) > eth1: netif_receive_skb(skb) > eth1: pcnet32_poll: pcnet32_rx() got 16 packets > eth1: base: 0x04c51812 status: 8000 next->status: 0310 > eth1: pcnet32_poll: pcnet32_rx() got 0 packets > eth1: interrupt csr0=0x6f3 new csr=0x33, csr3=0x. > eth1: exiting interrupt, csr0=0x0033, csr3=0x5f00. > eth1: base: 0x04c51812 status: 8000 next->status: 0310 > eth1: pcnet32_poll: pcnet32_rx() got 0 packets > eth1: interrupt csr0=0x4f3 new csr=0x33, csr3=0x. > eth1: exiting interrupt, csr0=0x0033, csr3=0x5f00. > eth1: base: 0x04c51812 status: 8000 next->status: 0310 > eth1: pcnet32_poll: pcnet32_rx() got 0 packets > eth1: interrupt csr0=0x4f3 new csr=0x33, csr3=0x. > eth1: exiting interrupt, csr0=0x0433, csr3=0x5f00. > eth1: base: 0x04c51812 status: 8000 next->status: 0310 > eth1: pcnet32_poll: pcnet32_rx() got 0 packets > > So somehow it ends up that when it reads the status of the descriptor at > address 0x04c51812, it sees the status as 0x8000 (which means owned by > the MAC I believe), even though the next descriptor in the ring has a > sensible status, indicating that the descriptor is ready to be handled > by the driver. Since the descriptor isn't ready, we exit without > handling anything and NAPI reschedules is the next time we get an > interrupt, and after some random number of tries, we finally see the > right status and handle the packet, along with a bunch of other packets > waiting in the descriptor ring. Then we seem to hit the exact same > descriptor address again, with the same problem in the status we read, > and again we are stuck for a while, until finally we see the right > status, and another pile of packets get handled, and we again hit the > same descriptor address and get stuck. I have been poking at things with firescope to see if the MAC is actually writing to system memory or not. The entry that it gets stuch on is _always_ entry 0 in the rx_ring. There does not appear to be any exceptions to this. Here is my firescope (slightly modified for this purpose) dump of the rx_ring of eth1: Descriptor:Address: /--base---\ /buf\ /sta\ /-message-\ /reserved-\ : : | | |len| |tus| | length | | | RXdesc[00]:6694000: 12 18 5f 05 fa f9 00 80 40 00 00 00 00 00 00 00 RXdesc[01]:6694010: 12 78 15 05 fa f9 40 03 ee 05 00 00 00 00 00 00 RXdesc[02]:6694020: 12 a0 52 06 fa f9 40 03 ee 05 00 00 00 00 00 00 RXdesc[03]:6694030: 12 f8 c2 04 fa f9 40 03 ee 05 00 00 00 00 00 00 RXdesc[04]:6694040: 12 70 15 05 fa f9 40 03 ee 05 00 00 00 00 00 00 RXdesc[05]:6694050: 12 e8 37 05 fa f9 40 03 ee 05 00 00 00 00 00 00 RXdesc[06]:6694060: 12 e0 37 05 fa f9 40 03 ee 05 00 00 00 00 00 00 RXdesc[07]:6694070: 12 e8 d5 04 fa f9 40 03 ee 05 00 00 00 00 00 00 RXdesc[08]:6694080: 12 e0 d5 04 fa f9 40 03 ee 05 00 00 00 00 00 00 RXdesc[09]:6694090: 12 d8 d1 05 fa f9 40 03 46 00 00 00 00 00 00 00 RXdesc[10]:66940a0: 12 d0 d1 05 fa f9 40 03 4e 00 00 00 00 00 00 00 RXdesc[11]:66940b0: 12 d8 02 05 fa f9 10 03 40 00 00 00 00 00 00 00 RXdesc[12]:66940c0: 12 d0 02 05 fa f9 40 03 46 00 00 00 00 00 00 00 RXdesc[13]:66940d0: 12 38 58 05 fa f9 00 80 ee 05 00 00 00 00 00 00 RXdesc[14]:66940e0: 12 30 58 05 fa f9 00 80 ee 05 00 00 00 00 00 00 RXdesc[15]:66940f0: 12 78 2c 05 fa f9 00 80 ee 05 00 00 00 00 00 00 RXdesc[16]:6694100: 12 a0 58 05 fa f9 00 80 ee 05 00 00 00 00 00 00 RXdesc[17]:6694110: 12 b0 04 05 fa f9 00 80 ee 05 00 00 00 00 00 00 RXdesc[18]:6694120: 12 b8 04 05 fa f9 00 80 ee 05 00 00 00 00 00 00 RXdesc[19]:6694130: 12 70 2c 05 fa f9 00 80 ee 05 00 00 00 00 00 00 RXdesc[20]:6694140: 12 f8 56 05 fa f9 00 80 ee 05 00 00 00 00 00 00 RXdesc[21]:6694150: 12 c8 29 05 fa f9 00 80 ee 05 00 00 00 00 00 00 RXdesc[22]:6694160: 12 20 03 05 fa f9 00 80 ee 05 00 00 00 00 00 00 RXdesc[23]:6694170: 12 60 4c 05 fa f9 00 80 87 05 00 00 00 00 00 00 RXdesc[24]:6694180: 12 98 53 05 fa f9 00 80 40 00 00 00 00 00 00 00 RXdesc[25]:6694190: 12 b0 cc 04 fa f9 00 80 40 00 00 00 00 00 00 00 RXdesc[26]:66941a0: 12 a8 3f 05 fa f9 00 80 40 00 00 00 00 00 00 00 RXdesc[27]:66941b0: 12 58 e8 04 fa f9 00 80 40 00 00 00 00 00 00 00 RXdesc[28]:66941c0: 12 b0 4d 06 fa f9 00 80 40 00 00 00 00 00 00 00 RXdesc[29]:66941d0: 12 38 ef 04 fa f9 00 80 40 00 00 00 00 00 00 00 RXdesc[30]:66941e0: 12 98 1f 05 fa f9 00 80 40 00
Re: MediaGX/GeodeGX1 requires X86_OOSTORE.
On Mon, Feb 19, 2007 at 11:48:27AM -0800, Roland Dreier wrote: > > Does anyone know if there is any way to flush a cache line of the cpu to > > force rereading system memory for a given address or address range? > > There is the "clflush" instruction, but not all x86 CPUs support it. > You need to check the CPUID flag to know for sure (/proc/cpuinfo will > show a "clflush" flag if it is supported). Well I will check for that. Of course it is still possible that is it actually the network chip screwing up somehow. -- Len Sorensen - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MediaGX/GeodeGX1 requires X86_OOSTORE.
> Does anyone know if there is any way to flush a cache line of the cpu to > force rereading system memory for a given address or address range? There is the "clflush" instruction, but not all x86 CPUs support it. You need to check the CPUID flag to know for sure (/proc/cpuinfo will show a "clflush" flag if it is supported). - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 19 Feb 2007 13:04:12 +0100, Andi Kleen <[EMAIL PROTECTED]> wrote: LRU tends to be hell for caches in MP systems, because it writes to the cache lines too and makes them exclusive and more expensive. That's why you let the hardware worry about LRU. You don't write to the upper layers of the splay tree when you don't have to. It's the mere traversal of the upper layers that keeps them in cache, causing the cache hierarchy to mimic the data structure hierarchy. RCU changes the whole game, of course, because you don't write to the old copy at all; you have to clone the altered node and all its ancestors and swap out the root node itself under a spinlock. Except you don't use a spinlock; you have a ring buffer of root nodes and atomically increment the writer index. That atomically incremented index is the only thing on which there's any write contention. (Obviously you need a completion flag on the new root node for the next writer to poll on, so the sequence is atomic-increment ... copy and alter from leaf to root ... wmb() ... mark new root complete.) When you share TCP sessions among CPUs, and packets associated with the same session may hit softirq in any CPU, you are going to eat a lot of interconnect bandwidth keeping the sessions coherent. (The only way out of this is to partition the tuple space by CPU at the NIC layer with separate per-core, or perhaps per-cache, receive queues; at which point the NIC is so smart that you might as well put the DDoS handling there.) But at least it's cache coherency protocol bandwidth and not bandwidth to and from DRAM, which has much nastier latencies. The only reason the data structure matters _at_all_ is that DDoS attacks threaten to evict the working set of real sessions out of cache. That's why you add new sessions at the leaves and don't rotate them up until they're hit a second time. Of course the leaf layer can't be RCU, but it doesn't have to be; it's just a bucket of tuples. You need an auxiliary structure to hold the session handshake trackers for the leaf layer, but you assume that you're always hitting cold cache when diving into this structure and ration accesses accordingly. Maybe you even explicitly evict entries from cache after sending the SYNACK, so they don't crowd other stuff out; they go to DRAM and get pulled into the new CPU (and rotated up) if and when the next packet in the session arrives. (I'm assuming T/TCP here, so you can't skimp much on session tracker size during the handshake.) Every software firewall I've seen yet falls over under DDoS. If you want to change that, you're going to need more than the back-of-the-napkin calculations that show that session lookup bandwidth exceeds frame throughput for min-size packets. You're going to need to strategize around exploiting the cache hierarchy already present in your commodity processor to implicitly partition real traffic from the DDoS storm. It's not a trivial problem, even in the mathematician's sense (in which all problems are either trivial or unsolved). Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Mon, Feb 19, 2007 at 01:26:42PM -0500, Benjamin LaHaise wrote: > On Mon, Feb 19, 2007 at 07:13:07PM +0100, Eric Dumazet wrote: > > So even with a lazy hash function, 89 % of lookups are satisfied with less > > than 6 compares. > > Which sucks, as those are typically going to be cache misses (costing many > hundreds of cpu cycles). Hash chains fair very poorly under DoS conditions, > and must be removed under a heavy load. Worst case handling is very > important next to common case. I should clarify. Back of the napkin calculations show that there is only 157 cycles on a 3GHz processor in which to decide what happens to a packet, which means 1 cache miss is more than enough. In theory we can get pretty close to line rate with quad core processors, but it definately needs some of the features that newer chipsets have for stuffing packets directly into the cache. I would venture a guess that we also need to intelligently partition packets so that we make the most use of available cache resources. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Mon, Feb 19, 2007 at 07:13:07PM +0100, Eric Dumazet wrote: > So even with a lazy hash function, 89 % of lookups are satisfied with less > than 6 compares. Which sucks, as those are typically going to be cache misses (costing many hundreds of cpu cycles). Hash chains fair very poorly under DoS conditions, and must be removed under a heavy load. Worst case handling is very important next to common case. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Monday 19 February 2007 16:14, Eric Dumazet wrote: > > Because O(1) is different of O(log(N)) ? > if N = 2^20, it certainly makes a difference. > Yes, 1% of chains might have a length > 10, but yet 99% of the lookups are > touching less than 4 cache lines. > With a binary tree, log(2^20) is 20. or maybe not ? If you tell me it's 4, > I will be very pleased. > Here is the tcp ehash chain length distribution on a real server : ehash_addr=0x81047600 ehash_size=1048576 333835 used chains, 3365 used twchains Distribution of sockets/chain length [chain length]:number of sockets [1]:221019 37.4645% [2]:56590 56.6495% [3]:21250 67.4556% [4]:12534 75.9541% [5]:8677 83.3082% [6]:5862 89.2701% [7]:3640 93.5892% [8]:2219 96.5983% [9]:1083 98.2505% [10]:539 99.1642% [11]:244 99.6191% [12]:112 99.8469% [13]:39 99.9329% [14]:16 99.9708% [15]:6 99.9861% [16]:3 99.9942% [17]:2 100% total : 589942 sockets So even with a lazy hash function, 89 % of lookups are satisfied with less than 6 compares. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Monday 19 February 2007 15:25, Evgeniy Polyakov wrote: > On Mon, Feb 19, 2007 at 03:14:02PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > > > Forget about cache misses and cache lines - we have a hash table, only > > > part of which is used (part for time-wait sockets, part for established > > > ones). > > > > No you didnt not read my mail. Current ehash is not as decribed by you. > > I did. > And I also said that my tests do not have timewait sockets at all - I > removed sk_for_each and so on, which should effectively increase lookup > time twice on busy system with lots of created/removed sockets per > timeframe (that is theory from my side already). > Anyway, I ran the same test with increased table too. > > > > Anyway, even with 2^20 (i.e. when the whole table is only used for > > > established sockets) search time is about 360-370 nsec on 3.7 GHz Core > > > Duo (only one CPU is used) with 2 GB of ram. > > > > Your tests are user land, so unfortunatly are biased... > > > > (Unless you use hugetlb data ?) > > No I do not. But the same can be applied to trie test - it is also > performed in userspace and thus suffers from possible swapping/cache > flushing and so on. > > And I doubt moving test into kernel will suddenly end up with 10 times > increased rates. At least some architectures pay a high price using vmalloc() instead of kmalloc(), and TLB misses means something for them. Not everybody has the latest Intel cpu. Normally, ehash table is using huge pages. > > Anyway, trie test (broken implementation) is two times slower than hash > table (resized already), and it does not include locking isses of the > hash table access and further scalability issues. > You mix apples and oranges. We already know locking has nothing to do with hashing or trie-ing. We *can* put RCU on top of the existing ehash. We also can add hash resizing if we really care. > I think I need to fix my trie implementation to fully show its > potential, but original question was why tree/trie based implementation > is not considered at all although it promises better performance and > scalability. Because you mix performance and scalability. Thats not exactly the same. Sometime, high performance means *suboptimal* scalability. Because O(1) is different of O(log(N)) ? if N = 2^20, it certainly makes a difference. Yes, 1% of chains might have a length > 10, but yet 99% of the lookups are touching less than 4 cache lines. With a binary tree, log(2^20) is 20. or maybe not ? If you tell me it's 4, I will be very pleased. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] - drivers/net/hamradio remove local random function, use random32()
On Fri, 2007-02-16 at 09:42 -0800, Joe Perches wrote: > Signed-off-by: Joe Perches <[EMAIL PROTECTED]> Acked-By: Thomas Sailer <[EMAIL PROTECTED]> Thanks a lot! Tom - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MediaGX/GeodeGX1 requires X86_OOSTORE.
On Sat, Feb 17, 2007 at 11:11:13PM +0900, takada wrote: > is it mean what doesn't help with doesn't call set_cx86_reoder()? > this function disable to reorder at 0x4000: to 0x:. > does pcnet32 access at out of above range? > > --- arch/i386/Kconfig.cpu~2007-02-05 03:44:54.0 +0900 > +++ arch/i386/Kconfig.cpu 2007-02-17 21:25:52.0 +0900 > @@ -322,7 +322,7 @@ config X86_USE_3DNOW > > config X86_OOSTORE > bool > - depends on (MWINCHIP3D || MWINCHIP2 || MWINCHIPC6) && MTRR > + depends on (MWINCHIP3D || MWINCHIP2 || MWINCHIPC6) && MTRR || MGEODEGX1 > default y > > config X86_TSC Well it turns out that enabling OOSTORE doesn't elliminate the problem, but it does make it go from occouring within seconds to occouring within many hours. I am off to investigate some more. Does anyone know if there is any way to flush a cache line of the cpu to force rereading system memory for a given address or address range? -- Len Sorensen - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Mon, Feb 19, 2007 at 03:14:02PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > > Forget about cache misses and cache lines - we have a hash table, only > > part of which is used (part for time-wait sockets, part for established > > ones). > > No you didnt not read my mail. Current ehash is not as decribed by you. I did. And I also said that my tests do not have timewait sockets at all - I removed sk_for_each and so on, which should effectively increase lookup time twice on busy system with lots of created/removed sockets per timeframe (that is theory from my side already). Anyway, I ran the same test with increased table too. > > Anyway, even with 2^20 (i.e. when the whole table is only used for > > established sockets) search time is about 360-370 nsec on 3.7 GHz Core > > Duo (only one CPU is used) with 2 GB of ram. > > Your tests are user land, so unfortunatly are biased... > > (Unless you use hugetlb data ?) No I do not. But the same can be applied to trie test - it is also performed in userspace and thus suffers from possible swapping/cache flushing and so on. And I doubt moving test into kernel will suddenly end up with 10 times increased rates. Anyway, trie test (broken implementation) is two times slower than hash table (resized already), and it does not include locking isses of the hash table access and further scalability issues. I think I need to fix my trie implementation to fully show its potential, but original question was why tree/trie based implementation is not considered at all although it promises better performance and scalability. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 8013] New: select for write hangs on a socket after write returned ECONNRESET
On 17-02-2007 17:25, Evgeniy Polyakov wrote: > On Fri, Feb 16, 2007 at 09:34:27PM +0300, Evgeniy Polyakov ([EMAIL > PROTECTED]) wrote: >> Otherwise we can extend select output mask to include hungup too >> (getting into account that hungup is actually output event). > > This is another possible way to fix select after write after connection > reset. I hope you know what you are doing and that this will change functionality for some users. In my opinion it looks like a problem with interpretation and not a bug. From tcp.c: " * Some poll() documentation says that POLLHUP is incompatible * with the POLLOUT/POLLWR flags, so somebody should check this * all. But careful, it tends to be safer to return too many * bits than too few, and you can easily break real applications * if you don't tell them that something has hung up! ... * Actually, it is interesting to look how Solaris and DUX * solve this dilemma. I would prefer, if PULLHUP were maskable, * then we could set it on SND_SHUTDOWN. BTW examples given * in Stevens' books assume exactly this behaviour, it explains * why PULLHUP is incompatible with POLLOUT.--ANK * * NOTE. Check for TCP_CLOSE is added. The goal is to prevent * blocking on fresh not-connected or disconnected socket. --ANK */" So it seems ANK hesitated and somebody choose not to do this - maybe for some reason... Regards, Jarek P. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Monday 19 February 2007 14:56, Evgeniy Polyakov wrote: > On Mon, Feb 19, 2007 at 02:38:13PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > > On Monday 19 February 2007 12:41, Evgeniy Polyakov wrote: > > > > 1 microsecond ? Are you kidding ? We want no more than 50 ns. > > > > > > Theory again. > > > > Theory is nice, but I personally prefer oprofile :) > > I base my comments on real facts. > > We *want* 50 ns tcp lookups (2 cache line misses, one with reader intent, > > one for exclusive access intent) > > I said that your words are theory in previous mails :) > > Current code works 10 times worse than you expect. > > > > Existing table does not scale that good - I created (1<<20)/2 (to cover > > > only established part) entries table and filled it with 1 million of > > > random entries -search time is about half of microsecod. > > > > I use exactly 1^20 slots, not 1^19 (see commit > > dbca9b2750e3b1ee6f56a616160ccfc12e8b161f , where I changed layout of > > ehash table so that two chains (established/timewait) are on the same > > cache line. every cache miss *counts*) > > Forget about cache misses and cache lines - we have a hash table, only > part of which is used (part for time-wait sockets, part for established > ones). No you didnt not read my mail. Current ehash is not as decribed by you. > > Anyway, even with 2^20 (i.e. when the whole table is only used for > established sockets) search time is about 360-370 nsec on 3.7 GHz Core > Duo (only one CPU is used) with 2 GB of ram. Your tests are user land, so unfortunatly are biased... (Unless you use hugetlb data ?) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
Actually for socket code any other binary tree will work perfectly ok - socket code does not have wildcards (except listening sockets), so it is possible to combine all values into one search key used in flat one-dimensional tree - it scales as hell and allows still very high lookup time. As of cache usage - such trees can be combined with different protocols to increase cache locality. The only reason I implemented trie is that netchannels support wildcards, that is how netfilter is implemented on top of them. Tree with lazy deletion (i.e. without deletion at all) can be moved to RCU very easily. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Mon, Feb 19, 2007 at 02:38:13PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > On Monday 19 February 2007 12:41, Evgeniy Polyakov wrote: > > > > 1 microsecond ? Are you kidding ? We want no more than 50 ns. > > > > Theory again. > > > Theory is nice, but I personally prefer oprofile :) > I base my comments on real facts. > We *want* 50 ns tcp lookups (2 cache line misses, one with reader intent, one > for exclusive access intent) I said that your words are theory in previous mails :) Current code works 10 times worse than you expect. > > Existing table does not scale that good - I created (1<<20)/2 (to cover > > only established part) entries table and filled it with 1 million of random > > entries -search time is about half of microsecod. > > I use exactly 1^20 slots, not 1^19 (see commit > dbca9b2750e3b1ee6f56a616160ccfc12e8b161f , where I changed layout of ehash > table so that two chains (established/timewait) are on the same cache line. > every cache miss *counts*) Forget about cache misses and cache lines - we have a hash table, only part of which is used (part for time-wait sockets, part for established ones). Anyway, even with 2^20 (i.e. when the whole table is only used for established sockets) search time is about 360-370 nsec on 3.7 GHz Core Duo (only one CPU is used) with 2 GB of ram. > http://www.mail-archive.com/netdev@vger.kernel.org/msg31096.html > > (Of course, you may have to change MAX_ORDER to 14 or else the hash table > hits > the MAX_ORDER limit) > > Search time under 100 ns, for real trafic (kind of random... but not quite) > Most of this time is taken by the rwlock, so expect 50 ns once RCU is finally > in... My experiment shows almost 400 nsecs without _any_ locks - they are removed completely - it is pure hash selection/list traverse time. > In your tests, please make sure a User process is actually doing real work on > each CPU, ie evicting cpu caches every ms... > > The rule is : On a normal machine, cpu caches contain UserMode data, not > kernel data. (as a typical machine spends 15% of its cpu time in kernel land, > and 85% in User land). You can assume kernel text is in cache, but even this > assumption may be wrong. In my tests _only_ hash tables are in memory (well with some bits of other stuff) - I use exactly the same approach for both trie and hash table tests - table/trie is allocated, filled and lookup of random values is performed in a loop. It is done in userspace - I just moved list.h inet_hashtable.h and other needed files into separate project and compiled them (with removed locks, atomic operations and other pure kernel stuff). So actual time even more for hash table - at least it requires locks while trie implementation works with RCU. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Monday 19 February 2007 12:41, Evgeniy Polyakov wrote: > > 1 microsecond ? Are you kidding ? We want no more than 50 ns. > > Theory again. Theory is nice, but I personally prefer oprofile :) I base my comments on real facts. We *want* 50 ns tcp lookups (2 cache line misses, one with reader intent, one for exclusive access intent) > > Existing table does not scale that good - I created (1<<20)/2 (to cover > only established part) entries table and filled it with 1 million of random > entries -search time is about half of microsecod. I use exactly 1^20 slots, not 1^19 (see commit dbca9b2750e3b1ee6f56a616160ccfc12e8b161f , where I changed layout of ehash table so that two chains (established/timewait) are on the same cache line. every cache miss *counts*) http://www.mail-archive.com/netdev@vger.kernel.org/msg31096.html (Of course, you may have to change MAX_ORDER to 14 or else the hash table hits the MAX_ORDER limit) Search time under 100 ns, for real trafic (kind of random... but not quite) Most of this time is taken by the rwlock, so expect 50 ns once RCU is finally in... In your tests, please make sure a User process is actually doing real work on each CPU, ie evicting cpu caches every ms... The rule is : On a normal machine, cpu caches contain UserMode data, not kernel data. (as a typical machine spends 15% of its cpu time in kernel land, and 85% in User land). You can assume kernel text is in cache, but even this assumption may be wrong. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
Andi Kleen writes: > > If not, you loose. > > It all depends on if the higher levels on the trie are small > enough to be kept in cache. Even with two cache misses it might > still break even, but have better scalability. Yes the trick to keep root large to allow a very flat tree and few cache misses. Stefan Nilsson (author of LC-trie) and were able to improve the the LC-trie quite a bit we called this trie+hash ->trash The paper discusses trie/hash... (you've seen it) http://www.nada.kth.se/~snilsson/public/papers/trash/ > Another advantage would be to eliminate the need for large memory > blocks, which cause problems too e.g. on NUMA. It certainly would > save quite some memory if the tree levels are allocated on demand > only. However breaking it up might also cost more TLB misses, > but those could be eliminated by preallocating the tree in > the same way as the hash today. Don't know if it's needed or not. > > I guess someone needs to code it up and try it. I've implemented trie/trash as replacement for the dst hash to full key lookup for ipv4 (unicache) to start with. And is still is focusing on the nasty parts, packet forwarding, as we don't want break this So the benfits of full flow lookup is not accounted. I.e the full flow lookup could give socket at no cost and do some conntrack support like Evgeniy did in netchannels pathes. Below, some recent comparisions and profiles for the packet forwardning Input 2 * 65k concurrent flows eth0->eth1, eth2->eth3 in forwarding On separate CPU Opteron 2218 (2.6 GHZ) net-2.6.21 git. Numbers are very approximative but should still be representative. Profiles are collected. Performance comparison -- Table below holds: dst-entries in use, lookup hits, slow path and total pps Flowlen 40 250k 1020 + 21 = 1041 pps Vanilla rt_hash=32k 1M950 + 29 = 979 pps Vanilla rt_hash=131k 260k 980 + 24 = 1004 pps Unicache Flowlen 4 (rdos) 290k 560 + 162 = 722 kpps Vanilla rt_hash=32k 1M 400 + 165 = 565 kpps Vanilla rt_hash=131k 230k 570 + 170 = 740 kpps Unicache unicache flen=4 pkts c02df84f 5257 7.72078 tkey_extract_bits c023151a 5230 7.68112 e1000_clean_rx_irq c02df908 3306 4.85541 tkey_equals c014cf31 3296 4.84072 kfree c02f8c3b 3067 4.5044 ip_route_input c02fbdf0 2948 4.32963 ip_forward c023024e 2809 4.12548 e1000_xmit_frame c02e06f1 2792 4.10052 trie_lookup c02fd764 2159 3.17085 ip_output c032591c 1965 2.88593 fn_trie_lookup c014cd82 1456 2.13838 kmem_cache_alloc c02fa941 1337 1.96361 ip_rcv c014ced0 1334 1.9592 kmem_cache_free c02e1538 1289 1.89311 unicache_tcp_establish c02e2d70 1218 1.78884 dev_queue_xmit c02e31af 1074 1.57735 netif_receive_skb c02f8484 1053 1.54651 ip_route_input_slow c02db552 987 1.44957 __alloc_skb c02e626f 913 1.34089 dst_alloc c02edaad 828 1.21606 __qdisc_run c0321ccf 810 1.18962 fib_get_table c02e14c1 782 1.1485 match_pktgen c02e6375 766 1.125 dst_destroy c02e10e8 728 1.06919 unicache_hash_code c0231242 647 0.950227e1000_clean_tx_irq c02f7d23 625 0.917916ipv4_dst_destroy unicache flen=40 pkts - c023151a 6742 10.3704 e1000_clean_rx_irq c02df908 4553 7.00332 tkey_equals c02fbdf0 4455 6.85258 ip_forward c02e06f1 4067 6.25577 trie_lookup c02f8c3b 3951 6.07734 ip_route_input c02df84f 3929 6.0435 tkey_extract_bits c023024e 3538 5.44207 e1000_xmit_frame c014cf31 3152 4.84834 kfree c02fd764 2711 4.17ip_output c02e1538 1930 2.96868 unicache_tcp_establish c02fa941 1696 2.60875 ip_rcv c02e31af 1466 2.25497 netif_receive_skb c02e2d70 1412 2.17191 dev_queue_xmit c014cd82 1397 2.14883 kmem_cache_alloc c02db552 1394 2.14422 __alloc_skb c02edaad 1032 1.5874 __qdisc_run c02ed6b8 957 1.47204 eth_header c02e15dd 904 1.39051 unicache_garbage_collect_active c02db94e 861 1.32437 kfree_skb c0231242 794 1.22131 e1000_clean_tx_irq c022fd58 778 1.1967 e1000_tx_map c014ce73 756 1.16286 __kmalloc c014ced0 740 1.13825 kmem_cache_free c02e14c1 701 1.07826 match_pktgen c023002c 621 0.955208e1000_tx_queue c02e78fa 519 0.798314neigh_resolve_output Vanilla w. flen=4 pkts rt_hash=32k -- c02f6852 1570422.9102 ip_route_input c023151a 5324 7.76705 e1000_clean_rx_irq c02f84a1 4457 6.5022 ip_rcv c02f9950 3065 4.47145 ip_forward c023024e 2630 3.83684 e1000_xmit_frame c0323380 2343 3.41814 fn_trie_lookup c02fb2c4 2181 3.1818 ip_output c02f4a3b 1839 2.68287 rt_intern_hash c02f4480 1762 2.57054 rt_may_expire c02f60
Re: [PATCH 3/4] 8139too: RTNL and flush_scheduled_work deadlock
On Fri, Feb 16, 2007 at 09:20:34PM +0100, Francois Romieu wrote: > Jarek Poplawski <[EMAIL PROTECTED]> : ... > > > @@ -1603,18 +1605,21 @@ static void rtl8139_thread (struct work_struct > > > *work) > > > struct net_device *dev = tp->mii.dev; > > > unsigned long thr_delay = next_tick; > > > > > > + rtnl_lock(); > > > + > > > + if (!netif_running(dev)) > > > + goto out_unlock; > > > > I wonder, why you don't do netif_running before > > rtnl_lock ? It's an atomic operation. And I'm not sure if increasing > > rtnl_lock range is really needed here. > > threadA: netif_running() > user task B: rtnl_lock() > user task B: dev->close() > user task B: rtnl_unlock() > threadA: rtnl_lock() > threadA: mess with closed device > > Btw, the thread runs every 3*HZ at most. You are right (mostly)! But I think rtnl_lock is special and should be spared (even this 3*HZ) and here it's used for some mainly internal purpose (close synchronization). And it looks like mainly for this internal reason holding of rtnl_lock is increased. And because rtnl_lock is quite popular you have to take into consideration that after this 3*HZ it could spend some time waiting for the lock. So, maybe it would be nicer to check this netif_running twice (after rtnl_lock where needed), but maybe it's a mater of taste only, and yours is better, as well. (Btw. I didn't verify this, but I hope you checked that places not under rtnl_lock before the patch are safe from some locking problems now.) Jarek P. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Sun, Feb 18, 2007 at 09:21:30PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > Evgeniy Polyakov a e'crit : > >On Sun, Feb 18, 2007 at 07:46:22PM +0100, Eric Dumazet > >([EMAIL PROTECTED]) wrote: > >>>Why anyone do not want to use trie - for socket-like loads it has > >>>exactly constant search/insert/delete time and scales as hell. > >>> > >>Because we want to be *very* fast. You cannot beat hash table. > >> > >>Say you have 1.000.000 tcp connections, with 50.000 incoming packets per > >>second to *random* streams... > > > >What is really good in trie, that you may have upto 2^32 connections > >without _any_ difference in lookup performance of random streams. > > So are you speaking of one memory cache miss per lookup ? > If not, you loose. With trie big part of it _does_ live in cache compared to hash table where similar addresses ends up in a completely different hash entries. > >>With a 2^20 hashtable, a lookup uses one cache line (the hash head > >>pointer) plus one cache line to get the socket (you need it to access its > >>refcounter) > >> > >>Several attempts were done in the past to add RCU to ehash table (last > >>done by Benjamin LaHaise last March). I believe this was delayed a bit, > >>because David would like to be able to resize the hash table... > > > >This is a theory. > > Not theory, but actual practice, on a real machine. > > # cat /proc/net/sockstat > sockets: used 918944 > TCP: inuse 925413 orphan 7401 tw 4906 alloc 926292 mem 304759 > UDP: inuse 9 > RAW: inuse 0 > FRAG: inuse 9 memory 18360 Theory is a speculation about performance. Highly cache usage optimized bubble sorting still much worse than cache usage non-optimized binary tree. > >Practice includes cost for hashing, locking, and list traversal > >(each pointer is in own cache line btw, which must be fetched) and plus > >the same for time wait sockets (if we are unlucky). > > > >No need to talk about price of cache miss when there might be more > >serious problems - for example length of the linked list to traverse each > >time new packet is received. > > > >For example lookup time in trie with 1.6 millions random 3-dimensional > >32bit (saddr/daddr/ports) entries is about 1 microsecond on amd athlon64 > >3500 cpu (test was ran in userspace emulator though). > > 1 microsecond ? Are you kidding ? We want no more than 50 ns. Theory again. Existing table does not scale that good - I created (1<<20)/2 (to cover only established part) entries table and filled it with 1 million of random entries -search time is about half of microsecod. Wanna see a code? I copied Linux hash table magic into userspace and run the same inet_hash() and inet_lookup() in a loop. Result above. Trie is still 2 times worse, but I've just found a bug in my implementation. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 9/18] [TCP] FRTO: Response should reset also snd_cwnd_cnt
Since purpose is to reduce CWND, we prevent immediate growth. This is not a major issue nor is "the correct way" specified anywhere. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_input.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 2679279..9637abd 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2490,6 +2490,7 @@ static int tcp_ack_update_window(struct static void tcp_conservative_spur_to_response(struct tcp_sock *tp) { tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh); + tp->snd_cwnd_cnt = 0; tcp_moderate_cwnd(tp); } -- 1.4.2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/18] [TCP]: Don't enter to fast recovery while using FRTO
Because TCP is not in Loss state during FRTO recovery, fast recovery could be triggered by accident. Non-SACK FRTO is more robust than not yet included SACK-enhanced version (that can receiver high number of duplicate ACKs with SACK blocks during FRTO), at least with unidirectional transfers, but under extraordinary patterns fast recovery can be incorrectly triggered, e.g., Data loss+ACK losses => cumulative ACK with enough SACK blocks to meet sacked_out >= dupthresh condition). Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_input.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 9637abd..309da3e 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1547,6 +1547,10 @@ static int tcp_time_to_recover(struct so { __u32 packets_out; + /* Do not perform any recovery during FRTO algorithm */ + if (tp->frto_counter) + return 0; + /* Trick#1: The loss is proven. */ if (tp->lost_out) return 1; -- 1.4.2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 15/18] [TCP] FRTO: Fake cwnd for ssthresh callback
TCP without FRTO would be in Loss state with small cwnd. FRTO, however, leaves cwnd (typically) to a larger value which causes ssthresh to become too large in case RTO is triggered again compared to what conventional recovery would do. Because consecutive RTOs result in only a single ssthresh reduction, RTO+cumulative ACK+RTO pattern is required to trigger this event. A large comment is included for congestion control module writers trying to figure out what CA_EVENT_FRTO handler should do because there exists a remote possibility of incompatibility between FRTO and module defined ssthresh functions. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_input.c | 26 +- 1 files changed, 25 insertions(+), 1 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 2c0b387..5d935b1 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1289,7 +1289,31 @@ void tcp_enter_frto(struct sock *sk) ((icsk->icsk_ca_state == TCP_CA_Loss || tp->frto_counter) && !icsk->icsk_retransmits)) { tp->prior_ssthresh = tcp_current_ssthresh(sk); - tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk); + /* Our state is too optimistic in ssthresh() call because cwnd +* is not reduced until tcp_enter_frto_loss() when previous FRTO +* recovery has not yet completed. Pattern would be this: RTO, +* Cumulative ACK, RTO (2xRTO for the same segment does not end +* up here twice). +* RFC4138 should be more specific on what to do, even though +* RTO is quite unlikely to occur after the first Cumulative ACK +* due to back-off and complexity of triggering events ... +*/ + if (tp->frto_counter) { + u32 stored_cwnd; + stored_cwnd = tp->snd_cwnd; + tp->snd_cwnd = 2; + tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk); + tp->snd_cwnd = stored_cwnd; + } else { + tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk); + } + /* ... in theory, cong.control module could do "any tricks" in +* ssthresh(), which means that ca_state, lost bits and lost_out +* counter would have to be faked before the call occurs. We +* consider that too expensive, unlikely and hacky, so modules +* using these in ssthresh() must deal these incompatibility +* issues if they receives CA_EVENT_FRTO and frto_counter != 0 +*/ tcp_ca_event(sk, CA_EVENT_FRTO); } -- 1.4.2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 18/18] [TCP] FRTO: Sysctl documentation for SACK enhanced version
The description is overly verbose to avoid ambiguity between "SACK enabled" and "SACK enhanced FRTO" Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- Documentation/networking/ip-sysctl.txt |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index a0f6842..d66777b 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -178,7 +178,10 @@ tcp_frto - BOOLEAN Enables F-RTO, an enhanced recovery algorithm for TCP retransmission timeouts. It is particularly beneficial in wireless environments where packet loss is typically due to random radio interference - rather than intermediate router congestion. + rather than intermediate router congestion. If set to 1, basic + version is enabled. 2 enables SACK enhanced FRTO, which is + EXPERIMENTAL. The basic version can be used also when SACK is + enabled for a flow through tcp_sack sysctl. tcp_keepalive_time - INTEGER How often TCP sends out keepalive messages when keepalive is enabled. -- 1.4.2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 14/18] [TCP] FRTO: Reverse RETRANS bit clearing logic
Previously RETRANS bits were cleared on the entry to FRTO. We postpone that into tcp_enter_frto_loss, which is really the place were the clearing should be done anyway. This allows simplification of the logic from a clearing loop to the head skb clearing only. Besides, the other changes made in the previous patches to tcp_use_frto made it impossible for the non-SACKed FRTO to be entered if other than the head has been rexmitted. With SACK-enhanced FRTO (and Appendix B), however, there can be a number retransmissions in flight when RTO expires (same thing could happen before this patchset also with non-SACK FRTO). To not introduce any jumpiness into the packet counting during FRTO, instead of clearing RETRANS bits from skbs during entry, do it later on. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_input.c | 35 +++ 1 files changed, 23 insertions(+), 12 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 8f0aa9d..2c0b387 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1268,7 +1268,11 @@ int tcp_use_frto(struct sock *sk) /* RTO occurred, but do not yet enter Loss state. Instead, defer RTO * recovery a bit and use heuristics in tcp_process_frto() to detect if - * the RTO was spurious. + * the RTO was spurious. Only clear SACKED_RETRANS of the head here to + * keep retrans_out counting accurate (with SACK F-RTO, other than head + * may still have that bit set); TCPCB_LOST and remaining SACKED_RETRANS + * bits are handled if the Loss state is really to be entered (in + * tcp_enter_frto_loss). * * Do like tcp_enter_loss() would; when RTO expires the second time it * does: @@ -1289,17 +1293,13 @@ void tcp_enter_frto(struct sock *sk) tcp_ca_event(sk, CA_EVENT_FRTO); } - /* Have to clear retransmission markers here to keep the bookkeeping -* in shape, even though we are not yet in Loss state. -* If something was really lost, it is eventually caught up -* in tcp_enter_frto_loss. -*/ - tp->retrans_out = 0; tp->undo_marker = tp->snd_una; tp->undo_retrans = 0; - sk_stream_for_retrans_queue(skb, sk) { + skb = skb_peek(&sk->sk_write_queue); + if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS) { TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS; + tp->retrans_out -= tcp_skb_pcount(skb); } tcp_sync_left_out(tp); @@ -1313,7 +1313,7 @@ void tcp_enter_frto(struct sock *sk) * which indicates that we should follow the traditional RTO recovery, * i.e. mark everything lost and do go-back-N retransmission. */ -static void tcp_enter_frto_loss(struct sock *sk, int allowed_segments) +static void tcp_enter_frto_loss(struct sock *sk, int allowed_segments, int flag) { struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; @@ -1322,10 +1322,21 @@ static void tcp_enter_frto_loss(struct s tp->sacked_out = 0; tp->lost_out = 0; tp->fackets_out = 0; + tp->retrans_out = 0; sk_stream_for_retrans_queue(skb, sk) { cnt += tcp_skb_pcount(skb); - TCP_SKB_CB(skb)->sacked &= ~TCPCB_LOST; + /* +* Count the retransmission made on RTO correctly (only when +* waiting for the first ACK and did not get it)... +*/ + if ((tp->frto_counter == 1) && !(flag&FLAG_DATA_ACKED)) { + tp->retrans_out += tcp_skb_pcount(skb); + /* ...enter this if branch just for the first segment */ + flag |= FLAG_DATA_ACKED; + } else { + TCP_SKB_CB(skb)->sacked &= ~(TCPCB_LOST|TCPCB_SACKED_RETRANS); + } if (!(TCP_SKB_CB(skb)->sacked&TCPCB_SACKED_ACKED)) { /* Do not mark those segments lost that were @@ -2550,7 +2561,7 @@ static int tcp_process_frto(struct sock inet_csk(sk)->icsk_retransmits = 0; if (!before(tp->snd_una, tp->frto_highmark)) { - tcp_enter_frto_loss(sk, tp->frto_counter + 1); + tcp_enter_frto_loss(sk, tp->frto_counter + 1, flag); return 1; } @@ -2562,7 +2573,7 @@ static int tcp_process_frto(struct sock return 1; if (!(flag&FLAG_DATA_ACKED)) { - tcp_enter_frto_loss(sk, (tp->frto_counter == 1 ? 0 : 3)); + tcp_enter_frto_loss(sk, (tp->frto_counter == 1 ? 0 : 3), flag); return 1; } -- 1.4.2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 17/18] [TCP]: SACK enhanced FRTO
Implements the SACK-enhanced FRTO given in RFC4138 using the variant given in Appendix B. RFC4138, Appendix B: "This means that in order to declare timeout spurious, the TCP sender must receive an acknowledgment for non-retransmitted segment between SND.UNA and RecoveryPoint in algorithm step 3. RecoveryPoint is defined in conservative SACK-recovery algorithm [RFC3517]" The basic version of the FRTO algorithm can still be used also when SACK is enabled. To enabled SACK-enhanced version, tcp_frto sysctl is set to 2. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_input.c | 76 +++--- 1 files changed, 65 insertions(+), 11 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 356de02..3ce4019 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -100,6 +100,7 @@ #define FLAG_DATA_SACKED0x20 /* New SAC #define FLAG_ECE 0x40 /* ECE in this ACK */ #define FLAG_DATA_LOST 0x80 /* SACK detected data lossage. */ #define FLAG_SLOWPATH 0x100 /* Do not skip RFC checks for window update.*/ +#define FLAG_ONLY_ORIG_SACKED 0x200 /* SACKs only non-rexmit sent before RTO */ #define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED) #define FLAG_NOT_DUP (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED) @@ -110,6 +111,8 @@ #define IsReno(tp) ((tp)->rx_opt.sack_ok #define IsFack(tp) ((tp)->rx_opt.sack_ok & 2) #define IsDSack(tp) ((tp)->rx_opt.sack_ok & 4) +#define IsSackFrto() (sysctl_tcp_frto == 0x2) + #define TCP_REMNANT (TCP_FLAG_FIN|TCP_FLAG_URG|TCP_FLAG_SYN|TCP_FLAG_PSH) /* Adapt the MSS value used to make delayed ack decision to the @@ -1159,6 +1162,18 @@ tcp_sacktag_write_queue(struct sock *sk, /* clear lost hint */ tp->retransmit_skb_hint = NULL; } + /* SACK enhanced F-RTO detection. +* Set flag if and only if non-rexmitted +* segments below frto_highmark are +* SACKed (RFC4138; Appendix B). +* Clearing correct due to in-order walk +*/ + if (after(end_seq, tp->frto_highmark)) { + flag &= ~FLAG_ONLY_ORIG_SACKED; + } else { + if (!(sacked & TCPCB_RETRANS)) + flag |= FLAG_ONLY_ORIG_SACKED; + } } TCP_SKB_CB(skb)->sacked |= TCPCB_SACKED_ACKED; @@ -1240,7 +1255,8 @@ #endif /* F-RTO can only be used if these conditions are satisfied: * - there must be some unsent new data * - the advertised window should allow sending it - * - TCP has never retransmitted anything other than head + * - TCP has never retransmitted anything other than head (SACK enhanced + *variant from Appendix B of RFC4138 is more robust here) */ int tcp_use_frto(struct sock *sk) { @@ -1252,6 +1268,9 @@ int tcp_use_frto(struct sock *sk) tp->snd_una + tp->snd_wnd)) return 0; + if (IsSackFrto()) + return 1; + /* Avoid expensive walking of rexmit queue if possible */ if (tp->retrans_out > 1) return 0; @@ -1328,9 +1347,18 @@ void tcp_enter_frto(struct sock *sk) } tcp_sync_left_out(tp); + /* Earlier loss recovery underway (see RFC4138; Appendix B). +* The last condition is necessary at least in tp->frto_counter case. +*/ + if (IsSackFrto() && (tp->frto_counter || + ((1 << icsk->icsk_ca_state) & (TCPF_CA_Recovery|TCPF_CA_Loss))) && + after(tp->high_seq, tp->snd_una)) { + tp->frto_highmark = tp->high_seq; + } else { + tp->frto_highmark = tp->snd_nxt; + } tcp_set_ca_state(sk, TCP_CA_Disorder); tp->high_seq = tp->snd_nxt; - tp->frto_highmark = tp->snd_nxt; tp->frto_counter = 1; } @@ -2566,6 +2594,10 @@ static void tcp_conservative_spur_to_res * Rationale: if the RTO was spurious, new ACKs should arrive from the * original window even after we transmit two new data segments. * + * SACK version: + * on first step, wait until first cumulative ACK arrives, then move to + * the second step. In second step, the next ACK decides. + * * F-RTO is implemented (mainly) in four functions: * - tcp_use_frto() is used to determine if TCP is can use F-RTO * - tcp_enter_frto() prepares TCP state on RTO if F-RTO is used, it is @@ -
[PATCH 16/18] [TCP]: Prevent reordering adjustments during FRTO
To be honest, I'm not too sure how the reord stuff works in the first place but this seems necessary. When FRTO has been active, the one and only retransmission could be unnecessary but the state and sending order might not be what the sacktag code expects it to be (to work correctly). Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_input.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 5d935b1..356de02 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1224,7 +1224,8 @@ tcp_sacktag_write_queue(struct sock *sk, tp->left_out = tp->sacked_out + tp->lost_out; - if ((reord < tp->fackets_out) && icsk->icsk_ca_state != TCP_CA_Loss) + if ((reord < tp->fackets_out) && icsk->icsk_ca_state != TCP_CA_Loss && + (tp->frto_highmark && after(tp->snd_una, tp->frto_highmark))) tcp_update_reordering(sk, ((tp->fackets_out + 1) - reord), 0); #if FASTRETRANS_DEBUG > 0 -- 1.4.2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/18] [TCP] FRTO: fixes fallback to conventional recovery
The FRTO detection did not care how ACK pattern affects to cwnd calculation of the conventional recovery. This caused incorrect setting of cwnd when the fallback becames necessary. The knowledge tcp_process_frto() has about the incoming ACK is now passed on to tcp_enter_frto_loss() in allowed_segments parameter that gives the number of segments that must be added to packets-in-flight while calculating the new cwnd. Instead of snd_una we use FLAG_DATA_ACKED in duplicate ACK detection because RFC4138 states (in Section 2.2): If the first acknowledgment after the RTO retransmission does not acknowledge all of the data that was retransmitted in step 1, the TCP sender reverts to the conventional RTO recovery. Otherwise, a malicious receiver acknowledging partial segments could cause the sender to declare the timeout spurious in a case where data was lost. If the next ACK after RTO is duplicate, we do not retransmit anything, which is equal to what conservative conventional recovery does in such case. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_input.c | 14 +- 1 files changed, 9 insertions(+), 5 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 5831daa..2679279 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1296,7 +1296,7 @@ void tcp_enter_frto(struct sock *sk) * which indicates that we should follow the traditional RTO recovery, * i.e. mark everything lost and do go-back-N retransmission. */ -static void tcp_enter_frto_loss(struct sock *sk) +static void tcp_enter_frto_loss(struct sock *sk, int allowed_segments) { struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; @@ -1326,7 +1326,7 @@ static void tcp_enter_frto_loss(struct s } tcp_sync_left_out(tp); - tp->snd_cwnd = tp->frto_counter + tcp_packets_in_flight(tp)+1; + tp->snd_cwnd = tcp_packets_in_flight(tp) + allowed_segments; tp->snd_cwnd_cnt = 0; tp->snd_cwnd_stamp = tcp_time_stamp; tp->undo_marker = 0; @@ -2527,6 +2527,11 @@ static void tcp_process_frto(struct sock if (flag&FLAG_DATA_ACKED) inet_csk(sk)->icsk_retransmits = 0; + if (!before(tp->snd_una, tp->frto_highmark)) { + tcp_enter_frto_loss(sk, tp->frto_counter + 1); + return; + } + /* RFC4138 shortcoming in step 2; should also have case c): ACK isn't * duplicate nor advances window, e.g., opposite dir data, winupdate */ @@ -2534,9 +2539,8 @@ static void tcp_process_frto(struct sock !(flag&FLAG_FORWARD_PROGRESS)) return; - if (tp->snd_una == prior_snd_una || - !before(tp->snd_una, tp->frto_highmark)) { - tcp_enter_frto_loss(sk); + if (!(flag&FLAG_DATA_ACKED)) { + tcp_enter_frto_loss(sk, (tp->frto_counter == 1 ? 0 : 3)); return; } -- 1.4.2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/18] [TCP] FRTO: Moved tcp_use_frto from tcp.h to tcp_input.c
In addition, removed inline. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- include/net/tcp.h| 14 +- net/ipv4/tcp_input.c | 13 + 2 files changed, 14 insertions(+), 13 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 5c472f2..572a77b 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -341,6 +341,7 @@ extern struct sock *tcp_check_req(stru extern int tcp_child_process(struct sock *parent, struct sock *child, struct sk_buff *skb); +extern int tcp_use_frto(const struct sock *sk); extern voidtcp_enter_frto(struct sock *sk); extern voidtcp_enter_loss(struct sock *sk, int how); extern voidtcp_clear_retrans(struct tcp_sock *tp); @@ -1033,19 +1034,6 @@ static inline int tcp_paws_check(const s #define TCP_CHECK_TIMER(sk) do { } while (0) -static inline int tcp_use_frto(const struct sock *sk) -{ - const struct tcp_sock *tp = tcp_sk(sk); - - /* F-RTO must be activated in sysctl and there must be some -* unsent new data, and the advertised window should allow -* sending it. -*/ - return (sysctl_tcp_frto && sk->sk_send_head && - !after(TCP_SKB_CB(sk->sk_send_head)->end_seq, - tp->snd_una + tp->snd_wnd)); -} - static inline void tcp_mib_init(void) { /* See RFC 2012 */ diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index c5be3d0..294cb44 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1236,6 +1236,19 @@ #endif return flag; } +int tcp_use_frto(const struct sock *sk) +{ + const struct tcp_sock *tp = tcp_sk(sk); + + /* F-RTO must be activated in sysctl and there must be some +* unsent new data, and the advertised window should allow +* sending it. +*/ + return (sysctl_tcp_frto && sk->sk_send_head && + !after(TCP_SKB_CB(sk->sk_send_head)->end_seq, + tp->snd_una + tp->snd_wnd)); +} + /* RTO occurred, but do not yet enter loss state. Instead, transmit two new * segments to see from the next ACKs whether any data was really missing. * If the RTO was spurious, new ACKs should arrive. -- 1.4.2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/18] [TCP] FRTO: Entry is allowed only during (New)Reno like recovery
This interpretation comes from RFC4138: "If the sender implements some loss recovery algorithm other than Reno or NewReno [FHG04], the F-RTO algorithm SHOULD NOT be entered when earlier fast recovery is underway." I think the RFC means to say (especially in the light of Appendix B) that ...recovery is underway (not just fast recovery) or was underway when it was interrupted by an earlier (F-)RTO that hasn't yet been resolved (snd_una has not advanced enough). Thus, my interpretation is that whenever TCP has ever retransmitted other than head, basic version cannot be used because then the order assumptions which are used as FRTO basis do not hold. NewReno has only the head segment retransmitted at a time. Therefore, walk up to the segment that has not been SACKed, if that segment is not retransmitted nor anything before it, we know for sure, that nothing after the non-SACKed segment should be either. This assumption is valid because TCPCB_EVER_RETRANS does not leave holes but each non-SACKed segment is rexmitted in-order. Check for retrans_out > 1 avoids more expensive walk through the skb list, as we can know the result beforehand: F-RTO will not be allowed. SACKed skb can turn into non-SACked only in the extremely rare case of SACK reneging, in this case we might fail to detect retransmissions if there were them for any other than head. To get rid of that feature, whole rexmit queue would have to be walked (always) or FRTO should be prevented when SACK reneging happens. Of course RTO should still trigger after reneging which makes this issue even less likely to show up. And as long as the response is as conservative as it's now, nothing bad happens even then. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- include/net/tcp.h|2 +- net/ipv4/tcp_input.c | 25 + 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 572a77b..7fd6b77 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -341,7 +341,7 @@ extern struct sock *tcp_check_req(stru extern int tcp_child_process(struct sock *parent, struct sock *child, struct sk_buff *skb); -extern int tcp_use_frto(const struct sock *sk); +extern int tcp_use_frto(struct sock *sk); extern voidtcp_enter_frto(struct sock *sk); extern voidtcp_enter_loss(struct sock *sk, int how); extern voidtcp_clear_retrans(struct tcp_sock *tp); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 5e952f0..8f0aa9d 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1239,14 +1239,31 @@ #endif /* F-RTO can only be used if these conditions are satisfied: * - there must be some unsent new data * - the advertised window should allow sending it + * - TCP has never retransmitted anything other than head */ -int tcp_use_frto(const struct sock *sk) +int tcp_use_frto(struct sock *sk) { const struct tcp_sock *tp = tcp_sk(sk); + struct sk_buff *skb; - return (sysctl_tcp_frto && sk->sk_send_head && - !after(TCP_SKB_CB(sk->sk_send_head)->end_seq, - tp->snd_una + tp->snd_wnd)); + if (!sysctl_tcp_frto || !sk->sk_send_head || + after(TCP_SKB_CB(sk->sk_send_head)->end_seq, + tp->snd_una + tp->snd_wnd)) + return 0; + + /* Avoid expensive walking of rexmit queue if possible */ + if (tp->retrans_out > 1) + return 0; + + skb = skb_peek(&sk->sk_write_queue)->next; /* Skips head */ + sk_stream_for_retrans_queue_from(skb, sk) { + if (TCP_SKB_CB(skb)->sacked&TCPCB_RETRANS) + return 0; + /* Short-circuit when first non-SACKed skb has been checked */ + if (!(TCP_SKB_CB(skb)->sacked&TCPCB_SACKED_ACKED)) + break; + } + return 1; } /* RTO occurred, but do not yet enter Loss state. Instead, defer RTO -- 1.4.2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/18] [TCP] FRTO: frto_counter modulo-op converted to two assignments
Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_input.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 309da3e..9fc7f66 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2551,11 +2551,11 @@ static void tcp_process_frto(struct sock if (tp->frto_counter == 1) { tp->snd_cwnd = tcp_packets_in_flight(tp) + 2; + tp->frto_counter = 2; } else /* frto_counter == 2 */ { tcp_conservative_spur_to_response(tp); + tp->frto_counter = 0; } - - tp->frto_counter = (tp->frto_counter + 1) % 3; } /* This routine deals with incoming acks, but not outgoing ones. */ -- 1.4.2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/18] [TCP]: Prevent unrelated cwnd adjustment while using FRTO
FRTO controls cwnd when it still processes the ACK input or it has just reverted back to conventional RTO recovery; the normal rules apply when FRTO has reverted to standard congestion control. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_input.c | 18 +++--- 1 files changed, 11 insertions(+), 7 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 9fc7f66..5e952f0 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2522,7 +2522,7 @@ static void tcp_conservative_spur_to_res * to prove that the RTO is indeed spurious. It transfers the control * from F-RTO to the conventional RTO recovery */ -static void tcp_process_frto(struct sock *sk, u32 prior_snd_una, int flag) +static int tcp_process_frto(struct sock *sk, u32 prior_snd_una, int flag) { struct tcp_sock *tp = tcp_sk(sk); @@ -2534,7 +2534,7 @@ static void tcp_process_frto(struct sock if (!before(tp->snd_una, tp->frto_highmark)) { tcp_enter_frto_loss(sk, tp->frto_counter + 1); - return; + return 1; } /* RFC4138 shortcoming in step 2; should also have case c): ACK isn't @@ -2542,20 +2542,22 @@ static void tcp_process_frto(struct sock */ if ((tp->snd_una == prior_snd_una) && (flag&FLAG_NOT_DUP) && !(flag&FLAG_FORWARD_PROGRESS)) - return; + return 1; if (!(flag&FLAG_DATA_ACKED)) { tcp_enter_frto_loss(sk, (tp->frto_counter == 1 ? 0 : 3)); - return; + return 1; } if (tp->frto_counter == 1) { tp->snd_cwnd = tcp_packets_in_flight(tp) + 2; tp->frto_counter = 2; + return 1; } else /* frto_counter == 2 */ { tcp_conservative_spur_to_response(tp); tp->frto_counter = 0; } + return 0; } /* This routine deals with incoming acks, but not outgoing ones. */ @@ -2569,6 +2571,7 @@ static int tcp_ack(struct sock *sk, stru u32 prior_in_flight; s32 seq_rtt; int prior_packets; + int frto_cwnd = 0; /* If the ack is newer than sent or older than previous acks * then we can probably ignore it. @@ -2631,15 +2634,16 @@ static int tcp_ack(struct sock *sk, stru flag |= tcp_clean_rtx_queue(sk, &seq_rtt); if (tp->frto_counter) - tcp_process_frto(sk, prior_snd_una, flag); + frto_cwnd = tcp_process_frto(sk, prior_snd_una, flag); if (tcp_ack_is_dubious(sk, flag)) { /* Advance CWND, if state allows this. */ - if ((flag & FLAG_DATA_ACKED) && tcp_may_raise_cwnd(sk, flag)) + if ((flag & FLAG_DATA_ACKED) && !frto_cwnd && + tcp_may_raise_cwnd(sk, flag)) tcp_cong_avoid(sk, ack, seq_rtt, prior_in_flight, 0); tcp_fastretrans_alert(sk, prior_snd_una, prior_packets, flag); } else { - if ((flag & FLAG_DATA_ACKED)) + if ((flag & FLAG_DATA_ACKED) && !frto_cwnd) tcp_cong_avoid(sk, ack, seq_rtt, prior_in_flight, 1); } -- 1.4.2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/18] [TCP] FRTO: Ignore some uninteresting ACKs
Handles RFC4138 shortcoming (in step 2); it should also have case c) which ignores ACKs that are not duplicates nor advance window (opposite dir data, winupdate). Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_input.c | 13 ++--- 1 files changed, 10 insertions(+), 3 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index d1e731f..5831daa 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2495,9 +2495,9 @@ static void tcp_conservative_spur_to_res /* F-RTO spurious RTO detection algorithm (RFC4138) * - * F-RTO affects during two new ACKs following RTO. State (ACK number) is kept - * in frto_counter. When ACK advances window (but not to or beyond highest - * sequence sent before RTO): + * F-RTO affects during two new ACKs following RTO (well, almost, see inline + * comments). State (ACK number) is kept in frto_counter. When ACK advances + * window (but not to or beyond highest sequence sent before RTO): * On First ACK, send two new segments out. * On Second ACK, RTO was likely spurious. Do spurious response (response * algorithm is not part of the F-RTO detection algorithm @@ -2527,6 +2527,13 @@ static void tcp_process_frto(struct sock if (flag&FLAG_DATA_ACKED) inet_csk(sk)->icsk_retransmits = 0; + /* RFC4138 shortcoming in step 2; should also have case c): ACK isn't +* duplicate nor advances window, e.g., opposite dir data, winupdate +*/ + if ((tp->snd_una == prior_snd_una) && (flag&FLAG_NOT_DUP) && + !(flag&FLAG_FORWARD_PROGRESS)) + return; + if (tp->snd_una == prior_snd_una || !before(tp->snd_una, tp->frto_highmark)) { tcp_enter_frto_loss(sk); -- 1.4.2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/18] [TCP] FRTO: Consecutive RTOs keep prior_ssthresh and ssthresh
In case a latency spike causes more than one RTO, the later should not cause the already reduced ssthresh to propagate into the prior_ssthresh since FRTO declares all such RTOs spurious at once or none of them. In treating of ssthresh, we mimic what tcp_enter_loss() does. The previous state (in frto_counter) must be available until we have checked it in tcp_enter_frto(), and also ACK information flag in process_frto(). Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_input.c | 20 ++-- 1 files changed, 14 insertions(+), 6 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index f645c3e..c846beb 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1252,6 +1252,10 @@ int tcp_use_frto(const struct sock *sk) /* RTO occurred, but do not yet enter Loss state. Instead, defer RTO * recovery a bit and use heuristics in tcp_process_frto() to detect if * the RTO was spurious. + * + * Do like tcp_enter_loss() would; when RTO expires the second time it + * does: + * "Reduce ssthresh if it has not yet been made inside this window." */ void tcp_enter_frto(struct sock *sk) { @@ -1259,11 +1263,10 @@ void tcp_enter_frto(struct sock *sk) struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; - tp->frto_counter = 1; - - if (icsk->icsk_ca_state <= TCP_CA_Disorder || + if ((!tp->frto_counter && icsk->icsk_ca_state <= TCP_CA_Disorder) || tp->snd_una == tp->high_seq || - (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) { + ((icsk->icsk_ca_state == TCP_CA_Loss || tp->frto_counter) && +!icsk->icsk_retransmits)) { tp->prior_ssthresh = tcp_current_ssthresh(sk); tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk); tcp_ca_event(sk, CA_EVENT_FRTO); @@ -1285,6 +1288,7 @@ void tcp_enter_frto(struct sock *sk) tcp_set_ca_state(sk, TCP_CA_Open); tp->frto_highmark = tp->snd_nxt; + tp->frto_counter = 1; } /* Enter Loss state after F-RTO was applied. Dupack arrived after RTO, @@ -2513,12 +2517,16 @@ static void tcp_conservative_spur_to_res * to prove that the RTO is indeed spurious. It transfers the control * from F-RTO to the conventional RTO recovery */ -static void tcp_process_frto(struct sock *sk, u32 prior_snd_una) +static void tcp_process_frto(struct sock *sk, u32 prior_snd_una, int flag) { struct tcp_sock *tp = tcp_sk(sk); tcp_sync_left_out(tp); + /* Duplicate the behavior from Loss state (fastretrans_alert) */ + if (flag&FLAG_DATA_ACKED) + inet_csk(sk)->icsk_retransmits = 0; + if (tp->snd_una == prior_snd_una || !before(tp->snd_una, tp->frto_highmark)) { tcp_enter_frto_loss(sk); @@ -2607,7 +2615,7 @@ static int tcp_ack(struct sock *sk, stru flag |= tcp_clean_rtx_queue(sk, &seq_rtt); if (tp->frto_counter) - tcp_process_frto(sk, prior_snd_una); + tcp_process_frto(sk, prior_snd_una, flag); if (tcp_ack_is_dubious(sk, flag)) { /* Advance CWND, if state allows this. */ -- 1.4.2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/18] [TCP] FRTO: Use Disorder state during operation instead of Open
Retransmission counter assumptions are to be changed. Forcing reason to do this exist: Using sysctl in check would be racy as soon as FRTO starts to ignore some ACKs (doing that in the following patches). Userspace may disable it at any moment giving nice oops if timing is right. frto_counter would be inaccessible from userspace, but with SACK enhanced FRTO retrans_out can include other than head, and possibly leaving it non-zero after spurious RTO, boom again. Luckily, solution seems rather simple: never go directly to Open state but use Disorder instead. This does not really change much, since TCP could anyway change its state to Disorder during FRTO using path tcp_fastretrans_alert -> tcp_try_to_open (e.g., when a SACK block makes ACK dubious). Besides, Disorder seems to be the state where TCP should be if not recovering (in Recovery or Loss state) while having some retransmissions in-flight (see tcp_try_to_open), which is exactly what happens with FRTO. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_input.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index c846beb..d1e731f 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1286,7 +1286,8 @@ void tcp_enter_frto(struct sock *sk) } tcp_sync_left_out(tp); - tcp_set_ca_state(sk, TCP_CA_Open); + tcp_set_ca_state(sk, TCP_CA_Disorder); + tp->high_seq = tp->snd_nxt; tp->frto_highmark = tp->snd_nxt; tp->frto_counter = 1; } @@ -2014,8 +2015,7 @@ tcp_fastretrans_alert(struct sock *sk, u /* E. Check state exit conditions. State can be terminated *when high_seq is ACKed. */ if (icsk->icsk_ca_state == TCP_CA_Open) { - if (!sysctl_tcp_frto) - BUG_TRAP(tp->retrans_out == 0); + BUG_TRAP(tp->retrans_out == 0); tp->retrans_stamp = 0; } else if (!before(tp->snd_una, tp->high_seq)) { switch (icsk->icsk_ca_state) { -- 1.4.2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/18] [TCP] FRTO: Comment cleanup & improvement
Moved comments out from the body of process_frto() to the head (preferred way; see Documentation/CodingStyle). Bonus: it's much easier to read in this compacted form. FRTO algorithm and implementation is described in greater detail. For interested reader, more information is available in RFC4138. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_input.c | 49 - 1 files changed, 32 insertions(+), 17 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 294cb44..f645c3e 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1236,22 +1236,22 @@ #endif return flag; } +/* F-RTO can only be used if these conditions are satisfied: + * - there must be some unsent new data + * - the advertised window should allow sending it + */ int tcp_use_frto(const struct sock *sk) { const struct tcp_sock *tp = tcp_sk(sk); - /* F-RTO must be activated in sysctl and there must be some -* unsent new data, and the advertised window should allow -* sending it. -*/ return (sysctl_tcp_frto && sk->sk_send_head && !after(TCP_SKB_CB(sk->sk_send_head)->end_seq, tp->snd_una + tp->snd_wnd)); } -/* RTO occurred, but do not yet enter loss state. Instead, transmit two new - * segments to see from the next ACKs whether any data was really missing. - * If the RTO was spurious, new ACKs should arrive. +/* RTO occurred, but do not yet enter Loss state. Instead, defer RTO + * recovery a bit and use heuristics in tcp_process_frto() to detect if + * the RTO was spurious. */ void tcp_enter_frto(struct sock *sk) { @@ -2489,6 +2489,30 @@ static void tcp_conservative_spur_to_res tcp_moderate_cwnd(tp); } +/* F-RTO spurious RTO detection algorithm (RFC4138) + * + * F-RTO affects during two new ACKs following RTO. State (ACK number) is kept + * in frto_counter. When ACK advances window (but not to or beyond highest + * sequence sent before RTO): + * On First ACK, send two new segments out. + * On Second ACK, RTO was likely spurious. Do spurious response (response + * algorithm is not part of the F-RTO detection algorithm + * given in RFC4138 but can be selected separately). + * Otherwise (basically on duplicate ACK), RTO was (likely) caused by a loss + * and TCP falls back to conventional RTO recovery. + * + * Rationale: if the RTO was spurious, new ACKs should arrive from the + * original window even after we transmit two new data segments. + * + * F-RTO is implemented (mainly) in four functions: + * - tcp_use_frto() is used to determine if TCP is can use F-RTO + * - tcp_enter_frto() prepares TCP state on RTO if F-RTO is used, it is + * called when tcp_use_frto() showed green light + * - tcp_process_frto() handles incoming ACKs during F-RTO algorithm + * - tcp_enter_frto_loss() is called if there is not enough evidence + * to prove that the RTO is indeed spurious. It transfers the control + * from F-RTO to the conventional RTO recovery + */ static void tcp_process_frto(struct sock *sk, u32 prior_snd_una) { struct tcp_sock *tp = tcp_sk(sk); @@ -2497,25 +2521,16 @@ static void tcp_process_frto(struct sock if (tp->snd_una == prior_snd_una || !before(tp->snd_una, tp->frto_highmark)) { - /* RTO was caused by loss, start retransmitting in -* go-back-N slow start -*/ tcp_enter_frto_loss(sk); return; } if (tp->frto_counter == 1) { - /* First ACK after RTO advances the window: allow two new -* segments out. -*/ tp->snd_cwnd = tcp_packets_in_flight(tp) + 2; - } else { + } else /* frto_counter == 2 */ { tcp_conservative_spur_to_response(tp); } - /* F-RTO affects on two new ACKs following RTO. -* At latest on third ACK the TCP behavior is back to normal. -*/ tp->frto_counter = (tp->frto_counter + 1) % 3; } -- 1.4.2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHSET 0/18] FRTO: fixes and small changes + SACK enhanced version
Hi, Here is a set of patches that fix most of the flaws the current FRTO implementation (specified in RFC4138) has, besides that, the last two patches add SACK-enhanced FRTO (not enabled unless frto sysctl is set to 2, which allows using of the basic version also with SACK). There are some depencies to the earlier patches in the set (hard to list all thoughts I've had, but not all combinations are not good ones even if they apply cleanly). Documentation/networking/ip-sysctl.txt |5 - include/net/tcp.h | 14 -- net/ipv4/tcp_input.c | 265 ++-- 3 files changed, 221 insertions(+), 63 deletions(-) (At least) one interpretation issue exists, see patch "FRTO: Entry is allowed only during (New)Reno like recovery". Besides that, these things should/could be solved (later on): - Setting undo_marker when RTO is not spurious (FRTO has been clearing it, which disabled DSACK undos for conventional recovery). - Interaction with Eifel - Different response (new sysctl to select them?) - When cumulative ACK arrives to the frto_highseq during FRTO, it could be useful to go directly to CA_Open because then duplicate ACKs for that segment could then be used initiate recovery if it was lost. Most of the time, the duplicate ACKs won't be false ones (we might have made too many unnecessary retransmission but that's less likely with FRTO and it could be consider while making state decision). - Maybe the frto_highmark should be reset somewhere during a connection due to wrapping of seqnos (reord adjustment relies on it having a valid after relation...)? - tcp_use_frto and tcp_enter_loss now both scan skb list from the beginning, it might be possible to take advantage of this either by combining them or by passing skb from use_frto iteration to tcp_enter_loss. I did some tests with FACK + SACK FRTO, results seemed to be correct but the conservative response had really poor performance. I'm more familiar with more aggressive response time-seq graphs and I was really wondering does this thing really work at all (in couple of cases), but yes, I found after tracing that it worked although the results was not a very good looking one due to interaction with rate halving, maybe a "rate-halving aware" response could do much better (or alternatively one that does more aggressive undo). # Test 1: normal TCP # Test 2: spurious RTO # Test 3: drop the segment # Test 4: drop a delayed segment # Test 5: drop the next segment # Test 6: drop in window segment # Test 7: drop the segment and the next segment # Test 8: drop the segment and in window segment # Test 9: delay the first and next (spurious RTOs, for different segments) # Test 10: delay the first excessively (two spurious RTOs) # Test n+1: drop rexmission # Test n+2: delay rexmission (spurious RTO also after frto_highmark) # Test n+3: delay rexmission (spurious RTO also after highmark), drop RTO seg # Test n+4: drop the segment and rexmit # Test n+5: drop the segment and first new data # Test n+6: drop the segment and second new data The tests were run in 2.6.18, I have quite a lot of own modifications included in but they were disable using sysctls except for a change in mark_head_lost: if condition from !TAGBITS -> !(TAGBITS & ~SACKED_RETRANS) but afaict, it shouldn't affect, and if it does, it should be included (if you received this mail from previous send attempt, I claimed by a mistakenly that SACKED_ACKED was the bit that was excluded and had incorrect parenthesized it here). I couldn't come up with a scenario in mainline only code where SACKED_RETRANS would be set for a skb when LOST has not been set, except for the head by FRTO itself which will not be a problem. I have checked that the FRTO parts used in tests were identical to the result of this patchset. Compile tested againts the net-2.6 (also intermediate steps). -- i. ps. I'm sorry if you receive these twice, the previous attempted had some charset problems and was rejected at least by netdev. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/18] [TCP] FRTO: Incorrectly clears TCPCB_EVER_RETRANS bit
FRTO was slightly too brave... Should only clear TCPCB_SACKED_RETRANS bit. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_input.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 1a14191..b21e232 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1266,7 +1266,7 @@ void tcp_enter_frto(struct sock *sk) tp->undo_retrans = 0; sk_stream_for_retrans_queue(skb, sk) { - TCP_SKB_CB(skb)->sacked &= ~TCPCB_RETRANS; + TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS; } tcp_sync_left_out(tp); -- 1.4.2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/18] [TCP] FRTO: Separated response from FRTO detection algorithm
FRTO spurious RTO detection algorithm (RFC4138) does not include response to a detected spurious RTO but can use different response algorithms. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_input.c | 16 ++-- 1 files changed, 10 insertions(+), 6 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index b21e232..c5be3d0 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2467,6 +2467,15 @@ static int tcp_ack_update_window(struct return flag; } +/* A very conservative spurious RTO response algorithm: reduce cwnd and + * continue in congestion avoidance. + */ +static void tcp_conservative_spur_to_response(struct tcp_sock *tp) +{ + tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh); + tcp_moderate_cwnd(tp); +} + static void tcp_process_frto(struct sock *sk, u32 prior_snd_una) { struct tcp_sock *tp = tcp_sk(sk); @@ -2488,12 +2497,7 @@ static void tcp_process_frto(struct sock */ tp->snd_cwnd = tcp_packets_in_flight(tp) + 2; } else { - /* Also the second ACK after RTO advances the window. -* The RTO was likely spurious. Reduce cwnd and continue -* in congestion avoidance -*/ - tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh); - tcp_moderate_cwnd(tp); + tcp_conservative_spur_to_response(tp); } /* F-RTO affects on two new ACKs following RTO. -- 1.4.2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
"Michael K. Edwards" <[EMAIL PROTECTED]> writes: > A better data structure for RCU, even with a fixed key space, is > probably a splay tree. Much less vulnerable to cache eviction DDoS > than a hash, because the hot connections get rotated up into non-leaf > layers and get traversed enough to keep them in the LRU set. LRU tends to be hell for caches in MP systems, because it writes to the cache lines too and makes them exclusive and more expensive. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
Eric Dumazet <[EMAIL PROTECTED]> writes: > > So are you speaking of one memory cache miss per lookup ? Actually two: if the trie'ing allows RCUing you would save the spinlock cache line too. This would increase the break-even budget for the trie. > If not, you loose. It all depends on if the higher levels on the trie are small enough to be kept in cache. Even with two cache misses it might still break even, but have better scalability. Another advantage would be to eliminate the need for large memory blocks, which cause problems too e.g. on NUMA. It certainly would save quite some memory if the tree levels are allocated on demand only. However breaking it up might also cost more TLB misses, but those could be eliminated by preallocating the tree in the same way as the hash today. Don't know if it's needed or not. I guess someone needs to code it up and try it. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2][TCP] YeAH-TCP: algorithm implementation
The patch. Angelo P. Castellani ha scritto: From: Angelo P. Castellani <[EMAIL PROTECTED]> YeAH-TCP is a sender-side high-speed enabled TCP congestion control algorithm, which uses a mixed loss/delay approach to compute the congestion window. It's design goals target high efficiency, internal, RTT and Reno fairness, resilience to link loss while keeping network elements load as low as possible. For further details look here: http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf Signed-off-by: Angelo P. Castellani <[EMAIL PROTECTED]> --- This is the YeAH-TCP implementation of the algorithm presented to PFLDnet2007 (http://wil.cs.caltech.edu/pfldnet2007/). Regards, Angelo P. Castellani Kconfig| 14 ++ Makefile |1 tcp_yeah.c | 288 + tcp_yeah.h | 134 4 files changed, 437 insertions(+) diff -uprN linux-2.6.20-a/net/ipv4/Kconfig linux-2.6.20-b/net/ipv4/Kconfig --- linux-2.6.20-a/net/ipv4/Kconfig 2007-02-04 19:44:54.0 +0100 +++ linux-2.6.20-b/net/ipv4/Kconfig 2007-02-19 10:52:46.0 +0100 @@ -574,6 +574,20 @@ config TCP_CONG_VENO loss packets. See http://www.ntu.edu.sg/home5/ZHOU0022/papers/CPFu03a.pdf +config TCP_CONG_YEAH + tristate "YeAH TCP" + depends on EXPERIMENTAL + default n + ---help--- + YeAH-TCP is a sender-side high-speed enabled TCP congestion control + algorithm, which uses a mixed loss/delay approach to compute the + congestion window. It's design goals target high efficiency, + internal, RTT and Reno fairness, resilience to link loss while + keeping network elements load as low as possible. + + For further details look here: + http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf + choice prompt "Default TCP congestion control" default DEFAULT_CUBIC diff -uprN linux-2.6.20-a/net/ipv4/Makefile linux-2.6.20-b/net/ipv4/Makefile --- linux-2.6.20-a/net/ipv4/Makefile 2007-02-04 19:44:54.0 +0100 +++ linux-2.6.20-b/net/ipv4/Makefile 2007-02-19 10:52:46.0 +0100 @@ -49,6 +49,7 @@ obj-$(CONFIG_TCP_CONG_VEGAS) += tcp_vega obj-$(CONFIG_TCP_CONG_VENO) += tcp_veno.o obj-$(CONFIG_TCP_CONG_SCALABLE) += tcp_scalable.o obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o +obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o obj-$(CONFIG_NETLABEL) += cipso_ipv4.o obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \ diff -uprN linux-2.6.20-a/net/ipv4/tcp_yeah.c linux-2.6.20-b/net/ipv4/tcp_yeah.c --- linux-2.6.20-a/net/ipv4/tcp_yeah.c 1970-01-01 01:00:00.0 +0100 +++ linux-2.6.20-b/net/ipv4/tcp_yeah.c 2007-02-19 10:52:46.0 +0100 @@ -0,0 +1,288 @@ +/* + * + * YeAH TCP + * + * For further details look at: + *http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf + * + */ + +#include "tcp_yeah.h" + +/* Default values of the Vegas variables, in fixed-point representation + * with V_PARAM_SHIFT bits to the right of the binary point. + */ +#define V_PARAM_SHIFT 1 + +#define TCP_YEAH_ALPHA 80 //lin number of packets queued at the bottleneck +#define TCP_YEAH_GAMMA1 //lin fraction of queue to be removed per rtt +#define TCP_YEAH_DELTA3 //log minimum fraction of cwnd to be removed on loss +#define TCP_YEAH_EPSILON 1 //log maximum fraction to be removed on early decongestion +#define TCP_YEAH_PHY 8 //lin maximum delta from base +#define TCP_YEAH_RHO 16 //lin minumum number of consecutive rtt to consider competition on loss +#define TCP_YEAH_ZETA50 //lin minimum number of state switchs to reset reno_count + +#define TCP_SCALABLE_AI_CNT 100U + +/* YeAH variables */ +struct yeah { + /* Vegas */ + u32 beg_snd_nxt; /* right edge during last RTT */ + u32 beg_snd_una; /* left edge during last RTT */ + u32 beg_snd_cwnd; /* saves the size of the cwnd */ + u8 doing_vegas_now;/* if true, do vegas for this RTT */ + u16 cntRTT; /* # of RTTs measured within last RTT */ + u32 minRTT; /* min of RTTs measured within last RTT (in usec) */ + u32 baseRTT; /* the min of all Vegas RTT measurements seen (in usec) */ + + /* YeAH */ + u32 lastQ; + u32 doing_reno_now; + + u32 reno_count; + u32 fast_count; + + u32 pkts_acked; +}; + +static void tcp_yeah_init(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct yeah *yeah = inet_csk_ca(sk); + + tcp_vegas_init(sk); + + yeah->doing_reno_now = 0; + yeah->lastQ = 0; + + yeah->reno_count = 2; + + /* Ensure the MD arithmetic works. This is somewhat pedantic, + * since I don't think we will see a cwnd this large. :) */ + tp->snd_cwnd_clamp = min_t(u32, tp->snd_cwnd_clamp, 0x/128); + +} + + +static void tcp_yeah_pkts_acked(struct sock *sk, u32 pkts_acked) +{ + const struct inet_connection_sock *icsk = inet_csk(sk); + struct yeah *yeah = inet_csk_ca(sk); + + if (icsk->icsk_ca_state == TCP_CA_Open) + yeah->pkts_acked = pkts_acked; +} + +/* 64bit divisor, dividend and result. dynamic precision */ +static inline u64 div64_64(u64 dividend, u64 divisor) +{
[PATCH 1/2][TCP] YeAH-TCP: algorithm implementation
From: Angelo P. Castellani <[EMAIL PROTECTED]> YeAH-TCP is a sender-side high-speed enabled TCP congestion control algorithm, which uses a mixed loss/delay approach to compute the congestion window. It's design goals target high efficiency, internal, RTT and Reno fairness, resilience to link loss while keeping network elements load as low as possible. For further details look here: http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf Signed-off-by: Angelo P. Castellani <[EMAIL PROTECTED]> --- This is the YeAH-TCP implementation of the algorithm presented to PFLDnet2007 (http://wil.cs.caltech.edu/pfldnet2007/). Regards, Angelo P. Castellani Kconfig| 14 ++ Makefile |1 tcp_yeah.c | 288 + tcp_yeah.h | 134 4 files changed, 437 insertions(+) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2][TCP] YeAH-TCP: algorithm implementation
From: Angelo P. Castellani <[EMAIL PROTECTED]> YeAH-TCP is a sender-side high-speed enabled TCP congestion control algorithm, which uses a mixed loss/delay approach to compute the congestion window. It's design goals target high efficiency, internal, RTT and Reno fairness, resilience to link loss while keeping network elements load as low as possible. For further details look here: http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf Signed-off-by: Angelo P. Castellani <[EMAIL PROTECTED]> --- This is the YeAH-TCP implementation of the algorithm presented to PFLDnet2007 (http://wil.cs.caltech.edu/pfldnet2007/). Regards, Angelo P. Castellani Kconfig| 14 ++ Makefile |1 tcp_yeah.c | 288 + tcp_yeah.h | 134 4 files changed, 437 insertions(+) diff -uprN linux-2.6.20-a/net/ipv4/Kconfig linux-2.6.20-b/net/ipv4/Kconfig --- linux-2.6.20-a/net/ipv4/Kconfig 2007-02-04 19:44:54.0 +0100 +++ linux-2.6.20-b/net/ipv4/Kconfig 2007-02-19 10:52:46.0 +0100 @@ -574,6 +574,20 @@ config TCP_CONG_VENO loss packets. See http://www.ntu.edu.sg/home5/ZHOU0022/papers/CPFu03a.pdf +config TCP_CONG_YEAH + tristate "YeAH TCP" + depends on EXPERIMENTAL + default n + ---help--- + YeAH-TCP is a sender-side high-speed enabled TCP congestion control + algorithm, which uses a mixed loss/delay approach to compute the + congestion window. It's design goals target high efficiency, + internal, RTT and Reno fairness, resilience to link loss while + keeping network elements load as low as possible. + + For further details look here: + http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf + choice prompt "Default TCP congestion control" default DEFAULT_CUBIC diff -uprN linux-2.6.20-a/net/ipv4/Makefile linux-2.6.20-b/net/ipv4/Makefile --- linux-2.6.20-a/net/ipv4/Makefile 2007-02-04 19:44:54.0 +0100 +++ linux-2.6.20-b/net/ipv4/Makefile 2007-02-19 10:52:46.0 +0100 @@ -49,6 +49,7 @@ obj-$(CONFIG_TCP_CONG_VEGAS) += tcp_vega obj-$(CONFIG_TCP_CONG_VENO) += tcp_veno.o obj-$(CONFIG_TCP_CONG_SCALABLE) += tcp_scalable.o obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o +obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o obj-$(CONFIG_NETLABEL) += cipso_ipv4.o obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \ diff -uprN linux-2.6.20-a/net/ipv4/tcp_yeah.c linux-2.6.20-b/net/ipv4/tcp_yeah.c --- linux-2.6.20-a/net/ipv4/tcp_yeah.c 1970-01-01 01:00:00.0 +0100 +++ linux-2.6.20-b/net/ipv4/tcp_yeah.c 2007-02-19 10:52:46.0 +0100 @@ -0,0 +1,288 @@ +/* + * + * YeAH TCP + * + * For further details look at: + *http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf + * + */ + +#include "tcp_yeah.h" + +/* Default values of the Vegas variables, in fixed-point representation + * with V_PARAM_SHIFT bits to the right of the binary point. + */ +#define V_PARAM_SHIFT 1 + +#define TCP_YEAH_ALPHA 80 //lin number of packets queued at the bottleneck +#define TCP_YEAH_GAMMA1 //lin fraction of queue to be removed per rtt +#define TCP_YEAH_DELTA3 //log minimum fraction of cwnd to be removed on loss +#define TCP_YEAH_EPSILON 1 //log maximum fraction to be removed on early decongestion +#define TCP_YEAH_PHY 8 //lin maximum delta from base +#define TCP_YEAH_RHO 16 //lin minumum number of consecutive rtt to consider competition on loss +#define TCP_YEAH_ZETA50 //lin minimum number of state switchs to reset reno_count + +#define TCP_SCALABLE_AI_CNT 100U + +/* YeAH variables */ +struct yeah { + /* Vegas */ + u32 beg_snd_nxt; /* right edge during last RTT */ + u32 beg_snd_una; /* left edge during last RTT */ + u32 beg_snd_cwnd; /* saves the size of the cwnd */ + u8 doing_vegas_now;/* if true, do vegas for this RTT */ + u16 cntRTT; /* # of RTTs measured within last RTT */ + u32 minRTT; /* min of RTTs measured within last RTT (in usec) */ + u32 baseRTT; /* the min of all Vegas RTT measurements seen (in usec) */ + + /* YeAH */ + u32 lastQ; + u32 doing_reno_now; + + u32 reno_count; + u32 fast_count; + + u32 pkts_acked; +}; + +static void tcp_yeah_init(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct yeah *yeah = inet_csk_ca(sk); + + tcp_vegas_init(sk); + + yeah->doing_reno_now = 0; + yeah->lastQ = 0; + + yeah->reno_count = 2; + + /* Ensure the MD arithmetic works. This is somewhat pedantic, + * since I don't think we will see a cwnd this large. :) */ + tp->snd_cwnd_clamp = min_t(u32, tp->snd_cwnd_clamp, 0x/128); + +} + + +static void tcp_yeah_pkts_acked(struct sock *sk, u32 pkts_acked) +{ + const struct inet_connection_sock *icsk = inet_csk(sk); + struct yeah *yeah = inet_csk_ca(sk); + + if (icsk->icsk_ca_state == TCP_CA_Open) + yeah->pkts_acked = pkts_acked; +} + +/* 64bit divisor, dividend and result. dynamic precision */ +static inline u64 div64_64(u64 dividend, u64 divisor) +{ + u32 d = divisor; + + if (divisor > 0xf
Re: [PATCH 2/2][TCP] YeAH-TCP: limited slow start exported function
Forgot the patch.. Angelo P. Castellani ha scritto: From: Angelo P. Castellani <[EMAIL PROTECTED]> RFC3742: limited slow start See http://www.ietf.org/rfc/rfc3742.txt Signed-off-by: Angelo P. Castellani <[EMAIL PROTECTED]> --- To allow code reutilization I've added the limited slow start procedure as an exported symbol of linux tcp congestion control. On large BDP networks canonical slow start should be avoided because it requires large packet losses to converge, whereas at lower BDPs slow start and limited slow start are identical. Large BDP is defined through the max_ssthresh variable. I think limited slow start could safely replace the canonical slow start procedure in Linux. Regards, Angelo P. Castellani p.s.: in the attached patch is added an exported function currently used only by YeAH TCP include/net/tcp.h |1 + net/ipv4/tcp_cong.c | 23 +++ 2 files changed, 24 insertions(+) diff -uprN linux-2.6.20-a/include/net/tcp.h linux-2.6.20-c/include/net/tcp.h --- linux-2.6.20-a/include/net/tcp.h 2007-02-04 19:44:54.0 +0100 +++ linux-2.6.20-c/include/net/tcp.h 2007-02-19 10:54:10.0 +0100 @@ -669,6 +669,7 @@ extern void tcp_get_allowed_congestion_c extern int tcp_set_allowed_congestion_control(char *allowed); extern int tcp_set_congestion_control(struct sock *sk, const char *name); extern void tcp_slow_start(struct tcp_sock *tp); +extern void tcp_limited_slow_start(struct tcp_sock *tp); extern struct tcp_congestion_ops tcp_init_congestion_ops; extern u32 tcp_reno_ssthresh(struct sock *sk); diff -uprN linux-2.6.20-a/net/ipv4/tcp_cong.c linux-2.6.20-c/net/ipv4/tcp_cong.c --- linux-2.6.20-a/net/ipv4/tcp_cong.c 2007-02-04 19:44:54.0 +0100 +++ linux-2.6.20-c/net/ipv4/tcp_cong.c 2007-02-19 10:54:10.0 +0100 @@ -297,6 +297,29 @@ void tcp_slow_start(struct tcp_sock *tp) } EXPORT_SYMBOL_GPL(tcp_slow_start); +void tcp_limited_slow_start(struct tcp_sock *tp) +{ + /* RFC3742: limited slow start + * the window is increased by 1/K MSS for each arriving ACK, + * for K = int(cwnd/(0.5 max_ssthresh)) + */ + + const int max_ssthresh = 100; + + if (max_ssthresh > 0 && tp->snd_cwnd > max_ssthresh) { + u32 k = max(tp->snd_cwnd / (max_ssthresh >> 1), 1U); + if (++tp->snd_cwnd_cnt >= k) { + if (tp->snd_cwnd < tp->snd_cwnd_clamp) +tp->snd_cwnd++; + tp->snd_cwnd_cnt = 0; + } + } else { + if (tp->snd_cwnd < tp->snd_cwnd_clamp) + tp->snd_cwnd++; + } +} +EXPORT_SYMBOL_GPL(tcp_limited_slow_start); + /* * TCP Reno congestion control * This is special case used for fallback as well.
[PATCH 2/2][TCP] YeAH-TCP: limited slow start exported function
From: Angelo P. Castellani <[EMAIL PROTECTED]> RFC3742: limited slow start See http://www.ietf.org/rfc/rfc3742.txt Signed-off-by: Angelo P. Castellani <[EMAIL PROTECTED]> --- To allow code reutilization I've added the limited slow start procedure as an exported symbol of linux tcp congestion control. On large BDP networks canonical slow start should be avoided because it requires large packet losses to converge, whereas at lower BDPs slow start and limited slow start are identical. Large BDP is defined through the max_ssthresh variable. I think limited slow start could safely replace the canonical slow start procedure in Linux. Regards, Angelo P. Castellani p.s.: in the attached patch is added an exported function currently used only by YeAH TCP include/net/tcp.h |1 + net/ipv4/tcp_cong.c | 23 +++ 2 files changed, 24 insertions(+) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Converting network devices from class devices causes namespace pollution
Greg KH <[EMAIL PROTECTED]> writes: > We need our own namespace for these devices, and we have it today > already. Look if you enable CONFIG_SYSFS_DEPRECATED, or on a pre-2.6.19 > machine at what shows up in the pci device directories: > -r--r--r-- 1 root root 4096 2007-02-18 13:06 vendor Interesting. I hadn't noticed that before. > So, all we need to do is rename these devices back to the "net:eth0" > name, and everything will be fine. I'll work on fixing that tomorrow as > it will take a bit of hacking on the kobject symlink function and the > driver core code (but it gets us rid of a symlink in "compatiblity > mode", which is always a nice win...) Ok. I'm groaning a little bit at what a nuisance this is going to be to get support for multiple network namespaces in there after your fix goes in, directories can be easier to deal with. But once you figure this part out I will figure something out. For me the nasty case is 1 pci device that has multiple ethernet devices coming from it (I think IB devices have this property today), each showing up in a different network namespace, so they might all have the same name. Ugh. Eric - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html