Re: bge hangs on recent 7.3-STABLE
On Mon, Sep 13, 2010 at 11:04:47AM -0700, Pyun YongHyeon wrote: > On Mon, Sep 13, 2010 at 06:27:08PM +0400, Igor Sysoev wrote: > > On Thu, Sep 09, 2010 at 02:18:08PM -0700, Pyun YongHyeon wrote: > > > > > On Thu, Sep 09, 2010 at 01:10:50PM -0700, Pyun YongHyeon wrote: > > > > On Thu, Sep 09, 2010 at 02:28:26PM +0400, Igor Sysoev wrote: > > > > > Hi, > > > > > > > > > > I have several hosts running FreeBSD/amd64 7.2-STABLE updated on > > > > > 11.01.2010 > > > > > and 25.02.2010. Hosts process about 10K input and 10K output packets/s > > > > > without issues. One of them, however, is loaded more than others, so > > > > > it > > > > > processes 20K/20K packets/s. > > > > > > > > > > Recently, I have upgraded one host to 7.3-STABLE, 24.08.2010. > > > > > Then bge on this host hung two times. I was able to restart it from > > > > > console using: > > > > > /etc/rc.d/netif restart bge0 > > > > > > > > > > Then I have upgraded the most loaded (20K/20K) host to 7.3-STABLE, > > > > > 07.09.2010. > > > > > After reboot bge hung every several seconds. I was able to restart it, > > > > > but bge hung again after several seconds. > > > > > > > > > > Then I have downgraded this host to 7.3-STABLE, 14.08.2010, since > > > > > there > > > > > were several if_bge.c commits on 15.08.2010. The same hangs. > > > > > Then I have downgraded this host to 7.3-STABLE, 17.03.2010, before > > > > > the first if_bge.c commit after 25.02.2010. Now it runs without hangs. > > > > > > > > > > The hosts are amd64 dual core SMP with 4G machines. bge information: > > Thank you, it seems the patch has fixed the bug. > > BTW, I noticed the same hungs on FreeBSD 8.1, date=2010.09.06.23.59.59 > > I will apply the patch on all my updated hosts. > > > > Thanks for testing. I'm afraid bge(4) in HEAD, stable/8 and > stable/7(including 8.1-RELEASE and 7.3-RELEASE) may suffer from > this issue. Let me know what other hosts work with the patch. Currently I have patched two hosts only: 7.3, 24.08.2010 and 8.1, 06.09.2010. 7.3 now handles 20K/20K packets/s without issues. One host has been downgraded to 17.03.2010 as I already reported. Other hosts still run 7.x, from January and February 2010. If there not will be hangs I will upgrade other hosts and will patch them. -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: bge hangs on recent 7.3-STABLE
On Thu, Sep 09, 2010 at 02:18:08PM -0700, Pyun YongHyeon wrote: > On Thu, Sep 09, 2010 at 01:10:50PM -0700, Pyun YongHyeon wrote: > > On Thu, Sep 09, 2010 at 02:28:26PM +0400, Igor Sysoev wrote: > > > Hi, > > > > > > I have several hosts running FreeBSD/amd64 7.2-STABLE updated on > > > 11.01.2010 > > > and 25.02.2010. Hosts process about 10K input and 10K output packets/s > > > without issues. One of them, however, is loaded more than others, so it > > > processes 20K/20K packets/s. > > > > > > Recently, I have upgraded one host to 7.3-STABLE, 24.08.2010. > > > Then bge on this host hung two times. I was able to restart it from > > > console using: > > > /etc/rc.d/netif restart bge0 > > > > > > Then I have upgraded the most loaded (20K/20K) host to 7.3-STABLE, > > > 07.09.2010. > > > After reboot bge hung every several seconds. I was able to restart it, > > > but bge hung again after several seconds. > > > > > > Then I have downgraded this host to 7.3-STABLE, 14.08.2010, since there > > > were several if_bge.c commits on 15.08.2010. The same hangs. > > > Then I have downgraded this host to 7.3-STABLE, 17.03.2010, before > > > the first if_bge.c commit after 25.02.2010. Now it runs without hangs. > > > > > > The hosts are amd64 dual core SMP with 4G machines. bge information: > > > > > > b...@pci0:4:0:0:class=0x02 card=0x165914e4 chip=0x165914e4 > > > rev=0x11 hdr=0x00 > > > vendor = 'Broadcom Corporation' > > > device = 'NetXtreme Gigabit Ethernet PCI Express (BCM5721)' > > > > > > bge0: > > 0x004101> mem 0xfe5f-0xfe5f irq 19 at device 0.0 on pci4 > > > miibus1: on bge0 > > > brgphy0: PHY 1 on miibus1 > > > brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, > > > 1000baseT-FDX, auto > > > bge0: Ethernet address: 00:e0:81:5f:6e:8a > > > > > > > Could you show me verbose boot message(bge part only)? > > Also show me the output of "pciconf -lcbv". > > > > Forgot to send a patch. Let me know whether attached patch fixes > the issue or not. > Index: sys/dev/bge/if_bge.c > === > --- sys/dev/bge/if_bge.c (revision 212341) > +++ sys/dev/bge/if_bge.c (working copy) > @@ -3386,9 +3386,11 @@ > sc->bge_rx_saved_considx = rx_cons; > bge_writembx(sc, BGE_MBX_RX_CONS0_LO, sc->bge_rx_saved_considx); > if (stdcnt) > - bge_writembx(sc, BGE_MBX_RX_STD_PROD_LO, sc->bge_std); > + bge_writembx(sc, BGE_MBX_RX_STD_PROD_LO, (sc->bge_std + > + BGE_STD_RX_RING_CNT - 1) % BGE_STD_RX_RING_CNT); > if (jumbocnt) > - bge_writembx(sc, BGE_MBX_RX_JUMBO_PROD_LO, sc->bge_jumbo); > + bge_writembx(sc, BGE_MBX_RX_JUMBO_PROD_LO, (sc->bge_jumbo + > + BGE_JUMBO_RX_RING_CNT - 1) % BGE_JUMBO_RX_RING_CNT); > #ifdef notyet > /* >* This register wraps very quickly under heavy packet drops. Thank you, it seems the patch has fixed the bug. BTW, I noticed the same hungs on FreeBSD 8.1, date=2010.09.06.23.59.59 I will apply the patch on all my updated hosts. -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: bge hangs on recent 7.3-STABLE
On Fri, Sep 10, 2010 at 07:39:15AM +0400, Igor Sysoev wrote: > On Thu, Sep 09, 2010 at 01:10:50PM -0700, Pyun YongHyeon wrote: > > > On Thu, Sep 09, 2010 at 02:28:26PM +0400, Igor Sysoev wrote: > > > Hi, > > > > > > I have several hosts running FreeBSD/amd64 7.2-STABLE updated on > > > 11.01.2010 > > > and 25.02.2010. Hosts process about 10K input and 10K output packets/s > > > without issues. One of them, however, is loaded more than others, so it > > > processes 20K/20K packets/s. > > > > > > Recently, I have upgraded one host to 7.3-STABLE, 24.08.2010. > > > Then bge on this host hung two times. I was able to restart it from > > > console using: > > > /etc/rc.d/netif restart bge0 > > > > > > Then I have upgraded the most loaded (20K/20K) host to 7.3-STABLE, > > > 07.09.2010. > > > After reboot bge hung every several seconds. I was able to restart it, > > > but bge hung again after several seconds. > > > > > > Then I have downgraded this host to 7.3-STABLE, 14.08.2010, since there > > > were several if_bge.c commits on 15.08.2010. The same hangs. > > > Then I have downgraded this host to 7.3-STABLE, 17.03.2010, before > > > the first if_bge.c commit after 25.02.2010. Now it runs without hangs. > > > > > > The hosts are amd64 dual core SMP with 4G machines. bge information: > > > > > > b...@pci0:4:0:0:class=0x02 card=0x165914e4 chip=0x165914e4 > > > rev=0x11 hdr=0x00 > > > vendor = 'Broadcom Corporation' > > > device = 'NetXtreme Gigabit Ethernet PCI Express (BCM5721)' > > > > > > bge0: > > 0x004101> mem 0xfe5f-0xfe5f irq 19 at device 0.0 on pci4 > > > miibus1: on bge0 > > > brgphy0: PHY 1 on miibus1 > > > brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, > > > 1000baseT-FDX, auto > > > bge0: Ethernet address: 00:e0:81:5f:6e:8a > > > > > > > Could you show me verbose boot message(bge part only)? > > Also show me the output of "pciconf -lcbv". > > Here is "pciconf -lcbv", I will send the "boot -v" part later. > > b...@pci0:4:0:0: class=0x02 card=0x165914e4 chip=0x165914e4 rev=0x11 > hdr=0x00 > vendor = 'Broadcom Corporation' > device = 'NetXtreme Gigabit Ethernet PCI Express (BCM5721)' > class = network > subclass = ethernet > bar [10] = type Memory, range 64, base 0xfe5f, size 65536, enabled > cap 01[48] = powerspec 2 supports D0 D3 current D0 > cap 03[50] = VPD > cap 05[58] = MSI supports 8 messages, 64 bit > cap 10[d0] = PCI-Express 1 endpoint max data 128(128) link x1(x1) Sorry for delay. Here is "boot -v" part. It is from other host, but the host hungs too: pci4: on pcib4 pci4: domain=0, physical bus=4 found-> vendor=0x14e4, dev=0x1659, revid=0x11 domain=0, bus=4, slot=0, func=0 class=02-00-00, hdrtype=0x00, mfdev=0 cmdreg=0x0006, statreg=0x0010, cachelnsz=8 (dwords) lattimer=0x00 (0 ns), mingnt=0x00 (0 ns), maxlat=0x00 (0 ns) intpin=a, irq=5 powerspec 2 supports D0 D3 current D0 MSI supports 8 messages, 64 bit map[10]: type Memory, range 64, base 0xfe5f, size 16, enabled pcib4: requested memory range 0xfe5f-0xfe5f: good pcib0: matched entry for 0.13.INTA (src \_SB_.PCI0.APC4:0) pcib0: slot 13 INTA routed to irq 19 via \_SB_.PCI0.APC4 pcib4: slot 0 INTA is routed to irq 19 pci0:4:0:0: bad VPD cksum, remain 14 bge0: mem 0 xfe5f0000-0xfe5f irq 19 at device 0.0 on pci4 bge0: Reserved 0x1 bytes for rid 0x10 type 3 at 0xfe5f bge0: CHIP ID 0x4101; ASIC REV 0x04; CHIP REV 0x41; PCI-E miibus1: on bge0 brgphy0: PHY 1 on miibus1 brgphy0: OUI 0x000818, model 0x0018, rev. 0 brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto bge0: bpf attached bge0: Ethernet address: 00:e0:81:5c:64:85 ioapic0: routing intpin 19 (PCI IRQ 19) to vector 54 bge0: [MPSAFE] bge0: [ITHREAD] -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: bge hangs on recent 7.3-STABLE
On Thu, Sep 09, 2010 at 01:10:50PM -0700, Pyun YongHyeon wrote: > On Thu, Sep 09, 2010 at 02:28:26PM +0400, Igor Sysoev wrote: > > Hi, > > > > I have several hosts running FreeBSD/amd64 7.2-STABLE updated on 11.01.2010 > > and 25.02.2010. Hosts process about 10K input and 10K output packets/s > > without issues. One of them, however, is loaded more than others, so it > > processes 20K/20K packets/s. > > > > Recently, I have upgraded one host to 7.3-STABLE, 24.08.2010. > > Then bge on this host hung two times. I was able to restart it from > > console using: > > /etc/rc.d/netif restart bge0 > > > > Then I have upgraded the most loaded (20K/20K) host to 7.3-STABLE, > > 07.09.2010. > > After reboot bge hung every several seconds. I was able to restart it, > > but bge hung again after several seconds. > > > > Then I have downgraded this host to 7.3-STABLE, 14.08.2010, since there > > were several if_bge.c commits on 15.08.2010. The same hangs. > > Then I have downgraded this host to 7.3-STABLE, 17.03.2010, before > > the first if_bge.c commit after 25.02.2010. Now it runs without hangs. > > > > The hosts are amd64 dual core SMP with 4G machines. bge information: > > > > b...@pci0:4:0:0:class=0x02 card=0x165914e4 chip=0x165914e4 > > rev=0x11 hdr=0x00 > > vendor = 'Broadcom Corporation' > > device = 'NetXtreme Gigabit Ethernet PCI Express (BCM5721)' > > > > bge0: > > mem 0xfe5f-0xfe5f irq 19 at device 0.0 on pci4 > > miibus1: on bge0 > > brgphy0: PHY 1 on miibus1 > > brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, > > 1000baseT-FDX, auto > > bge0: Ethernet address: 00:e0:81:5f:6e:8a > > > > Could you show me verbose boot message(bge part only)? > Also show me the output of "pciconf -lcbv". Here is "pciconf -lcbv", I will send the "boot -v" part later. b...@pci0:4:0:0:class=0x02 card=0x165914e4 chip=0x165914e4 rev=0x11 hdr=0x00 vendor = 'Broadcom Corporation' device = 'NetXtreme Gigabit Ethernet PCI Express (BCM5721)' class = network subclass = ethernet bar [10] = type Memory, range 64, base 0xfe5f, size 65536, enabled cap 01[48] = powerspec 2 supports D0 D3 current D0 cap 03[50] = VPD cap 05[58] = MSI supports 8 messages, 64 bit cap 10[d0] = PCI-Express 1 endpoint max data 128(128) link x1(x1) -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
bge hangs on recent 7.3-STABLE
Hi, I have several hosts running FreeBSD/amd64 7.2-STABLE updated on 11.01.2010 and 25.02.2010. Hosts process about 10K input and 10K output packets/s without issues. One of them, however, is loaded more than others, so it processes 20K/20K packets/s. Recently, I have upgraded one host to 7.3-STABLE, 24.08.2010. Then bge on this host hung two times. I was able to restart it from console using: /etc/rc.d/netif restart bge0 Then I have upgraded the most loaded (20K/20K) host to 7.3-STABLE, 07.09.2010. After reboot bge hung every several seconds. I was able to restart it, but bge hung again after several seconds. Then I have downgraded this host to 7.3-STABLE, 14.08.2010, since there were several if_bge.c commits on 15.08.2010. The same hangs. Then I have downgraded this host to 7.3-STABLE, 17.03.2010, before the first if_bge.c commit after 25.02.2010. Now it runs without hangs. The hosts are amd64 dual core SMP with 4G machines. bge information: b...@pci0:4:0:0:class=0x02 card=0x165914e4 chip=0x165914e4 rev=0x11 hdr=0x00 vendor = 'Broadcom Corporation' device = 'NetXtreme Gigabit Ethernet PCI Express (BCM5721)' bge0: mem 0xfe5f-0xfe5f irq 19 at device 0.0 on pci4 miibus1: on bge0 brgphy0: PHY 1 on miibus1 brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto bge0: Ethernet address: 00:e0:81:5f:6e:8a bge has 3 vlans: bge0: flags=8943 metric 0 mtu 15 00 options=9b ether 00:e0:81:5f:6e:8a media: Ethernet autoselect (1000baseTX ) status: active vlan173: flags=8843 metric 0 mtu 1500 options=3 ether 00:e0:81:5f:6e:8a inet 192.168.173.101 netmask 0xff00 broadcast 192.168.173.255 media: Ethernet autoselect (1000baseTX ) status: active vlan: 173 parent interface: bge0 [ ... ] -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
bge hangs on recent 7.3-STABLE
Hi, I have several hosts running FreeBSD/amd64 7.2-STABLE updated on 11.01.2010 and 25.02.2010. Hosts process about 10K input and 10K output packets/s without issues. One of them, however, is loaded more than others, so it processes 20K/20K packets/s. Recently, I have upgraded one host to 7.3-STABLE, 24.08.2010. Then bge on this host hung two times. I was able to restart it from console using: /etc/rc.d/netif restart bge0 Then I have upgraded the most loaded (20K/20K) host to 7.3-STABLE, 07.09.2010. After reboot bge hung every several seconds. I was able to restart it, but bge hung after several seconds. Then I have downgraded this host to 7.3-STABLE, 14.08.2010, since there were several if_bge.c commits on 15.08.2010. The same hangs. Then I have downgraded this host to 7.3-STABLE, 17.03.2010, before the first if_bge.c commit after 25.02.2010. Now it runs without hangs. The hosts are amd64 dual core SMP with 4G machines. bge information: b...@pci0:4:0:0:class=0x02 card=0x165914e4 chip=0x165914e4 rev=0x11 hdr=0x00 vendor = 'Broadcom Corporation' device = 'NetXtreme Gigabit Ethernet PCI Express (BCM5721)' bge0: mem 0xfe5f-0xfe5f irq 19 at device 0.0 on pci4 miibus1: on bge0 brgphy0: PHY 1 on miibus1 brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto bge0: Ethernet address: 00:e0:81:5f:6e:8a bge has 3 vlans: bge0: flags=8943 metric 0 mtu 15 00 options=9b ether 00:e0:81:5f:6e:8a media: Ethernet autoselect (1000baseTX ) status: active vlan173: flags=8843 metric 0 mtu 1500 options=3 ether 00:e0:81:5f:6e:8a inet 192.168.173.101 netmask 0xff00 broadcast 192.168.173.255 media: Ethernet autoselect (1000baseTX ) status: active vlan: 173 parent interface: bge0 [ ... ] -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
net.inet.tcp.slowstart_flightsize in 8-STABLE
It seems that net.inet.tcp.slowstart_flightsize does not work in 8-STABLE. For a long time I used slowstart_flightsize=2 on FreeBSD 4, 6, and 7 hosts. However, FreeBSD-8 always starts with the single packet. I saw this on different versions of 8-STABLE since 8 Oct 2009 till 04 Apr 2010. -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: hw.bge.forced_collapse
On Thu, Jan 14, 2010 at 10:10:31AM -0800, Pyun YongHyeon wrote: > On Thu, Jan 14, 2010 at 07:03:33PM +0300, Igor Sysoev wrote: > > On Fri, Dec 04, 2009 at 12:22:13PM -0800, Pyun YongHyeon wrote: > > > > > On Fri, Dec 04, 2009 at 11:13:03PM +0300, Igor Sysoev wrote: > > > > On Fri, Dec 04, 2009 at 11:51:40AM -0800, Pyun YongHyeon wrote: > > > > > > > > > On Fri, Dec 04, 2009 at 10:11:14PM +0300, Igor Sysoev wrote: > > > > > > On Fri, Dec 04, 2009 at 09:32:43AM -0800, Pyun YongHyeon wrote: > > > > > > > > > > > > > On Fri, Dec 04, 2009 at 10:54:40AM +0300, Igor Sysoev wrote: > > > > > > > > I saw commit introducing hw.bge.forced_collapse loader tunable. > > > > > > > > Just intresting, why it can not be a sysctl ? > > > > > > > > > > > > > > I didn't think the sysctl variable would be frequently changed > > > > > > > in runtime except debugging driver so I took simple path. > > > > > > > > > > > > I do not think it's worth to reboot server just to look how various > > > > > > values affect on bandwidth and CPU usage, expecially in production. > > > > > > > > > > > > As I understand the change is trivial: > > > > > > > > > > > > - CTLFLAG_RD > > > > > > + CTLFLAG_RW > > > > > > > > > > > > since bge_forced_collapse is used atomically. > > > > > > > > > > > > > > > > I have no problem changing it to RW but that case I may have to > > > > > create actual sysctl node(e.g. dev.bge.0.forced_collapse) instead > > > > > of hw.bge.forced_collapse which may affect all bge(4) controllers > > > > > on system. Attached patch may be what you want. You can change the > > > > > value at any time. > > > > > > > > Thank you for the patch. Can it be installed on 8-STABLE ? > > > > > > > > > > bge(4) in HEAD has many fixes which were not MFCed to stable/8 so > > > I'm not sure that patch could be applied cleanly. But I guess you > > > can manually patch it. > > > I'll wait a couple of days for wider testing/review and commit the > > > patch. > > > > Sorry for the late response. We've tested bge.forced_collapse in December > > on HEAD and found that values >1 froze connections with big data amount, > > for example, "top -Ss1" output. Connection with small data amount such as > > short ssh commands worked OK. Now I've tested modern 7.2-STABLE and found > > that forced_collapse >1 freezes it too. > > > > Thanks for reporting! It seems I've incorrectly dropped mbuf chains > when collapsing fails. Would you try attached patch? BTW, it's strange that collapsing fails too often. > Index: sys/dev/bge/if_bge.c > === > --- sys/dev/bge/if_bge.c (revision 202268) > +++ sys/dev/bge/if_bge.c (working copy) > @@ -3940,11 +3940,8 @@ > m = m_defrag(m, M_DONTWAIT); > else > m = m_collapse(m, M_DONTWAIT, sc->bge_forced_collapse); > - if (m == NULL) { > - m_freem(*m_head); > - *m_head = NULL; > - return (ENOBUFS); > - } > + if (m == NULL) > + m = *m_head; > *m_head = m; > } > -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: hw.bge.forced_collapse
On Thu, Jan 14, 2010 at 10:10:31AM -0800, Pyun YongHyeon wrote: > On Thu, Jan 14, 2010 at 07:03:33PM +0300, Igor Sysoev wrote: > > On Fri, Dec 04, 2009 at 12:22:13PM -0800, Pyun YongHyeon wrote: > > > > > On Fri, Dec 04, 2009 at 11:13:03PM +0300, Igor Sysoev wrote: > > > > On Fri, Dec 04, 2009 at 11:51:40AM -0800, Pyun YongHyeon wrote: > > > > > > > > > On Fri, Dec 04, 2009 at 10:11:14PM +0300, Igor Sysoev wrote: > > > > > > On Fri, Dec 04, 2009 at 09:32:43AM -0800, Pyun YongHyeon wrote: > > > > > > > > > > > > > On Fri, Dec 04, 2009 at 10:54:40AM +0300, Igor Sysoev wrote: > > > > > > > > I saw commit introducing hw.bge.forced_collapse loader tunable. > > > > > > > > Just intresting, why it can not be a sysctl ? > > > > > > > > > > > > > > I didn't think the sysctl variable would be frequently changed > > > > > > > in runtime except debugging driver so I took simple path. > > > > > > > > > > > > I do not think it's worth to reboot server just to look how various > > > > > > values affect on bandwidth and CPU usage, expecially in production. > > > > > > > > > > > > As I understand the change is trivial: > > > > > > > > > > > > - CTLFLAG_RD > > > > > > + CTLFLAG_RW > > > > > > > > > > > > since bge_forced_collapse is used atomically. > > > > > > > > > > > > > > > > I have no problem changing it to RW but that case I may have to > > > > > create actual sysctl node(e.g. dev.bge.0.forced_collapse) instead > > > > > of hw.bge.forced_collapse which may affect all bge(4) controllers > > > > > on system. Attached patch may be what you want. You can change the > > > > > value at any time. > > > > > > > > Thank you for the patch. Can it be installed on 8-STABLE ? > > > > > > > > > > bge(4) in HEAD has many fixes which were not MFCed to stable/8 so > > > I'm not sure that patch could be applied cleanly. But I guess you > > > can manually patch it. > > > I'll wait a couple of days for wider testing/review and commit the > > > patch. > > > > Sorry for the late response. We've tested bge.forced_collapse in December > > on HEAD and found that values >1 froze connections with big data amount, > > for example, "top -Ss1" output. Connection with small data amount such as > > short ssh commands worked OK. Now I've tested modern 7.2-STABLE and found > > that forced_collapse >1 freezes it too. > > > > Thanks for reporting! It seems I've incorrectly dropped mbuf chains > when collapsing fails. Would you try attached patch? Thank you, the patch fixes the bug. > Index: sys/dev/bge/if_bge.c > === > --- sys/dev/bge/if_bge.c (revision 202268) > +++ sys/dev/bge/if_bge.c (working copy) > @@ -3940,11 +3940,8 @@ > m = m_defrag(m, M_DONTWAIT); > else > m = m_collapse(m, M_DONTWAIT, sc->bge_forced_collapse); > - if (m == NULL) { > - m_freem(*m_head); > - *m_head = NULL; > - return (ENOBUFS); > - } > + if (m == NULL) > + m = *m_head; > *m_head = m; > } > -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: hw.bge.forced_collapse
On Fri, Dec 04, 2009 at 12:22:13PM -0800, Pyun YongHyeon wrote: > On Fri, Dec 04, 2009 at 11:13:03PM +0300, Igor Sysoev wrote: > > On Fri, Dec 04, 2009 at 11:51:40AM -0800, Pyun YongHyeon wrote: > > > > > On Fri, Dec 04, 2009 at 10:11:14PM +0300, Igor Sysoev wrote: > > > > On Fri, Dec 04, 2009 at 09:32:43AM -0800, Pyun YongHyeon wrote: > > > > > > > > > On Fri, Dec 04, 2009 at 10:54:40AM +0300, Igor Sysoev wrote: > > > > > > I saw commit introducing hw.bge.forced_collapse loader tunable. > > > > > > Just intresting, why it can not be a sysctl ? > > > > > > > > > > I didn't think the sysctl variable would be frequently changed > > > > > in runtime except debugging driver so I took simple path. > > > > > > > > I do not think it's worth to reboot server just to look how various > > > > values affect on bandwidth and CPU usage, expecially in production. > > > > > > > > As I understand the change is trivial: > > > > > > > > - CTLFLAG_RD > > > > + CTLFLAG_RW > > > > > > > > since bge_forced_collapse is used atomically. > > > > > > > > > > I have no problem changing it to RW but that case I may have to > > > create actual sysctl node(e.g. dev.bge.0.forced_collapse) instead > > > of hw.bge.forced_collapse which may affect all bge(4) controllers > > > on system. Attached patch may be what you want. You can change the > > > value at any time. > > > > Thank you for the patch. Can it be installed on 8-STABLE ? > > > > bge(4) in HEAD has many fixes which were not MFCed to stable/8 so > I'm not sure that patch could be applied cleanly. But I guess you > can manually patch it. > I'll wait a couple of days for wider testing/review and commit the > patch. Sorry for the late response. We've tested bge.forced_collapse in December on HEAD and found that values >1 froze connections with big data amount, for example, "top -Ss1" output. Connection with small data amount such as short ssh commands worked OK. Now I've tested modern 7.2-STABLE and found that forced_collapse >1 freezes it too. -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: hw.bge.forced_collapse
On Fri, Dec 04, 2009 at 11:51:40AM -0800, Pyun YongHyeon wrote: > On Fri, Dec 04, 2009 at 10:11:14PM +0300, Igor Sysoev wrote: > > On Fri, Dec 04, 2009 at 09:32:43AM -0800, Pyun YongHyeon wrote: > > > > > On Fri, Dec 04, 2009 at 10:54:40AM +0300, Igor Sysoev wrote: > > > > I saw commit introducing hw.bge.forced_collapse loader tunable. > > > > Just intresting, why it can not be a sysctl ? > > > > > > I didn't think the sysctl variable would be frequently changed > > > in runtime except debugging driver so I took simple path. > > > > I do not think it's worth to reboot server just to look how various > > values affect on bandwidth and CPU usage, expecially in production. > > > > As I understand the change is trivial: > > > > - CTLFLAG_RD > > + CTLFLAG_RW > > > > since bge_forced_collapse is used atomically. > > > > I have no problem changing it to RW but that case I may have to > create actual sysctl node(e.g. dev.bge.0.forced_collapse) instead > of hw.bge.forced_collapse which may affect all bge(4) controllers > on system. Attached patch may be what you want. You can change the > value at any time. Thank you for the patch. Can it be installed on 8-STABLE ? -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: hw.bge.forced_collapse
On Fri, Dec 04, 2009 at 09:32:43AM -0800, Pyun YongHyeon wrote: > On Fri, Dec 04, 2009 at 10:54:40AM +0300, Igor Sysoev wrote: > > I saw commit introducing hw.bge.forced_collapse loader tunable. > > Just intresting, why it can not be a sysctl ? > > I didn't think the sysctl variable would be frequently changed > in runtime except debugging driver so I took simple path. I do not think it's worth to reboot server just to look how various values affect on bandwidth and CPU usage, expecially in production. As I understand the change is trivial: - CTLFLAG_RD + CTLFLAG_RW since bge_forced_collapse is used atomically. -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
hw.bge.forced_collapse
I saw commit introducing hw.bge.forced_collapse loader tunable. Just intresting, why it can not be a sysctl ? -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: interface FIB
On Fri, Nov 27, 2009 at 09:12:37PM -0800, Julian Elischer wrote: > Igor Sysoev wrote: > > Currently only packets generated during encapsulation can use > > interface's FIB stored during interface creation: > > > > setfib 1 ifconfig gif0 ... > > setfib 1 ifconfig tun0 ... > > not sure if tun actually does this (in fac tit shouldn't) > > but for gre and gif (and stf) these are tunnelling other things into > IP and thus it makes sense to be able to connect a routing table with > the generated envelopes. I've got this from 8.0 release notes: A packet generated on tunnel interfaces such as gif(4) and tun(4) will be encapsulated using the FIB of the process which set up the tunnel. However, sys/net/if_tun.c is really has no FIB related changes. > > is it possible to implement this feature for any interface: > > > > setfib 1 ifconfig vlan0 ... > > > > or > > > > ifconfig vlan0 setfib 1 ... > > these two things would mean differnt things. > and one of them wouldn't mean anything. > > setfig 1 ifconfig vlan0 woudl mean "what" exactly? > VLAN tagging is an L2/L1 operation and FIBS have no effect on this. > > as for ifconfig vlan0 setfib 1, or ifconfig em0 setfib 1 > > this will (shortly) mean that incoming packets through this interface > will be default be connected with fib 1 so the any return packets > (resets, icmp etc.) will use FIB1 to go back to the sender. This is exactly what I meant. > That patch is in the works. I'm ready to test the patch in production on 7/8-STABLE if the patch can be applied to it. -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
interface FIB
Currently only packets generated during encapsulation can use interface's FIB stored during interface creation: setfib 1 ifconfig gif0 ... setfib 1 ifconfig tun0 ... is it possible to implement this feature for any interface: setfib 1 ifconfig vlan0 ... or ifconfig vlan0 setfib 1 ... -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: stuck TIME_WAIT sockets
On Fri, Oct 02, 2009 at 02:06:21PM -0400, Skip Ford wrote: > Igor Sysoev wrote: > > The TIME_WAIT sockets suddenly started to grow on a host running > > FreeBSD 7.2-STABLE, date=2009.09.06.23.59.59 > > Usually there are 3,000-5,000 TIME_WAIT sockets on the host. > > However, today they stared to grow, have reached 110,000 sockets in hour > > and still remain on this level. > > net.inet.tcp.msl is 3. > > The host uptime is 24 days, 21:53. > > Perhaps you need this patch? > > Author: peter > Date: Thu Aug 20 22:53:28 2009 > New Revision: 196410 > URL: http://svn.freebsd.org/changeset/base/196410 > > Log: > Fix signed comparison bug when ticks goes negative after 24 days of > uptime. This causes the tcp time_wait state code to fail to expire > sockets in timewait state. > > Approved by:re (kensmith) > > Modified: > head/sys/netinet/tcp_timewait.c Thank you. -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: stuck TIME_WAIT sockets
On Fri, Oct 02, 2009 at 05:06:46PM +0400, Igor Sysoev wrote: > The TIME_WAIT sockets suddenly started to grow on a host running > FreeBSD 7.2-STABLE, date=2009.09.06.23.59.59 > Usually there are 3,000-5,000 TIME_WAIT sockets on the host. > However, today they stared to grow, have reached 110,000 sockets in hour > and still remain on this level. > net.inet.tcp.msl is 3. > The host uptime is 24 days, 21:53. > > I have saved a coredump and may try to help to debug the issue. There are also 10 stuck LAST_ACK sockets. "swi4: clock sio" is usually idle, however, if I run netstat -an | grep TIME_WAIT | wc -l then swi4 gets some CPU: PID USERNAME THR PRI NICE SIZERES STATE C TIME WCPU COMMAND 11 root1 171 ki31 0K16K CPU11 112.0H 98.29% idle: cpu1 12 root1 171 ki31 0K16K RUN 0 116.8H 94.78% idle: cpu0 14 root1 -32- 0K16K WAIT0 13:11 1.66% swi4: clock 26 root1 -68- 0K16K WAIT 0 334:11 0.00% irq19: bge0 -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
stuck TIME_WAIT sockets
The TIME_WAIT sockets suddenly started to grow on a host running FreeBSD 7.2-STABLE, date=2009.09.06.23.59.59 Usually there are 3,000-5,000 TIME_WAIT sockets on the host. However, today they stared to grow, have reached 110,000 sockets in hour and still remain on this level. net.inet.tcp.msl is 3. The host uptime is 24 days, 21:53. I have saved a coredump and may try to help to debug the issue. -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: bge interrupt coalescing sysctls
On Thu, Jun 11, 2009 at 11:54:29AM +1000, Bruce Evans wrote: > On Wed, 10 Jun 2009, Igor Sysoev wrote: > > >For a long time I used Bruce Evans' patch to tune bge interrupt coalescing: > >http://lists.freebsd.org/pipermail/freebsd-net/2007-November/015956.html > >However, recent commit SVN r192478 in 7-STABLE (r192127 in HEAD) had broken > >the patch. I'm not sure how to fix the collision, and since I do not > >use dynamic tuning > > That commit looked ugly (lots of internal API changes and bloat in interrupt > handlers in many network drivers to support polling which mostly shouldn't > be supported at all and mostly doesn't use the interrupt handlers). > > >I has left only static coalescing parameters in the patch > >and has added a loader tunable to set number of receive descriptors and > >read only sysctl to read the tunable. I usually use these parameters: > > > >/boot/loader.conf: > >hw.bge.rxd=512 > > > >/etc/sysctl.conf: > >dev.bge.0.rx_coal_ticks=500 > >dev.bge.0.tx_coal_ticks=1 > >dev.bge.0.rx_max_coal_bds=64 > > These rx settings give to high a latency for me. Probably, however, I use this on a host that has 6000 packets/s. > >dev.bge.0.tx_max_coal_bds=128 > ># apply the above parameters > >dev.bge.0.program_coal=1 > > > >Could anyone commit it ? > > Not me, sorry. > > The patch is quite clean. If I committed then I would commit the > dynamic coalescing configuration separately anyway. So have you any objections if some one else will commit this patch ? > You can probably make hw.bge.rxd a sysctl too (it would take a down/up > to get it changed, but that is already needed for too many parameters > in network drivers anyway). I should use a sysctl for the ifq length > too. This could be done at a high level for each driver. Limiting > queue lengths may be a good way to reduce cache misses, while increasing > them is sometimes good for reducing packet loss. Do you mean simple command sequence: sysctl hw.bge.rxd=512 ifconfig down ifconfig up or SYSCTL_ADD_PROC for hw.bge.rxd ? -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
bge interrupt coalescing sysctls
For a long time I used Bruce Evans' patch to tune bge interrupt coalescing: http://lists.freebsd.org/pipermail/freebsd-net/2007-November/015956.html However, recent commit SVN r192478 in 7-STABLE (r192127 in HEAD) had broken the patch. I'm not sure how to fix the collision, and since I do not use dynamic tuning I has left only static coalescing parameters in the patch and has added a loader tunable to set number of receive descriptors and read only sysctl to read the tunable. I usually use these parameters: /boot/loader.conf: hw.bge.rxd=512 /etc/sysctl.conf: dev.bge.0.rx_coal_ticks=500 dev.bge.0.tx_coal_ticks=1 dev.bge.0.rx_max_coal_bds=64 dev.bge.0.tx_max_coal_bds=128 # apply the above parameters dev.bge.0.program_coal=1 Could anyone commit it ? -- Igor Sysoev http://sysoev.ru/en/ --- sys/dev/bge/if_bge.c2009-05-21 01:17:10.0 +0400 +++ sys/dev/bge/if_bge.c2009-06-05 13:45:49.0 +0400 @@ -447,12 +447,16 @@ DRIVER_MODULE(miibus, bge, miibus_driver, miibus_devclass, 0, 0); static int bge_allow_asf = 0; +static int bge_rxd = BGE_SSLOTS; TUNABLE_INT("hw.bge.allow_asf", &bge_allow_asf); +TUNABLE_INT("hw.bge.rxd", &bge_rxd); SYSCTL_NODE(_hw, OID_AUTO, bge, CTLFLAG_RD, 0, "BGE driver parameters"); SYSCTL_INT(_hw_bge, OID_AUTO, allow_asf, CTLFLAG_RD, &bge_allow_asf, 0, "Allow ASF mode if available"); +SYSCTL_INT(_hw_bge, OID_AUTO, bge_rxd, CTLFLAG_RD, &bge_rxd, 0, + "Number of receive descriptors"); #defineSPARC64_BLADE_1500_MODEL"SUNW,Sun-Blade-1500" #defineSPARC64_BLADE_1500_PATH_BGE "/p...@1f,70/netw...@2" @@ -1008,21 +1012,15 @@ return (0); } -/* - * The standard receive ring has 512 entries in it. At 2K per mbuf cluster, - * that's 1MB or memory, which is a lot. For now, we fill only the first - * 256 ring entries and hope that our CPU is fast enough to keep up with - * the NIC. - */ static int bge_init_rx_ring_std(struct bge_softc *sc) { int i; - for (i = 0; i < BGE_SSLOTS; i++) { + for (i = 0; i < bge_rxd; i++) { if (bge_newbuf_std(sc, i, NULL) == ENOBUFS) return (ENOBUFS); - }; + } bus_dmamap_sync(sc->bge_cdata.bge_rx_std_ring_tag, sc->bge_cdata.bge_rx_std_ring_map, @@ -2383,6 +2381,52 @@ #endif static int +bge_sysctl_program_coal(SYSCTL_HANDLER_ARGS) +{ + struct bge_softc *sc; + int error, i, val; + + val = 0; + error = sysctl_handle_int(oidp, &val, 0, req); + if (error != 0 || req->newptr == NULL) + return (error); +sc = arg1; + BGE_LOCK(sc); + + /* XXX cut from bge_blockinit(): */ + + /* Disable host coalescing until we get it set up */ + CSR_WRITE_4(sc, BGE_HCC_MODE, 0x); + + /* Poll to make sure it's shut down. */ + for (i = 0; i < BGE_TIMEOUT; i++) { + if (!(CSR_READ_4(sc, BGE_HCC_MODE) & BGE_HCCMODE_ENABLE)) + break; + DELAY(10); + } + + if (i == BGE_TIMEOUT) { + device_printf(sc->bge_dev, + "host coalescing engine failed to idle\n"); + CSR_WRITE_4(sc, BGE_HCC_MODE, BGE_HCCMODE_ENABLE); + BGE_UNLOCK(sc); + return (ENXIO); + } + + /* Set up host coalescing defaults */ + CSR_WRITE_4(sc, BGE_HCC_RX_COAL_TICKS, sc->bge_rx_coal_ticks); + CSR_WRITE_4(sc, BGE_HCC_TX_COAL_TICKS, sc->bge_tx_coal_ticks); + CSR_WRITE_4(sc, BGE_HCC_RX_MAX_COAL_BDS, sc->bge_rx_max_coal_bds); + CSR_WRITE_4(sc, BGE_HCC_TX_MAX_COAL_BDS, sc->bge_tx_max_coal_bds); + + /* Turn on host coalescing state machine */ + CSR_WRITE_4(sc, BGE_HCC_MODE, BGE_HCCMODE_ENABLE); + + BGE_UNLOCK(sc); + return (0); +} + +static int bge_attach(device_t dev) { struct ifnet *ifp; @@ -4495,6 +4539,19 @@ ctx = device_get_sysctl_ctx(sc->bge_dev); children = SYSCTL_CHILDREN(device_get_sysctl_tree(sc->bge_dev)); + SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "program_coal", + CTLTYPE_INT | CTLFLAG_RW, + sc, 0, bge_sysctl_program_coal, "I", + "program bge coalescence values"); + SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "rx_coal_ticks", CTLFLAG_RW, + &sc->bge_rx_coal_ticks, 0, ""); + SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "tx_coal_ticks", CTLFLAG_RW, + &sc->bge_tx_coal_ticks, 0, ""); + SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "rx_max_coal_bds", CTLFLAG_RW, + &sc->bge_rx_max_coal_bds, 0, ""); + SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "tx_max_coal_bds", CTLFLAG_RW, +
Re: FIB MFC
On Thu, Jul 24, 2008 at 09:44:15AM -0700, Julian Elischer wrote: > Igor Sysoev wrote: > >On Thu, Jul 24, 2008 at 08:33:09AM -0700, Julian Elischer wrote: > > > > > >>I was thinking that it might be possible to tag a socket to accept the > >>fib of the packet coming in, but if we do this, we should decide > >>API to label a socket in this way.. > > > >I think it should be sysctl to globaly enable TCP FIB inheritance. > >API is already exists: sockopt(SO_SETFIB) for listening socket. > > But a socket ALWAYS has a fib, even if you do nothing > because every process has a fib (usually 0) > so you need a new bit of state somewhere that means "inherit". > (I guess in the socket flags). I see. > Possibly the FIB value of -1 when applied on a socket option might > signify that behaviour. (thus save us a new sockopt). > But such a value would revert to that of the process if the socket was > not used as a listen socket. (or clear itself). -1 is good variant. > I have some MRT unhansements in hte pipeline and will include this if > I can. > > BTW could you send me the diff for ipfw(8)? > I'll compare it with the one I'm about to commit. This is exactly your already commited 1.108.2.9 -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: FIB MFC
On Thu, Jul 24, 2008 at 08:33:09AM -0700, Julian Elischer wrote: > Igor Sysoev wrote: > >Julian, thank you for FIB. I have tried in on FreeBSD-7. > > > >I've found that ipfw does not know about setfib: > >ipfw: invalid action setfib > > > > Oh I have not finished MFC.. > will finish today.. > > the svn server crashed last night .. :-/ > (or at least went very strange) while I was working on this so I > went to bed. > > > > >Therefore I've added missing part from CURRENT. > >Then I have tried the following configuration: > > > >vlan1: 10.0.0.100 > >vlan2: 192.168.1.100 > > > >route add default 10.0.0.1 > >setfib 1 route add default 192.168.1.1 > >ipfw add setfib 1 ip from any to any in via vlan2 > > > >I expected that outgoing packets of TCP connection established > >via vlan2 will be routed to 192.168.1.1, but this did not happen. > >The packets went to 10.0.0.1 via vlan1: > > no, while this doesmake sense, the fib is only used for outgoing > packets and the fib of local sockets is set by the process that opens > the socket. (either with setfib(2) or sockopt(SETFIB)) > > I was thinking that it might be possible to tag a socket to accept the > fib of the packet coming in, but if we do this, we should decide > API to label a socket in this way.. I think it should be sysctl to globaly enable TCP FIB inheritance. API is already exists: sockopt(SO_SETFIB) for listening socket. > It is a n execellent idea however, and I don't know why I didn't > do it already.. > > > > >tcp4 0 0 192.168.1.100.80 XX SYN_RCVD > >tcp4 0 0 192.168.1.100.80 XX SYN_RCVD > >tcp4 0 0 192.168.1.100.80 XX SYN_RCVD > > > >Can TCP connection inherit FIB from first SYN packet or not ? > > no but it is a good idea. -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
FIB MFC
Julian, thank you for FIB. I have tried in on FreeBSD-7. I've found that ipfw does not know about setfib: ipfw: invalid action setfib Therefore I've added missing part from CURRENT. Then I have tried the following configuration: vlan1: 10.0.0.100 vlan2: 192.168.1.100 route add default 10.0.0.1 setfib 1 route add default 192.168.1.1 ipfw add setfib 1 ip from any to any in via vlan2 I expected that outgoing packets of TCP connection established via vlan2 will be routed to 192.168.1.1, but this did not happen. The packets went to 10.0.0.1 via vlan1: tcp4 0 0 192.168.1.100.80 XX SYN_RCVD tcp4 0 0 192.168.1.100.80 XX SYN_RCVD tcp4 0 0 192.168.1.100.80 XX SYN_RCVD Can TCP connection inherit FIB from first SYN packet or not ? -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Multiple routing tables in action...
On Tue, Apr 29, 2008 at 12:11:03PM -0700, Julian Elischer wrote: > >Then you can export RIB entries , say > >you have 5 BGP peers and you want to export 2 or 3 or all of them into > >the 'main' routing instance you can set up a policy to add those learned > >routes into the main instance and v-v. > >Linux behaves a little bit differently as you have to make an 'ip rule' > >entry for it but it doesn't use the firewall. > > for now this code asks you to use a firewall to classify incoming > packets.. > > e.g. > 100 setfib 2 ip from any to any in recv em0 Is is possible to extend ifconfig to classify incoming packets ? -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: zonelimit issues...
On Mon, Apr 21, 2008 at 03:27:53PM +0400, Igor Sysoev wrote: > The problem that FreeBSD has small KVA space: only 2G even on amd64 32G > machines. > > So with > > vm.kmem_size=1G > # 64M KVA > kern.maxbcache=64M > # 4M KVA > kern.ipc.maxpipekva=4M > > > I can use something like this: > > # 256M KVA/KVM > kern.ipc.nmbjumbop=64000 > # 216M KVA/KVM > kern.ipc.nmbclusters=98304 > # 162M KVA/KVM > kern.ipc.maxsockets=163840 > # 8M KVA/KVM > net.inet.tcp.maxtcptw=163840 > # 24M KVA/KVM > kern.maxfiles=204800 Actually, on amd64 it is possible to increase KVM up to 1.8G without boot time panic: vm.kmem_size=1844M # 64M KVA kern.maxbcache=64M # 4M KVA kern.ipc.maxpipekva=4M Without descreasing kern.maxbcache (200M by default) and kern.ipc.maxpipekva (~40M by default) you can get only about 1.5G. So with 1.8G KVM I able to set # 4G phys, 2G KVA, 1.8G KVM # # 750M KVA/KVM kern.ipc.nmbjumbop=192000 # 504M KVA/KVM kern.ipc.nmbclusters=229376 # 334M KVA/KVM kern.ipc.maxsockets=204800 # 8M KVA/KVM net.inet.tcp.maxtcptw=163840 # 24M KVA/KVM kern.maxfiles=204800 Now KVA is split as kernel code8M kmem_map 1844M buffer_map 64M pager_map 32M exec_map 4.2M pipe_map 4M ???60M vm.kvm_free32M I leave unused spare 32M free KVA (vm.kvm_free) because some map (unknown for me) after pipe_map may grow slightly. If vm.kvm_free will become 0, kernel will panic. -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: bge loader tunables
On Tue, Apr 22, 2008 at 12:20:38AM +0400, Igor Sysoev wrote: > Finally I have tested your second (without debug stuff) patch in > production environment (~45K in/out packets) on FreeBSD 7.0-STABLE. > I think it should be commited. > > I use my usual static settings in /etc/sysctl.conf: > > dev.bge.0.dyncoal_max_intr_freq=0 > # > dev.bge.0.rx_coal_ticks=500 > dev.bge.0.tx_coal_ticks=1 > dev.bge.0.rx_max_coal_bds=64 > dev.bge.0.tx_max_coal_bds=128 > # apply the above parameters > dev.bge.0.program_coal=0 > > and have about only 1700-1900 interrupts per second. > > The only issue was at boot time: > > dev.bge.0.dyncoal_max_intr_freq: 1 -> 0 > dev.bge.0.rx_coal_ticks: 0 -> 500 > dev.bge.0.tx_coal_ticks: 100 -> 1 > dev.bge.0.rx_max_coal_bds: 128 -> 64 > dev.bge.0.tx_max_coal_bds: 384 -> 128 > ... > bge0: flags =8843 metric 0 mtu 1500 > options=9b > ... > Local package initialization: > ... > dev.bge.0.rx_coal_ticks: 150 -> 500 > > When disabling dyncoal_max_intr_freq at bge UPing resets rx_coal_ticks to > 150. I has to use dev.bge.0.program_coal=1 in /etc/sysctl.conf, otherwise /etc/rc.d/sysctl does not call it at all. -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: bge loader tunables
On Sat, Nov 17, 2007 at 09:13:50PM +1100, Bruce Evans wrote: > On Sat, 17 Nov 2007, Igor Sysoev wrote: > > >On Sat, Nov 17, 2007 at 08:30:58AM +1100, Bruce Evans wrote: > > > >>On Fri, 16 Nov 2007, Igor Sysoev wrote: > >> > >>>The attached patch creates the following bge loader tunables: > >> > >>I plan to commit old work to do this using sysctls. Tunables are > >>harder to use and aren't needed since changes to the defaults aren't > >>needed for booting. I also implemented dynamic tuning for rx coal > >>parameters so that the sysctls are mostly not needed. Ask for patches > >>if you want to test this extensively. > > > >Yes, I can test your patches on 6.2 and 7.0. > >Now bge set the coalescing parameters at attach time. > >Do the sysctl's allow to change them on-the-fly ? > >How does rx dynamic tuning work ? > >Could it be turned off ? > > OK, the patch is enclosed at the end, in 2 versions: > - all my patches for bge (with lots of debugging cruft and half-baked > fixes for 5705+ sysctls. > - edited version with only the coalescing parameter changes. > > I haven't used it under 6.2, but have used a similar version in ~5.2, > and it should work in 6.2 except for the 5705+ sysctl fixes. > > bge actually sets parameters at init time, and it initializes whenever the > link is brought back up, so the parameters can be changed using > "ifconfig bgeN down up". Several network drivers have interrupt moderation > parameters that can be changed in this way, but it is painful to change > the link status like that, so I have a sysctl dev.bge.N.program_coal to > apply the current parameters to the hardware. The other sysctls to change > the parameters don't apply immediately, except the one for the rx tuning > max interrupt rate, since applying the changed parameters to the hardware > takes more code than a SYSCTL_INT(), and it is sometimes necessary to > change all the parameters together atomically. > > Dynamic tuning works by monitoring the current rx packet rate and > increasing the active rx_max_coal_bds so that the ratio rate> / rx_max_coal_bds is usually <= the specified max rx interrupt > rate. rx_coal_ticks is set to the constant value of the inverse of > the specified max rx interrupt rate (in ticks) on transition to dynamic > mode but IIRC is not changed when the dynamic rate is changed (not > always changing it automatically allows adjusting it independently of > the rate but is often not what is wanted). The transition has some > bias towards lower latency over too many interrupts, so that short > bursts don't increase the latency. I think this simple algorithm is > good enough provided the load (in rx packets/second) doesn't oscillate > rapidly. > > Dynamic tuning requires efficient reprogramming of at least one of the > hardware coal registers so that the tuning can respond rapidly to changes. > I have 2 methods for this: > - bge_careful_coal = 1 avoids using uses a potentially very long > busy-wait loop in the interrupt handler by giving up on reprogramming > the host coalescing engine (HCE) if the HCE seems to be busy. Docs > seem to require waiting for up to several milliseconds for the HCE > to stablilize, and it is not clear if it is possible for the HCE to > never stabilize because packets are streaming in. (I don't have > proper docs.) This seems to always work (the HCE is never busy) > for rx_max_coal_bds, but something near here didn't work for > changing rx_coal_ticks in an old version. > - bge_careful_coal = 0 avoids the loop by writing to the rx_max_coal_bds > register without waiting for the HCE. This seems to work too. It > isn't critical for the HCE to see the change immediately or even > for it to be seen at all (missed changes might do more than give a > huge interrupt rate for too long), but it is important for the > change to not break the engine. > There is no sysctl for this of for some other hackish parameters. The > source must be edited to change this from 1 to 0. > > Dynamic tuning is turned off by setting the dynamic max interrupt > frequency to 0. Then rx_coal_ticks is reset to 150, and the active > rx_max_coal_bds is restored to the static value. Finally I have tested your second (without debug stuff) patch in production environment (~45K in/out packets) on FreeBSD 7.0-STABLE. I think it should be commited. I use my usual static settings in /etc/sysctl.conf: dev.bge.0.dyncoal_max_intr_freq=0 # dev.bge.0.rx_coal_ticks=500 dev.bge.0.tx_coal_ticks=1 dev.bge.0.rx_max_coal_bds=64 dev.bge.0.tx_max_coal_bds=128 # apply the above parameters dev.bge.0.pro
Re: zonelimit issues...
On Mon, Apr 21, 2008 at 05:16:28PM +0900, [EMAIL PROTECTED] wrote: > At Mon, 21 Apr 2008 16:46:00 +0900, > [EMAIL PROTECTED] wrote: > > > > At Sun, 20 Apr 2008 10:32:25 +0100 (BST), > > rwatson wrote: > > > > > > > > > On Fri, 18 Apr 2008, [EMAIL PROTECTED] wrote: > > > > > > > I am wondering why this patch was never committed? > > > > > > > > http://people.freebsd.org/~delphij/misc/patch-zonelimit-workaround > > > > > > > > It does seem to address an issue I'm seeing where processes get into > > > > the > > > > zonelimit state through the use of mbufs (a high speed UDP packet > > > > receiver) > > > > but even after network pressure is reduced/removed the process never > > > > gets > > > > out of that state again. Applying the patch fixed the issue, but I'd > > > > like > > > > to have some discussion as to the general merits of the approach. > > > > > > > > Unfortunately the test that currently causes this is tied very tightly > > > > to > > > > code at work that I can't share, but I will hopefully be improving > > > > mctest to > > > > try to exhibit this behavior. > > > > > > When you take all load off the system, do mbufs and clusters get properly > > > freed back to UMA (as visible in netstat -m)? If not, continuing to bump > > > up > > > against the zonelimit would suggest an mbuf/cluster leak, in which case > > > we > > > need to track that bug. > > > > > > > This is unclear as the process that creates the issue opens 50 UDP > > multicast sockets with very large socket buffers. I am investigating > > this aspect some more. > > > > OK, yes, the clusters etc. go back to normal when the incoming > pressure is released. I do not believe we have a cluster/mbuf leak. There is no cluster/mbuf leak. The problem that FreeBSD has small KVA space: only 2G even on amd64 32G machines. So with vm.kmem_size=1G # 64M KVA kern.maxbcache=64M # 4M KVA kern.ipc.maxpipekva=4M I can use something like this: # 256M KVA/KVM kern.ipc.nmbjumbop=64000 # 216M KVA/KVM kern.ipc.nmbclusters=98304 # 162M KVA/KVM kern.ipc.maxsockets=163840 # 8M KVA/KVM net.inet.tcp.maxtcptw=163840 # 24M KVA/KVM kern.maxfiles=204800 -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: bge loader tunables
On Sat, Nov 17, 2007 at 08:30:58AM +1100, Bruce Evans wrote: > On Fri, 16 Nov 2007, Igor Sysoev wrote: > > >The attached patch creates the following bge loader tunables: > > I plan to commit old work to do this using sysctls. Tunables are > harder to use and aren't needed since changes to the defaults aren't > needed for booting. I also implemented dynamic tuning for rx coal > parameters so that the sysctls are mostly not needed. Ask for patches > if you want to test this extensively. Yes, I can test your patches on 6.2 and 7.0. Now bge set the coalescing parameters at attach time. Do the sysctl's allow to change them on-the-fly ? How does rx dynamic tuning work ? Could it be turned off ? > >hw.bge.rxd=512 > > > >Number of standard receive descriptors allocated by the driver. > >The default value is 256. The maximum value is 512. > > I always use 512 for this. The corresponding value for jumbo buffers > is hard-coded (JSLOTS exists to tune the value at config time, like > SSLOTS does for this, but is no longer used). Only machines with a > small amount of memory should care about the wastage from always > allocating the max number of descriptors. I agree: the default jumbo rx ring takes 256*9216=2.3M, while maximum standard rx ring takes 512*2048=1M, nevertheless it is limited to 256*2048=512K. > >hw.bge.rx_int_delay=500 > > > >This value delays the generation of receive interrupts in microseconds. > >The default value is 150 microseconds. > > This is a good default. I normally use 100 (goes with dynamic tuning to > limit the rx interrupt rate to 10 kHz). > > >hw.bge.tx_int_delay=500 > > > >This value delays the generation of transmit interrupts in microseconds. > >The default value is 150 microseconds. > > I use 1 second. Infinity works right, except it wastes mbufs when the > tx is idle for a long time. It seems 1 second is good for me: I use sendfile() and lot of mbufs clusters: kern.ipc.nmbclusters=196608 > >hw.bge.rx_coal_desc=64 > > > >This value delays the generation of receive interrupts until specified > >number of packets will be received. The default value is 10. > > 64 is a good default. 10 is a bad default (it optimizes too much for > latency at a cost of efficiency to be good). I use 1 when optimizing > for latency. Dynamic tuning sets this to a value suitable for limiting > the rx interrupt rate to a specified frequency (10 kHz is a good limit). > > >hw.bge.tx_coal_desc=128 > > > >This value delays the generation of transmit interrupts until specified > >number of packets will be transmited. The default value is 10. > > 128 is a good default. I use 384. There are few latency issues here, so > the default of 10 mainly costs efficiency. Does 384 not delay tx if there is shortage of free tx descriptors ? -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
bge loader tunables
The attached patch creates the following bge loader tunables: hw.bge.rxd=512 Number of standard receive descriptors allocated by the driver. The default value is 256. The maximum value is 512. hw.bge.rx_int_delay=500 This value delays the generation of receive interrupts in microseconds. The default value is 150 microseconds. hw.bge.tx_int_delay=500 This value delays the generation of transmit interrupts in microseconds. The default value is 150 microseconds. hw.bge.rx_coal_desc=64 This value delays the generation of receive interrupts until specified number of packets will be received. The default value is 10. hw.bge.tx_coal_desc=128 This value delays the generation of transmit interrupts until specified number of packets will be transmited. The default value is 10. -- Igor Sysoev http://sysoev.ru/en/ --- sys/dev/bge/if_bge.c 2007-09-30 15:05:14.0 +0400 +++ sys/dev/bge/if_bge.c 2007-11-15 23:01:57.0 +0300 @@ -426,8 +426,18 @@ DRIVER_MODULE(miibus, bge, miibus_driver, miibus_devclass, 0, 0); static int bge_allow_asf = 0; +static int bge_rxd = BGE_SSLOTS; +static int bge_rx_coal_ticks = 150; +static int bge_tx_coal_ticks = 150; +static int bge_rx_max_coal_bds = 10; +static int bge_tx_max_coal_bds = 10; TUNABLE_INT("hw.bge.allow_asf", &bge_allow_asf); +TUNABLE_INT("hw.bge.rxd", &bge_rxd); +TUNABLE_INT("hw.bge.rx_int_delay", &bge_rx_coal_ticks); +TUNABLE_INT("hw.bge.tx_int_delay", &bge_tx_coal_ticks); +TUNABLE_INT("hw.bge.rx_coal_desc", &bge_rx_max_coal_bds); +TUNABLE_INT("hw.bge.tx_coal_desc", &bge_tx_max_coal_bds); SYSCTL_NODE(_hw, OID_AUTO, bge, CTLFLAG_RD, 0, "BGE driver parameters"); SYSCTL_INT(_hw_bge, OID_AUTO, allow_asf, CTLFLAG_RD, &bge_allow_asf, 0, @@ -877,10 +887,10 @@ { int i; - for (i = 0; i < BGE_SSLOTS; i++) { + for (i = 0; i < bge_rxd; i++) { if (bge_newbuf_std(sc, i, NULL) == ENOBUFS) return (ENOBUFS); - }; + } bus_dmamap_sync(sc->bge_cdata.bge_rx_std_ring_tag, sc->bge_cdata.bge_rx_std_ring_map, @@ -2453,10 +2463,10 @@ /* Set default tuneable values. */ sc->bge_stat_ticks = BGE_TICKS_PER_SEC; - sc->bge_rx_coal_ticks = 150; - sc->bge_tx_coal_ticks = 150; - sc->bge_rx_max_coal_bds = 10; - sc->bge_tx_max_coal_bds = 10; + sc->bge_rx_coal_ticks = bge_rx_coal_ticks; + sc->bge_tx_coal_ticks = bge_tx_coal_ticks; + sc->bge_rx_max_coal_bds = bge_rx_max_coal_bds; + sc->bge_tx_max_coal_bds = bge_tx_max_coal_bds; /* Set up ifnet structure */ ifp = sc->bge_ifp = if_alloc(IFT_ETHER); ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
setup_loopback() in /etc/rc.firewall
After 1.49 src/etc/rc.firewall setup_loopback() is called in any firewall type including custom firewall defined filename. I think setup_loopback() should be called for predefined firewalls. -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Add socket related statistics to netstat(1)?
On Wed, Aug 29, 2007 at 02:39:57PM +0100, Robert Watson wrote: > > On Wed, 29 Aug 2007, Igor Sysoev wrote: > > >On Wed, Aug 29, 2007 at 02:48:57PM +0800, LI Xin wrote: > > > >>Here is a proof-of-concept patch that adds sockets related statistics to > >>netstat(1)'s -m option, which could make SA's life easier. Inspired by a > >>local user's suggestion. > >> > >>Comments? > > > >I think socket info should be groupped together: > > The netstat -m output is getting quite cluttered these days, isn't it. I > wonder if we should be laying it out a bit more consistently, perhaps > something like: > >current cachetotalmax > mbufs 2407 1058 3465 - > mbuf clusters 1117 797 1914 98304 > mbufs + clusters 1117 90 -- > 4k jumbo clusters 761 417 1178 0 > ... > > It's less compact but possibly quite a bit more readable... I agree - it's much better, however, someone may argue that it will break statistic scripts. May we should use anthor switch. > >2407/1058/3465 mbufs in use (current/cache/total) > >1117/797/1914/98304 mbuf clusters in use (current/cache/total/max) > >1117/90 mbuf+clusters out of packet secondary zone in use (current/cache) > >761/417/1178/0 4k (page size) jumbo clusters in use > >(current/cache/total/max) > >0/0/0/0 9k jumbo clusters in use (current/cache/total/max) > >0/0/0/0 16k jumbo clusters in use (current/cache/total/max) > >5879K/3526K/9406K bytes allocated to network (current/cache/total) > >0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) > >0/0/0 requests for jumbo clusters denied (4k/9k/16k) > >15333/15537/30870/204800 socket UMA in use (current/cache/total/max) > >5929K bytes allocated to socket > >0 request for socket UMA denied > >104/264/6656 sfbufs in use (current/peak/max) > >0 requests for sfbufs denied > >0 requests for sfbufs delayed > >135834 requests for I/O initiated by sendfile > >0 calls to protocol drain routines > > > >Second, I think socket memory calculation should include > >tcpcb, udpcb, inpcb, unpcb and probably tcptw items. > > > > > >-- > >Igor Sysoev > >http://sysoev.ru/en/ > >___ > >freebsd-net@freebsd.org mailing list > >http://lists.freebsd.org/mailman/listinfo/freebsd-net > >To unsubscribe, send any mail to "[EMAIL PROTECTED]" > > > ___ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "[EMAIL PROTECTED]" > -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Add socket related statistics to netstat(1)?
On Wed, Aug 29, 2007 at 02:48:57PM +0800, LI Xin wrote: > Here is a proof-of-concept patch that adds sockets related statistics to > netstat(1)'s -m option, which could make SA's life easier. Inspired by > a local user's suggestion. > > Comments? I think socket info should be groupped together: 2407/1058/3465 mbufs in use (current/cache/total) 1117/797/1914/98304 mbuf clusters in use (current/cache/total/max) 1117/90 mbuf+clusters out of packet secondary zone in use (current/cache) 761/417/1178/0 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/0 9k jumbo clusters in use (current/cache/total/max) 0/0/0/0 16k jumbo clusters in use (current/cache/total/max) 5879K/3526K/9406K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 15333/15537/30870/204800 socket UMA in use (current/cache/total/max) 5929K bytes allocated to socket 0 request for socket UMA denied 104/264/6656 sfbufs in use (current/peak/max) 0 requests for sfbufs denied 0 requests for sfbufs delayed 135834 requests for I/O initiated by sendfile 0 calls to protocol drain routines Second, I think socket memory calculation should include tcpcb, udpcb, inpcb, unpcb and probably tcptw items. -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: maximum number of outgoing connections
On Mon, Aug 20, 2007 at 10:30:12PM +0400, Igor Sysoev wrote: > On Mon, Aug 20, 2007 at 09:53:55AM -0700, John-Mark Gurney wrote: > > > Igor Sysoev wrote this message on Mon, Aug 20, 2007 at 19:11 +0400: > > > It seems that FreeBSD can not make more than > > > > > > net.inet.ip.portrange.last - net.inet.ip.portrange.first > > > > > > simultaneous outgoing connections, i.e., no more than about 64k. > > > > > > If I made ~64000 connections 127.0.0.1: > 127.0.0.1:80, then > > > connect() to an external address returns EADDRNOTAVAIL. > > > > Isn't this more of a limitation of TCP/IP than FreeBSD? because you > > need to treat the srcip/srcport/dstip/dstport as a unique value, and > > in your test, you are only changing one of the four... Have you tried > > running a second we server on port 8080, and see if you can connect > > another ~64000 connections to that port too? > > No, TCP/IP limitation is for in 127.0.0.1: <> 127.0.0.1:80, > but FreeBSD limits all outgoing connections to the port range, i.e. > > local part remote part > 127.0.0.1:5000 <> 127.0.0.1:80 > 192.168.1.1:5000 <> 10.0.0.1:25 > > can not exist simultaneously, if both connections were started from > local host. To be exact - if connect() was called on unbound socket. -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: maximum number of outgoing connections
On Mon, Aug 20, 2007 at 09:53:55AM -0700, John-Mark Gurney wrote: > Igor Sysoev wrote this message on Mon, Aug 20, 2007 at 19:11 +0400: > > It seems that FreeBSD can not make more than > > > > net.inet.ip.portrange.last - net.inet.ip.portrange.first > > > > simultaneous outgoing connections, i.e., no more than about 64k. > > > > If I made ~64000 connections 127.0.0.1: > 127.0.0.1:80, then > > connect() to an external address returns EADDRNOTAVAIL. > > Isn't this more of a limitation of TCP/IP than FreeBSD? because you > need to treat the srcip/srcport/dstip/dstport as a unique value, and > in your test, you are only changing one of the four... Have you tried > running a second we server on port 8080, and see if you can connect > another ~64000 connections to that port too? No, TCP/IP limitation is for in 127.0.0.1: <> 127.0.0.1:80, but FreeBSD limits all outgoing connections to the port range, i.e. local part remote part 127.0.0.1:5000 <> 127.0.0.1:80 192.168.1.1:5000 <> 10.0.0.1:25 can not exist simultaneously, if both connections were started from local host. I can not write a simple test-case program, but I can offer simple setup: cd /usr/ports/www/nginx && make install create simple nginx.conf: events { worker_connections 2; } http { server { listen8080; server_name test; location = /loop { proxy_pass http://127.0.0.1:8080; error_page 502 = /yahoo; } location = /yahoo { proxy_pass http://www.yahoo.com; } } } set sysctl net.inet.ip.portrange.randomized=0 sysctl net.inet.ip.portrange.first=1024 sysctl net.inet.ip.portrange.last=5000 to see the case with default small number of files, sockets, etc. and run as root: /usr/local/sbin/nginx -c ./nginx.conf then ask http://host:8080/loop in browser. nginx will cycle to itslef, then after first error 2007/08/20 22:05:16 [crit] 29669#0: *94165 connect() to 127.0.0.1:8080 failed (49: Can't assign requested address) while connecting to upstream, client: 127.0.0.1, server: test, URL: "/loop", upstream: "http://127.0.0.1:8080/loop";, host: "127.0.0.1:8080" you will see the second error: 2007/08/20 22:05:16 [crit] 29669#0: *94165 connect() to 87.248.113.14:80 failed (49: Can't assign requested address) while connecting to upstream, client: 127.0.0.1, server: test, URL: "/loop", upstream: "http://87.248.113.14:80/loop";, host: "127.0.0.1:8080" If you think it may be nginx fault, run this under ktrace/truss and see syscalls. -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: maximum number of outgoing connections
On Mon, Aug 20, 2007 at 05:19:14PM +0100, Tom Judge wrote: > Igor Sysoev wrote: > >It seems that FreeBSD can not make more than > > > >net.inet.ip.portrange.last - net.inet.ip.portrange.first > > > >simultaneous outgoing connections, i.e., no more than about 64k. > > > >If I made ~64000 connections 127.0.0.1: > 127.0.0.1:80, then > >connect() to an external address returns EADDRNOTAVAIL. > > > >net.inet.ip.portrange.randomized is 0. > > > >sockets, etc. are enough: > > > >ITEMSIZE LIMIT USED FREE REQUESTS FAILURES > >socket: 356, 204809,13915, 146443, 148189452,0 > >inpcb: 180, 204820,20375, 137277, 147631805,0 > >tcpcb: 464, 204800,13882, 142102, 147631805,0 > >tcptw:48,41028, 6493,11213, 29804665,0 > > > >I saw it on 6.2-STABLE. > > > > > > In an ideal world (Not sure if this is quite correct for FreeBSD) TCP > connections are tracked with a pair of tupels source-addr:src-port -> > dst-addr:dst-port > > As your always connecting to the same destination service 127.0.0.1:80 > and always from the same source IP 127.0.0.1 then you only have one > variable left to change, the source port. If you where to use the hole > of the whole of the port range minus the reserved ports you would only > ever be able to make 64512 simultaneous connections. In order to make > more connections the first thing that you may want to start changing is > the source IP. If you added a second IP to you lo0 interface (say > 127.0.0.2) and used a round robin approach to making your out bound > connections then you could make around 129k outbound connections. Connections to 127.0.0.1 were via lo0, external connections are via bge0. > I am not sure if there are any other constraints that need to be taken > into account such as the maximum number of sockets, RAM etc No, there are no constraints in memory, sockets, mbufs, clusters, etc. If there's contraint in memory, then FreeBSD simply panics. If there's contraint in mbuf clusters, then process stucks in zonelimit state forever. I suspect that local address in in_pcbbind_setup() is 0.0.0.0 so there is 64K limit. -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
maximum number of outgoing connections
It seems that FreeBSD can not make more than net.inet.ip.portrange.last - net.inet.ip.portrange.first simultaneous outgoing connections, i.e., no more than about 64k. If I made ~64000 connections 127.0.0.1: > 127.0.0.1:80, then connect() to an external address returns EADDRNOTAVAIL. net.inet.ip.portrange.randomized is 0. sockets, etc. are enough: ITEMSIZE LIMIT USED FREE REQUESTS FAILURES socket: 356, 204809,13915, 146443, 148189452,0 inpcb: 180, 204820,20375, 137277, 147631805,0 tcpcb: 464, 204800,13882, 142102, 147631805,0 tcptw:48,41028, 6493,11213, 29804665,0 I saw it on 6.2-STABLE. -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: syncookie in 6.x and 7.x
On Sun, Aug 19, 2007 at 04:42:51AM -0500, Mike Silbersack wrote: > On Thu, 16 Aug 2007, Igor Sysoev wrote: > > >I have looked sources and found that in early versions the sent counter > >was simply not incremented at all. The patch attached. > > The patch looks ready to commit to me. Do you want me to commit or, or do > you have another committer lined up? Feel free to commit. > >After the patch has been applied I have found that 6 always sends > >syncookies too, however, 6 unlike 7 never receives them. Why ? > > Have you tried patching 6 so that the syncache is non-functional and > forced it to rely on syncookies? Last I checked (which was a long time > ago), syncookies worked on 6. Adding a sysctl like 7's > net.inet.tcp.syncookies_only to 6 might not be a bad idea, as long as it's > behind #ifdef DIAGNOSTIC or INVARIANTS. No, I have not tried. > The question you may really be asking is: Why does 7 *think* that it is > receiving syncookies all the time? :) > > I haven't tried to answer that question yet. I have found two 4.8's: 17460166 syncache entries added 106312 retransmitted 90435 dupsyn 0 dropped 17424177 completed 465 bucket overflow 0 cache overflow 21526 reset 13725 stale 0 aborted 0 badack 279 unreach 0 zone failures 0 cookies sent 6 cookies received 1671768 syncache entries added 63163 retransmitted 37566 dupsyn 0 dropped 1645430 completed 248 bucket overflow 0 cache overflow 13144 reset 12888 stale 0 aborted 0 badack 174 unreach 0 zone failures 0 cookies sent 116 cookies received and 4.11's: 5643772 syncache entries added 45993 retransmitted 41452 dupsyn 0 dropped 5630013 completed 298 bucket overflow 0 cache overflow 7374 reset 6030 stale 0 aborted 0 badack 93 unreach 0 zone failures 0 cookies sent 36 cookies received 141791272 syncache entries added 280354 retransmitted 273529 dupsyn 0 dropped 141703800 completed 206 bucket overflow 0 cache overflow 9847 reset 35570 stale 36034 aborted 0 badack 5854 unreach 0 zone failures 0 cookies sent 40 cookies received I have found one 6.1-PRERELEASE with 298 uptime: 2672792190 syncache entries added 83640383 retransmitted 77727918 dupsyn 282 dropped 2645872801 completed 0 bucket overflow 0 cache overflow 10974940 reset 15657014 stale 91 aborted 52 badack 287259 unreach 0 zone failures 0 cookies sent 8 cookies received 4.x have uptimes from week to month. On other 6.x with small uptime and do not see received cookies. And I have no 5.x at all. Anyway, 7 receives cookies much more - here is statistics from 3 days uptime: 52175610 syncache entries added 2092809 retransmitted 2021384 dupsyn 0 dropped 51681903 completed 0 bucket overflow 0 cache overflow 181311 reset 258220 stale 4 aborted 0 badack 18384 unreach 0 zone failures 52175610 cookies sent 16238 cookies received I have found that in 7 received cookies correlate with unreach. -- Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
syncookie in 6.x and 7.x
During testing 7.0-CURRENT I have found that it always sends syncookies while on early FreeBSD versions "netstat -s -p tcp" always shows: 0 cookies sent 0 cookies received I have looked sources and found that in early versions the sent counter was simply not incremented at all. The patch attached. After the patch has been applied I have found that 6 always sends syncookies too, however, 6 unlike 7 never receives them. Why ? Here is 6 statistics: 1046714 syncache entries added 28395 retransmitted 32879 dupsyn 0 dropped 1038153 completed 0 bucket overflow 0 cache overflow 4201 reset 3972 stale 0 aborted 0 badack 254 unreach 0 zone failures 1046714 cookies sent 0 cookies received Here is 7 statistics: 76018 syncache entries added 2536 retransmitted 2574 dupsyn 0 dropped 75114 completed 0 bucket overflow 0 cache overflow 456 reset 267 stale 0 aborted 0 badack 20 unreach 0 zone failures 76018 cookies sent 24 cookies received -- Igor Sysoev http://sysoev.ru/en/ --- sys/netinet/tcp_syncache.c 2006-02-16 04:06:22.0 +0300 +++ sys/netinet/tcp_syncache.c 2007-08-15 13:55:25.0 +0400 @@ -1323,6 +1323,7 @@ MD5Final((u_char *)&md5_buffer, &syn_ctx); data ^= (md5_buffer[0] & ~SYNCOOKIE_WNDMASK); *flowid = md5_buffer[1]; + tcpstat.tcps_sc_sendcookie++; return (data); } ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Improved TCP syncookie implementation
On Thu, 14 Sep 2006, Ruslan Ermilov wrote: On Wed, Sep 13, 2006 at 10:31:43PM +0200, Andre Oppermann wrote: Igor Sysoev wrote: Well, suppose protocol similar to SSH or SMTP: 1) the client calls connect(), it sends SYN; 2) the server receives SYN and sends SYN/ACK with cookie; 3) the client receives SYN/ACK and sends ACK; 4) the client returns successfully from connect() and calls read(); 5) the ACK is lost; 6) the server does not about this connection, so application can not accept() it, and it can not send() HELO message. 7) the client gets ETIMEDOUT from read(). Where in this sequence client may retransmit its ACK ? Never. You're correct. There is no data that would cause a retransmit if the application is waiting for a server prompt first. I shouldn't write wrong explanations when I'm tired, hungry and in between two tasks. ;) This problem is the reason why we don't switch entirely to syncookies and still keep syncache as is. Perhaps it would be a good idea to remove net.inet.tcp.syncookies_only then? In any case, please don't forget to update the syncache(4) manpage to reflect your changes, and if you decide not to remove this sysctl, please add a warning of its potential to break a protocol. I think that setting syncookies only not globally, but on per port basis, say, for HTTP would be helpfull. Setting it for other protocols, e.g, SSH, rsync, SMTP, IMAP, POP3 may break them. Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Improved TCP syncookie implementation
On Wed, 13 Sep 2006, Andre Oppermann wrote: Igor Sysoev wrote: On Sun, 3 Sep 2006, Andre Oppermann wrote: I've pretty much rewritten our implementation of TCP syncookies to get rid of some locking in TCP syncache and to improve their functionality. The RFC1323 timestamp option is used to carry the full TCP SYN+SYN/ACK optional feature information. This means that a FreeBSD host may run with syncookies only and not degrade TCP connections made through it. All important TCP connection setup negotiated options are preserved (send/receive window scaling, SACK, MSS) without storing any state on the host during the SYN-SYN/ACK phase. As a nice side effect the timestamps we respond with are randomized instead of directly using ticks (which reveals out uptime). As I understand syncache is used to retransmit SYN/ACK. What would be if 1) a client sent SYN, 2) we sent SYN/ACK with cookie, 3) the client sent ACK, but the ACK was lost If the client sent ACK it will retry again after the normal retransmit timeout. Well, suppose protocol similar to SSH or SMTP: 1) the client calls connect(), it sends SYN; 2) the server receives SYN and sends SYN/ACK with cookie; 3) the client receives SYN/ACK and sends ACK; 4) the client returns successfully from connect() and calls read(); 5) the ACK is lost; 6) the server does not about this connection, so application can not accept() it, and it can not send() HELO message. 7) the client gets ETIMEDOUT from read(). Where in this sequence client may retransmit its ACK ? If our SYN-ACK back to client is lost we won't resend it with syncookies. The client then has to try again which is does after the syn retransmit timeout. Yes. Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Improved TCP syncookie implementation
On Sun, 3 Sep 2006, Andre Oppermann wrote: I've pretty much rewritten our implementation of TCP syncookies to get rid of some locking in TCP syncache and to improve their functionality. The RFC1323 timestamp option is used to carry the full TCP SYN+SYN/ACK optional feature information. This means that a FreeBSD host may run with syncookies only and not degrade TCP connections made through it. All important TCP connection setup negotiated options are preserved (send/receive window scaling, SACK, MSS) without storing any state on the host during the SYN-SYN/ACK phase. As a nice side effect the timestamps we respond with are randomized instead of directly using ticks (which reveals out uptime). As I understand syncache is used to retransmit SYN/ACK. What would be if 1) a client sent SYN, 2) we sent SYN/ACK with cookie, 3) the client sent ACK, but the ACK was lost ? I suppose the client will see timed out error. Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: strange timeout error returned by kevent() in 6.0
On Tue, 6 Dec 2005, John-Mark Gurney wrote: Igor Sysoev wrote this message on Thu, Sep 01, 2005 at 18:26 +0400: On Thu, 1 Sep 2005, Igor Sysoev wrote: I found strange timeout errors returned by kevent() in 6.0 using my http server named nginx. The nginx's run on three machines: two 4.10-RELEASE and one 6.0-BETA3. All machines serve the same content (simple cluster) and each handles about 200 requests/second. On 6.0 sometimes (2 or 3 times per hour) in the daytime kevent() returns EV_EOF in flags and ETIMEDOUT in fflags, nevertheless: 1) nginx does not set any kernel timeout for sockets; 2) the total request time for such failed requests is small, 30 and so seconds. I have changed code to ignore the ETIMEDOUT error returned by kevent() and found that subsequent sendfile() returned the ENOTCONN. By the way, why sendfile() may return ENOTCONN ? I saw this error code on 4.x too. The reason that you are seeing ETIMEDOUT/ENOTCONN is that the connection probably ETIMEDOUT (aka timed out)... and so is ENOTCONN (no longer connected).. can you also do a read or a write to the socket successfully? At least recv() returns ETIMEDOUT. I could not test write() right now. and sendfile(3) says: ERRORS [...] [ENOTCONN] The s argument points to an unconnected socket. and a glance at tcp(4) says: ERRORS [...] [ETIMEDOUT]when a connection was dropped due to excessive retransmissions; There's the answers... Yes, it seems that ETIMEDOUT is retransmission failure. I've seen it in experiment. The strangeness is that I did not see this error on 4.10. Only on 6.0 and recenty on 4.11. May be I will upgrade cluster machine from 4.10 to 4.11 to see changes. Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: strange timeout error returned by kevent() in 6.0
On Thu, 1 Dec 2005, Igor Sysoev wrote: On Thu, 1 Sep 2005, Igor Sysoev wrote: On Thu, 1 Sep 2005, Igor Sysoev wrote: I found strange timeout errors returned by kevent() in 6.0 using my http server named nginx. The nginx's run on three machines: two 4.10-RELEASE and one 6.0-BETA3. All machines serve the same content (simple cluster) and each handles about 200 requests/second. On 6.0 sometimes (2 or 3 times per hour) in the daytime kevent() returns EV_EOF in flags and ETIMEDOUT in fflags, nevertheless: 1) nginx does not set any kernel timeout for sockets; 2) the total request time for such failed requests is small, 30 and so seconds. I have changed code to ignore the ETIMEDOUT error returned by kevent() and found that subsequent sendfile() returned the ENOTCONN. By the way, why sendfile() may return ENOTCONN ? I saw this error code on 4.x too. Recently I've found that kevent() in FreeBSD 5.4 may return wrong the ETIMEDOUT too. Also I've found that recv() on FreeBSD 6.0 may return wrong ETIMEDOUT error for socket that has no any kernel timeout. It seems this ETIMEDOUT error masks another error. It's seems that this ETIMEDOUT is caused by a retransmit failure, when data were retransmited 12 times with backoff timeout. The whole timeout is small, 30-50 seconds, because the initial RTO is very small: 5-10 ms. Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: strange timeout error returned by kevent() in 6.0
On Thu, 1 Sep 2005, Igor Sysoev wrote: On Thu, 1 Sep 2005, Igor Sysoev wrote: I found strange timeout errors returned by kevent() in 6.0 using my http server named nginx. The nginx's run on three machines: two 4.10-RELEASE and one 6.0-BETA3. All machines serve the same content (simple cluster) and each handles about 200 requests/second. On 6.0 sometimes (2 or 3 times per hour) in the daytime kevent() returns EV_EOF in flags and ETIMEDOUT in fflags, nevertheless: 1) nginx does not set any kernel timeout for sockets; 2) the total request time for such failed requests is small, 30 and so seconds. I have changed code to ignore the ETIMEDOUT error returned by kevent() and found that subsequent sendfile() returned the ENOTCONN. By the way, why sendfile() may return ENOTCONN ? I saw this error code on 4.x too. Recently I've found that kevent() in FreeBSD 5.4 may return wrong the ETIMEDOUT too. Also I've found that recv() on FreeBSD 6.0 may return wrong ETIMEDOUT error for socket that has no any kernel timeout. It seems this ETIMEDOUT error masks another error. Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: strange timeout error returned by kevent() in 6.0
On Thu, 1 Sep 2005, Igor Sysoev wrote: I found strange timeout errors returned by kevent() in 6.0 using my http server named nginx. The nginx's run on three machines: two 4.10-RELEASE and one 6.0-BETA3. All machines serve the same content (simple cluster) and each handles about 200 requests/second. On 6.0 sometimes (2 or 3 times per hour) in the daytime kevent() returns EV_EOF in flags and ETIMEDOUT in fflags, nevertheless: 1) nginx does not set any kernel timeout for sockets; 2) the total request time for such failed requests is small, 30 and so seconds. I have changed code to ignore the ETIMEDOUT error returned by kevent() and found that subsequent sendfile() returned the ENOTCONN. By the way, why sendfile() may return ENOTCONN ? I saw this error code on 4.x too. Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
strange timeout error returned by kevent() in 6.0
I found strange timeout errors returned by kevent() in 6.0 using my http server named nginx. The nginx's run on three machines: two 4.10-RELEASE and one 6.0-BETA3. All machines serve the same content (simple cluster) and each handles about 200 requests/second. On 6.0 sometimes (2 or 3 times per hour) in the daytime kevent() returns EV_EOF in flags and ETIMEDOUT in fflags, nevertheless: 1) nginx does not set any kernel timeout for sockets; 2) the total request time for such failed requests is small, 30 and so seconds. Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
setsockopt() can not remove the accept filter
Hi, man setsockopt(2) states that "passing in an optval of NULL will remove the filter", however, setsockopt() always return EINVAL in this case, because do_setopt_accept_filter() removes the filter if sopt == NULL, but not if sopt->val == NULL. The fix is easy: -if (sopt == NULL) { +if (sopt == NULL || sopt->val == NULL) { By the way, is it easy to add timeout for dataready and httpready filters ? Now the stale connections may live for long time. Igor Sysoev http://sysoev.ru/en/ ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Very strange kevent problem possibly to do with vinum
On Wed, 8 Dec 2004, Kevin Day wrote: > I have a really really strange kevent problem(i think anyway) that has > really stumped me. > > Here's the scenario: > > Three mostly identical servers running 5.2.1 or 5.3 (problem exists on > both). All three running thttpd sending out large files to thousands of > clients. Thttpd internally uses kqueue/kevent and sendfile to send > files rather quickly. > > All three have the same configuration, are getting approximately the > same numbers of requests, and are sending approximately the same files. > (I can swap IP addresses between the servers to confirm that the > request distribution stays the same between the servers) > > Server #3 is able to send 400mbps or more of traffic through without > breaking a sweat. Thttpd is either in "RUN", "biord" "sfbufa" or > "*Giant" when I watch it in top, and I still have 80-90% idle time. > > Servers #1 and #2 seem to top out around 80mbps, and are constantly in > "RUN" or "CPUx" states. I don't get any errors anywhere, but they just > aren't capable of going any faster. > > Looking at ktrace on thttpd on all three servers, I see that server 3 > calls kevent, and gets 20-100 sockets in response back, that each get > serviced. Servers 1 and 2 never seem to get more than 1 socket back > from kevent. Even if the event is just that the socket was > disconnected, nothing needs to be done on it, and kevent can be called > again immediately, there's only 1 socket returned next time. I ran > ktrace on thttpd for more than 15 minutes and produced a humongous > ktrace file, and there were only a handful of times that kevent > returned more than one socket with something to do on it. Contrasting > that to server 3, where i never saw kevent returning less than a half > dozen sockets at a time when it had a few hundred mbps flowing through > it. > > The ONLY difference between servers 1 and 2 and server 3 is the disk > subsystem. Servers 1/2 use an "ahc" SCSI controller and vinum RAID5. > Server 3 uses an "aac" hardware RAID. However, disk activity is really > truly minimal on all of these servers. Most of the data remains cached, > since 99% of the requests are for the same handful of files. > systat/vmstat shows that the disks are busy less than 10% of the time, > and artificially creating a bunch of disk load on any of the servers > doesn't seem to affect anything. > > I'm not sure if the kevent difference is the cause of the problem > (thttpd doesn't seem to handle going through its event loop over and > over again for just one socket at a time, it makes some rather > expensive syscalls from that loop), or if it's just a symptom. Is > something in vinum possibly waking my process up somewhat prematurely? > Is that even possible if the files are being sent through sendfile? What does "systat -vm" show on these machines ? Igor Sysoev http://sysoev.ru/en/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: using natd to load balance port 80 to multiple servers
On Sat, 23 Oct 2004, Stephane Raimbault wrote: > I'm currently using a freebsd box running natd to forward port 80 to several > (5) web servers on private IP's. > > I have discovered that natd doesn't handle many requests/second all that > well (seem to choke at about 200 req/second (educated guess)) > > There are other packet filtering options on FreeBSD and I wonder if I can > use them to do what I'm trying to do with natd. > > Would someone be able to point me to documentation or help me have either > ipf/ipfw/pf forward port 80 traffic to private space IP's? > > Is there a better way of split port 80 traffic across multiple webservers > that has elduded me? Other then a comercial content switch that is :) > > I've worked with the loadd port and ran into some problems, so I resulted in > simply using some natd syntax to forward port 80 traffic to multiple > servers... Now that seems to have run to it's limitation and I'm wondering > if I can do the same thing with ipf/ipfw/pf as I believe that might be a bit > more efficient. > > Any feedback would be appreciated... You could look at PF. Also you could use http reverse-proxy like nginx, look the example of the configuration (the page is in Russian, but the configuration is in English :) http://sysoev.ru/nginx/docs/example.html Currenty, to proxy the several servers you need to set up their IPs under one name in DNS. nginx would connect to them in round robin. If some server does not response then nginx would try the next. You could set several reasons to try the next server: proxy_next_upstream error timeout invalid_header http_500; or even proxy_next_upstream error timeout invalid_header http_500 http_404; nginx was tested on several busy sites under FreeBSD (serving the static files and the proxing, using kqueue/select/poll), Linux (static and proxy, using epoll, rt signals) and Solaris (static only, using /dev/poll). Igor Sysoev http://sysoev.ru/en/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Error 49, socket problem?
On Sat, 23 Oct 2004, Stephane Raimbault wrote: > I was running out of ports in the 1024-5000 range and setting my last port > to 65535 via sysctl did solve my problem. > > In 4.10 what will sysctl -w net.inet.ip.portrange.randomized=0 do for me? If you have too many quick connections between proxy (4.10) and backend, or between the http server (4.10) and the SQL server then you may see in the logs the accidental errors "Connection refused". This is because 4.10 gets port number randomly and there is the chance that other side has the connection with the same port in TIME_WAIT state. See, i.e., http://freebsd.rambler.ru/bsdmail/freebsd-stable_2004/msg02310.html > Is there any danger of me setting the port range from 1024 - 65535 ? I believe it is safe. Igor Sysoev http://sysoev.ru/en/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: aio_connect ?
On Thu, 21 Oct 2004, Ronald F. Guilmette wrote: > >I believe if you want to build a more maintainable, more adaptable, > >more modularized program then you should avoid two things - the threads and > >the signals. If you like to use a callback behaviour of the signals you could > >easy implement it without any signal. > > OK. I'll bite. How? I'm sure you know it. Sorry, English is not my native language so I may tell you only shortly. You can use two notification models. First is the socket readiness for operations, second is the operation completeness. In the first model you use usual read()/write() operations and learn readiness using select()/poll()/kevent(). In the second model you use aio_read()/aio_write() operations and learn about their completeness using aio_suspend()/aio_waitcomplete()/kevent(). After you have got the notifications you would call your callback handlers as well as the kernel would call your signal handlers. The difference between your code and kernel is that your code always calls handlers in the well known places that allows to avoid the various race conditions. The kernel may call the signal handler any time if the signal is not blocked. Igor Sysoev http://sysoev.ru/en/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Error 49, socket problem?
On Fri, 22 Oct 2004, Stephane Raimbault wrote: > The servers are busier today and error 49 is comming up frequently now. What does "netstat -n | grep 127.0.0.1 | wc -l" show ? You should probably try sysctl -w net.inet.ip.portrange.first=49152 sysctl -w net.inet.ip.portrange.last=65535 or even sysctl -w net.inet.ip.portrange.first=1024 sysctl -w net.inet.ip.portrange.last=65535 And after you upgrade to 4.10 do not forget to set sysctl -w net.inet.ip.portrange.randomized=0 Igor Sysoev http://sysoev.ru/en/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: aio_connect ?
On Wed, 20 Oct 2004, Julian Elischer wrote: > Now that we have real threads, it shuld be possible to write an aio > library that is > implemented by having a bunch of underlying threads.. Do you mean the kernel only threads when the single threaded user process has several threads in kernel ? As I understand FreeBSD 4.x already has similar AIO implementation. Or do you mean the implementaion by user-level threads like in Solaris ? Igor Sysoev http://sysoev.ru/en/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
RE: aio_connect ?
On Wed, 20 Oct 2004, Christopher M. Sedore wrote: > > > > While the developing my server nginx, I found the POSIX aio_* > > > > operations > > > > uncomfortable. I do not mean a different programming style, I mean > > > > the aio_read() and aio_write() drawbacks - they have no > > scatter-gather > > > > capabilities (aio_readv/aio_writev) and they require too many > > > > syscalls. > > > > E.g, the reading requires > > > > *) 3 syscalls for ready data: aio_read(), aio_error(), > > aio_return() > > > > *) 5 syscalls for non-ready data: aio_read(), aio_error(), > > > >waiting for notification, then aio_error(), aio_return(), > > > >or if timeout occuired - aio_cancel(), aio_error(). > > > > > > This is why I added aio_waitcomplete(). It reduces both > > cases to two > > > syscalls. Yes, aio_waitcomplete() can be used as the single waiting point. But then I can not accept() connetions. How could I learn about the new connections ? > > As I understand aio_waitcomplete() returns aiocb of any complete AIO > > operation but I need to know the state of the exact AIO, > > namely the last > > aio_read(). > > Correct, it won't poll, but what state can you get from calling > aio_error() that you don't already know from aio_waitcomplete(). The > operation has either completed (successfully or unsuccessfully) or it > hasn't. If it hasn't you haven't "gotten it back" via aio_waitcomplete, > and if it has, you did. I may be missing something, but how does > aio_error() tell you something that you don't already know? With aio_error() I may (and even have to) pass aiocb of the wanted operation. aio_waitcomplete() returns aiocb of any operation. If I have several operations there may be the race condition. > > I use kqueue to get AIO notifications. If AIO operation would fail > > at the start, will kqueue return notificaiton about this operation ? > > I don't think so--IIRC, if you have a parameter problem or the operation > can't be queued, you'll get an error return from aio_read and no kqueue > result. If it is queued, you'll get a kqueue notification. Well, so I may not call aio_error() just after aio_read()/aio_write(). However, I can not use aio_waitcomplete() instead of aio_error()/aio_return() pair after kevent() reports the completetion. Igor Sysoev http://sysoev.ru/en/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: aio_connect ?
On Wed, 20 Oct 2004, Ronald F. Guilmette wrote: > > and they require too many syscalls. > >E.g, the reading requires > >*) 3 syscalls for ready data: aio_read(), aio_error(), aio_return() > >*) 5 syscalls for non-ready data: aio_read(), aio_error(), > > waiting for notification, then aio_error(), aio_return(), > > or if timeout occuired - aio_cancel(), aio_error(). > > This assumes that one is _not_ using the signaling capabilities of the > aio_*() functions in order to allow the kernel to dynamically signal the > userland program upon completion of a previously scheduled async I/O > operation. If however a programmer were to use _that_ approache to de- > tecting I/O completions, then the number of syscals would be reduced > accordingly. Yes, nginx does not use the AIO signaling capabilities. With signals you do not call the syscall that waits the completion. But the call of the signal handler requires 3 context switches instead of 2 switches in the case of syscall. > However this all misses the point. As I noted earlier in this thread, > efficience _for the machines_ is not always one's highest engineering > design goal. If I have a choice between building a more maintainable, > more adaptable, more modularized program, or instead building a more > machine-efficient program, I personally will almost always choose to > build the clearly, more modularized program as opposed to trying to > squeeze every last machine cycle out of the thing. In fact, that is > why I program almost exclusively in higher level languages, even though > I could almost certainly write assembly code that would almost always be > faster. Machine time is worth something, but my time is worth more. I believe if you want to build a more maintainable, more adaptable, more modularized program then you should avoid two things - the threads and the signals. If you like to use a callback behaviour of the signals you could easy implement it without any signal. Igor Sysoev http://sysoev.ru/en/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
RE: aio_connect ?
On Wed, 20 Oct 2004, Christopher M. Sedore wrote: > > While the developing my server nginx, I found the POSIX aio_* > > operations > > uncomfortable. I do not mean a different programming style, I mean > > the aio_read() and aio_write() drawbacks - they have no scatter-gather > > capabilities (aio_readv/aio_writev) and they require too many > > syscalls. > > E.g, the reading requires > > *) 3 syscalls for ready data: aio_read(), aio_error(), aio_return() > > *) 5 syscalls for non-ready data: aio_read(), aio_error(), > >waiting for notification, then aio_error(), aio_return(), > >or if timeout occuired - aio_cancel(), aio_error(). > > This is why I added aio_waitcomplete(). It reduces both cases to two > syscalls. As I understand aio_waitcomplete() returns aiocb of any complete AIO operation but I need to know the state of the exact AIO, namely the last aio_read(). I use kqueue to get AIO notifications. If AIO operation would fail at the start, will kqueue return notificaiton about this operation ? Igor Sysoev http://sysoev.ru/en/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: aio_connect ?
On Sun, 17 Oct 2004, Ronald F. Guilmette wrote: > I'm sitting here looking at that man pages for aio_read and aio_write, > and the question occurs to me: ``Home come there is no such thing as > an aio_connect function?'' > > There are clearly cases in which one would like to perform reads > asynchronously, but likewise, there are cases where one might like > to also perform socket connects asynchronously. So how come no > aio_connect? In FreeBSD you can do connect() on the non-blocking socket, then set the socket to a blocking mode, and post aio_read() or aio_write() operations on the socket. FreeBSD allows to post AIO operaitons on non-connected socket. NT (and W2K, I believe) do not. This is why ConnectEx() appeared in XP. I do not know about other OSes, but I belive only FreeBSD and NT have the kernel level AIO sockets implementation without the threads emulation in the user level (Solaris) or without the quietly falling to synchronous behaviour (Linux). While the developing my server nginx, I found the POSIX aio_* operations uncomfortable. I do not mean a different programming style, I mean the aio_read() and aio_write() drawbacks - they have no scatter-gather capabilities (aio_readv/aio_writev) and they require too many syscalls. E.g, the reading requires *) 3 syscalls for ready data: aio_read(), aio_error(), aio_return() *) 5 syscalls for non-ready data: aio_read(), aio_error(), waiting for notification, then aio_error(), aio_return(), or if timeout occuired - aio_cancel(), aio_error(). I think aio_* may be usefull for the zero-copy sockets, however, FreeBSD's aio_write() does not wait when the data would be acknowledged by peer and notifies the completion just after it pass the data to the network layer. Igor Sysoev http://sysoev.ru/en/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: "netstat -m" and sendfile(2) statistics in STABLE
On Fri, 18 Jun 2004, Mike Silbersack wrote: > On Thu, 17 Jun 2004, Alfred Perlstein wrote: > > > I was going to suggest vmstat now that sfbufs are used for so many > > other things than just "sendfile bufs". > > > > -- > > - Alfred Perlstein > > How about if we do this: > > 5.x: List sfbufs both in vmstat _and_ in netstat -m, as their status is > relevant to both network and general memory usage. > > 4.x: MFC the vmstat implementation. > > This would preserve 4.x's behavior, but allow 5.x users (who have a new > netstat -m output format anyway) to see sfbuf information without invocing > multiple utilities. In 4.x sfbufs are network buffers only and I think it's handy to see the network buffer statistics in one place. I prefer to see netstat -ms or netstat -m. And nothing against additional the vmstat implementation. Igor Sysoev http://sysoev.ru/en/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
"netstat -m" and sendfile(2) statistics in STABLE
Hi, I read objections in cvs-all@ about netstat's output after MFC of sendfile(2) statistics. How about "netstat -ms" ? Right now this switch combination is treated as simple "-m" in both -STABLE and -CURRENT. Igor Sysoev http://sysoev.ru/en/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
"thundering herd" problem in accept
Hi, I noticed rev 1.123 of src/sys/kern/uipc_socket2.c and two MFC's of the fix. Does it mean that the "thundering herd" problem in accept() appeared again in FreeBSD since 4.4-STABLE (after syncache was introduced) ? Igor Sysoev http://sysoev.ru/en/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: sendfile returning ENOTCONN under heavy load
On Fri, 26 Mar 2004, Kevin Day wrote: > I'm using thttpd on a server that pushes 300-400mbps of static content, > using sendfile(2). > > Once the load reaches a certain point (around 800-1000 clients > downloading, anywhere from 150-250mbps), sendfile() will start randomly > returning ENOTCONN, and the client is disconnected. I've raised > kern.ipc.nsfbufs pretty high and that hasn't made any difference. Is > there any easy way to tell exactly why the sockets are being closed? I > can't seem to find any obvious signs of memory exhaustion or anything. It's the sendfile(2) feature. It can return ENOTCONN instead EPIPE. See the message: http://freebsd.rambler.ru/bsdmail/freebsd-hackers_2004/msg00019.html and its follow-ups. Igor Sysoev http://sysoev.ru/en/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: mbuf tuning
On Mon, 19 Jan 2004, CHOI Junho wrote: > From: Mike Silbersack <[EMAIL PROTECTED]> > Subject: Re: mbuf tuning > Date: Mon, 19 Jan 2004 01:12:08 -0600 (CST) > > > There are no good guidelines other than "don't set it too high." Andre > > and I have talked about some ideas on how to make mbuf usage more dynamic, > > I think that he has something in the works. But at present, once you hit > > the wall, that's it. > > > > One way to reduce mbuf cluster usage is to use sendfile where possible. > > Data sent via sendfile does not use mbuf clusters, and is more memory > > efficient. If you run 5.2 or above, it's *much* more memory efficient, > > due to change Alan Cox recently made. Apache 2 will use sendfile by > > default, so if you're running apache 1, that may be one reason for an > > upgrade. > > I am using custom version of thttpd. It allocates mmap() first(builtin > method of thttpd), and it try to use sendfile() if mmap() fails(out of > mmap memory). It really works good in normal status but the problem is > that sendfile buffer is also easy to flood. I need more sendfile > buffers but I don't know how to increase sendfile buffers either(I > think it's hidden sysctl but it was more difficult to tune than > nmbclusters). With higher traffic, thttpd sometimes stuck at "sfbufa" > status when I run top(I guess it's "sendfile buffer allocation" > status). In 4.x you have to rebuild the kernel with options NSFBUFS=16384 It equals to (512 + maxusers * 16) by default. By the way, why do you want to use the big net.inet.tcp.sendspace and net.inet.tcp.recvspace ? It makes a sense for Apache but thttpd can easy work with the small buffers, say, 16K or even 8K. > > > Increasing kern.ipc.nmbclusters caused frequent kernel panic > > > under 4.7/4.8/4.9. How can I set more nmbclusters value with 64K tcp > > > buffers? Or is any dependency for mbufclusters value? (e.g. RAM size, > > > kern.maxusers value or etc) > > > > > > p.s. RAM is 2G, Xeon 2.0G x 1 or 2 machines. > > > > You probably need to bump up KVA_PAGES to fit in all the extra mbuf > > clusters you're allocating. > > Can you tell me in more detail? >From LINT: --- # # Change the size of the kernel virtual address space. Due to # constraints in loader(8) on i386, this must be a multiple of 4. # 256 = 1 GB of kernel address space. Increasing this also causes # a reduction of the address space in user processes. 512 splits # the 4GB cpu address space in half (2GB user, 2GB kernel). # options KVA_PAGES=260 --- Default KVA_PAGES are 256. Igor Sysoev http://sysoev.ru/en/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: turning off TCP_NOPUSH
On Wed, 28 May 2003, Garrett Wollman wrote: > < said: > > > always calls tcp_output() when TCP_NOPUSH is turned off. I think > > tcp_output() should be called only if data in the send buffer is less > > than MSS: > > I believe that this is intentional. The application had to explicitly > enable TCP_NOPUSH, so if the application disables it explicitly, then > we interpret that as meaning that the application wants to send a PSH > segment immediately. As I understand if the data in the send buffer is bigger than MSS it means that TCP stack has some reason not to send it and this reason is not TF_NOPUSH flag. Am I wrong ? Igor Sysoev http://sysoev.ru/en/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
turning off TCP_NOPUSH
The 1.53 fix http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/netinet/tcp_usrreq.c.diff?r1=1.52&r2=1.53 always calls tcp_output() when TCP_NOPUSH is turned off. I think tcp_output() should be called only if data in the send buffer is less than MSS: tp->t_flags &= ~TF_NOPUSH; - error = tcp_output(tp); + if (so->so_snd.sb_cc < tp->t_maxseg) { + error = tcp_output(tp); + } If the pending data is bigger than MSS then it will be sent without significant delay. Igor Sysoev http://sysoev.ru/en/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"