Re: bge hangs on recent 7.3-STABLE

2010-09-13 Thread Igor Sysoev
On Mon, Sep 13, 2010 at 11:04:47AM -0700, Pyun YongHyeon wrote:

> On Mon, Sep 13, 2010 at 06:27:08PM +0400, Igor Sysoev wrote:
> > On Thu, Sep 09, 2010 at 02:18:08PM -0700, Pyun YongHyeon wrote:
> > 
> > > On Thu, Sep 09, 2010 at 01:10:50PM -0700, Pyun YongHyeon wrote:
> > > > On Thu, Sep 09, 2010 at 02:28:26PM +0400, Igor Sysoev wrote:
> > > > > Hi,
> > > > > 
> > > > > I have several hosts running FreeBSD/amd64 7.2-STABLE updated on 
> > > > > 11.01.2010
> > > > > and 25.02.2010. Hosts process about 10K input and 10K output packets/s
> > > > > without issues. One of them, however, is loaded more than others, so 
> > > > > it
> > > > > processes 20K/20K packets/s.
> > > > > 
> > > > > Recently, I have upgraded one host to 7.3-STABLE, 24.08.2010.
> > > > > Then bge on this host hung two times. I was able to restart it from
> > > > > console using:
> > > > >   /etc/rc.d/netif restart bge0
> > > > > 
> > > > > Then I have upgraded the most loaded (20K/20K) host to 7.3-STABLE, 
> > > > > 07.09.2010.
> > > > > After reboot bge hung every several seconds. I was able to restart it,
> > > > > but bge hung again after several seconds.
> > > > > 
> > > > > Then I have downgraded this host to 7.3-STABLE, 14.08.2010, since 
> > > > > there
> > > > > were several if_bge.c commits on 15.08.2010. The same hangs.
> > > > > Then I have downgraded this host to 7.3-STABLE, 17.03.2010, before
> > > > > the first if_bge.c commit after 25.02.2010. Now it runs without hangs.
> > > > > 
> > > > > The hosts are amd64 dual core SMP with 4G machines. bge information:

> > Thank you, it seems the patch has fixed the bug.
> > BTW, I noticed the same hungs on FreeBSD 8.1, date=2010.09.06.23.59.59
> > I will apply the patch on all my updated hosts.
> > 
> 
> Thanks for testing. I'm afraid bge(4) in HEAD, stable/8 and
> stable/7(including 8.1-RELEASE and 7.3-RELEASE) may suffer from
> this issue. Let me know what other hosts work with the patch.

Currently I have patched two hosts only: 7.3, 24.08.2010 and 8.1, 06.09.2010.
7.3 now handles 20K/20K packets/s without issues.

One host has been downgraded to 17.03.2010 as I already reported.
Other hosts still run 7.x, from January and February 2010.
If there not will be hangs I will upgrade other hosts and will patch them.


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: bge hangs on recent 7.3-STABLE

2010-09-13 Thread Igor Sysoev
On Thu, Sep 09, 2010 at 02:18:08PM -0700, Pyun YongHyeon wrote:

> On Thu, Sep 09, 2010 at 01:10:50PM -0700, Pyun YongHyeon wrote:
> > On Thu, Sep 09, 2010 at 02:28:26PM +0400, Igor Sysoev wrote:
> > > Hi,
> > > 
> > > I have several hosts running FreeBSD/amd64 7.2-STABLE updated on 
> > > 11.01.2010
> > > and 25.02.2010. Hosts process about 10K input and 10K output packets/s
> > > without issues. One of them, however, is loaded more than others, so it
> > > processes 20K/20K packets/s.
> > > 
> > > Recently, I have upgraded one host to 7.3-STABLE, 24.08.2010.
> > > Then bge on this host hung two times. I was able to restart it from
> > > console using:
> > >   /etc/rc.d/netif restart bge0
> > > 
> > > Then I have upgraded the most loaded (20K/20K) host to 7.3-STABLE, 
> > > 07.09.2010.
> > > After reboot bge hung every several seconds. I was able to restart it,
> > > but bge hung again after several seconds.
> > > 
> > > Then I have downgraded this host to 7.3-STABLE, 14.08.2010, since there
> > > were several if_bge.c commits on 15.08.2010. The same hangs.
> > > Then I have downgraded this host to 7.3-STABLE, 17.03.2010, before
> > > the first if_bge.c commit after 25.02.2010. Now it runs without hangs.
> > > 
> > > The hosts are amd64 dual core SMP with 4G machines. bge information:
> > > 
> > > b...@pci0:4:0:0:class=0x02 card=0x165914e4 chip=0x165914e4 
> > > rev=0x11 hdr=0x00
> > > vendor = 'Broadcom Corporation'
> > > device = 'NetXtreme Gigabit Ethernet PCI Express (BCM5721)'
> > > 
> > > bge0:  > > 0x004101> mem 0xfe5f-0xfe5f irq 19 at device 0.0 on pci4
> > > miibus1:  on bge0
> > > brgphy0:  PHY 1 on miibus1
> > > brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
> > > 1000baseT-FDX, auto
> > > bge0: Ethernet address: 00:e0:81:5f:6e:8a
> > > 
> > 
> > Could you show me verbose boot message(bge part only)?
> > Also show me the output of "pciconf -lcbv".
> > 
> 
> Forgot to send a patch. Let me know whether attached patch fixes
> the issue or not.

> Index: sys/dev/bge/if_bge.c
> ===
> --- sys/dev/bge/if_bge.c  (revision 212341)
> +++ sys/dev/bge/if_bge.c  (working copy)
> @@ -3386,9 +3386,11 @@
>   sc->bge_rx_saved_considx = rx_cons;
>   bge_writembx(sc, BGE_MBX_RX_CONS0_LO, sc->bge_rx_saved_considx);
>   if (stdcnt)
> - bge_writembx(sc, BGE_MBX_RX_STD_PROD_LO, sc->bge_std);
> + bge_writembx(sc, BGE_MBX_RX_STD_PROD_LO, (sc->bge_std +
> + BGE_STD_RX_RING_CNT - 1) % BGE_STD_RX_RING_CNT);
>   if (jumbocnt)
> - bge_writembx(sc, BGE_MBX_RX_JUMBO_PROD_LO, sc->bge_jumbo);
> + bge_writembx(sc, BGE_MBX_RX_JUMBO_PROD_LO, (sc->bge_jumbo +
> + BGE_JUMBO_RX_RING_CNT - 1) % BGE_JUMBO_RX_RING_CNT);
>  #ifdef notyet
>   /*
>* This register wraps very quickly under heavy packet drops.

Thank you, it seems the patch has fixed the bug.
BTW, I noticed the same hungs on FreeBSD 8.1, date=2010.09.06.23.59.59
I will apply the patch on all my updated hosts.


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: bge hangs on recent 7.3-STABLE

2010-09-13 Thread Igor Sysoev
On Fri, Sep 10, 2010 at 07:39:15AM +0400, Igor Sysoev wrote:

> On Thu, Sep 09, 2010 at 01:10:50PM -0700, Pyun YongHyeon wrote:
> 
> > On Thu, Sep 09, 2010 at 02:28:26PM +0400, Igor Sysoev wrote:
> > > Hi,
> > > 
> > > I have several hosts running FreeBSD/amd64 7.2-STABLE updated on 
> > > 11.01.2010
> > > and 25.02.2010. Hosts process about 10K input and 10K output packets/s
> > > without issues. One of them, however, is loaded more than others, so it
> > > processes 20K/20K packets/s.
> > > 
> > > Recently, I have upgraded one host to 7.3-STABLE, 24.08.2010.
> > > Then bge on this host hung two times. I was able to restart it from
> > > console using:
> > >   /etc/rc.d/netif restart bge0
> > > 
> > > Then I have upgraded the most loaded (20K/20K) host to 7.3-STABLE, 
> > > 07.09.2010.
> > > After reboot bge hung every several seconds. I was able to restart it,
> > > but bge hung again after several seconds.
> > > 
> > > Then I have downgraded this host to 7.3-STABLE, 14.08.2010, since there
> > > were several if_bge.c commits on 15.08.2010. The same hangs.
> > > Then I have downgraded this host to 7.3-STABLE, 17.03.2010, before
> > > the first if_bge.c commit after 25.02.2010. Now it runs without hangs.
> > > 
> > > The hosts are amd64 dual core SMP with 4G machines. bge information:
> > > 
> > > b...@pci0:4:0:0:class=0x02 card=0x165914e4 chip=0x165914e4 
> > > rev=0x11 hdr=0x00
> > > vendor = 'Broadcom Corporation'
> > > device = 'NetXtreme Gigabit Ethernet PCI Express (BCM5721)'
> > > 
> > > bge0:  > > 0x004101> mem 0xfe5f-0xfe5f irq 19 at device 0.0 on pci4
> > > miibus1:  on bge0
> > > brgphy0:  PHY 1 on miibus1
> > > brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
> > > 1000baseT-FDX, auto
> > > bge0: Ethernet address: 00:e0:81:5f:6e:8a
> > > 
> > 
> > Could you show me verbose boot message(bge part only)?
> > Also show me the output of "pciconf -lcbv".
> 
> Here is "pciconf -lcbv", I will send the "boot -v" part later.
> 
> b...@pci0:4:0:0:  class=0x02 card=0x165914e4 chip=0x165914e4 rev=0x11 
> hdr=0x00
> vendor = 'Broadcom Corporation'
> device = 'NetXtreme Gigabit Ethernet PCI Express (BCM5721)'
> class  = network
> subclass   = ethernet
> bar   [10] = type Memory, range 64, base 0xfe5f, size 65536, enabled
> cap 01[48] = powerspec 2  supports D0 D3  current D0
> cap 03[50] = VPD
> cap 05[58] = MSI supports 8 messages, 64 bit 
> cap 10[d0] = PCI-Express 1 endpoint max data 128(128) link x1(x1)

Sorry for delay. Here is "boot -v" part. It is from other host, but
the host hungs too:

pci4:  on pcib4
pci4: domain=0, physical bus=4
found-> vendor=0x14e4, dev=0x1659, revid=0x11
domain=0, bus=4, slot=0, func=0
class=02-00-00, hdrtype=0x00, mfdev=0
cmdreg=0x0006, statreg=0x0010, cachelnsz=8 (dwords)
lattimer=0x00 (0 ns), mingnt=0x00 (0 ns), maxlat=0x00 (0 ns)
intpin=a, irq=5
powerspec 2  supports D0 D3  current D0
MSI supports 8 messages, 64 bit
map[10]: type Memory, range 64, base 0xfe5f, size 16, enabled
pcib4: requested memory range 0xfe5f-0xfe5f: good
pcib0: matched entry for 0.13.INTA (src \_SB_.PCI0.APC4:0)
pcib0: slot 13 INTA routed to irq 19 via \_SB_.PCI0.APC4
pcib4: slot 0 INTA is routed to irq 19
pci0:4:0:0: bad VPD cksum, remain 14
bge0:  mem 0
xfe5f0000-0xfe5f irq 19 at device 0.0 on pci4
bge0: Reserved 0x1 bytes for rid 0x10 type 3 at 0xfe5f
bge0: CHIP ID 0x4101; ASIC REV 0x04; CHIP REV 0x41; PCI-E
miibus1:  on bge0
brgphy0:  PHY 1 on miibus1
brgphy0: OUI 0x000818, model 0x0018, rev. 0
brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
1000baseT-FDX, auto
bge0: bpf attached
bge0: Ethernet address: 00:e0:81:5c:64:85
ioapic0: routing intpin 19 (PCI IRQ 19) to vector 54
bge0: [MPSAFE]
bge0: [ITHREAD]


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: bge hangs on recent 7.3-STABLE

2010-09-09 Thread Igor Sysoev
On Thu, Sep 09, 2010 at 01:10:50PM -0700, Pyun YongHyeon wrote:

> On Thu, Sep 09, 2010 at 02:28:26PM +0400, Igor Sysoev wrote:
> > Hi,
> > 
> > I have several hosts running FreeBSD/amd64 7.2-STABLE updated on 11.01.2010
> > and 25.02.2010. Hosts process about 10K input and 10K output packets/s
> > without issues. One of them, however, is loaded more than others, so it
> > processes 20K/20K packets/s.
> > 
> > Recently, I have upgraded one host to 7.3-STABLE, 24.08.2010.
> > Then bge on this host hung two times. I was able to restart it from
> > console using:
> >   /etc/rc.d/netif restart bge0
> > 
> > Then I have upgraded the most loaded (20K/20K) host to 7.3-STABLE, 
> > 07.09.2010.
> > After reboot bge hung every several seconds. I was able to restart it,
> > but bge hung again after several seconds.
> > 
> > Then I have downgraded this host to 7.3-STABLE, 14.08.2010, since there
> > were several if_bge.c commits on 15.08.2010. The same hangs.
> > Then I have downgraded this host to 7.3-STABLE, 17.03.2010, before
> > the first if_bge.c commit after 25.02.2010. Now it runs without hangs.
> > 
> > The hosts are amd64 dual core SMP with 4G machines. bge information:
> > 
> > b...@pci0:4:0:0:class=0x02 card=0x165914e4 chip=0x165914e4 
> > rev=0x11 hdr=0x00
> > vendor = 'Broadcom Corporation'
> > device = 'NetXtreme Gigabit Ethernet PCI Express (BCM5721)'
> > 
> > bge0:  
> > mem 0xfe5f-0xfe5f irq 19 at device 0.0 on pci4
> > miibus1:  on bge0
> > brgphy0:  PHY 1 on miibus1
> > brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
> > 1000baseT-FDX, auto
> > bge0: Ethernet address: 00:e0:81:5f:6e:8a
> > 
> 
> Could you show me verbose boot message(bge part only)?
> Also show me the output of "pciconf -lcbv".

Here is "pciconf -lcbv", I will send the "boot -v" part later.

b...@pci0:4:0:0:class=0x02 card=0x165914e4 chip=0x165914e4 rev=0x11 
hdr=0x00
vendor = 'Broadcom Corporation'
device = 'NetXtreme Gigabit Ethernet PCI Express (BCM5721)'
class  = network
    subclass   = ethernet
bar   [10] = type Memory, range 64, base 0xfe5f, size 65536, enabled
cap 01[48] = powerspec 2  supports D0 D3  current D0
cap 03[50] = VPD
cap 05[58] = MSI supports 8 messages, 64 bit 
cap 10[d0] = PCI-Express 1 endpoint max data 128(128) link x1(x1)


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


bge hangs on recent 7.3-STABLE

2010-09-09 Thread Igor Sysoev
Hi,

I have several hosts running FreeBSD/amd64 7.2-STABLE updated on 11.01.2010
and 25.02.2010. Hosts process about 10K input and 10K output packets/s
without issues. One of them, however, is loaded more than others, so it
processes 20K/20K packets/s.

Recently, I have upgraded one host to 7.3-STABLE, 24.08.2010.
Then bge on this host hung two times. I was able to restart it from
console using:
  /etc/rc.d/netif restart bge0

Then I have upgraded the most loaded (20K/20K) host to 7.3-STABLE, 07.09.2010.
After reboot bge hung every several seconds. I was able to restart it,
but bge hung again after several seconds.

Then I have downgraded this host to 7.3-STABLE, 14.08.2010, since there
were several if_bge.c commits on 15.08.2010. The same hangs.
Then I have downgraded this host to 7.3-STABLE, 17.03.2010, before
the first if_bge.c commit after 25.02.2010. Now it runs without hangs.

The hosts are amd64 dual core SMP with 4G machines. bge information:

b...@pci0:4:0:0:class=0x02 card=0x165914e4 chip=0x165914e4 rev=0x11 
hdr=0x00
vendor = 'Broadcom Corporation'
device = 'NetXtreme Gigabit Ethernet PCI Express (BCM5721)'

bge0:  mem 
0xfe5f-0xfe5f irq 19 at device 0.0 on pci4
miibus1:  on bge0
brgphy0:  PHY 1 on miibus1
brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
1000baseT-FDX, auto
bge0: Ethernet address: 00:e0:81:5f:6e:8a

bge has 3 vlans:

bge0: flags=8943 metric 0 mtu 15
00
options=9b
ether 00:e0:81:5f:6e:8a
media: Ethernet autoselect (1000baseTX )
status: active

vlan173: flags=8843 metric 0 mtu 1500
options=3
ether 00:e0:81:5f:6e:8a
inet 192.168.173.101 netmask 0xff00 broadcast 192.168.173.255
media: Ethernet autoselect (1000baseTX )
status: active
vlan: 173 parent interface: bge0

[ ... ]


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


bge hangs on recent 7.3-STABLE

2010-09-09 Thread Igor Sysoev
Hi,

I have several hosts running FreeBSD/amd64 7.2-STABLE updated on 11.01.2010
and 25.02.2010. Hosts process about 10K input and 10K output packets/s
without issues. One of them, however, is loaded more than others, so it
processes 20K/20K packets/s.

Recently, I have upgraded one host to 7.3-STABLE, 24.08.2010.
Then bge on this host hung two times. I was able to restart it from
console using:
  /etc/rc.d/netif restart bge0

Then I have upgraded the most loaded (20K/20K) host to 7.3-STABLE, 07.09.2010.
After reboot bge hung every several seconds. I was able to restart it,
but bge hung after several seconds.

Then I have downgraded this host to 7.3-STABLE, 14.08.2010, since there
were several if_bge.c commits on 15.08.2010. The same hangs.
Then I have downgraded this host to 7.3-STABLE, 17.03.2010, before
the first if_bge.c commit after 25.02.2010. Now it runs without hangs.

The hosts are amd64 dual core SMP with 4G machines.
bge information:

b...@pci0:4:0:0:class=0x02 card=0x165914e4 chip=0x165914e4 rev=0x11 
hdr=0x00
vendor = 'Broadcom Corporation'
device = 'NetXtreme Gigabit Ethernet PCI Express (BCM5721)'

bge0:  mem 
0xfe5f-0xfe5f irq 19 at device 0.0 on pci4
miibus1:  on bge0
brgphy0:  PHY 1 on miibus1
brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
1000baseT-FDX, auto
bge0: Ethernet address: 00:e0:81:5f:6e:8a

bge has 3 vlans:

bge0: flags=8943 metric 0 mtu 15
00
options=9b
ether 00:e0:81:5f:6e:8a
media: Ethernet autoselect (1000baseTX )
status: active

vlan173: flags=8843 metric 0 mtu 1500
options=3
ether 00:e0:81:5f:6e:8a
inet 192.168.173.101 netmask 0xff00 broadcast 192.168.173.255
media: Ethernet autoselect (1000baseTX )
status: active
vlan: 173 parent interface: bge0

[ ... ]


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


net.inet.tcp.slowstart_flightsize in 8-STABLE

2010-05-12 Thread Igor Sysoev
It seems that net.inet.tcp.slowstart_flightsize does not work in 8-STABLE.
For a long time I used slowstart_flightsize=2 on FreeBSD 4, 6, and 7 hosts.
However, FreeBSD-8 always starts with the single packet.
I saw this on different versions of 8-STABLE since 8 Oct 2009 till
04 Apr 2010.


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: hw.bge.forced_collapse

2010-01-15 Thread Igor Sysoev
On Thu, Jan 14, 2010 at 10:10:31AM -0800, Pyun YongHyeon wrote:

> On Thu, Jan 14, 2010 at 07:03:33PM +0300, Igor Sysoev wrote:
> > On Fri, Dec 04, 2009 at 12:22:13PM -0800, Pyun YongHyeon wrote:
> > 
> > > On Fri, Dec 04, 2009 at 11:13:03PM +0300, Igor Sysoev wrote:
> > > > On Fri, Dec 04, 2009 at 11:51:40AM -0800, Pyun YongHyeon wrote:
> > > > 
> > > > > On Fri, Dec 04, 2009 at 10:11:14PM +0300, Igor Sysoev wrote:
> > > > > > On Fri, Dec 04, 2009 at 09:32:43AM -0800, Pyun YongHyeon wrote:
> > > > > > 
> > > > > > > On Fri, Dec 04, 2009 at 10:54:40AM +0300, Igor Sysoev wrote:
> > > > > > > > I saw commit introducing hw.bge.forced_collapse loader tunable.
> > > > > > > > Just intresting, why it can not be a sysctl ?
> > > > > > > 
> > > > > > > I didn't think the sysctl variable would be frequently changed
> > > > > > > in runtime except debugging driver so I took simple path.
> > > > > > 
> > > > > > I do not think it's worth to reboot server just to look how various
> > > > > > values affect on bandwidth and CPU usage, expecially in production.
> > > > > > 
> > > > > > As I understand the change is trivial:
> > > > > > 
> > > > > > -  CTLFLAG_RD
> > > > > > +  CTLFLAG_RW
> > > > > > 
> > > > > > since bge_forced_collapse is used atomically.
> > > > > > 
> > > > > 
> > > > > I have no problem changing it to RW but that case I may have to
> > > > > create actual sysctl node(e.g. dev.bge.0.forced_collapse) instead
> > > > > of hw.bge.forced_collapse which may affect all bge(4) controllers
> > > > > on system. Attached patch may be what you want. You can change the
> > > > > value at any time.
> > > > 
> > > > Thank you for the patch. Can it be installed on 8-STABLE ?
> > > > 
> > > 
> > > bge(4) in HEAD has many fixes which were not MFCed to stable/8 so
> > > I'm not sure that patch could be applied cleanly. But I guess you
> > > can manually patch it.
> > > I'll wait a couple of days for wider testing/review and commit the
> > > patch.
> > 
> > Sorry for the late response. We've tested bge.forced_collapse in December
> > on HEAD and found that values >1 froze connections with big data amount,
> > for example, "top -Ss1" output. Connection with small data amount such as
> > short ssh commands worked OK. Now I've tested modern 7.2-STABLE and found
> > that forced_collapse >1 freezes it too.
> > 
> 
> Thanks for reporting! It seems I've incorrectly dropped mbuf chains
> when collapsing fails. Would you try attached patch?

BTW, it's strange that collapsing fails too often.

> Index: sys/dev/bge/if_bge.c
> ===
> --- sys/dev/bge/if_bge.c  (revision 202268)
> +++ sys/dev/bge/if_bge.c  (working copy)
> @@ -3940,11 +3940,8 @@
>   m = m_defrag(m, M_DONTWAIT);
>   else
>   m = m_collapse(m, M_DONTWAIT, sc->bge_forced_collapse);
> - if (m == NULL) {
> - m_freem(*m_head);
> - *m_head = NULL;
> - return (ENOBUFS);
> - }
> + if (m == NULL)
> + m = *m_head;
>   *m_head = m;
>   }
>  


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: hw.bge.forced_collapse

2010-01-15 Thread Igor Sysoev
On Thu, Jan 14, 2010 at 10:10:31AM -0800, Pyun YongHyeon wrote:

> On Thu, Jan 14, 2010 at 07:03:33PM +0300, Igor Sysoev wrote:
> > On Fri, Dec 04, 2009 at 12:22:13PM -0800, Pyun YongHyeon wrote:
> > 
> > > On Fri, Dec 04, 2009 at 11:13:03PM +0300, Igor Sysoev wrote:
> > > > On Fri, Dec 04, 2009 at 11:51:40AM -0800, Pyun YongHyeon wrote:
> > > > 
> > > > > On Fri, Dec 04, 2009 at 10:11:14PM +0300, Igor Sysoev wrote:
> > > > > > On Fri, Dec 04, 2009 at 09:32:43AM -0800, Pyun YongHyeon wrote:
> > > > > > 
> > > > > > > On Fri, Dec 04, 2009 at 10:54:40AM +0300, Igor Sysoev wrote:
> > > > > > > > I saw commit introducing hw.bge.forced_collapse loader tunable.
> > > > > > > > Just intresting, why it can not be a sysctl ?
> > > > > > > 
> > > > > > > I didn't think the sysctl variable would be frequently changed
> > > > > > > in runtime except debugging driver so I took simple path.
> > > > > > 
> > > > > > I do not think it's worth to reboot server just to look how various
> > > > > > values affect on bandwidth and CPU usage, expecially in production.
> > > > > > 
> > > > > > As I understand the change is trivial:
> > > > > > 
> > > > > > -  CTLFLAG_RD
> > > > > > +  CTLFLAG_RW
> > > > > > 
> > > > > > since bge_forced_collapse is used atomically.
> > > > > > 
> > > > > 
> > > > > I have no problem changing it to RW but that case I may have to
> > > > > create actual sysctl node(e.g. dev.bge.0.forced_collapse) instead
> > > > > of hw.bge.forced_collapse which may affect all bge(4) controllers
> > > > > on system. Attached patch may be what you want. You can change the
> > > > > value at any time.
> > > > 
> > > > Thank you for the patch. Can it be installed on 8-STABLE ?
> > > > 
> > > 
> > > bge(4) in HEAD has many fixes which were not MFCed to stable/8 so
> > > I'm not sure that patch could be applied cleanly. But I guess you
> > > can manually patch it.
> > > I'll wait a couple of days for wider testing/review and commit the
> > > patch.
> > 
> > Sorry for the late response. We've tested bge.forced_collapse in December
> > on HEAD and found that values >1 froze connections with big data amount,
> > for example, "top -Ss1" output. Connection with small data amount such as
> > short ssh commands worked OK. Now I've tested modern 7.2-STABLE and found
> > that forced_collapse >1 freezes it too.
> > 
> 
> Thanks for reporting! It seems I've incorrectly dropped mbuf chains
> when collapsing fails. Would you try attached patch?

Thank you, the patch fixes the bug.

> Index: sys/dev/bge/if_bge.c
> ===
> --- sys/dev/bge/if_bge.c  (revision 202268)
> +++ sys/dev/bge/if_bge.c  (working copy)
> @@ -3940,11 +3940,8 @@
>   m = m_defrag(m, M_DONTWAIT);
>   else
>   m = m_collapse(m, M_DONTWAIT, sc->bge_forced_collapse);
> - if (m == NULL) {
> - m_freem(*m_head);
> - *m_head = NULL;
> - return (ENOBUFS);
> - }
> + if (m == NULL)
> + m = *m_head;
>   *m_head = m;
>   }
>  


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: hw.bge.forced_collapse

2010-01-14 Thread Igor Sysoev
On Fri, Dec 04, 2009 at 12:22:13PM -0800, Pyun YongHyeon wrote:

> On Fri, Dec 04, 2009 at 11:13:03PM +0300, Igor Sysoev wrote:
> > On Fri, Dec 04, 2009 at 11:51:40AM -0800, Pyun YongHyeon wrote:
> > 
> > > On Fri, Dec 04, 2009 at 10:11:14PM +0300, Igor Sysoev wrote:
> > > > On Fri, Dec 04, 2009 at 09:32:43AM -0800, Pyun YongHyeon wrote:
> > > > 
> > > > > On Fri, Dec 04, 2009 at 10:54:40AM +0300, Igor Sysoev wrote:
> > > > > > I saw commit introducing hw.bge.forced_collapse loader tunable.
> > > > > > Just intresting, why it can not be a sysctl ?
> > > > > 
> > > > > I didn't think the sysctl variable would be frequently changed
> > > > > in runtime except debugging driver so I took simple path.
> > > > 
> > > > I do not think it's worth to reboot server just to look how various
> > > > values affect on bandwidth and CPU usage, expecially in production.
> > > > 
> > > > As I understand the change is trivial:
> > > > 
> > > > -  CTLFLAG_RD
> > > > +  CTLFLAG_RW
> > > > 
> > > > since bge_forced_collapse is used atomically.
> > > > 
> > > 
> > > I have no problem changing it to RW but that case I may have to
> > > create actual sysctl node(e.g. dev.bge.0.forced_collapse) instead
> > > of hw.bge.forced_collapse which may affect all bge(4) controllers
> > > on system. Attached patch may be what you want. You can change the
> > > value at any time.
> > 
> > Thank you for the patch. Can it be installed on 8-STABLE ?
> > 
> 
> bge(4) in HEAD has many fixes which were not MFCed to stable/8 so
> I'm not sure that patch could be applied cleanly. But I guess you
> can manually patch it.
> I'll wait a couple of days for wider testing/review and commit the
> patch.

Sorry for the late response. We've tested bge.forced_collapse in December
on HEAD and found that values >1 froze connections with big data amount,
for example, "top -Ss1" output. Connection with small data amount such as
short ssh commands worked OK. Now I've tested modern 7.2-STABLE and found
that forced_collapse >1 freezes it too.


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: hw.bge.forced_collapse

2009-12-04 Thread Igor Sysoev
On Fri, Dec 04, 2009 at 11:51:40AM -0800, Pyun YongHyeon wrote:

> On Fri, Dec 04, 2009 at 10:11:14PM +0300, Igor Sysoev wrote:
> > On Fri, Dec 04, 2009 at 09:32:43AM -0800, Pyun YongHyeon wrote:
> > 
> > > On Fri, Dec 04, 2009 at 10:54:40AM +0300, Igor Sysoev wrote:
> > > > I saw commit introducing hw.bge.forced_collapse loader tunable.
> > > > Just intresting, why it can not be a sysctl ?
> > > 
> > > I didn't think the sysctl variable would be frequently changed
> > > in runtime except debugging driver so I took simple path.
> > 
> > I do not think it's worth to reboot server just to look how various
> > values affect on bandwidth and CPU usage, expecially in production.
> > 
> > As I understand the change is trivial:
> > 
> > -  CTLFLAG_RD
> > +  CTLFLAG_RW
> > 
> > since bge_forced_collapse is used atomically.
> > 
> 
> I have no problem changing it to RW but that case I may have to
> create actual sysctl node(e.g. dev.bge.0.forced_collapse) instead
> of hw.bge.forced_collapse which may affect all bge(4) controllers
> on system. Attached patch may be what you want. You can change the
> value at any time.

Thank you for the patch. Can it be installed on 8-STABLE ?


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: hw.bge.forced_collapse

2009-12-04 Thread Igor Sysoev
On Fri, Dec 04, 2009 at 09:32:43AM -0800, Pyun YongHyeon wrote:

> On Fri, Dec 04, 2009 at 10:54:40AM +0300, Igor Sysoev wrote:
> > I saw commit introducing hw.bge.forced_collapse loader tunable.
> > Just intresting, why it can not be a sysctl ?
> 
> I didn't think the sysctl variable would be frequently changed
> in runtime except debugging driver so I took simple path.

I do not think it's worth to reboot server just to look how various
values affect on bandwidth and CPU usage, expecially in production.

As I understand the change is trivial:

-  CTLFLAG_RD
+  CTLFLAG_RW

since bge_forced_collapse is used atomically.


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


hw.bge.forced_collapse

2009-12-03 Thread Igor Sysoev
I saw commit introducing hw.bge.forced_collapse loader tunable.
Just intresting, why it can not be a sysctl ?


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: interface FIB

2009-11-28 Thread Igor Sysoev
On Fri, Nov 27, 2009 at 09:12:37PM -0800, Julian Elischer wrote:

> Igor Sysoev wrote:
> > Currently only packets generated during encapsulation can use
> > interface's FIB stored during interface creation:
> > 
> > setfib 1 ifconfig gif0 ...
> > setfib 1 ifconfig tun0 ...
> 
> not sure if tun actually does this (in fac tit shouldn't)
> 
> but for gre and gif (and stf) these are tunnelling other things into 
> IP and thus it makes sense to be able to connect a routing table with 
> the generated envelopes.

I've got this from 8.0 release notes:

   A packet generated on tunnel interfaces such as gif(4) and tun(4) will
   be encapsulated using the FIB of the process which set up the tunnel.

However, sys/net/if_tun.c is really has no FIB related changes.

> > is it possible to implement this feature for any interface:
> > 
> > setfib 1 ifconfig vlan0 ...
> > 
> > or
> > 
> > ifconfig vlan0 setfib 1 ...
> 
> these two things would mean differnt things.
> and one of them wouldn't mean anything.
> 
> setfig 1 ifconfig vlan0 woudl mean "what" exactly?
> VLAN tagging is an L2/L1 operation and FIBS have no effect on this.
> 
> as for ifconfig vlan0 setfib 1, or  ifconfig em0 setfib 1
> 
> this will (shortly) mean that incoming packets through this interface 
> will be default be connected with fib 1 so the any return packets 
> (resets, icmp etc.) will use FIB1 to go back to the sender.

This is exactly what I meant.

> That patch is in the works.

I'm ready to test the patch in production on 7/8-STABLE if the patch
can be applied to it.


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


interface FIB

2009-11-27 Thread Igor Sysoev
Currently only packets generated during encapsulation can use
interface's FIB stored during interface creation:

setfib 1 ifconfig gif0 ...
setfib 1 ifconfig tun0 ...

is it possible to implement this feature for any interface:

setfib 1 ifconfig vlan0 ...

or

ifconfig vlan0 setfib 1 ...


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: stuck TIME_WAIT sockets

2009-10-02 Thread Igor Sysoev
On Fri, Oct 02, 2009 at 02:06:21PM -0400, Skip Ford wrote:

> Igor Sysoev wrote:
> > The TIME_WAIT sockets suddenly started to grow on a host running
> > FreeBSD 7.2-STABLE, date=2009.09.06.23.59.59
> > Usually there are 3,000-5,000 TIME_WAIT sockets on the host.
> > However, today they stared to grow, have reached 110,000 sockets in hour
> > and still remain on this level.
> > net.inet.tcp.msl is 3.
> > The host uptime is 24 days, 21:53.
> 
> Perhaps you need this patch?
> 
> Author: peter
> Date: Thu Aug 20 22:53:28 2009
> New Revision: 196410
> URL: http://svn.freebsd.org/changeset/base/196410
> 
> Log:
>   Fix signed comparison bug when ticks goes negative after 24 days of
>   uptime.  This causes the tcp time_wait state code to fail to expire
>   sockets in timewait state.
> 
>   Approved by:re (kensmith)
> 
> Modified:
>   head/sys/netinet/tcp_timewait.c

Thank you.


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: stuck TIME_WAIT sockets

2009-10-02 Thread Igor Sysoev
On Fri, Oct 02, 2009 at 05:06:46PM +0400, Igor Sysoev wrote:

> The TIME_WAIT sockets suddenly started to grow on a host running
> FreeBSD 7.2-STABLE, date=2009.09.06.23.59.59
> Usually there are 3,000-5,000 TIME_WAIT sockets on the host.
> However, today they stared to grow, have reached 110,000 sockets in hour
> and still remain on this level.
> net.inet.tcp.msl is 3.
> The host uptime is 24 days, 21:53.
> 
> I have saved a coredump and may try to help to debug the issue.

There are also 10 stuck LAST_ACK sockets.

"swi4: clock sio" is usually idle, however, if I run

netstat -an | grep TIME_WAIT | wc -l

then swi4 gets some CPU:

  PID USERNAME  THR PRI NICE   SIZERES STATE   C   TIME   WCPU COMMAND
   11 root1 171 ki31 0K16K CPU11 112.0H 98.29% idle: cpu1
   12 root1 171 ki31 0K16K RUN 0 116.8H 94.78% idle: cpu0
   14 root1 -32- 0K16K WAIT0  13:11  1.66% swi4: clock 
   26 root1 -68- 0K16K WAIT    0 334:11  0.00% irq19: bge0


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


stuck TIME_WAIT sockets

2009-10-02 Thread Igor Sysoev
The TIME_WAIT sockets suddenly started to grow on a host running
FreeBSD 7.2-STABLE, date=2009.09.06.23.59.59
Usually there are 3,000-5,000 TIME_WAIT sockets on the host.
However, today they stared to grow, have reached 110,000 sockets in hour
and still remain on this level.
net.inet.tcp.msl is 3.
The host uptime is 24 days, 21:53.

I have saved a coredump and may try to help to debug the issue.


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: bge interrupt coalescing sysctls

2009-06-18 Thread Igor Sysoev
On Thu, Jun 11, 2009 at 11:54:29AM +1000, Bruce Evans wrote:

> On Wed, 10 Jun 2009, Igor Sysoev wrote:
> 
> >For a long time I used Bruce Evans' patch to tune bge interrupt coalescing:
> >http://lists.freebsd.org/pipermail/freebsd-net/2007-November/015956.html
> >However, recent commit SVN r192478 in 7-STABLE (r192127 in HEAD) had broken
> >the patch. I'm not sure how to fix the collision, and since I do not
> >use dynamic tuning
> 
> That commit looked ugly (lots of internal API changes and bloat in interrupt
> handlers in many network drivers to support polling which mostly shouldn't
> be supported at all and mostly doesn't use the interrupt handlers).
> 
> >I has left only static coalescing parameters in the patch
> >and has added a loader tunable to set number of receive descriptors and
> >read only sysctl to read the tunable. I usually use these parameters:
> >
> >/boot/loader.conf:
> >hw.bge.rxd=512
> >
> >/etc/sysctl.conf:
> >dev.bge.0.rx_coal_ticks=500
> >dev.bge.0.tx_coal_ticks=1
> >dev.bge.0.rx_max_coal_bds=64
> 
> These rx settings give to high a latency for me.

Probably, however, I use this on a host that has 6000 packets/s.

> >dev.bge.0.tx_max_coal_bds=128
> ># apply the above parameters
> >dev.bge.0.program_coal=1
> >
> >Could anyone commit it ?
> 
> Not me, sorry.
> 
> The patch is quite clean.  If I committed then I would commit the
> dynamic coalescing configuration separately anyway.

So have you any objections if some one else will commit this patch ?

> You can probably make hw.bge.rxd a sysctl too (it would take a down/up
> to get it changed, but that is already needed for too many parameters
> in network drivers anyway).  I should use a sysctl for the ifq length
> too.  This could be done at a high level for each driver.  Limiting
> queue lengths may be a good way to reduce cache misses, while increasing
> them is sometimes good for reducing packet loss.

Do you mean simple command sequence:

sysctl hw.bge.rxd=512
ifconfig down
ifconfig up

or SYSCTL_ADD_PROC for hw.bge.rxd ?


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


bge interrupt coalescing sysctls

2009-06-10 Thread Igor Sysoev
For a long time I used Bruce Evans' patch to tune bge interrupt coalescing:
http://lists.freebsd.org/pipermail/freebsd-net/2007-November/015956.html
However, recent commit SVN r192478 in 7-STABLE (r192127 in HEAD) had broken
the patch. I'm not sure how to fix the collision, and since I do not
use dynamic tuning I has left only static coalescing parameters in the patch
and has added a loader tunable to set number of receive descriptors and
read only sysctl to read the tunable. I usually use these parameters:

/boot/loader.conf:
hw.bge.rxd=512

/etc/sysctl.conf:
dev.bge.0.rx_coal_ticks=500
dev.bge.0.tx_coal_ticks=1
dev.bge.0.rx_max_coal_bds=64
dev.bge.0.tx_max_coal_bds=128
# apply the above parameters
dev.bge.0.program_coal=1

Could anyone commit it ?


-- 
Igor Sysoev
http://sysoev.ru/en/
--- sys/dev/bge/if_bge.c2009-05-21 01:17:10.0 +0400
+++ sys/dev/bge/if_bge.c2009-06-05 13:45:49.0 +0400
@@ -447,12 +447,16 @@
 DRIVER_MODULE(miibus, bge, miibus_driver, miibus_devclass, 0, 0);
 
 static int bge_allow_asf = 0;
+static int bge_rxd = BGE_SSLOTS;
 
 TUNABLE_INT("hw.bge.allow_asf", &bge_allow_asf);
+TUNABLE_INT("hw.bge.rxd", &bge_rxd);
 
 SYSCTL_NODE(_hw, OID_AUTO, bge, CTLFLAG_RD, 0, "BGE driver parameters");
 SYSCTL_INT(_hw_bge, OID_AUTO, allow_asf, CTLFLAG_RD, &bge_allow_asf, 0,
"Allow ASF mode if available");
+SYSCTL_INT(_hw_bge, OID_AUTO, bge_rxd, CTLFLAG_RD, &bge_rxd, 0,
+   "Number of receive descriptors");
 
 #defineSPARC64_BLADE_1500_MODEL"SUNW,Sun-Blade-1500"
 #defineSPARC64_BLADE_1500_PATH_BGE "/p...@1f,70/netw...@2"
@@ -1008,21 +1012,15 @@
return (0);
 }
 
-/*
- * The standard receive ring has 512 entries in it. At 2K per mbuf cluster,
- * that's 1MB or memory, which is a lot. For now, we fill only the first
- * 256 ring entries and hope that our CPU is fast enough to keep up with
- * the NIC.
- */
 static int
 bge_init_rx_ring_std(struct bge_softc *sc)
 {
int i;
 
-   for (i = 0; i < BGE_SSLOTS; i++) {
+   for (i = 0; i < bge_rxd; i++) {
if (bge_newbuf_std(sc, i, NULL) == ENOBUFS)
return (ENOBUFS);
-   };
+   }
 
bus_dmamap_sync(sc->bge_cdata.bge_rx_std_ring_tag,
sc->bge_cdata.bge_rx_std_ring_map,
@@ -2383,6 +2381,52 @@
 #endif
 
 static int
+bge_sysctl_program_coal(SYSCTL_HANDLER_ARGS)
+{
+   struct bge_softc *sc;
+   int error, i, val;
+
+   val = 0;
+   error = sysctl_handle_int(oidp, &val, 0, req);
+   if (error != 0 || req->newptr == NULL)
+   return (error);
+sc = arg1;
+   BGE_LOCK(sc);
+
+   /* XXX cut from bge_blockinit(): */
+
+   /* Disable host coalescing until we get it set up */
+   CSR_WRITE_4(sc, BGE_HCC_MODE, 0x);
+
+   /* Poll to make sure it's shut down. */
+   for (i = 0; i < BGE_TIMEOUT; i++) {
+   if (!(CSR_READ_4(sc, BGE_HCC_MODE) & BGE_HCCMODE_ENABLE))
+   break;
+   DELAY(10);
+   }
+
+   if (i == BGE_TIMEOUT) {
+   device_printf(sc->bge_dev,
+   "host coalescing engine failed to idle\n");
+   CSR_WRITE_4(sc, BGE_HCC_MODE, BGE_HCCMODE_ENABLE);
+   BGE_UNLOCK(sc);
+   return (ENXIO);
+   }
+
+   /* Set up host coalescing defaults */
+   CSR_WRITE_4(sc, BGE_HCC_RX_COAL_TICKS, sc->bge_rx_coal_ticks);
+   CSR_WRITE_4(sc, BGE_HCC_TX_COAL_TICKS, sc->bge_tx_coal_ticks);
+   CSR_WRITE_4(sc, BGE_HCC_RX_MAX_COAL_BDS, sc->bge_rx_max_coal_bds);
+   CSR_WRITE_4(sc, BGE_HCC_TX_MAX_COAL_BDS, sc->bge_tx_max_coal_bds);
+
+   /* Turn on host coalescing state machine */
+   CSR_WRITE_4(sc, BGE_HCC_MODE, BGE_HCCMODE_ENABLE);
+
+   BGE_UNLOCK(sc);
+   return (0);
+}
+
+static int
 bge_attach(device_t dev)
 {
struct ifnet *ifp;
@@ -4495,6 +4539,19 @@
ctx = device_get_sysctl_ctx(sc->bge_dev);
children = SYSCTL_CHILDREN(device_get_sysctl_tree(sc->bge_dev));
 
+   SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "program_coal",
+   CTLTYPE_INT | CTLFLAG_RW,
+   sc, 0, bge_sysctl_program_coal, "I",
+   "program bge coalescence values");
+   SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "rx_coal_ticks", CTLFLAG_RW,
+   &sc->bge_rx_coal_ticks, 0, "");
+   SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "tx_coal_ticks", CTLFLAG_RW,
+   &sc->bge_tx_coal_ticks, 0, "");
+   SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "rx_max_coal_bds", CTLFLAG_RW,
+   &sc->bge_rx_max_coal_bds, 0, "");
+   SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "tx_max_coal_bds", CTLFLAG_RW,
+ 

Re: FIB MFC

2008-07-24 Thread Igor Sysoev
On Thu, Jul 24, 2008 at 09:44:15AM -0700, Julian Elischer wrote:

> Igor Sysoev wrote:
> >On Thu, Jul 24, 2008 at 08:33:09AM -0700, Julian Elischer wrote:
> >
> 
> 
> >>I was thinking that it might be possible to tag a socket to accept the 
> >>fib of the packet coming in, but if we do this, we should decide
> >>API to label a socket in this way..
> >
> >I think it should be sysctl to globaly enable TCP FIB inheritance.
> >API is already exists: sockopt(SO_SETFIB) for listening socket.
> 
> But a socket ALWAYS has a fib, even if you do nothing
> because every process has a fib (usually 0)
> so you need a new bit of state somewhere that means "inherit".
> (I guess in the socket flags).

I see.

> Possibly the FIB value of -1 when applied on a socket option might
> signify that behaviour. (thus save us a new sockopt).
> But such a value would revert to that of the process if the socket was 
> not used as a listen socket. (or clear itself).

-1 is good variant.

> I have some MRT unhansements in hte pipeline and will include this if
> I can.
> 
> BTW could you send me the diff for ipfw(8)?
> I'll compare it with the one I'm about to commit.

This is exactly your already commited 1.108.2.9


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FIB MFC

2008-07-24 Thread Igor Sysoev
On Thu, Jul 24, 2008 at 08:33:09AM -0700, Julian Elischer wrote:

> Igor Sysoev wrote:
> >Julian, thank you for FIB. I have tried in on FreeBSD-7.
> >
> >I've found that ipfw does not know about setfib:
> >ipfw: invalid action setfib
> >
> 
> Oh I have not finished MFC..
> will finish today..
> 
> the svn server crashed last night .. :-/
> (or at least went very strange) while I was working on this so I
> went to bed.
> 
> 
> 
> >Therefore I've added missing part from CURRENT.
> >Then I have tried the following configuration:
> >
> >vlan1: 10.0.0.100
> >vlan2: 192.168.1.100
> >
> >route add default 10.0.0.1
> >setfib 1 route add default 192.168.1.1
> >ipfw add setfib 1 ip from any to any in via vlan2
> >
> >I expected that outgoing packets of TCP connection established
> >via vlan2 will be routed to 192.168.1.1, but this did not happen.
> >The packets went to 10.0.0.1 via vlan1:
> 
> no, while this doesmake sense, the fib is only used for outgoing
> packets and the fib of local sockets is set by the process that opens 
> the socket. (either with setfib(2) or sockopt(SETFIB))
> 
> I was thinking that it might be possible to tag a socket to accept the 
> fib of the packet coming in, but if we do this, we should decide
> API to label a socket in this way..

I think it should be sysctl to globaly enable TCP FIB inheritance.
API is already exists: sockopt(SO_SETFIB) for listening socket.

> It is a n execellent idea however, and I don't know why I didn't
> do it already..
> 
> >
> >tcp4   0  0  192.168.1.100.80   XX  SYN_RCVD
> >tcp4   0  0  192.168.1.100.80   XX  SYN_RCVD
> >tcp4   0  0  192.168.1.100.80   XX  SYN_RCVD
> >
> >Can TCP connection inherit FIB from first SYN packet or not ?
> 
> no but it is a good idea.


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


FIB MFC

2008-07-24 Thread Igor Sysoev
Julian, thank you for FIB. I have tried in on FreeBSD-7.

I've found that ipfw does not know about setfib:
ipfw: invalid action setfib

Therefore I've added missing part from CURRENT.
Then I have tried the following configuration:

vlan1: 10.0.0.100
vlan2: 192.168.1.100

route add default 10.0.0.1
setfib 1 route add default 192.168.1.1
ipfw add setfib 1 ip from any to any in via vlan2

I expected that outgoing packets of TCP connection established
via vlan2 will be routed to 192.168.1.1, but this did not happen.
The packets went to 10.0.0.1 via vlan1:

tcp4   0  0  192.168.1.100.80   XX  SYN_RCVD
tcp4   0  0  192.168.1.100.80   XX  SYN_RCVD
tcp4   0  0  192.168.1.100.80   XX  SYN_RCVD

Can TCP connection inherit FIB from first SYN packet or not ?


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Multiple routing tables in action...

2008-05-11 Thread Igor Sysoev
On Tue, Apr 29, 2008 at 12:11:03PM -0700, Julian Elischer wrote:

> >Then you can export RIB entries , say 
> >you have 5 BGP peers and you want to export 2 or 3 or all of them into 
> >the 'main' routing instance you can set up a policy to add those learned 
> >routes into the main instance and v-v.
> >Linux behaves a little bit differently as you have to make an 'ip rule' 
> >entry for it but it doesn't use the firewall.
> 
> for now this code asks you to use a firewall to classify incoming 
> packets..
> 
> e.g.
> 100 setfib 2 ip from any to any in recv em0

Is is possible to extend ifconfig to classify incoming packets ?


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: zonelimit issues...

2008-04-24 Thread Igor Sysoev
On Mon, Apr 21, 2008 at 03:27:53PM +0400, Igor Sysoev wrote:

> The problem that FreeBSD has small KVA space: only 2G even on amd64 32G
> machines.
> 
> So with
> 
> vm.kmem_size=1G
> # 64M KVA
> kern.maxbcache=64M
> # 4M KVA
> kern.ipc.maxpipekva=4M
> 
> 
> I can use something like this:
> 
> # 256M KVA/KVM
> kern.ipc.nmbjumbop=64000
> # 216M KVA/KVM
> kern.ipc.nmbclusters=98304
> # 162M KVA/KVM
> kern.ipc.maxsockets=163840
> # 8M KVA/KVM
> net.inet.tcp.maxtcptw=163840
> # 24M KVA/KVM
> kern.maxfiles=204800

Actually, on amd64 it is possible to increase KVM up to 1.8G without
boot time panic:

vm.kmem_size=1844M
# 64M KVA
kern.maxbcache=64M
# 4M KVA
kern.ipc.maxpipekva=4M

Without descreasing kern.maxbcache (200M by default) and
kern.ipc.maxpipekva (~40M by default) you can get only about 1.5G.

So with 1.8G KVM I able to set

# 4G phys, 2G KVA, 1.8G KVM
#
# 750M KVA/KVM
kern.ipc.nmbjumbop=192000
# 504M KVA/KVM
kern.ipc.nmbclusters=229376
# 334M KVA/KVM
kern.ipc.maxsockets=204800
# 8M KVA/KVM
net.inet.tcp.maxtcptw=163840
# 24M KVA/KVM
kern.maxfiles=204800

Now KVA is split as

kernel code8M
kmem_map   1844M
buffer_map 64M
pager_map  32M
exec_map   4.2M
pipe_map   4M
???60M
vm.kvm_free32M

I leave unused spare 32M free KVA (vm.kvm_free) because some map
(unknown for me) after pipe_map may grow slightly. If vm.kvm_free will
become 0, kernel will panic.


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: bge loader tunables

2008-04-22 Thread Igor Sysoev
On Tue, Apr 22, 2008 at 12:20:38AM +0400, Igor Sysoev wrote:

> Finally I have tested your second (without debug stuff) patch in
> production environment (~45K in/out packets) on FreeBSD 7.0-STABLE.
> I think it should be commited.
> 
> I use my usual static settings in /etc/sysctl.conf:
> 
> dev.bge.0.dyncoal_max_intr_freq=0
> #
> dev.bge.0.rx_coal_ticks=500
> dev.bge.0.tx_coal_ticks=1
> dev.bge.0.rx_max_coal_bds=64
> dev.bge.0.tx_max_coal_bds=128
> # apply the above parameters
> dev.bge.0.program_coal=0
> 
> and have about only 1700-1900 interrupts per second.
> 
> The only issue was at boot time:
> 
> dev.bge.0.dyncoal_max_intr_freq: 1 -> 0
> dev.bge.0.rx_coal_ticks: 0 -> 500
> dev.bge.0.tx_coal_ticks: 100 -> 1
> dev.bge.0.rx_max_coal_bds: 128 -> 64
> dev.bge.0.tx_max_coal_bds: 384 -> 128
> ...
> bge0: flags =8843 metric 0 mtu 1500
> options=9b
> ...
> Local package initialization:
> ...
> dev.bge.0.rx_coal_ticks: 150 -> 500
> 
> When disabling dyncoal_max_intr_freq at bge UPing resets rx_coal_ticks to 
> 150.  

I has to use
dev.bge.0.program_coal=1

in /etc/sysctl.conf, otherwise /etc/rc.d/sysctl does not call it at all.


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: bge loader tunables

2008-04-21 Thread Igor Sysoev
On Sat, Nov 17, 2007 at 09:13:50PM +1100, Bruce Evans wrote:

> On Sat, 17 Nov 2007, Igor Sysoev wrote:
> 
> >On Sat, Nov 17, 2007 at 08:30:58AM +1100, Bruce Evans wrote:
> >
> >>On Fri, 16 Nov 2007, Igor Sysoev wrote:
> >>
> >>>The attached patch creates the following bge loader tunables:
> >>
> >>I plan to commit old work to do this using sysctls.  Tunables are
> >>harder to use and aren't needed since changes to the defaults aren't
> >>needed for booting.  I also implemented dynamic tuning for rx coal
> >>parameters so that the sysctls are mostly not needed.  Ask for patches
> >>if you want to test this extensively.
> >
> >Yes, I can test your patches on 6.2 and 7.0.
> >Now bge set the coalescing parameters at attach time.
> >Do the sysctl's allow to change them on-the-fly ?
> >How does rx dynamic tuning work ?
> >Could it be turned off ?
> 
> OK, the patch is enclosed at the end, in 2 versions:
> - all my patches for bge (with lots of debugging cruft and half-baked
>   fixes for 5705+ sysctls.
> - edited version with only the coalescing parameter changes.
> 
> I haven't used it under 6.2, but have used a similar version in ~5.2,
> and it should work in 6.2 except for the 5705+ sysctl fixes.
> 
> bge actually sets parameters at init time, and it initializes whenever the
> link is brought back up, so the parameters can be changed using
> "ifconfig bgeN down up".  Several network drivers have interrupt moderation
> parameters that can be changed in this way, but it is painful to change
> the link status like that, so I have a sysctl dev.bge.N.program_coal to
> apply the current parameters to the hardware.  The other sysctls to change
> the parameters don't apply immediately, except the one for the rx tuning
> max interrupt rate, since applying the changed parameters to the hardware
> takes more code than a SYSCTL_INT(), and it is sometimes necessary to
> change all the parameters together atomically.
> 
> Dynamic tuning works by monitoring the current rx packet rate and
> increasing the active rx_max_coal_bds so that the ratio  rate> / rx_max_coal_bds is usually <= the specified max rx interrupt
> rate.  rx_coal_ticks is set to the constant value of the inverse of
> the specified max rx interrupt rate (in ticks) on transition to dynamic
> mode but IIRC is not changed when the dynamic rate is changed (not
> always changing it automatically allows adjusting it independently of
> the rate but is often not what is wanted).  The transition has some
> bias towards lower latency over too many interrupts, so that short
> bursts don't increase the latency.  I think this simple algorithm is
> good enough provided the load (in rx packets/second) doesn't oscillate
> rapidly.
> 
> Dynamic tuning requires efficient reprogramming of at least one of the
> hardware coal registers so that the tuning can respond rapidly to changes.
> I have 2 methods for this:
> - bge_careful_coal = 1 avoids using uses a potentially very long
>   busy-wait loop in the interrupt handler by giving up on reprogramming
>   the host coalescing engine (HCE) if the HCE seems to be busy.  Docs
>   seem to require waiting for up to several milliseconds for the HCE
>   to stablilize, and it is not clear if it is possible for the HCE to
>   never stabilize because packets are streaming in.  (I don't have
>   proper docs.)  This seems to always work (the HCE is never busy)
>   for rx_max_coal_bds, but something near here didn't work for
>   changing rx_coal_ticks in an old version.
> - bge_careful_coal = 0 avoids the loop by writing to the rx_max_coal_bds
>   register without waiting for the HCE.  This seems to work too.  It
>   isn't critical for the HCE to see the change immediately or even
>   for it to be seen at all (missed changes might do more than give a
>   huge interrupt rate for too long), but it is important for the
>   change to not break the engine.
> There is no sysctl for this of for some other hackish parameters.  The
> source must be edited to change this from 1 to 0.
> 
> Dynamic tuning is turned off by setting the dynamic max interrupt
> frequency to 0.  Then rx_coal_ticks is reset to 150, and the active
> rx_max_coal_bds is restored to the static value.

Finally I have tested your second (without debug stuff) patch in
production environment (~45K in/out packets) on FreeBSD 7.0-STABLE.
I think it should be commited.

I use my usual static settings in /etc/sysctl.conf:

dev.bge.0.dyncoal_max_intr_freq=0
#
dev.bge.0.rx_coal_ticks=500
dev.bge.0.tx_coal_ticks=1
dev.bge.0.rx_max_coal_bds=64
dev.bge.0.tx_max_coal_bds=128
# apply the above parameters
dev.bge.0.pro

Re: zonelimit issues...

2008-04-21 Thread Igor Sysoev
On Mon, Apr 21, 2008 at 05:16:28PM +0900, [EMAIL PROTECTED] wrote:

> At Mon, 21 Apr 2008 16:46:00 +0900,
> [EMAIL PROTECTED] wrote:
> > 
> > At Sun, 20 Apr 2008 10:32:25 +0100 (BST),
> > rwatson wrote:
> > > 
> > > 
> > > On Fri, 18 Apr 2008, [EMAIL PROTECTED] wrote:
> > > 
> > > > I am wondering why this patch was never committed?
> > > >
> > > > http://people.freebsd.org/~delphij/misc/patch-zonelimit-workaround
> > > >
> > > > It does seem to address an issue I'm seeing where processes get into 
> > > > the 
> > > > zonelimit state through the use of mbufs (a high speed UDP packet 
> > > > receiver) 
> > > > but even after network pressure is reduced/removed the process never 
> > > > gets 
> > > > out of that state again.  Applying the patch fixed the issue, but I'd 
> > > > like 
> > > > to have some discussion as to the general merits of the approach.
> > > >
> > > > Unfortunately the test that currently causes this is tied very tightly 
> > > > to 
> > > > code at work that I can't share, but I will hopefully be improving 
> > > > mctest to 
> > > > try to exhibit this behavior.
> > > 
> > > When you take all load off the system, do mbufs and clusters get properly 
> > > freed back to UMA (as visible in netstat -m)?  If not, continuing to bump 
> > > up 
> > > against the zonelimit would suggest an mbuf/cluster leak, in which case 
> > > we 
> > > need to track that bug.
> > > 
> > 
> > This is unclear as the process that creates the issue opens 50 UDP
> > multicast sockets with very large socket buffers.  I am investigating
> > this aspect some more.
> > 
> 
> OK, yes, the clusters etc. go back to normal when the incoming
> pressure is released.  I do not believe we have a cluster/mbuf leak.

There is no cluster/mbuf leak.

The problem that FreeBSD has small KVA space: only 2G even on amd64 32G
machines.

So with

vm.kmem_size=1G
# 64M KVA
kern.maxbcache=64M
# 4M KVA
kern.ipc.maxpipekva=4M


I can use something like this:

# 256M KVA/KVM
kern.ipc.nmbjumbop=64000
# 216M KVA/KVM
kern.ipc.nmbclusters=98304
# 162M KVA/KVM
kern.ipc.maxsockets=163840
# 8M KVA/KVM
net.inet.tcp.maxtcptw=163840
# 24M KVA/KVM
kern.maxfiles=204800


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: bge loader tunables

2007-11-16 Thread Igor Sysoev
On Sat, Nov 17, 2007 at 08:30:58AM +1100, Bruce Evans wrote:

> On Fri, 16 Nov 2007, Igor Sysoev wrote:
> 
> >The attached patch creates the following bge loader tunables:
> 
> I plan to commit old work to do this using sysctls.  Tunables are
> harder to use and aren't needed since changes to the defaults aren't
> needed for booting.  I also implemented dynamic tuning for rx coal
> parameters so that the sysctls are mostly not needed.  Ask for patches
> if you want to test this extensively.

Yes, I can test your patches on 6.2 and 7.0.
Now bge set the coalescing parameters at attach time.
Do the sysctl's allow to change them on-the-fly ?
How does rx dynamic tuning work ?
Could it be turned off ?

> >hw.bge.rxd=512
> >
> >Number of standard receive descriptors allocated by the driver.
> >The default value is 256. The maximum value is 512.
> 
> I always use 512 for this.  The corresponding value for jumbo buffers
> is hard-coded (JSLOTS exists to tune the value at config time, like
> SSLOTS does for this, but is no longer used).  Only machines with a
> small amount of memory should care about the wastage from always
> allocating the max number of descriptors.

I agree: the default jumbo rx ring takes 256*9216=2.3M, while maximum
standard rx ring takes 512*2048=1M, nevertheless it is limited to
256*2048=512K.

> >hw.bge.rx_int_delay=500
> >
> >This value delays the generation of receive interrupts in microseconds.
> >The default value is 150 microseconds.
> 
> This is a good default.  I normally use 100 (goes with dynamic tuning to
> limit the rx interrupt rate to 10 kHz).
> 
> >hw.bge.tx_int_delay=500
> >
> >This value delays the generation of transmit interrupts in microseconds.
> >The default value is 150 microseconds.
> 
> I use 1 second.  Infinity works right, except it wastes mbufs when the
> tx is idle for a long time.

It seems 1 second is good for me: I use sendfile() and lot of mbufs clusters:
kern.ipc.nmbclusters=196608

> >hw.bge.rx_coal_desc=64
> >
> >This value delays the generation of receive interrupts until specified
> >number of packets will be received. The default value is 10.
> 
> 64 is a good default.  10 is a bad default (it optimizes too much for
> latency at a cost of efficiency to be good). I use 1 when optimizing
> for latency.  Dynamic tuning sets this to a value suitable for limiting
> the rx interrupt rate to a specified frequency (10 kHz is a good limit).
> 
> >hw.bge.tx_coal_desc=128
> >
> >This value delays the generation of transmit interrupts until specified
> >number of packets will be transmited. The default value is 10.
> 
> 128 is a good default.  I use 384.  There are few latency issues here, so
> the default of 10 mainly costs efficiency.

Does 384 not delay tx if there is shortage of free tx descriptors ?


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


bge loader tunables

2007-11-16 Thread Igor Sysoev
The attached patch creates the following bge loader tunables:

hw.bge.rxd=512

Number of standard receive descriptors allocated by the driver.
The default value is 256. The maximum value is 512.

hw.bge.rx_int_delay=500

This value delays the generation of receive interrupts in microseconds.
The default value is 150 microseconds.

hw.bge.tx_int_delay=500

This value delays the generation of transmit interrupts in microseconds.
The default value is 150 microseconds.

hw.bge.rx_coal_desc=64

This value delays the generation of receive interrupts until specified
number of packets will be received. The default value is 10.

hw.bge.tx_coal_desc=128

This value delays the generation of transmit interrupts until specified
number of packets will be transmited. The default value is 10.


-- 
Igor Sysoev
http://sysoev.ru/en/
--- sys/dev/bge/if_bge.c	2007-09-30 15:05:14.0 +0400
+++ sys/dev/bge/if_bge.c	2007-11-15 23:01:57.0 +0300
@@ -426,8 +426,18 @@
 DRIVER_MODULE(miibus, bge, miibus_driver, miibus_devclass, 0, 0);
 
 static int bge_allow_asf = 0;
+static int bge_rxd = BGE_SSLOTS;
+static int bge_rx_coal_ticks = 150;
+static int bge_tx_coal_ticks = 150;
+static int bge_rx_max_coal_bds = 10;
+static int bge_tx_max_coal_bds = 10;
 
 TUNABLE_INT("hw.bge.allow_asf", &bge_allow_asf);
+TUNABLE_INT("hw.bge.rxd", &bge_rxd);
+TUNABLE_INT("hw.bge.rx_int_delay", &bge_rx_coal_ticks);
+TUNABLE_INT("hw.bge.tx_int_delay", &bge_tx_coal_ticks);
+TUNABLE_INT("hw.bge.rx_coal_desc", &bge_rx_max_coal_bds);
+TUNABLE_INT("hw.bge.tx_coal_desc", &bge_tx_max_coal_bds);
 
 SYSCTL_NODE(_hw, OID_AUTO, bge, CTLFLAG_RD, 0, "BGE driver parameters");
 SYSCTL_INT(_hw_bge, OID_AUTO, allow_asf, CTLFLAG_RD, &bge_allow_asf, 0,
@@ -877,10 +887,10 @@
 {
 	int i;
 
-	for (i = 0; i < BGE_SSLOTS; i++) {
+	for (i = 0; i < bge_rxd; i++) {
 		if (bge_newbuf_std(sc, i, NULL) == ENOBUFS)
 			return (ENOBUFS);
-	};
+	}
 
 	bus_dmamap_sync(sc->bge_cdata.bge_rx_std_ring_tag,
 	sc->bge_cdata.bge_rx_std_ring_map,
@@ -2453,10 +2463,10 @@
 
 	/* Set default tuneable values. */
 	sc->bge_stat_ticks = BGE_TICKS_PER_SEC;
-	sc->bge_rx_coal_ticks = 150;
-	sc->bge_tx_coal_ticks = 150;
-	sc->bge_rx_max_coal_bds = 10;
-	sc->bge_tx_max_coal_bds = 10;
+	sc->bge_rx_coal_ticks = bge_rx_coal_ticks;
+	sc->bge_tx_coal_ticks = bge_tx_coal_ticks;
+	sc->bge_rx_max_coal_bds = bge_rx_max_coal_bds;
+	sc->bge_tx_max_coal_bds = bge_tx_max_coal_bds;
 
 	/* Set up ifnet structure */
 	ifp = sc->bge_ifp = if_alloc(IFT_ETHER);
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

setup_loopback() in /etc/rc.firewall

2007-10-19 Thread Igor Sysoev
After 1.49 src/etc/rc.firewall setup_loopback() is called in any
firewall type including custom firewall defined filename.

I think setup_loopback() should be called for predefined firewalls.


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Add socket related statistics to netstat(1)?

2007-08-29 Thread Igor Sysoev
On Wed, Aug 29, 2007 at 02:39:57PM +0100, Robert Watson wrote:

> 
> On Wed, 29 Aug 2007, Igor Sysoev wrote:
> 
> >On Wed, Aug 29, 2007 at 02:48:57PM +0800, LI Xin wrote:
> >
> >>Here is a proof-of-concept patch that adds sockets related statistics to 
> >>netstat(1)'s -m option, which could make SA's life easier.  Inspired by a 
> >>local user's suggestion.
> >>
> >>Comments?
> >
> >I think socket info should be groupped together:
> 
> The netstat -m output is getting quite cluttered these days, isn't it.  I 
> wonder if we should be laying it out a bit more consistently, perhaps 
> something like:
> 
>current   cachetotalmax
> mbufs  2407  1058 3465 -
> mbuf clusters  1117  797  1914 98304
> mbufs + clusters   1117  90   --
> 4k jumbo clusters  761   417  1178 0
> ...
> 
> It's less compact but possibly quite a bit more readable...

I agree - it's much better, however, someone may argue that it
will break statistic scripts.
May we should use anthor switch.

> >2407/1058/3465 mbufs in use (current/cache/total)
> >1117/797/1914/98304 mbuf clusters in use (current/cache/total/max)
> >1117/90 mbuf+clusters out of packet secondary zone in use (current/cache)
> >761/417/1178/0 4k (page size) jumbo clusters in use 
> >(current/cache/total/max)
> >0/0/0/0 9k jumbo clusters in use (current/cache/total/max)
> >0/0/0/0 16k jumbo clusters in use (current/cache/total/max)
> >5879K/3526K/9406K bytes allocated to network (current/cache/total)
> >0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
> >0/0/0 requests for jumbo clusters denied (4k/9k/16k)
> >15333/15537/30870/204800 socket UMA in use (current/cache/total/max)
> >5929K bytes allocated to socket
> >0 request for socket UMA denied
> >104/264/6656 sfbufs in use (current/peak/max)
> >0 requests for sfbufs denied
> >0 requests for sfbufs delayed
> >135834 requests for I/O initiated by sendfile
> >0 calls to protocol drain routines
> >
> >Second, I think socket memory calculation should include
> >tcpcb, udpcb, inpcb, unpcb and probably tcptw items.
> >
> >
> >-- 
> >Igor Sysoev
> >http://sysoev.ru/en/
> >___
> >freebsd-net@freebsd.org mailing list
> >http://lists.freebsd.org/mailman/listinfo/freebsd-net
> >To unsubscribe, send any mail to "[EMAIL PROTECTED]"
> >
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "[EMAIL PROTECTED]"
> 

-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Add socket related statistics to netstat(1)?

2007-08-29 Thread Igor Sysoev
On Wed, Aug 29, 2007 at 02:48:57PM +0800, LI Xin wrote:

> Here is a proof-of-concept patch that adds sockets related statistics to
> netstat(1)'s -m option, which could make SA's life easier.  Inspired by
> a local user's suggestion.
> 
> Comments?

I think socket info should be groupped together:

2407/1058/3465 mbufs in use (current/cache/total)
1117/797/1914/98304 mbuf clusters in use (current/cache/total/max)
1117/90 mbuf+clusters out of packet secondary zone in use (current/cache)
761/417/1178/0 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/0 9k jumbo clusters in use (current/cache/total/max)
0/0/0/0 16k jumbo clusters in use (current/cache/total/max)
5879K/3526K/9406K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
15333/15537/30870/204800 socket UMA in use (current/cache/total/max)
5929K bytes allocated to socket
0 request for socket UMA denied
104/264/6656 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
135834 requests for I/O initiated by sendfile
0 calls to protocol drain routines

Second, I think socket memory calculation should include
tcpcb, udpcb, inpcb, unpcb and probably tcptw items.


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: maximum number of outgoing connections

2007-08-22 Thread Igor Sysoev
On Mon, Aug 20, 2007 at 10:30:12PM +0400, Igor Sysoev wrote:

> On Mon, Aug 20, 2007 at 09:53:55AM -0700, John-Mark Gurney wrote:
> 
> > Igor Sysoev wrote this message on Mon, Aug 20, 2007 at 19:11 +0400:
> > > It seems that FreeBSD can not make more than
> > > 
> > > net.inet.ip.portrange.last - net.inet.ip.portrange.first
> > > 
> > > simultaneous outgoing connections, i.e., no more than about 64k.
> > > 
> > > If I made ~64000 connections 127.0.0.1: > 127.0.0.1:80, then
> > > connect() to an external address returns EADDRNOTAVAIL.
> > 
> > Isn't this more of a limitation of TCP/IP than FreeBSD?  because you
> > need to treat the srcip/srcport/dstip/dstport as a unique value, and
> > in your test, you are only changing one of the four...  Have you tried
> > running a second we server on port 8080, and see if you can connect
> > another ~64000 connections to that port too?
> 
> No, TCP/IP limitation is for  in 127.0.0.1: <> 127.0.0.1:80,
> but FreeBSD limits all outgoing connections to the port range, i.e.
> 
> local part  remote part
>   127.0.0.1:5000 <> 127.0.0.1:80
> 192.168.1.1:5000 <> 10.0.0.1:25
> 
> can not exist simultaneously, if both connections were started from
> local host.

To be exact - if connect() was called on unbound socket.


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: maximum number of outgoing connections

2007-08-20 Thread Igor Sysoev
On Mon, Aug 20, 2007 at 09:53:55AM -0700, John-Mark Gurney wrote:

> Igor Sysoev wrote this message on Mon, Aug 20, 2007 at 19:11 +0400:
> > It seems that FreeBSD can not make more than
> > 
> > net.inet.ip.portrange.last - net.inet.ip.portrange.first
> > 
> > simultaneous outgoing connections, i.e., no more than about 64k.
> > 
> > If I made ~64000 connections 127.0.0.1: > 127.0.0.1:80, then
> > connect() to an external address returns EADDRNOTAVAIL.
> 
> Isn't this more of a limitation of TCP/IP than FreeBSD?  because you
> need to treat the srcip/srcport/dstip/dstport as a unique value, and
> in your test, you are only changing one of the four...  Have you tried
> running a second we server on port 8080, and see if you can connect
> another ~64000 connections to that port too?

No, TCP/IP limitation is for  in 127.0.0.1: <> 127.0.0.1:80,
but FreeBSD limits all outgoing connections to the port range, i.e.

local part  remote part
  127.0.0.1:5000 <> 127.0.0.1:80
192.168.1.1:5000 <> 10.0.0.1:25

can not exist simultaneously, if both connections were started from
local host.

I can not write a simple test-case program, but I can offer simple setup:

cd /usr/ports/www/nginx && make install

create simple nginx.conf:


events {
worker_connections  2;
}

http {
server {
listen8080;
server_name   test;

location = /loop {
proxy_pass  http://127.0.0.1:8080;

error_page  502 = /yahoo;
}

location = /yahoo {
proxy_pass  http://www.yahoo.com;
}
}
}


set

sysctl net.inet.ip.portrange.randomized=0
sysctl net.inet.ip.portrange.first=1024
sysctl net.inet.ip.portrange.last=5000

to see the case with default small number of files, sockets, etc.

and run as root:

/usr/local/sbin/nginx -c ./nginx.conf

then ask http://host:8080/loop in browser. nginx will cycle to itslef, then
after first error

2007/08/20 22:05:16 [crit] 29669#0: *94165 connect() to 127.0.0.1:8080 failed 
(49: Can't assign requested address) while connecting to upstream, client: 
127.0.0.1, server: test, URL: "/loop", upstream: "http://127.0.0.1:8080/loop";, 
host: "127.0.0.1:8080"

you will see the second error:

2007/08/20 22:05:16 [crit] 29669#0: *94165 connect() to 87.248.113.14:80 failed 
(49: Can't assign requested address) while connecting to upstream, client: 
127.0.0.1, server: test, URL: "/loop", upstream: 
"http://87.248.113.14:80/loop";, host: "127.0.0.1:8080"

If you think it may be nginx fault, run this under ktrace/truss and see
syscalls.


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: maximum number of outgoing connections

2007-08-20 Thread Igor Sysoev
On Mon, Aug 20, 2007 at 05:19:14PM +0100, Tom Judge wrote:

> Igor Sysoev wrote:
> >It seems that FreeBSD can not make more than
> >
> >net.inet.ip.portrange.last - net.inet.ip.portrange.first
> >
> >simultaneous outgoing connections, i.e., no more than about 64k.
> >
> >If I made ~64000 connections 127.0.0.1: > 127.0.0.1:80, then
> >connect() to an external address returns EADDRNOTAVAIL.
> >
> >net.inet.ip.portrange.randomized is 0.
> >
> >sockets, etc. are enough:
> >
> >ITEMSIZE LIMIT  USED  FREE  REQUESTS  FAILURES
> >socket:  356,   204809,13915,   146443, 148189452,0
> >inpcb:   180,   204820,20375,   137277, 147631805,0
> >tcpcb:   464,   204800,13882,   142102, 147631805,0
> >tcptw:48,41028, 6493,11213, 29804665,0
> >
> >I saw it on 6.2-STABLE.
> >
> >
> 
> In an ideal world (Not sure if this is quite correct for FreeBSD) TCP 
> connections are tracked with a pair of tupels  source-addr:src-port -> 
> dst-addr:dst-port
> 
> As your always connecting to the same destination service 127.0.0.1:80 
> and always from the same source IP 127.0.0.1 then you only have one 
> variable left to change, the source port.  If you where to use the hole 
> of the whole of the port range minus the reserved ports you would only 
> ever be able to make 64512 simultaneous connections.  In order to make 
> more connections the first thing that you may want to start changing is 
> the source IP. If you added a second IP to you lo0 interface (say 
> 127.0.0.2) and used a round robin approach to making your out bound 
> connections then you could make around 129k outbound connections.

Connections to 127.0.0.1 were via lo0, external connections are via bge0.

> I am not sure if there are any other constraints that need to be taken 
> into account such as the maximum number of sockets, RAM etc

No, there are no constraints in memory, sockets, mbufs, clusters, etc.
If there's contraint in memory, then FreeBSD simply panics.
If there's contraint in mbuf clusters, then process stucks in zonelimit
state forever.

I suspect that local address in in_pcbbind_setup() is 0.0.0.0 so there
is 64K limit.


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


maximum number of outgoing connections

2007-08-20 Thread Igor Sysoev
It seems that FreeBSD can not make more than

net.inet.ip.portrange.last - net.inet.ip.portrange.first

simultaneous outgoing connections, i.e., no more than about 64k.

If I made ~64000 connections 127.0.0.1: > 127.0.0.1:80, then
connect() to an external address returns EADDRNOTAVAIL.

net.inet.ip.portrange.randomized is 0.

sockets, etc. are enough:

ITEMSIZE LIMIT  USED  FREE  REQUESTS  FAILURES
socket:  356,   204809,13915,   146443, 148189452,0
inpcb:   180,   204820,20375,   137277, 147631805,0
tcpcb:   464,   204800,13882,   142102, 147631805,0
tcptw:48,41028, 6493,11213, 29804665,0

I saw it on 6.2-STABLE.


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: syncookie in 6.x and 7.x

2007-08-19 Thread Igor Sysoev
On Sun, Aug 19, 2007 at 04:42:51AM -0500, Mike Silbersack wrote:

> On Thu, 16 Aug 2007, Igor Sysoev wrote:
> 
> >I have looked sources and found that in early versions the sent counter
> >was simply not incremented at all. The patch attached.
> 
> The patch looks ready to commit to me.  Do you want me to commit or, or do 
> you have another committer lined up?

Feel free to commit.

> >After the patch has been applied I have found that 6 always sends
> >syncookies too, however, 6 unlike 7 never receives them. Why ?
> 
> Have you tried patching 6 so that the syncache is non-functional and 
> forced it to rely on syncookies?  Last I checked (which was a long time 
> ago), syncookies worked on 6.  Adding a sysctl like 7's 
> net.inet.tcp.syncookies_only to 6 might not be a bad idea, as long as it's 
> behind #ifdef DIAGNOSTIC or INVARIANTS.

No, I have not tried.

> The question you may really be asking is:  Why does 7 *think* that it is 
> receiving syncookies all the time? :)
> 
> I haven't tried to answer that question yet.

I have found two 4.8's:

17460166 syncache entries added
106312 retransmitted
90435 dupsyn
0 dropped
17424177 completed
465 bucket overflow
0 cache overflow
21526 reset
13725 stale
0 aborted
0 badack
279 unreach
0 zone failures
0 cookies sent
6 cookies received

1671768 syncache entries added
63163 retransmitted
37566 dupsyn
0 dropped
1645430 completed
248 bucket overflow
0 cache overflow
13144 reset
12888 stale
0 aborted
0 badack
174 unreach
0 zone failures
0 cookies sent
116 cookies received

and 4.11's:

5643772 syncache entries added
45993 retransmitted
41452 dupsyn
0 dropped
5630013 completed
298 bucket overflow
0 cache overflow
7374 reset
6030 stale
0 aborted
0 badack
93 unreach
0 zone failures
0 cookies sent
36 cookies received

141791272 syncache entries added
280354 retransmitted
273529 dupsyn
0 dropped
141703800 completed
206 bucket overflow
0 cache overflow
9847 reset
35570 stale
36034 aborted
0 badack
5854 unreach
0 zone failures
0 cookies sent
40 cookies received

I have found one 6.1-PRERELEASE with 298 uptime:

2672792190 syncache entries added
83640383 retransmitted
77727918 dupsyn
282 dropped
2645872801 completed
0 bucket overflow
0 cache overflow
10974940 reset
15657014 stale
91 aborted
52 badack
287259 unreach
0 zone failures
0 cookies sent
8 cookies received

4.x have uptimes from week to month.
On other 6.x with small uptime and do not see received cookies.
And I have no 5.x at all.

Anyway, 7 receives cookies much more - here is statistics from 3 days uptime:

52175610 syncache entries added
2092809 retransmitted
2021384 dupsyn
0 dropped
51681903 completed
0 bucket overflow
0 cache overflow
181311 reset
258220 stale
4 aborted
0 badack
18384 unreach
0 zone failures
52175610 cookies sent
16238 cookies received

I have found that in 7 received cookies correlate with unreach.


-- 
Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


syncookie in 6.x and 7.x

2007-08-16 Thread Igor Sysoev
During testing 7.0-CURRENT I have found that it always sends syncookies
while on early FreeBSD versions "netstat -s -p tcp" always shows:

0 cookies sent
0 cookies received

I have looked sources and found that in early versions the sent counter
was simply not incremented at all. The patch attached.

After the patch has been applied I have found that 6 always sends
syncookies too, however, 6 unlike 7 never receives them. Why ?

Here is 6 statistics:

1046714 syncache entries added
28395 retransmitted
32879 dupsyn
0 dropped
1038153 completed
0 bucket overflow
0 cache overflow
4201 reset
3972 stale
0 aborted
0 badack
254 unreach
0 zone failures
1046714 cookies sent
0 cookies received

Here is 7 statistics:

76018 syncache entries added
2536 retransmitted
2574 dupsyn
0 dropped
75114 completed
0 bucket overflow
0 cache overflow
456 reset
267 stale
0 aborted
0 badack
20 unreach
0 zone failures
76018 cookies sent
24 cookies received


-- 
Igor Sysoev
http://sysoev.ru/en/
--- sys/netinet/tcp_syncache.c	2006-02-16 04:06:22.0 +0300
+++ sys/netinet/tcp_syncache.c	2007-08-15 13:55:25.0 +0400
@@ -1323,6 +1323,7 @@
 	MD5Final((u_char *)&md5_buffer, &syn_ctx);
 	data ^= (md5_buffer[0] & ~SYNCOOKIE_WNDMASK);
 	*flowid = md5_buffer[1];
+	tcpstat.tcps_sc_sendcookie++;
 	return (data);
 }
 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: Improved TCP syncookie implementation

2006-09-14 Thread Igor Sysoev

On Thu, 14 Sep 2006, Ruslan Ermilov wrote:


On Wed, Sep 13, 2006 at 10:31:43PM +0200, Andre Oppermann wrote:

Igor Sysoev wrote:

Well, suppose protocol similar to SSH or SMTP:

1) the client calls connect(), it sends SYN;
2) the server receives SYN and sends SYN/ACK with cookie;
3) the client receives SYN/ACK and sends ACK;
4) the client returns successfully from connect() and calls read();
5) the ACK is lost;
6) the server does not about this connection, so application can not
  accept() it, and it can not send() HELO message.
7) the client gets ETIMEDOUT from read().

Where in this sequence client may retransmit its ACK ?


Never.  You're correct.  There is no data that would cause a retransmit
if the application is waiting for a server prompt first.  I shouldn't
write wrong explanations when I'm tired, hungry and in between two tasks. ;)

This problem is the reason why we don't switch entirely to syncookies
and still keep syncache as is.


Perhaps it would be a good idea to remove net.inet.tcp.syncookies_only
then?  In any case, please don't forget to update the syncache(4) manpage
to reflect your changes, and if you decide not to remove this sysctl,
please add a warning of its potential to break a protocol.


I think that setting syncookies only not globally, but on per port basis,
say, for HTTP would be helpfull. Setting it for other protocols, e.g, SSH,
rsync, SMTP, IMAP, POP3 may break them.


Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Improved TCP syncookie implementation

2006-09-13 Thread Igor Sysoev

On Wed, 13 Sep 2006, Andre Oppermann wrote:


Igor Sysoev wrote:

On Sun, 3 Sep 2006, Andre Oppermann wrote:


I've pretty much rewritten our implementation of TCP syncookies to get
rid of some locking in TCP syncache and to improve their functionality.

The RFC1323 timestamp option is used to carry the full TCP SYN+SYN/ACK
optional feature information.  This means that a FreeBSD host may run
with syncookies only and not degrade TCP connections made through it.
All important TCP connection setup negotiated options are preserved
(send/receive window scaling, SACK, MSS) without storing any state on
the host during the SYN-SYN/ACK phase.  As a nice side effect the
timestamps we respond with are randomized instead of directly using
ticks (which reveals out uptime).


As I understand syncache is used to retransmit SYN/ACK.
What would be if

1) a client sent SYN,
2) we sent SYN/ACK with cookie,
3) the client sent ACK, but the ACK was lost


If the client sent ACK it will retry again after the normal retransmit
timeout.


Well, suppose protocol similar to SSH or SMTP:

1) the client calls connect(), it sends SYN;
2) the server receives SYN and sends SYN/ACK with cookie;
3) the client receives SYN/ACK and sends ACK;
4) the client returns successfully from connect() and calls read();
5) the ACK is lost;
6) the server does not about this connection, so application can not
   accept() it, and it can not send() HELO message.
7) the client gets ETIMEDOUT from read().

Where in this sequence client may retransmit its ACK ?


If our SYN-ACK back to client is lost we won't resend it with syncookies.
The client then has to try again which is does after the syn retransmit
timeout.


Yes.


Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Improved TCP syncookie implementation

2006-09-13 Thread Igor Sysoev

On Sun, 3 Sep 2006, Andre Oppermann wrote:


I've pretty much rewritten our implementation of TCP syncookies to get
rid of some locking in TCP syncache and to improve their functionality.

The RFC1323 timestamp option is used to carry the full TCP SYN+SYN/ACK
optional feature information.  This means that a FreeBSD host may run
with syncookies only and not degrade TCP connections made through it.
All important TCP connection setup negotiated options are preserved
(send/receive window scaling, SACK, MSS) without storing any state on
the host during the SYN-SYN/ACK phase.  As a nice side effect the
timestamps we respond with are randomized instead of directly using
ticks (which reveals out uptime).


As I understand syncache is used to retransmit SYN/ACK.
What would be if

1) a client sent SYN,
2) we sent SYN/ACK with cookie,
3) the client sent ACK, but the ACK was lost

?

I suppose the client will see timed out error.


Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: strange timeout error returned by kevent() in 6.0

2005-12-06 Thread Igor Sysoev

On Tue, 6 Dec 2005, John-Mark Gurney wrote:


Igor Sysoev wrote this message on Thu, Sep 01, 2005 at 18:26 +0400:

On Thu, 1 Sep 2005, Igor Sysoev wrote:


I found strange timeout errors returned by kevent() in 6.0 using
my http server named nginx.  The nginx's run on three machines:
two 4.10-RELEASE and one 6.0-BETA3.  All machines serve the same
content (simple cluster) and each handles about 200 requests/second.

On 6.0 sometimes (2 or 3 times per hour) in the daytime kevent()
returns EV_EOF in flags and ETIMEDOUT in fflags, nevertheless:

1) nginx does not set any kernel timeout for sockets;
2) the total request time for such failed requests is small, 30 and so
seconds.


I have changed code to ignore the ETIMEDOUT error returned by kevent()
and found that subsequent sendfile() returned the ENOTCONN.

By the way, why sendfile() may return ENOTCONN ?
I saw this error code on 4.x too.


The reason that you are seeing ETIMEDOUT/ENOTCONN is that the connection
probably ETIMEDOUT (aka timed out)... and so is ENOTCONN (no longer
connected).. can you also do a read or a write to the socket successfully?


At least recv() returns ETIMEDOUT. I could not test write() right now.


and sendfile(3) says:
ERRORS
[...]

[ENOTCONN] The s argument points to an unconnected socket.

and a glance at tcp(4) says:
ERRORS
[...]

[ETIMEDOUT]when a connection was dropped due to excessive
   retransmissions;

There's the answers...


Yes, it seems that ETIMEDOUT is retransmission failure. I've seen it in
experiment.

The strangeness is that I did not see this error on 4.10.
Only on 6.0 and recenty on 4.11. May be I will upgrade cluster machine
from 4.10 to 4.11 to see changes.


Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: strange timeout error returned by kevent() in 6.0

2005-12-06 Thread Igor Sysoev

On Thu, 1 Dec 2005, Igor Sysoev wrote:


On Thu, 1 Sep 2005, Igor Sysoev wrote:


On Thu, 1 Sep 2005, Igor Sysoev wrote:


I found strange timeout errors returned by kevent() in 6.0 using
my http server named nginx.  The nginx's run on three machines:
two 4.10-RELEASE and one 6.0-BETA3.  All machines serve the same
content (simple cluster) and each handles about 200 requests/second.

On 6.0 sometimes (2 or 3 times per hour) in the daytime kevent()
returns EV_EOF in flags and ETIMEDOUT in fflags, nevertheless:

1) nginx does not set any kernel timeout for sockets;
2) the total request time for such failed requests is small, 30 and so 
seconds.


I have changed code to ignore the ETIMEDOUT error returned by kevent()
and found that subsequent sendfile() returned the ENOTCONN.

By the way, why sendfile() may return ENOTCONN ?
I saw this error code on 4.x too.


Recently I've found that kevent() in FreeBSD 5.4 may return wrong
the ETIMEDOUT too.

Also I've found that recv() on FreeBSD 6.0 may return wrong ETIMEDOUT
error for socket that has no any kernel timeout. It seems this
ETIMEDOUT error masks another error.


It's seems that this ETIMEDOUT is caused by a retransmit failure, when
data were retransmited 12 times with backoff timeout. The whole timeout
is small, 30-50 seconds, because the initial RTO is very small: 5-10 ms.


Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: strange timeout error returned by kevent() in 6.0

2005-12-01 Thread Igor Sysoev

On Thu, 1 Sep 2005, Igor Sysoev wrote:


On Thu, 1 Sep 2005, Igor Sysoev wrote:


I found strange timeout errors returned by kevent() in 6.0 using
my http server named nginx.  The nginx's run on three machines:
two 4.10-RELEASE and one 6.0-BETA3.  All machines serve the same
content (simple cluster) and each handles about 200 requests/second.

On 6.0 sometimes (2 or 3 times per hour) in the daytime kevent()
returns EV_EOF in flags and ETIMEDOUT in fflags, nevertheless:

1) nginx does not set any kernel timeout for sockets;
2) the total request time for such failed requests is small, 30 and so 
seconds.


I have changed code to ignore the ETIMEDOUT error returned by kevent()
and found that subsequent sendfile() returned the ENOTCONN.

By the way, why sendfile() may return ENOTCONN ?
I saw this error code on 4.x too.


Recently I've found that kevent() in FreeBSD 5.4 may return wrong
the ETIMEDOUT too.

Also I've found that recv() on FreeBSD 6.0 may return wrong ETIMEDOUT
error for socket that has no any kernel timeout. It seems this
ETIMEDOUT error masks another error.


Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: strange timeout error returned by kevent() in 6.0

2005-09-01 Thread Igor Sysoev

On Thu, 1 Sep 2005, Igor Sysoev wrote:


I found strange timeout errors returned by kevent() in 6.0 using
my http server named nginx.  The nginx's run on three machines:
two 4.10-RELEASE and one 6.0-BETA3.  All machines serve the same
content (simple cluster) and each handles about 200 requests/second.

On 6.0 sometimes (2 or 3 times per hour) in the daytime kevent()
returns EV_EOF in flags and ETIMEDOUT in fflags, nevertheless:

1) nginx does not set any kernel timeout for sockets;
2) the total request time for such failed requests is small, 30 and so 
seconds.


I have changed code to ignore the ETIMEDOUT error returned by kevent()
and found that subsequent sendfile() returned the ENOTCONN.

By the way, why sendfile() may return ENOTCONN ?
I saw this error code on 4.x too.


Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


strange timeout error returned by kevent() in 6.0

2005-09-01 Thread Igor Sysoev

I found strange timeout errors returned by kevent() in 6.0 using
my http server named nginx.  The nginx's run on three machines:
two 4.10-RELEASE and one 6.0-BETA3.  All machines serve the same
content (simple cluster) and each handles about 200 requests/second.

On 6.0 sometimes (2 or 3 times per hour) in the daytime kevent()
returns EV_EOF in flags and ETIMEDOUT in fflags, nevertheless:

1) nginx does not set any kernel timeout for sockets;
2) the total request time for such failed requests is small, 30 and so seconds.


Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


setsockopt() can not remove the accept filter

2005-06-10 Thread Igor Sysoev

Hi,

man setsockopt(2) states that "passing in an optval of NULL will remove
the filter", however, setsockopt() always return EINVAL in this case,
because do_setopt_accept_filter() removes the filter if sopt == NULL, but
not if sopt->val == NULL.  The fix is easy:

-if (sopt == NULL) {
+if (sopt == NULL || sopt->val == NULL) {


By the way, is it easy to add timeout for dataready and httpready filters ?
Now the stale connections may live for long time.


Igor Sysoev
http://sysoev.ru/en/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Very strange kevent problem possibly to do with vinum

2004-12-10 Thread Igor Sysoev
On Wed, 8 Dec 2004, Kevin Day wrote:

> I have a really really strange kevent problem(i think anyway) that has
> really stumped me.
>
> Here's the scenario:
>
> Three mostly identical servers running 5.2.1 or 5.3 (problem exists on
> both). All three running thttpd sending out large files to thousands of
> clients. Thttpd internally uses kqueue/kevent and sendfile to send
> files rather quickly.
>
> All three have the same configuration, are getting approximately the
> same numbers of requests, and are sending approximately the same files.
> (I can swap IP addresses between the servers to confirm that the
> request distribution stays the same between the servers)
>
> Server #3 is able to send 400mbps or more of traffic through without
> breaking a sweat. Thttpd is either in "RUN", "biord" "sfbufa" or
> "*Giant" when I watch it in top, and I still have 80-90% idle time.
>
> Servers #1 and #2 seem to top out around 80mbps, and are constantly in
> "RUN" or "CPUx" states. I don't get any errors anywhere, but they just
> aren't capable of going any faster.
>
> Looking at ktrace on thttpd on all three servers, I see that server 3
> calls kevent, and gets 20-100 sockets in response back, that each get
> serviced. Servers 1 and 2 never seem to get more than 1 socket back
> from kevent. Even if the event is just that the socket was
> disconnected, nothing needs to be done on it, and kevent can be called
> again immediately, there's only 1 socket returned next time. I ran
> ktrace on thttpd for more than 15 minutes and produced a humongous
> ktrace file, and there were only a handful of times that kevent
> returned more than one socket with something to do on it. Contrasting
> that to server 3, where i never saw kevent returning less than a half
> dozen sockets at a time when it had a few hundred mbps flowing through
> it.
>
> The ONLY difference between servers 1 and 2 and server 3 is the disk
> subsystem.  Servers 1/2 use an "ahc" SCSI controller and vinum RAID5.
> Server 3 uses an "aac" hardware RAID. However, disk activity is really
> truly minimal on all of these servers. Most of the data remains cached,
> since 99% of the requests are for the same handful of files.
> systat/vmstat shows that the disks are busy less than 10% of the time,
> and artificially creating a bunch of disk load on any of the servers
> doesn't seem to affect anything.
>
> I'm not sure if the kevent difference is the cause of the problem
> (thttpd doesn't seem to handle going through its event loop over and
> over again for just one socket at a time, it makes some rather
> expensive syscalls from that loop), or if it's just a symptom. Is
> something in vinum possibly waking my process up somewhat prematurely?
> Is that even possible if the files are being sent through sendfile?

What does "systat -vm" show on these machines ?


Igor Sysoev
http://sysoev.ru/en/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: using natd to load balance port 80 to multiple servers

2004-10-23 Thread Igor Sysoev
On Sat, 23 Oct 2004, Stephane Raimbault wrote:

> I'm currently using a freebsd box running natd to forward port 80 to several
> (5) web servers on private IP's.
>
> I have discovered that natd doesn't handle many requests/second all that
> well (seem to choke at about 200 req/second (educated guess))
>
> There are other packet filtering options on FreeBSD and I wonder if I can
> use them to do what I'm trying to do with natd.
>
> Would someone be able to point me to documentation or help me have either
> ipf/ipfw/pf forward port 80 traffic to private space IP's?
>
> Is there a better way of split port 80 traffic across multiple webservers
> that has elduded me?  Other then a comercial content switch that is :)
>
> I've worked with the loadd port and ran into some problems, so I resulted in
> simply using some natd syntax to forward port 80 traffic to multiple
> servers... Now that seems to have run to it's limitation and I'm wondering
> if I can do the same thing with ipf/ipfw/pf as I believe that might be a bit
> more efficient.
>
> Any feedback would be appreciated...

You could look at PF.

Also you could use http reverse-proxy like nginx, look the example of the
configuration (the page is in Russian, but the configuration is in English :)
http://sysoev.ru/nginx/docs/example.html

Currenty, to proxy the several servers you need to set up their IPs
under one name in DNS. nginx would connect to them in round robin.
If some server does not response then nginx would try the next. You
could set several reasons to try the next server:

proxy_next_upstream   error timeout invalid_header http_500;

or even

proxy_next_upstream   error timeout invalid_header http_500 http_404;

nginx was tested on several busy sites under FreeBSD (serving the static
files and the proxing, using kqueue/select/poll), Linux (static and proxy,
using epoll, rt signals) and Solaris (static only, using /dev/poll).


Igor Sysoev
http://sysoev.ru/en/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Error 49, socket problem?

2004-10-23 Thread Igor Sysoev
On Sat, 23 Oct 2004, Stephane Raimbault wrote:

> I was running out of ports in the 1024-5000 range and setting my last port
> to 65535 via sysctl did solve my problem.
>
> In 4.10 what will sysctl -w net.inet.ip.portrange.randomized=0 do for me?

If you have too many quick connections between proxy (4.10) and backend,
or between the http server (4.10) and the SQL server then you may see
in the logs the accidental errors "Connection refused". This is because
4.10 gets port number randomly and there is the chance that other side
has the connection with the same port in TIME_WAIT state.

See, i.e., http://freebsd.rambler.ru/bsdmail/freebsd-stable_2004/msg02310.html

> Is there any danger of me setting the port range from 1024 - 65535 ?

I believe it is safe.


Igor Sysoev
http://sysoev.ru/en/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: aio_connect ?

2004-10-22 Thread Igor Sysoev
On Thu, 21 Oct 2004, Ronald F. Guilmette wrote:

> >I believe if you want to build a more maintainable, more adaptable,
> >more modularized program then you should avoid two things - the threads and
> >the signals. If you like to use a callback behaviour of the signals you could
> >easy implement it without any signal.
>
> OK.  I'll bite.  How?

I'm sure you know it.  Sorry, English is not my native language so I may
tell you only shortly.

You can use two notification models.  First is the socket readiness
for operations, second is the operation completeness.
In the first model you use usual read()/write() operations and learn
readiness using select()/poll()/kevent().
In the second model you use aio_read()/aio_write() operations and learn
about their completeness using aio_suspend()/aio_waitcomplete()/kevent().

After you have got the notifications you would call your callback handlers
as well as the kernel would call your signal handlers.

The difference between your code and kernel is that your code always
calls handlers in the well known places that allows to avoid the various race
conditions.  The kernel may call the signal handler any time if the signal
is not blocked.


Igor Sysoev
http://sysoev.ru/en/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Error 49, socket problem?

2004-10-22 Thread Igor Sysoev
On Fri, 22 Oct 2004, Stephane Raimbault wrote:

> The servers are busier today and error 49 is comming up frequently now.

What does "netstat -n | grep 127.0.0.1 | wc -l" show ?

You should probably try

sysctl -w net.inet.ip.portrange.first=49152
sysctl -w net.inet.ip.portrange.last=65535

or even

sysctl -w net.inet.ip.portrange.first=1024
sysctl -w net.inet.ip.portrange.last=65535

And after you upgrade to 4.10 do not forget to set

sysctl -w net.inet.ip.portrange.randomized=0


Igor Sysoev
http://sysoev.ru/en/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: aio_connect ?

2004-10-20 Thread Igor Sysoev
On Wed, 20 Oct 2004, Julian Elischer wrote:

> Now that we have real threads, it shuld be possible to write an aio
> library that is
>  implemented by having a bunch of underlying threads..

Do you mean the kernel only threads when the single threaded user process
has several threads in kernel ? As I understand FreeBSD 4.x already
has similar AIO implementation.

Or do you mean the implementaion by user-level threads like in Solaris ?


Igor Sysoev
http://sysoev.ru/en/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


RE: aio_connect ?

2004-10-20 Thread Igor Sysoev
On Wed, 20 Oct 2004, Christopher M. Sedore wrote:

> > > > While the developing my server nginx, I found the POSIX aio_*
> > > > operations
> > > > uncomfortable. I do not mean a different programming style, I mean
> > > > the aio_read() and aio_write() drawbacks - they have no
> > scatter-gather
> > > > capabilities (aio_readv/aio_writev) and they require too many
> > > > syscalls.
> > > > E.g, the reading requires
> > > > *) 3 syscalls for ready data: aio_read(), aio_error(),
> > aio_return()
> > > > *) 5 syscalls for non-ready data: aio_read(), aio_error(),
> > > >waiting for notification, then aio_error(), aio_return(),
> > > >or if timeout occuired - aio_cancel(), aio_error().
> > >
> > > This is why I added aio_waitcomplete().  It reduces both
> > cases to two
> > > syscalls.

Yes, aio_waitcomplete() can be used as the single waiting point. But then
I can not accept() connetions. How could I learn about the new connections ?

> > As I understand aio_waitcomplete() returns aiocb of any complete AIO
> > operation but I need to know the state of the exact AIO,
> > namely the last
> > aio_read().
>
> Correct, it won't poll, but what state can you get from calling
> aio_error() that you don't already know from aio_waitcomplete().  The
> operation has either completed (successfully or unsuccessfully) or it
> hasn't.  If it hasn't you haven't "gotten it back" via aio_waitcomplete,
> and if it has, you did.  I may be missing something, but how does
> aio_error() tell you something that you don't already know?

With aio_error() I may (and even have to) pass aiocb of the wanted operation.
aio_waitcomplete() returns aiocb of any operation. If I have several
operations there may be the race condition.

> > I use kqueue to get AIO notifications. If AIO operation would fail
> > at the start, will kqueue return notificaiton about this operation ?
>
> I don't think so--IIRC, if you have a parameter problem or the operation
> can't be queued, you'll get an error return from aio_read and no kqueue
> result. If it is queued, you'll get a kqueue notification.

Well, so I may not call aio_error() just after aio_read()/aio_write().

However, I can not use aio_waitcomplete() instead of aio_error()/aio_return()
pair after kevent() reports the completetion.


Igor Sysoev
http://sysoev.ru/en/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: aio_connect ?

2004-10-20 Thread Igor Sysoev
On Wed, 20 Oct 2004, Ronald F. Guilmette wrote:

> > and they require too many syscalls.
> >E.g, the reading requires
> >*) 3 syscalls for ready data: aio_read(), aio_error(), aio_return()
> >*) 5 syscalls for non-ready data: aio_read(), aio_error(),
> >   waiting for notification, then aio_error(), aio_return(),
> >   or if timeout occuired - aio_cancel(), aio_error().
>
> This assumes that one is _not_ using the signaling capabilities of the
> aio_*() functions in order to allow the kernel to dynamically signal the
> userland program upon completion of a previously scheduled async I/O
> operation.  If however a programmer were to use _that_ approache to de-
> tecting I/O completions, then the number of syscals would be reduced
> accordingly.

Yes, nginx does not use the AIO signaling capabilities. With signals
you do not call the syscall that waits the completion. But the call of
the signal handler requires 3 context switches instead of 2 switches
in the case of syscall.

> However this all misses the point.  As I noted earlier in this thread,
> efficience _for the machines_ is not always one's highest engineering
> design goal.  If I have a choice between building a more maintainable,
> more adaptable, more modularized program, or instead building a more
> machine-efficient program, I personally will almost always choose to
> build the clearly, more modularized program as opposed to trying to
> squeeze every last machine cycle out of the thing.  In fact, that is
> why I program almost exclusively in higher level languages, even though
> I could almost certainly write assembly code that would almost always be
> faster.  Machine time is worth something, but my time is worth more.

I believe if you want to build a more maintainable, more adaptable,
more modularized program then you should avoid two things - the threads and
the signals. If you like to use a callback behaviour of the signals you could
easy implement it without any signal.


Igor Sysoev
http://sysoev.ru/en/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


RE: aio_connect ?

2004-10-20 Thread Igor Sysoev
On Wed, 20 Oct 2004, Christopher M. Sedore wrote:

> > While the developing my server nginx, I found the POSIX aio_*
> > operations
> > uncomfortable. I do not mean a different programming style, I mean
> > the aio_read() and aio_write() drawbacks - they have no scatter-gather
> > capabilities (aio_readv/aio_writev) and they require too many
> > syscalls.
> > E.g, the reading requires
> > *) 3 syscalls for ready data: aio_read(), aio_error(), aio_return()
> > *) 5 syscalls for non-ready data: aio_read(), aio_error(),
> >waiting for notification, then aio_error(), aio_return(),
> >or if timeout occuired - aio_cancel(), aio_error().
>
> This is why I added aio_waitcomplete().  It reduces both cases to two
> syscalls.

As I understand aio_waitcomplete() returns aiocb of any complete AIO
operation but I need to know the state of the exact AIO, namely the last
aio_read().

I use kqueue to get AIO notifications. If AIO operation would fail
at the start, will kqueue return notificaiton about this operation ?


Igor Sysoev
http://sysoev.ru/en/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: aio_connect ?

2004-10-20 Thread Igor Sysoev
On Sun, 17 Oct 2004, Ronald F. Guilmette wrote:

> I'm sitting here looking at that man pages for aio_read and aio_write,
> and the question occurs to me:  ``Home come there is no such thing as
> an aio_connect function?''
>
> There are clearly cases in which one would like to perform reads
> asynchronously, but likewise, there are cases where one might like
> to also perform socket connects asynchronously.  So how come no
> aio_connect?

In FreeBSD you can do connect() on the non-blocking socket, then set
the socket to a blocking mode, and post aio_read() or aio_write()
operations on the socket.

FreeBSD allows to post AIO operaitons on non-connected socket.
NT (and W2K, I believe) do not. This is why ConnectEx() appeared in XP.
I do not know about other OSes, but I belive only FreeBSD and NT have
the kernel level AIO sockets implementation without the threads emulation
in the user level (Solaris) or without the quietly falling to synchronous
behaviour (Linux).

While the developing my server nginx, I found the POSIX aio_* operations
uncomfortable. I do not mean a different programming style, I mean
the aio_read() and aio_write() drawbacks - they have no scatter-gather
capabilities (aio_readv/aio_writev) and they require too many syscalls.
E.g, the reading requires
*) 3 syscalls for ready data: aio_read(), aio_error(), aio_return()
*) 5 syscalls for non-ready data: aio_read(), aio_error(),
   waiting for notification, then aio_error(), aio_return(),
   or if timeout occuired - aio_cancel(), aio_error().

I think aio_* may be usefull for the zero-copy sockets, however,
FreeBSD's aio_write() does not wait when the data would be acknowledged
by peer and notifies the completion just after it pass the data to
the network layer.


Igor Sysoev
http://sysoev.ru/en/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: "netstat -m" and sendfile(2) statistics in STABLE

2004-06-17 Thread Igor Sysoev
On Fri, 18 Jun 2004, Mike Silbersack wrote:

> On Thu, 17 Jun 2004, Alfred Perlstein wrote:
>
> > I was going to suggest vmstat now that sfbufs are used for so many
> > other things than just "sendfile bufs".
> >
> > --
> > - Alfred Perlstein
>
> How about if we do this:
>
> 5.x:  List sfbufs both in vmstat _and_ in netstat -m, as their status is
> relevant to both network and general memory usage.
>
> 4.x:  MFC the vmstat implementation.
>
> This would preserve 4.x's behavior, but allow 5.x users (who have a new
> netstat -m output format anyway) to see sfbuf information without invocing
> multiple utilities.

In 4.x sfbufs are network buffers only and I think it's handy to see
the network buffer statistics in one place. I prefer to see netstat -ms
or netstat -m.

And nothing against additional the vmstat implementation.


Igor Sysoev
http://sysoev.ru/en/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


"netstat -m" and sendfile(2) statistics in STABLE

2004-06-17 Thread Igor Sysoev
Hi,

I read objections in cvs-all@ about netstat's output after MFC
of sendfile(2) statistics.

How about "netstat -ms" ?

Right now this switch combination is treated as simple "-m" in both -STABLE
and -CURRENT.


Igor Sysoev
http://sysoev.ru/en/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


"thundering herd" problem in accept

2004-05-20 Thread Igor Sysoev
Hi,

I noticed rev 1.123 of src/sys/kern/uipc_socket2.c and two MFC's of the fix.
Does it mean that the "thundering herd" problem in accept() appeared again
in FreeBSD since 4.4-STABLE (after syncache was introduced) ?


Igor Sysoev
http://sysoev.ru/en/
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: sendfile returning ENOTCONN under heavy load

2004-03-27 Thread Igor Sysoev
On Fri, 26 Mar 2004, Kevin Day wrote:

> I'm using thttpd on a server that pushes 300-400mbps of static content, 
> using sendfile(2).
> 
> Once the load reaches a certain point (around 800-1000 clients 
> downloading, anywhere from 150-250mbps), sendfile() will start randomly 
> returning ENOTCONN, and the client is disconnected. I've raised 
> kern.ipc.nsfbufs pretty high and that hasn't made any difference. Is 
> there any easy way to tell exactly why the sockets are being closed? I 
> can't seem to find any obvious signs of memory exhaustion or anything.

It's the sendfile(2) feature. It can return ENOTCONN instead EPIPE.
See the message:
http://freebsd.rambler.ru/bsdmail/freebsd-hackers_2004/msg00019.html
and its follow-ups.


Igor Sysoev
http://sysoev.ru/en/

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: mbuf tuning

2004-01-19 Thread Igor Sysoev
On Mon, 19 Jan 2004, CHOI Junho wrote:

> From: Mike Silbersack <[EMAIL PROTECTED]>
> Subject: Re: mbuf tuning
> Date: Mon, 19 Jan 2004 01:12:08 -0600 (CST)
> 
> > There are no good guidelines other than "don't set it too high."  Andre
> > and I have talked about some ideas on how to make mbuf usage more dynamic,
> > I think that he has something in the works.  But at present, once you hit
> > the wall, that's it.
> > 
> > One way to reduce mbuf cluster usage is to use sendfile where possible.
> > Data sent via sendfile does not use mbuf clusters, and is more memory
> > efficient.  If you run 5.2 or above, it's *much* more memory efficient,
> > due to change Alan Cox recently made.  Apache 2 will use sendfile by
> > default, so if you're running apache 1, that may be one reason for an
> > upgrade.
> 
> I am using custom version of thttpd. It allocates mmap() first(builtin
> method of thttpd), and it try to use sendfile() if mmap() fails(out of
> mmap memory). It really works good in normal status but the problem is
> that sendfile buffer is also easy to flood. I need more sendfile
> buffers but I don't know how to increase sendfile buffers either(I
> think it's hidden sysctl but it was more difficult to tune than
> nmbclusters). With higher traffic, thttpd sometimes stuck at "sfbufa"
> status when I run top(I guess it's "sendfile buffer allocation"
> status).

In 4.x you have to rebuild the kernel with

options  NSFBUFS=16384

It equals to (512 + maxusers * 16) by default.

By the way, why do you want to use the big net.inet.tcp.sendspace and
net.inet.tcp.recvspace ? It makes a sense for Apache but thttpd can easy
work with the small buffers, say, 16K or even 8K.

> > > Increasing kern.ipc.nmbclusters caused frequent kernel panic
> > > under 4.7/4.8/4.9. How can I set more nmbclusters value with 64K tcp
> > > buffers? Or is any dependency for mbufclusters value? (e.g. RAM size,
> > > kern.maxusers value or etc)
> > >
> > > p.s. RAM is 2G, Xeon 2.0G x 1 or 2 machines.
> > 
> > You probably need to bump up KVA_PAGES to fit in all the extra mbuf
> > clusters you're allocating.
> 
> Can you tell me in more detail?

>From LINT:
---
#
# Change the size of the kernel virtual address space.  Due to
# constraints in loader(8) on i386, this must be a multiple of 4.
# 256 = 1 GB of kernel address space.  Increasing this also causes
# a reduction of the address space in user processes.  512 splits
# the 4GB cpu address space in half (2GB user, 2GB kernel).
#
options KVA_PAGES=260
---

Default KVA_PAGES are 256.


Igor Sysoev
http://sysoev.ru/en/

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: turning off TCP_NOPUSH

2003-05-29 Thread Igor Sysoev
On Wed, 28 May 2003, Garrett Wollman wrote:

> < said:
> 
> > always calls tcp_output() when TCP_NOPUSH is turned off.  I think
> > tcp_output() should be called only if data in the send buffer is less
> > than MSS:
> 
> I believe that this is intentional.  The application had to explicitly
> enable TCP_NOPUSH, so if the application disables it explicitly, then
> we interpret that as meaning that the application wants to send a PSH
> segment immediately.

As I understand if the data in the send buffer is bigger than MSS it means
that TCP stack has some reason not to send it and this reason is not
TF_NOPUSH flag.  Am I wrong ?


Igor Sysoev
http://sysoev.ru/en/

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


turning off TCP_NOPUSH

2003-05-28 Thread Igor Sysoev

The 1.53 fix

http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/netinet/tcp_usrreq.c.diff?r1=1.52&r2=1.53

always calls tcp_output() when TCP_NOPUSH is turned off.  I think
tcp_output() should be called only if data in the send buffer is less
than MSS:

tp->t_flags &= ~TF_NOPUSH;
-   error = tcp_output(tp);
+   if (so->so_snd.sb_cc < tp->t_maxseg) {
+   error = tcp_output(tp);
+   }

If the pending data is bigger than MSS then it will be sent without
significant delay.


Igor Sysoev
http://sysoev.ru/en/

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"