Re: kernel hangs by many connections (reproducable)

2010-10-05 Thread Jonathan Gray
On Tue, Oct 05, 2010 at 10:35:28AM +0200, Mike Belopuhov wrote:
> 
> there's a forgotten splx in the driver.  that explains why system
> appears to be hung.  OK?

While this is a bug and should be fixed, we never actually call that codepath
as only re_diag sets testmode and we never call that :)

> 
> Index: re.c
> ===
> RCS file: /home/cvs/src/sys/dev/ic/re.c,v
> retrieving revision 1.128
> diff -u -p -u -p -r1.128 re.c
> --- re.c  27 Aug 2010 17:08:00 -  1.128
> +++ re.c  5 Oct 2010 08:27:05 -
> @@ -2042,8 +2042,10 @@ re_init(struct ifnet *ifp)
>   if (sc->sc_hwrev != RL_HWREV_8139CPLUS)
>   CSR_WRITE_2(sc, RL_MAXRXPKTLEN, 16383);
>  
> - if (sc->rl_testmode)
> + if (sc->rl_testmode) {
> + splx(s);
>   return (0);
> + }
>  
>   mii_mediachg(&sc->sc_mii);



Re: kernel hangs by many connections (reproducable)

2010-10-05 Thread Mike Belopuhov
On Sun, Sep 12, 2010 at 11:26 +0600, Anton Maksimenkov wrote:
> Hello.
> I use my OBSD machine to test some server on another machine. They are
> connected by pathcord, 1Gbit network cards are used.
> Test program (uses kqueue) do many (I want thousands) connections to
> server. Write query, read answer.
> And it tries to keep that much connections by doing as much new
> connections as needed.
> 
> When number of connections kept below 100 - all ok. But if I raise
> them (upto about 500-1000) the program start these connections, do
> some write/read (show about 10-20 successful reads) and the kernel
> hangs. 1-2 sec after start.
> Tweaks - kern.maxfiles=16384 and openfiles-cur/max=8192 for my user.
> 
> Info from ddb (see dmesg below):
> 
> ddb> show panic
> the kernel did not panic
> 
> ddb> trace
> Debugger(0,3f8,0,0,1) at Debugger+0x4
> comintr(d1571000) at comintr+0x287
> Xrecurse_legacy4() at Xrecurse_legacy4+0xb3
> --- interrupt ---
> pool_do_get(d0a10b60,0,0,0,60) at pool_do_get+0x2c2
> pool_get(d0a10b60,0,8000,0,0) at pool_get+0x54
> m_gethdr(1,1,8000,369e99,0) at m_gethdr+0x39
> m_clget(0,1,d1526054,800,d03e1aeb) at m_clget+0x10a
> re_newbuf(d1526000,10,d999eb48,d02b30cc,d1526000) at re_newbuf+0x35
> re_rx_list_fill(d1526000,20,60,58,d1520010) at re_rx_list_fill+0x21
> re_rxeof(d1526000,d9799800,3e,10,10) at re_rxeof+0x37c
> re_intr(d1526000) at re_intr+0x12a
> Xrecurse_legacy11() at Xrecurse_legacy11+0xb7
> --- interrupt ---
> filt_soread(d9a5bdc0,0,0,d9a5bd98,d9a5bd98) at filt_soread+0x1
> selwakeup(d9a5bdbc,d9b08300,d9b08200,d9b08300,d9a5bd98) at selwakeup+0x22
> sowakeup(d9a5bd4c,d9a5bd98,14,d999ed24,1) at sowakeup+0x1d
> tcp_input(d9b08300,14,0,0,6) at tcp_input+0x26ac
> ipv4_input(d9b08300,0,d999ede8,d0202089,d03d0058) at ipv4_input+0x42a
> ipintr(d03d0058,d09e0010,10,d5d10010,d09e72c0) at ipintr+0x49
> Bad frame pointer: 0xd999ede8
> 

there's a forgotten splx in the driver.  that explains why system
appears to be hung.  OK?

Index: re.c
===
RCS file: /home/cvs/src/sys/dev/ic/re.c,v
retrieving revision 1.128
diff -u -p -u -p -r1.128 re.c
--- re.c27 Aug 2010 17:08:00 -  1.128
+++ re.c5 Oct 2010 08:27:05 -
@@ -2042,8 +2042,10 @@ re_init(struct ifnet *ifp)
if (sc->sc_hwrev != RL_HWREV_8139CPLUS)
CSR_WRITE_2(sc, RL_MAXRXPKTLEN, 16383);
 
-   if (sc->rl_testmode)
+   if (sc->rl_testmode) {
+   splx(s);
return (0);
+   }
 
mii_mediachg(&sc->sc_mii);



Re: kernel hangs by many connections (reproducable)

2010-09-19 Thread Anton Maksimenkov
2010/9/13 Anton Maksimenkov :
> 2010/9/13 Claudio Jeker :
>> When running with that many sockets a prominent warning about increasing
>> kern.maxclusters shows up. This is not just dmesg spam, running
>> out of mbuf clusters will stop your network stack.
>
> I've not seen any message neither on console nor in logs.
>
> I tried to set kern.maxclusters to 10, no success. Same "freeze",
> here is the ddb outputs.

Can anybody say something about this situation?
Is it 're' driver problem or something wrong with the mbuf allocations or what?

BTW, I tried it on FBSD, it works with '5000 connections' without problems.
-- 
antonvm



Re: kernel hangs by many connections (reproducable)

2010-09-13 Thread Anton Maksimenkov
2010/9/13 Claudio Jeker :
> When running with that many sockets a prominent warning about increasing
> kern.maxclusters shows up. This is not just dmesg spam, running
> out of mbuf clusters will stop your network stack.

I've not seen any message neither on console nor in logs.

I tried to set kern.maxclusters to 10, no success. Same "freeze",
here is the ddb outputs.

ddb> trace
Debugger(0,3f8,0,d0a10c40,0) at Debugger+0x4
comintr(d1571000) at comintr+0x287
Xrecurse_legacy4() at Xrecurse_legacy4+0xb3
--- interrupt ---
m_cldrop(0,1,d1526054,800,d03e1aeb) at m_cldrop
re_newbuf(d1526000,10,d9a237ac,d02b30cc,d1526000) at re_newbuf+0x35
re_rx_list_fill(d1526000,20,60,58,10) at re_rx_list_fill+0x21
re_rxeof(d1526000,d9799800,3e,10,10) at re_rxeof+0x37c
re_intr(d1526000) at re_intr+0x12a
Xrecurse_legacy11() at Xrecurse_legacy11+0xb7
--- interrupt ---
m_gethdr(1,2,0,d9a23904,2) at m_gethdr+0x78
tcp_output(d9b8ac88,d9b89550,14,d9a23a70,1) at tcp_output+0x754
tcp_input(d55c4e00,14,0,0,6) at tcp_input+0x2711
ipv4_input(d55c4e00,0,d9a23b34,d0202089,d5b40058) at ipv4_input+0x42a
ipintr(d5b40058,d9a20010,d9a20010,d0510010,3) at ipintr+0x49
Bad frame pointer: 0xd9a23b34

ddb> show registers
ds0xd9a20010end+0x8f5802c
es  0x10
fs0xd9a20058end+0x8f58074
gs0xd1310010end+0x84802c
edi   0xd150c960end+0xa4497c
esi   0xd15750acend+0xaad0c8
ebp   0xd9a23640end+0x8f5b65c
ebx 0xf9
edx0x3f8
ecx   0xd1571000end+0xaa901c
eax  0x1
eip   0xd05670b4Debugger+0x4
cs  0x50
eflags 0x202
esp   0xd9a23640end+0x8f5b65c
ss0xd9a20010end+0x8f5802c
Debugger+0x4:   popl%ebp

ddb> show uvmexp
Current UVM status:
  pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
  126367 VM pages: 6421 active, 1015 inactive, 0 wired, 110020 free
  min  10% (25) anon, 10% (25) vnode, 5% (12) vtext
  pages  0 anon, 0 vnode, 0 vtext
  freemin=4212, free-target=5616, inactive-target=0, wired-max=42122
  faults=48741, traps=55367, intrs=230694, ctxswitch=32618 fpuswitch=183
  softint=67847, syscalls=372383, swapins=0, swapouts=0, kmapent=19
  fault counts:
noram=0, noanon=0, pgwait=0, pgrele=0
ok relocks(total)=2251(2251), anget(retries)=27555(0), amapcopy=12411
neighbor anon/obj pg=1481/24619, gets(lock/unlock)=10220/2251
cases: anon=23933, anoncow=3622, obj=9207, prcopy=1013, przero=10965
  daemon and swap counts:
woke=0, revs=0, scans=0, obscans=0, anscans=0
busy=0, freed=0, reactivate=0, deactivate=0
pageouts=0, pending=0, nswget=0
nswapdev=1, nanon=0, nanonneeded=0 nfreeanon=0
swpages=66267, swpginuse=0, swpgonly=0 paging=0
  kernel pointers:
objs(kern)=0xd09e7280

ddb> show all pools
Name  Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg Maxpg Idle
inpcbpl228103740 934361 06161 0 80
plimitpl   148   1705 1 0 1 1 0 80
synpl  192303 1 0 1 1 0 81
tcpqepl 16  2880   69 1 0 1 1 0130
tcpcbpl400100490 9027   108 4   104   104 0 80
rtentpl116   3500 2 0 2 2 0 80
pfosfp  28  8140  407 3 0 3 3 0 80
pfosfpen   108 13920  69630111919 0 80
pfstateitempl 12  102620 202425 02525 0 80
pfstatekeypl 72   102620 2024   148 0   148   148 0 80
pfstatepl  212102620 1870   459 0   459   459 0   527   16
pfrulepl   1148  130   11 5 0 5 5 0 83
dirhash1024  2900 8 0 8 8 0   1280
dino1pl128 17290956 05656 0 80
ffsino 184 17290979 07979 0 80
nchpl   88 28380   2962 06262 0 80
vnodes 156 17400070 07070 0 80
namei  102466160 6616 3 0 3 3 0 83
wdcspl  96 14400 1440 1 0 1 1 0 81
sigapl 324  2360  207 3 0 3 3 0 80
knotepl 642004201804232 03232 0 80
kqueuepl   192504 1 0 1 1 0 80
kqueuepl   192504 1 0 1 1 0 80
fdescpl300  2370  207 3 0 3 3 0 80
filepl  881283901172825 02525 0 80
lockfpl 56402 1 0 

Re: kernel hangs by many connections (reproducable)

2010-09-13 Thread Henning Brauer
* Claudio Jeker  [2010-09-13 08:12]:
> Oh, that pool_get succeds since mbufs don't have a limit but the
> allocation of the cluster fails so the driver will reuse the old buffer on
> the queue.

well, the trace shows a pool_get from m_gethdr which is certainly not
the cluster.

> By breaking into ddb it just happend to be in there instead of
> somewhere else.

point.

and resolves my confusion why pool_get on mbpl would fail/sleep ;)

-- 
Henning Brauer, h...@bsws.de, henn...@openbsd.org
BS Web Services, http://bsws.de
Full-Service ISP - Secure Hosting, Mail and DNS Services
Dedicated Servers, Rootservers, Application Hosting



Re: kernel hangs by many connections (reproducable)

2010-09-12 Thread Claudio Jeker
On Mon, Sep 13, 2010 at 06:35:10AM +0200, Bret S. Lambert wrote:
> On Mon, Sep 13, 2010 at 10:12:44AM +0600, Anton Maksimenkov wrote:
> > 2010/9/13 Henning Brauer :
> > >> hangs. 1-2 sec after start.
> > >> --- interrupt ---
> > >> pool_do_get(d0a10b60,0,0,0,60) at pool_do_get+0x2c2
> > >> pool_get(d0a10b60,0,8000,0,0) at pool_get+0x54
> > >> m_gethdr(1,1,8000,369e99,0) at m_gethdr+0x39
> > > too me that simply looks like you are running out of memory in mbpl
> > > and the pool_get is a M_WAITOK one
> > 
> > But it not unfreezed even after minute. SSH connections dropped, com
> > console didn't response (but it can be dropped into ddb, of course).
> 
> yes, because you've soaked up all the memory that's available for
> handling incoming/outgoing network traffic; you've got a bunch of
> processes that try to grab a limited number of resources, fail to
> get all they need, and sleep while holding already-allocated mbufs,
> meaning that nobody else can get them, and none of your processes
> can advance.
> 
> That said, the pool_get that's failing in the re driver is set as
> non-blocking, so it should fail. However, it's hard to see how
> you're tickling this without seeing the source that you're running,
> since we don't know how you're cornholing the network stack.
> 

Oh, that pool_get succeds since mbufs don't have a limit but the
allocation of the cluster fails so the driver will reuse the old buffer on
the queue. By breaking into ddb it just happend to be in there instead of
somewhere else.

When running with that many sockets a prominent warning about increasing
kern.maxclusters shows up. This is not just dmesg spam, running
out of mbuf clusters will stop your network stack.
Most systems today come up with 6144 clusters. A TCP connection needs at
least 4 clusters plus the network stack itself needs a few clusters for
things like defragmenting long mbuf chains. So to be able to run with
1 sockets you should set kern.maxclusters to around 10. This will
allow the network stack to allocate around 200MB of memory in kva.

-- 
:wq Claudio



Re: kernel hangs by many connections (reproducable)

2010-09-12 Thread Anton Maksimenkov
2010/9/13 Bret S. Lambert :
> yes, because you've soaked up all the memory that's available for
> handling incoming/outgoing network traffic; you've got a bunch of
> processes that try to grab a limited number of resources, fail to
> get all they need, and sleep while holding already-allocated mbufs,
> meaning that nobody else can get them, and none of your processes
> can advance.

yes, I understand. but kernel must resolve this situation somehow. Isn't it?

> That said, the pool_get that's failing in the re driver is set as
> non-blocking, so it should fail. However, it's hard to see how
> you're tickling this without seeing the source that you're running,
> since we don't know how you're cornholing the network stack.

oh, here is the source http://gist.github.com/576851
it is rather ugly, collected from pieces of another programs. my
english is bad so I cut out my comments.
I need the instrument to do a brief test of server under load. so this
"penetrator" was planned as simply (yes, ugly but...) and quickly
created instrument.
-- 
antonvm



Re: kernel hangs by many connections (reproducable)

2010-09-12 Thread Bret S. Lambert
On Mon, Sep 13, 2010 at 10:12:44AM +0600, Anton Maksimenkov wrote:
> 2010/9/13 Henning Brauer :
> >> hangs. 1-2 sec after start.
> >> --- interrupt ---
> >> pool_do_get(d0a10b60,0,0,0,60) at pool_do_get+0x2c2
> >> pool_get(d0a10b60,0,8000,0,0) at pool_get+0x54
> >> m_gethdr(1,1,8000,369e99,0) at m_gethdr+0x39
> > too me that simply looks like you are running out of memory in mbpl
> > and the pool_get is a M_WAITOK one
> 
> But it not unfreezed even after minute. SSH connections dropped, com
> console didn't response (but it can be dropped into ddb, of course).

yes, because you've soaked up all the memory that's available for
handling incoming/outgoing network traffic; you've got a bunch of
processes that try to grab a limited number of resources, fail to
get all they need, and sleep while holding already-allocated mbufs,
meaning that nobody else can get them, and none of your processes
can advance.

That said, the pool_get that's failing in the re driver is set as
non-blocking, so it should fail. However, it's hard to see how
you're tickling this without seeing the source that you're running,
since we don't know how you're cornholing the network stack.

> -- 
> antonvm



Re: kernel hangs by many connections (reproducable)

2010-09-12 Thread Anton Maksimenkov
2010/9/13 Henning Brauer :
>> hangs. 1-2 sec after start.
>> --- interrupt ---
>> pool_do_get(d0a10b60,0,0,0,60) at pool_do_get+0x2c2
>> pool_get(d0a10b60,0,8000,0,0) at pool_get+0x54
>> m_gethdr(1,1,8000,369e99,0) at m_gethdr+0x39
> too me that simply looks like you are running out of memory in mbpl
> and the pool_get is a M_WAITOK one

But it not unfreezed even after minute. SSH connections dropped, com
console didn't response (but it can be dropped into ddb, of course).
-- 
antonvm



Re: kernel hangs by many connections (reproducable)

2010-09-12 Thread Henning Brauer
* Anton Maksimenkov  [2010-09-12 07:35]:
> I use my OBSD machine to test some server on another machine. They are
> connected by pathcord, 1Gbit network cards are used.
> Test program (uses kqueue) do many (I want thousands) connections to
> server. Write query, read answer.
> And it tries to keep that much connections by doing as much new
> connections as needed.
> 
> When number of connections kept below 100 - all ok. But if I raise
> them (upto about 500-1000) the program start these connections, do
> some write/read (show about 10-20 successful reads) and the kernel
> hangs. 1-2 sec after start.
> Tweaks - kern.maxfiles=16384 and openfiles-cur/max=8192 for my user.
> 
> Info from ddb (see dmesg below):
> 
> ddb> show panic
> the kernel did not panic
> 
> ddb> trace
> Debugger(0,3f8,0,0,1) at Debugger+0x4
> comintr(d1571000) at comintr+0x287
> Xrecurse_legacy4() at Xrecurse_legacy4+0xb3
> --- interrupt ---
> pool_do_get(d0a10b60,0,0,0,60) at pool_do_get+0x2c2
> pool_get(d0a10b60,0,8000,0,0) at pool_get+0x54
> m_gethdr(1,1,8000,369e99,0) at m_gethdr+0x39

too me that simply looks like you are running out of memory in mbpl
and the pool_get is a M_WAITOK one

-- 
Henning Brauer, h...@bsws.de, henn...@openbsd.org
BS Web Services, http://bsws.de
Full-Service ISP - Secure Hosting, Mail and DNS Services
Dedicated Servers, Rootservers, Application Hosting