Tony,

i'm not saying it's directly related to the maildir patch, i've had
very good performance just untill 2 weeks ago. the amount of active
users has increased a little, but the load increased dramatically out
of the blue.

i'm really not able to test other cases as far as mailbox format and
authentication is concerned. and yes this is very frustrating for me
too. the amount of mailboxes is just too big to just switch them
to local auth and/or mbox format. we've had a test setup but we're
unable to get to this kind of behavior even when subject to very high
amount of connections. we seem only able to get this problem in the
production environment. i've already thrown in 2 extra boxes (temporary)
which are also handling pop sessions, (this buys us some time, but the 
real problem is certainly not gone). this also gives us the oppurtunity to 
test some stuff and upgrade some stuff to see if it affects the problem. 

now, during peaks, i often see processes waiting to run:

040121 15:16:44  procs -----------memory---------- ---swap-- -----io---- --system-- 
----cpu----
040121 15:16:44   r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us 
sy id wa
040121 15:16:44   1 18   1792  10608   9620 2002316    0    0    64   401 2605  8911  
6 10 85  0
040121 15:16:49  27 15   1792  13892   9608 1998868    0    0    30   890 6057 22022 
12 26 62  0
040121 15:16:54  53  9   1792  15940   9624 2001352    0    0    84   187 5711 30160 
11 31 59  0
040121 15:16:59  34 10   1792  15604   9668 2014532    0    0    25   326 5919 26858  
9 22 68  0
040121 15:17:04   0 17   1792  10544   9728 2016196    0    0    42   371 5158 15635  
9 12 78  0
040121 15:17:09   1 15   1792  10488   9624 2005556    0    0    42   364 6380 23565 
12 24 64  0
040121 15:17:14   1 21   1792  18812   9620 2007196    0    0    22   816 6423 24094 
13 25 62  0
040121 15:17:19   0 22   1792  11908   9640 2014016    0    0   106   482 6697 27224 
11 25 64  0
040121 15:17:24  50 13   1792  15136   9648 2014668    0    0    63   179 4901 23929  
7 23 69  0
040121 15:17:29   1 17   1792  10652   9692 2019968    0    0    35   186 4472 20241  
6 16 77  0
040121 15:17:34   0 18   1792  12152   9716 2012088    0    0    58   221 5410 27104  
9 24 67  0

the mailspool is located on a NAS, so it's accesed using nfs, however
the NAS seems to be doing fine, no performance issues to be found there 
untill now, but we're checking this ofcourse.

the problem with running a popper in strace that it outputs an awfull lot of
data and usually the load doesn't start increasing right away. i did see
a bunch (1021) of fd errors only at startup like this:

15:54:41.272417 open("/dev/null", O_RDWR|O_CREAT|O_TRUNC, 0666) = 3
15:54:41.272589 fork()                  = 25607
[pid 25607] 15:54:41.272788 setsid( <unfinished ...>
[pid 25605] 15:54:41.272876 semget(IPC_PRIVATE, 0, 0x1|0 <unfinished ...>
[pid 25607] 15:54:41.272924 <... setsid resumed> ) = 25607
[pid 25605] 15:54:41.272952 <... semget resumed> ) = -1 ENOSYS (Function not 
implemented)
[pid 25607] 15:54:41.272989 fork( <unfinished ...>
[pid 25605] 15:54:41.273028 _exit(0)    = ?
[pid 25607] 15:54:41.273101 <... fork resumed> ) = 25608
[pid 25608] 15:54:41.273314 chdir("/")  = 0
[pid 25607] 15:54:41.273428 semget(IPC_PRIVATE, 0, 0x1|0 <unfinished ...>
[pid 25608] 15:54:41.273479 getrlimit(0x7, 0xbffff7f8 <unfinished ...>
[pid 25607] 15:54:41.273518 <... semget resumed> ) = -1 ENOSYS (Function not 
implemented)
[pid 25608] 15:54:41.273546 <... getrlimit resumed> ) = 0
[pid 25607] 15:54:41.273579 _exit(0)    = ?
15:54:41.273610 close(1024)             = -1 EBADF (Bad file descriptor)
15:54:41.273736 close(1023)             = -1 EBADF (Bad file descriptor)
15:54:41.273828 close(1022)             = -1 EBADF (Bad file descriptor)
15:54:41.273901 close(1021)             = -1 EBADF (Bad file descriptor)
etc...

i have an strace output generated during the peaks, if you're interested i
can mail it to you (27mb) or a portion of it.

i've even tried a 2.6.1 kernel to see if it had any effect, but it didn't,
untill i did see that a debian3 was having the same problem but less
intensive then a slack9. the major difference between those distro's is
the gcc version 2.95.x <> 3.2.2. compiling qpopper with gcc 2.95.x did not
have any effect on the slack boxes.


bart

On Tue, Jan 20, 2004 at 03:45:49PM -0800, The Little Prince wrote:
> I have heard of any performance problems with my patch. People have 
> reported really good perf. with thousands of users.
> Nobody has reported anything with radius auth. used at the same time.
> Not being able to test any other cases, e.g. local auth. and maildir, 
> radius, and mbox, etc.. doesn't help you.
> Like Clifton said, check your stats. Watch vmstat statistics.
> Even strace some of the processes to see what calls they spend the most 
> time in.
> 
> --Tony
> .-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-.
> Anthony J. Biacco                            Network Administrator/Engineer
> [EMAIL PROTECTED]              http://www.asteroid-b612.org
> 
>        "You find magic from your god, and I find magic everywhere" 
> .-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-.
> 
> On Tue, 20 Jan 2004, Clifton Royston wrote:
> 
> > On Mon, Jan 19, 2004 at 03:47:38PM +0100, Bart Dumon wrote:
> > > i'm running qpopper 4.0.5 on linux (2.4.x) with maildir patch 
> > > (0.12) and pam_radius for authentication. 
> > > 
> > > right now, i'm suffering from high cpu load averages once it's
> > > gets too busy the load will skyrocket to abnormal high values
> > > and the service will become unavailable untill it's restarted. 
> > > this typically happens during peak times when we receive 15 pop 
> > > sessions/sec.
> > > 
> > > at first it thought it was radius related because i'm seeing the
> > > following error message during the peak times:
> > > 
> > > Jan 19 14:07:41 xxx popper[13404]: pam_radius_auth: RADIUS server x.x.x.x failed 
> > > to respond
> > > 
> > > but even with a more performant radius, the problem persists, it
> > > looks like the radius errors are a consequence of the problem and
> > > not the real cause.
> > > everything is pointing in the direction of the amount of pop sessions
> > > whenever you get to the 13-14pops/sec barrier, qpopper seems to
> > > be giving up. it's not traffic related because the amount of traffic
> > > is higher outside the peak hours.
> > 
> >   Usually this kind of overload is due to many users having large
> > mailboxes (e.g. 30MB and up) in the old UNIX mbox format.  In this
> > format, the file needs to be recopied to update the messages' status
> > when popped, which results in the POP sessions completely saturating
> > your disk I/O bandwidth.
> > 
> >   I have also seen some Radius daemons show a tendency to die under
> > this type of heavy load.
> > 
> >   I haven't seen reports of this with maildir format.  However, what
> > you're describing is consistent with I/O bandwidth saturation.
> > 
> >   If you are saturating your disk bandwidth, you'll see a large number
> > of concurrent tasks waiting to run ("load" as shown by the uptime
> > command or xload) but a high proportion of idle time shown by vmstat.
> > At that point you'll need to try to figure out why all this bandwidth
> > is still going on even with maildir format; I don't use that patch, so
> > I can't help with troubleshooting it.
> >   -- Clifton

Reply via email to