Tony, i'm not saying it's directly related to the maildir patch, i've had very good performance just untill 2 weeks ago. the amount of active users has increased a little, but the load increased dramatically out of the blue.
i'm really not able to test other cases as far as mailbox format and authentication is concerned. and yes this is very frustrating for me too. the amount of mailboxes is just too big to just switch them to local auth and/or mbox format. we've had a test setup but we're unable to get to this kind of behavior even when subject to very high amount of connections. we seem only able to get this problem in the production environment. i've already thrown in 2 extra boxes (temporary) which are also handling pop sessions, (this buys us some time, but the real problem is certainly not gone). this also gives us the oppurtunity to test some stuff and upgrade some stuff to see if it affects the problem. now, during peaks, i often see processes waiting to run: 040121 15:16:44 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- 040121 15:16:44 r b swpd free buff cache si so bi bo in cs us sy id wa 040121 15:16:44 1 18 1792 10608 9620 2002316 0 0 64 401 2605 8911 6 10 85 0 040121 15:16:49 27 15 1792 13892 9608 1998868 0 0 30 890 6057 22022 12 26 62 0 040121 15:16:54 53 9 1792 15940 9624 2001352 0 0 84 187 5711 30160 11 31 59 0 040121 15:16:59 34 10 1792 15604 9668 2014532 0 0 25 326 5919 26858 9 22 68 0 040121 15:17:04 0 17 1792 10544 9728 2016196 0 0 42 371 5158 15635 9 12 78 0 040121 15:17:09 1 15 1792 10488 9624 2005556 0 0 42 364 6380 23565 12 24 64 0 040121 15:17:14 1 21 1792 18812 9620 2007196 0 0 22 816 6423 24094 13 25 62 0 040121 15:17:19 0 22 1792 11908 9640 2014016 0 0 106 482 6697 27224 11 25 64 0 040121 15:17:24 50 13 1792 15136 9648 2014668 0 0 63 179 4901 23929 7 23 69 0 040121 15:17:29 1 17 1792 10652 9692 2019968 0 0 35 186 4472 20241 6 16 77 0 040121 15:17:34 0 18 1792 12152 9716 2012088 0 0 58 221 5410 27104 9 24 67 0 the mailspool is located on a NAS, so it's accesed using nfs, however the NAS seems to be doing fine, no performance issues to be found there untill now, but we're checking this ofcourse. the problem with running a popper in strace that it outputs an awfull lot of data and usually the load doesn't start increasing right away. i did see a bunch (1021) of fd errors only at startup like this: 15:54:41.272417 open("/dev/null", O_RDWR|O_CREAT|O_TRUNC, 0666) = 3 15:54:41.272589 fork() = 25607 [pid 25607] 15:54:41.272788 setsid( <unfinished ...> [pid 25605] 15:54:41.272876 semget(IPC_PRIVATE, 0, 0x1|0 <unfinished ...> [pid 25607] 15:54:41.272924 <... setsid resumed> ) = 25607 [pid 25605] 15:54:41.272952 <... semget resumed> ) = -1 ENOSYS (Function not implemented) [pid 25607] 15:54:41.272989 fork( <unfinished ...> [pid 25605] 15:54:41.273028 _exit(0) = ? [pid 25607] 15:54:41.273101 <... fork resumed> ) = 25608 [pid 25608] 15:54:41.273314 chdir("/") = 0 [pid 25607] 15:54:41.273428 semget(IPC_PRIVATE, 0, 0x1|0 <unfinished ...> [pid 25608] 15:54:41.273479 getrlimit(0x7, 0xbffff7f8 <unfinished ...> [pid 25607] 15:54:41.273518 <... semget resumed> ) = -1 ENOSYS (Function not implemented) [pid 25608] 15:54:41.273546 <... getrlimit resumed> ) = 0 [pid 25607] 15:54:41.273579 _exit(0) = ? 15:54:41.273610 close(1024) = -1 EBADF (Bad file descriptor) 15:54:41.273736 close(1023) = -1 EBADF (Bad file descriptor) 15:54:41.273828 close(1022) = -1 EBADF (Bad file descriptor) 15:54:41.273901 close(1021) = -1 EBADF (Bad file descriptor) etc... i have an strace output generated during the peaks, if you're interested i can mail it to you (27mb) or a portion of it. i've even tried a 2.6.1 kernel to see if it had any effect, but it didn't, untill i did see that a debian3 was having the same problem but less intensive then a slack9. the major difference between those distro's is the gcc version 2.95.x <> 3.2.2. compiling qpopper with gcc 2.95.x did not have any effect on the slack boxes. bart On Tue, Jan 20, 2004 at 03:45:49PM -0800, The Little Prince wrote: > I have heard of any performance problems with my patch. People have > reported really good perf. with thousands of users. > Nobody has reported anything with radius auth. used at the same time. > Not being able to test any other cases, e.g. local auth. and maildir, > radius, and mbox, etc.. doesn't help you. > Like Clifton said, check your stats. Watch vmstat statistics. > Even strace some of the processes to see what calls they spend the most > time in. > > --Tony > .-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-. > Anthony J. Biacco Network Administrator/Engineer > [EMAIL PROTECTED] http://www.asteroid-b612.org > > "You find magic from your god, and I find magic everywhere" > .-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-. > > On Tue, 20 Jan 2004, Clifton Royston wrote: > > > On Mon, Jan 19, 2004 at 03:47:38PM +0100, Bart Dumon wrote: > > > i'm running qpopper 4.0.5 on linux (2.4.x) with maildir patch > > > (0.12) and pam_radius for authentication. > > > > > > right now, i'm suffering from high cpu load averages once it's > > > gets too busy the load will skyrocket to abnormal high values > > > and the service will become unavailable untill it's restarted. > > > this typically happens during peak times when we receive 15 pop > > > sessions/sec. > > > > > > at first it thought it was radius related because i'm seeing the > > > following error message during the peak times: > > > > > > Jan 19 14:07:41 xxx popper[13404]: pam_radius_auth: RADIUS server x.x.x.x failed > > > to respond > > > > > > but even with a more performant radius, the problem persists, it > > > looks like the radius errors are a consequence of the problem and > > > not the real cause. > > > everything is pointing in the direction of the amount of pop sessions > > > whenever you get to the 13-14pops/sec barrier, qpopper seems to > > > be giving up. it's not traffic related because the amount of traffic > > > is higher outside the peak hours. > > > > Usually this kind of overload is due to many users having large > > mailboxes (e.g. 30MB and up) in the old UNIX mbox format. In this > > format, the file needs to be recopied to update the messages' status > > when popped, which results in the POP sessions completely saturating > > your disk I/O bandwidth. > > > > I have also seen some Radius daemons show a tendency to die under > > this type of heavy load. > > > > I haven't seen reports of this with maildir format. However, what > > you're describing is consistent with I/O bandwidth saturation. > > > > If you are saturating your disk bandwidth, you'll see a large number > > of concurrent tasks waiting to run ("load" as shown by the uptime > > command or xload) but a high proportion of idle time shown by vmstat. > > At that point you'll need to try to figure out why all this bandwidth > > is still going on even with maildir format; I don't use that patch, so > > I can't help with troubleshooting it. > > -- Clifton