On Wed, 21 Jan 2004, Bart Dumon wrote: > Tony, > > i'm not saying it's directly related to the maildir patch, i've had > very good performance just untill 2 weeks ago. the amount of active > users has increased a little, but the load increased dramatically out > of the blue. >
Forgive me bart, it just totally went over my head that you and I had talked a while ago when you had the rename() problem. When you said NAS and nfs, the little light bulb went on above my head :-) Those context switch/cs numbers under the vmstat seem awfully high to me. I wish i had something good to tell you. How many boxes you got now? didn't you have like 4 last time we talked with 300k users? Do any SW upgrades between the time it was going fine, and it started crawling? That's probably a stupid question. Might as well send the strace to me, and i'll check it out..not sure what else to tell you. btw, did i ever send you anything regarding that unpopable zero-filesize problem? I don't remember. If not, i'm sorry, it must have slipped my mind and i'll revisit it. --Tony .-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-. Anthony J. Biacco Network Administrator/Engineer [EMAIL PROTECTED] http://www.asteroid-b612.org "You find magic from your god, and I find magic everywhere" .-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-. > i'm really not able to test other cases as far as mailbox format and > authentication is concerned. and yes this is very frustrating for me > too. the amount of mailboxes is just too big to just switch them > to local auth and/or mbox format. we've had a test setup but we're > unable to get to this kind of behavior even when subject to very high > amount of connections. we seem only able to get this problem in the > production environment. i've already thrown in 2 extra boxes (temporary) > which are also handling pop sessions, (this buys us some time, but the > real problem is certainly not gone). this also gives us the oppurtunity to > test some stuff and upgrade some stuff to see if it affects the problem. > > now, during peaks, i often see processes waiting to run: > > 040121 15:16:44 procs -----------memory---------- ---swap-- -----io---- --system-- > ----cpu---- > 040121 15:16:44 r b swpd free buff cache si so bi bo in cs > us sy id wa > 040121 15:16:44 1 18 1792 10608 9620 2002316 0 0 64 401 2605 8911 > 6 10 85 0 > 040121 15:16:49 27 15 1792 13892 9608 1998868 0 0 30 890 6057 22022 > 12 26 62 0 > 040121 15:16:54 53 9 1792 15940 9624 2001352 0 0 84 187 5711 30160 > 11 31 59 0 > 040121 15:16:59 34 10 1792 15604 9668 2014532 0 0 25 326 5919 26858 > 9 22 68 0 > 040121 15:17:04 0 17 1792 10544 9728 2016196 0 0 42 371 5158 15635 > 9 12 78 0 > 040121 15:17:09 1 15 1792 10488 9624 2005556 0 0 42 364 6380 23565 > 12 24 64 0 > 040121 15:17:14 1 21 1792 18812 9620 2007196 0 0 22 816 6423 24094 > 13 25 62 0 > 040121 15:17:19 0 22 1792 11908 9640 2014016 0 0 106 482 6697 27224 > 11 25 64 0 > 040121 15:17:24 50 13 1792 15136 9648 2014668 0 0 63 179 4901 23929 > 7 23 69 0 > 040121 15:17:29 1 17 1792 10652 9692 2019968 0 0 35 186 4472 20241 > 6 16 77 0 > 040121 15:17:34 0 18 1792 12152 9716 2012088 0 0 58 221 5410 27104 > 9 24 67 0 > > the mailspool is located on a NAS, so it's accesed using nfs, however > the NAS seems to be doing fine, no performance issues to be found there > untill now, but we're checking this ofcourse. > > the problem with running a popper in strace that it outputs an awfull lot of > data and usually the load doesn't start increasing right away. i did see > a bunch (1021) of fd errors only at startup like this: > > 15:54:41.272417 open("/dev/null", O_RDWR|O_CREAT|O_TRUNC, 0666) = 3 > 15:54:41.272589 fork() = 25607 > [pid 25607] 15:54:41.272788 setsid( <unfinished ...> > [pid 25605] 15:54:41.272876 semget(IPC_PRIVATE, 0, 0x1|0 <unfinished ...> > [pid 25607] 15:54:41.272924 <... setsid resumed> ) = 25607 > [pid 25605] 15:54:41.272952 <... semget resumed> ) = -1 ENOSYS (Function not > implemented) > [pid 25607] 15:54:41.272989 fork( <unfinished ...> > [pid 25605] 15:54:41.273028 _exit(0) = ? > [pid 25607] 15:54:41.273101 <... fork resumed> ) = 25608 > [pid 25608] 15:54:41.273314 chdir("/") = 0 > [pid 25607] 15:54:41.273428 semget(IPC_PRIVATE, 0, 0x1|0 <unfinished ...> > [pid 25608] 15:54:41.273479 getrlimit(0x7, 0xbffff7f8 <unfinished ...> > [pid 25607] 15:54:41.273518 <... semget resumed> ) = -1 ENOSYS (Function not > implemented) > [pid 25608] 15:54:41.273546 <... getrlimit resumed> ) = 0 > [pid 25607] 15:54:41.273579 _exit(0) = ? > 15:54:41.273610 close(1024) = -1 EBADF (Bad file descriptor) > 15:54:41.273736 close(1023) = -1 EBADF (Bad file descriptor) > 15:54:41.273828 close(1022) = -1 EBADF (Bad file descriptor) > 15:54:41.273901 close(1021) = -1 EBADF (Bad file descriptor) > etc... > > i have an strace output generated during the peaks, if you're interested i > can mail it to you (27mb) or a portion of it. > > i've even tried a 2.6.1 kernel to see if it had any effect, but it didn't, > untill i did see that a debian3 was having the same problem but less > intensive then a slack9. the major difference between those distro's is > the gcc version 2.95.x <> 3.2.2. compiling qpopper with gcc 2.95.x did not > have any effect on the slack boxes. > > > bart > > On Tue, Jan 20, 2004 at 03:45:49PM -0800, The Little Prince wrote: > > I have heard of any performance problems with my patch. People have > > reported really good perf. with thousands of users. > > Nobody has reported anything with radius auth. used at the same time. > > Not being able to test any other cases, e.g. local auth. and maildir, > > radius, and mbox, etc.. doesn't help you. > > Like Clifton said, check your stats. Watch vmstat statistics. > > Even strace some of the processes to see what calls they spend the most > > time in. > > > > --Tony > > .-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-. > > Anthony J. Biacco Network Administrator/Engineer > > [EMAIL PROTECTED] http://www.asteroid-b612.org > > > > "You find magic from your god, and I find magic everywhere" > > .-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-. > > > > On Tue, 20 Jan 2004, Clifton Royston wrote: > > > > > On Mon, Jan 19, 2004 at 03:47:38PM +0100, Bart Dumon wrote: > > > > i'm running qpopper 4.0.5 on linux (2.4.x) with maildir patch > > > > (0.12) and pam_radius for authentication. > > > > > > > > right now, i'm suffering from high cpu load averages once it's > > > > gets too busy the load will skyrocket to abnormal high values > > > > and the service will become unavailable untill it's restarted. > > > > this typically happens during peak times when we receive 15 pop > > > > sessions/sec. > > > > > > > > at first it thought it was radius related because i'm seeing the > > > > following error message during the peak times: > > > > > > > > Jan 19 14:07:41 xxx popper[13404]: pam_radius_auth: RADIUS server x.x.x.x > > > > failed to respond > > > > > > > > but even with a more performant radius, the problem persists, it > > > > looks like the radius errors are a consequence of the problem and > > > > not the real cause. > > > > everything is pointing in the direction of the amount of pop sessions > > > > whenever you get to the 13-14pops/sec barrier, qpopper seems to > > > > be giving up. it's not traffic related because the amount of traffic > > > > is higher outside the peak hours. > > > > > > Usually this kind of overload is due to many users having large > > > mailboxes (e.g. 30MB and up) in the old UNIX mbox format. In this > > > format, the file needs to be recopied to update the messages' status > > > when popped, which results in the POP sessions completely saturating > > > your disk I/O bandwidth. > > > > > > I have also seen some Radius daemons show a tendency to die under > > > this type of heavy load. > > > > > > I haven't seen reports of this with maildir format. However, what > > > you're describing is consistent with I/O bandwidth saturation. > > > > > > If you are saturating your disk bandwidth, you'll see a large number > > > of concurrent tasks waiting to run ("load" as shown by the uptime > > > command or xload) but a high proportion of idle time shown by vmstat. > > > At that point you'll need to try to figure out why all this bandwidth > > > is still going on even with maildir format; I don't use that patch, so > > > I can't help with troubleshooting it. > > > -- Clifton > > -- .-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-. Anthony J. Biacco Network Administrator/Engineer [EMAIL PROTECTED] http://www.asteroid-b612.org "You find magic from your god, and I find magic everywhere" .-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-.