On Wed, 21 Jan 2004, Bart Dumon wrote:

> Tony,
> 
> i'm not saying it's directly related to the maildir patch, i've had
> very good performance just untill 2 weeks ago. the amount of active
> users has increased a little, but the load increased dramatically out
> of the blue.
> 

Forgive me bart, it just totally went over my head that you and I had 
talked a while ago when you had the rename() problem.
When you said NAS and nfs, the little light bulb went on above my head 
:-)
Those context switch/cs numbers under the vmstat seem awfully high to me. 
I wish i had something good to tell you.
How many boxes you got now? didn't you have like 4 last time we talked 
with 300k users?
Do any SW upgrades between the time it was going fine, and it started 
crawling? That's probably a stupid question.
Might as well send the strace to me, and i'll check it out..not sure what 
else to tell you.
btw, did i ever send you anything regarding that unpopable zero-filesize 
problem? I don't remember. If not, i'm sorry, it must have slipped my mind 
and i'll revisit it.

--Tony
.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-.
Anthony J. Biacco                            Network Administrator/Engineer
[EMAIL PROTECTED]              http://www.asteroid-b612.org

       "You find magic from your god, and I find magic everywhere" 
.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-.


> i'm really not able to test other cases as far as mailbox format and
> authentication is concerned. and yes this is very frustrating for me
> too. the amount of mailboxes is just too big to just switch them
> to local auth and/or mbox format. we've had a test setup but we're
> unable to get to this kind of behavior even when subject to very high
> amount of connections. we seem only able to get this problem in the
> production environment. i've already thrown in 2 extra boxes (temporary)
> which are also handling pop sessions, (this buys us some time, but the 
> real problem is certainly not gone). this also gives us the oppurtunity to 
> test some stuff and upgrade some stuff to see if it affects the problem. 
> 
> now, during peaks, i often see processes waiting to run:
> 
> 040121 15:16:44  procs -----------memory---------- ---swap-- -----io---- --system-- 
> ----cpu----
> 040121 15:16:44   r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs 
> us sy id wa
> 040121 15:16:44   1 18   1792  10608   9620 2002316    0    0    64   401 2605  8911 
>  6 10 85  0
> 040121 15:16:49  27 15   1792  13892   9608 1998868    0    0    30   890 6057 22022 
> 12 26 62  0
> 040121 15:16:54  53  9   1792  15940   9624 2001352    0    0    84   187 5711 30160 
> 11 31 59  0
> 040121 15:16:59  34 10   1792  15604   9668 2014532    0    0    25   326 5919 26858 
>  9 22 68  0
> 040121 15:17:04   0 17   1792  10544   9728 2016196    0    0    42   371 5158 15635 
>  9 12 78  0
> 040121 15:17:09   1 15   1792  10488   9624 2005556    0    0    42   364 6380 23565 
> 12 24 64  0
> 040121 15:17:14   1 21   1792  18812   9620 2007196    0    0    22   816 6423 24094 
> 13 25 62  0
> 040121 15:17:19   0 22   1792  11908   9640 2014016    0    0   106   482 6697 27224 
> 11 25 64  0
> 040121 15:17:24  50 13   1792  15136   9648 2014668    0    0    63   179 4901 23929 
>  7 23 69  0
> 040121 15:17:29   1 17   1792  10652   9692 2019968    0    0    35   186 4472 20241 
>  6 16 77  0
> 040121 15:17:34   0 18   1792  12152   9716 2012088    0    0    58   221 5410 27104 
>  9 24 67  0
> 
> the mailspool is located on a NAS, so it's accesed using nfs, however
> the NAS seems to be doing fine, no performance issues to be found there 
> untill now, but we're checking this ofcourse.
> 
> the problem with running a popper in strace that it outputs an awfull lot of
> data and usually the load doesn't start increasing right away. i did see
> a bunch (1021) of fd errors only at startup like this:
> 
> 15:54:41.272417 open("/dev/null", O_RDWR|O_CREAT|O_TRUNC, 0666) = 3
> 15:54:41.272589 fork()                  = 25607
> [pid 25607] 15:54:41.272788 setsid( <unfinished ...>
> [pid 25605] 15:54:41.272876 semget(IPC_PRIVATE, 0, 0x1|0 <unfinished ...>
> [pid 25607] 15:54:41.272924 <... setsid resumed> ) = 25607
> [pid 25605] 15:54:41.272952 <... semget resumed> ) = -1 ENOSYS (Function not 
> implemented)
> [pid 25607] 15:54:41.272989 fork( <unfinished ...>
> [pid 25605] 15:54:41.273028 _exit(0)    = ?
> [pid 25607] 15:54:41.273101 <... fork resumed> ) = 25608
> [pid 25608] 15:54:41.273314 chdir("/")  = 0
> [pid 25607] 15:54:41.273428 semget(IPC_PRIVATE, 0, 0x1|0 <unfinished ...>
> [pid 25608] 15:54:41.273479 getrlimit(0x7, 0xbffff7f8 <unfinished ...>
> [pid 25607] 15:54:41.273518 <... semget resumed> ) = -1 ENOSYS (Function not 
> implemented)
> [pid 25608] 15:54:41.273546 <... getrlimit resumed> ) = 0
> [pid 25607] 15:54:41.273579 _exit(0)    = ?
> 15:54:41.273610 close(1024)             = -1 EBADF (Bad file descriptor)
> 15:54:41.273736 close(1023)             = -1 EBADF (Bad file descriptor)
> 15:54:41.273828 close(1022)             = -1 EBADF (Bad file descriptor)
> 15:54:41.273901 close(1021)             = -1 EBADF (Bad file descriptor)
> etc...
> 
> i have an strace output generated during the peaks, if you're interested i
> can mail it to you (27mb) or a portion of it.
> 
> i've even tried a 2.6.1 kernel to see if it had any effect, but it didn't,
> untill i did see that a debian3 was having the same problem but less
> intensive then a slack9. the major difference between those distro's is
> the gcc version 2.95.x <> 3.2.2. compiling qpopper with gcc 2.95.x did not
> have any effect on the slack boxes.
> 
> 
> bart
> 
> On Tue, Jan 20, 2004 at 03:45:49PM -0800, The Little Prince wrote:
> > I have heard of any performance problems with my patch. People have 
> > reported really good perf. with thousands of users.
> > Nobody has reported anything with radius auth. used at the same time.
> > Not being able to test any other cases, e.g. local auth. and maildir, 
> > radius, and mbox, etc.. doesn't help you.
> > Like Clifton said, check your stats. Watch vmstat statistics.
> > Even strace some of the processes to see what calls they spend the most 
> > time in.
> > 
> > --Tony
> > .-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-.
> > Anthony J. Biacco                            Network Administrator/Engineer
> > [EMAIL PROTECTED]              http://www.asteroid-b612.org
> > 
> >        "You find magic from your god, and I find magic everywhere" 
> > .-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-.
> > 
> > On Tue, 20 Jan 2004, Clifton Royston wrote:
> > 
> > > On Mon, Jan 19, 2004 at 03:47:38PM +0100, Bart Dumon wrote:
> > > > i'm running qpopper 4.0.5 on linux (2.4.x) with maildir patch 
> > > > (0.12) and pam_radius for authentication. 
> > > > 
> > > > right now, i'm suffering from high cpu load averages once it's
> > > > gets too busy the load will skyrocket to abnormal high values
> > > > and the service will become unavailable untill it's restarted. 
> > > > this typically happens during peak times when we receive 15 pop 
> > > > sessions/sec.
> > > > 
> > > > at first it thought it was radius related because i'm seeing the
> > > > following error message during the peak times:
> > > > 
> > > > Jan 19 14:07:41 xxx popper[13404]: pam_radius_auth: RADIUS server x.x.x.x 
> > > > failed to respond
> > > > 
> > > > but even with a more performant radius, the problem persists, it
> > > > looks like the radius errors are a consequence of the problem and
> > > > not the real cause.
> > > > everything is pointing in the direction of the amount of pop sessions
> > > > whenever you get to the 13-14pops/sec barrier, qpopper seems to
> > > > be giving up. it's not traffic related because the amount of traffic
> > > > is higher outside the peak hours.
> > > 
> > >   Usually this kind of overload is due to many users having large
> > > mailboxes (e.g. 30MB and up) in the old UNIX mbox format.  In this
> > > format, the file needs to be recopied to update the messages' status
> > > when popped, which results in the POP sessions completely saturating
> > > your disk I/O bandwidth.
> > > 
> > >   I have also seen some Radius daemons show a tendency to die under
> > > this type of heavy load.
> > > 
> > >   I haven't seen reports of this with maildir format.  However, what
> > > you're describing is consistent with I/O bandwidth saturation.
> > > 
> > >   If you are saturating your disk bandwidth, you'll see a large number
> > > of concurrent tasks waiting to run ("load" as shown by the uptime
> > > command or xload) but a high proportion of idle time shown by vmstat.
> > > At that point you'll need to try to figure out why all this bandwidth
> > > is still going on even with maildir format; I don't use that patch, so
> > > I can't help with troubleshooting it.
> > >   -- Clifton
> 
> 

-- 
.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-.
Anthony J. Biacco                            Network Administrator/Engineer
[EMAIL PROTECTED]              http://www.asteroid-b612.org

       "You find magic from your god, and I find magic everywhere" 
.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-.


Reply via email to