On Fri, Feb 03, 2006 at 11:53:01AM -0500, Nima wrote:

> What basically happens is that once I have more than 150 users
> logged in the response time for a page takes a minute or more which
> is very frustrating for users.  Today we received an email saying
> that the system is simply "shit". I don't know what to do next. All
> I know is that from semester to semester the load is getting higher
> but the frustration as well.

Nima, rrom your description this did it NOT happen recently all at
once, it has been gradually getting worse as you add more users,
right?

It sounds like you're fairly lost and that you have an urgent
performance problem affecting your users.  I don't know just what
dotLRN installation this is, but if you really have 150 "concurrent"
users (whatever that means precisely in this case), then it's probably
one of the big dotLRN installations at a large university somewhere
(ah, from the config file below, "uni-mannheim.de").

I hope that means you also have some sort of support contract with one
of the OpenACS / dotLRN gurus.  If it's not something you can
immediately (like, today) identify and fix, my advice is get them
involved ASAP.  It might be just a simple misconfiguration somewhere,
or it might be something deeper and trickier to fix.  Either way,
having all your users actively angry at you makes it plenty urgent
enough to call in the big guns...

> We have three linux boxes. One for an aolserver with database connection, 
> one for a static aolserver and one for the database.
> 
> The database box never goes above 5-10%. The static server is also not 

That's only cpu load.  You also want to check it's I/O activity.
(Solaris top also shows "I/O wait" percentages but Linux unfortunately
does not.)  On older Linux boxes "iostat 5" was the way to do that.
Newer Linux systems may have different/better ways to do that.

> very busy but the dynamic server can go upt to 99% and a load of 10 and 
> more.

Well, that VERY strongly suggests that the rate limiter is simply
executing all that Tcl code in your AOLserver.  If so, add more
AOLserver boxes, and set up Pound or the like as a front-end server to
split the load between them.  And/or upgrade to a much faster server.

In addition, try to find out what pages are eating up most of the
processing time, and speed them up.  A lot of that processing may be
redundant and/or innefficient.  Some judicious cacheing and/or code
tuning could make a huge difference.

Oh, and a silly question:  What version of Tcl are you using, and did
you compile it with optimization?  You definitely want to be using the
latest Tcl 8.4.x version compiled with either "-g -O2" (my preferecne)
or "-O2".  I don't know how much slower Tcl is if you leave compiler
optimization turned off, but it's probably enough to be very noticable
in your case.  (Make sure AOLserver was also compiled with optization
of course; it re-uses the Tcl build flags.)

Finally, this is more of a research project, but your site is large
and busy enough to benefit from figuring out just what the current
status is of this patch:

  Cache compiled Tcl page bytecode
  
http://sourceforge.net/tracker/?func=detail&aid=689515&group_id=3152&atid=353152

> Currently:
> %MEM %CPU  SHR   PID USER      PR  NI  VIRT  RES S    TIME+  COMMAND
> 41.1  0.0 7448 27147 unima2    25   0 1770m 1.6g S   
> 0:36.66 /opt/aolserver4/bin/nsd -u unima2 -t /www/unima2/etc/config.tcl 
> with 44 users logged into the system.

That says that your AOLserver is using 1.7 GB total memory, almost all
of it resident.  Which is huge for most people, but probably quite
reasonable for you, since you have 4 GB in that box.  That at least
probably means that the box isn't thrashing between RAM and disk,
good.

> dotlrn (dynamic server)
> 
> AOLServer 4.0.10 (connected to the database)
> Pound 1.8.2 (as reverse proxy for ssl and load balancing)
> Apache 2.0.53 (only redirect from 80 to 443 where pound is)

Oh, you're already using Pound as the front-end.  So, shouldn't it be
easy to stick in additional AOLservers behind it for dynamic content?

The CATCH is, is your site and all its code, both stock and custom,
already set up to work nicely with multiple AOLservers?  Or does it
rashly ASSUME only 1 AOLserver process in some places, such that you
are going to see bugs or inconsistencies when using multiple
AOLservers?  I dunno.  For that, I definitely recommend talking to the
other folks running multiple AOLservers with OpenACS and dotLRN.

It sounds like you're running Pound on the same box as AOLserver.
You'll definitely need to change that in order to add dynamic content
servers.  (I don't understand why you're using Apache to redirect
client browsers from port 80 to 443 either, that seems odd.)

> SuSE 9.2
> Linux Linux version 2.6.8-24.18-smp (gcc version 3.3.4 (pre 3.3.5 
> 20040809)) #1 SMP Fri Aug 19 11:56:28 UTC 2005
> 4 CPU Intel(R) Xeon(TM) CPU 3.06GHz , L2 cache: 512K
> 4 GByte RAM - Memory: 4070968k/4111296k available (2339k kernel code, 
> 39528k reserved, 824k data, 252k init, 3193792k highmem)
> 2 Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet Cards

Hm, do you REALLY have 4 Xeon CPUs in that box, or is that the Intel
hyper-threading feature turned on?  I suspect you have 2 Xeon
single-core sockets, with hyper-threading turned on.  Back with Linux
2.4.x the ule of was to always turn hyper-threading OFF, as Linux
didn't know how to use it properly and hyper-threading would slow
things down, not speed them up.  I don't know whether that has changed
with the newer 2.6.x kernels.

I was going to suggest that if your single AOLserver box is a few
years old, then immediately replacing it with a (much faster) brand
spanking new one may be the easiest and most cost effective way to
alleviate the problem.  Then you can take more time to get multiple
boxes set up, make sure that your code works correctly in that
configuration, etc.  However, from the specs above you're already
using a fairly high end machine, so that might not make sense.
Keeping your existing box and adding more is probably the way to go.

For comparison though, for about $5900 US, right now you could order a
PowerEdge 1850 1U box from Dell with 2 sockets each with a dual-core
2.8 GHz Xeon (2x2 MB L2 cache), 8 GB of RAM (expandable to 16 GB),
RAID-1 w/ 2 15k rpm SCSI drives.  That should give you roughly 2x the
performance of your current box.  Or the same Dell box with 2
single-core 3.8 GHz Xeons for about $5600.

I wouldn't necessarily pick either that machine or Dell, but that's a
useful price point for comparison.  The Dell box is mildly gold-plated
anyway, since this is effecively a compute box, it is running only
AOLserver no RDBMS or anything else disk intensive, I would probably
go with hardware RAID-1 but with SATA or even plain old IDE drives, no
need to pay extra for SCSI.  If I had many of these identical boxes
and was set up to easily automatically install them, I might even skip
the RAID card.

-- 
Andrew Piskorski <[EMAIL PROTECTED]>
http://www.piskorski.com/


--
AOLserver - http://www.aolserver.com/

To Remove yourself from this list, simply send an email to <[EMAIL PROTECTED]> 
with the
body of "SIGNOFF AOLSERVER" in the email message. You can leave the Subject: 
field of your email blank.

Reply via email to