We've noticed something pretty bothersome for our
environment when we went from SLES 11SP2 to 11SP3.  The
penalty for using more virtual memory on a machine than you have
"real" memory allocated to it has gone up dramatically.  Have any
of you seen anything like this?

This has been seen with all three SP3 kernels: 3.0.82-0.7.9,
3.0.93-0.8.2, and 3.0.101-0.8.1.  Kswap0 seems to start frantically 
going through the virtual memory space looking for something it can 
free or swap out; the CPU use is very high and the machine is close 
to non-responsive.

A test case we came up with was a simple Perl script that allocated 3.5 GB of
memory as one big array and stepped through it.  That took two hours 
on a server with 1 GB of memory total but a minute on a server with 3 GB free.

Our first engineer said that this is normal behavior because you shouldn't dip 
deeply into swap and expect the system to perform decently  That argument works
in the Intel world because swap goes to disk and disks are very slow.
It's not nearly as true on mainframes because swap (at least our swap)
goes to extended storage and that's still memory.  Since then Suse have 
come around to our view that we shouldn't be seeing this.

Besides, we got away with it in SP2.  Something changed with SP3.  

So:  questions. 
  * Has anyone else seen this?
  * Does anyone know what changed in the kernel between SP2 and SP3?
    SP2 kernels of similar release number to SP3 don't show this.
  * Can we tune something to alleviate this?

For those who are interested, more detail follows:

The real-world applications that trigger this are Java applications
that use a lot of memory.  Some are Websphere and some are
home-grown.  I'm convinced that it's the memory used, not the
details of the application that's the problem.

Once kswapd finishes looking around it really doesn't take that long
to go through the array.  Once the system gives up cleaning the
cupboards and actually starts going through them it's not too bad;
it's slower than with adequate memory but by a factor closer to 4
than to 60.  The CPU use is also what convinces me it's a kernel
problem instead of understandably poor hardware performance.

We could get our test systems to go back to the old behavior by
downgrading the kernel to SP2---even if the SP2 kernel had a higher
version than the SP3 one.   For example, we can run the 3.0.93-0.5.1
from SP2 successfully on an otherwise-SP3 system; the 3.0.93-0.8.2
kernel from SP3 has the problem on an otherwise SP2 system. 

We asked about this here earlier; that thread starts at
http://www.mail-archive.com/linux-390@vm.marist.edu/msg64647.html

And if you got this far:  thank you!

Ted Rodriguez-Bell
Mainframe and Midrange Services, Wells Fargo
te...@wellsfargo.com



Company policy requires:  This message may contain confidential and/or 
privileged information.  If you are not the addressee or authorized to receive 
this for the addressee, you must not use, copy, disclose, or take any action 
based on this message or any information herein.  If you have received this 
message in error, please advise the sender immediately by reply e-mail and 
delete this message.  Thank you for your cooperation.

----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to lists...@vm.marist.edu with the message: INFO LINUX-390 or visit
http://www.marist.edu/htbin/wlvindex?LINUX-390
----------------------------------------------------------------------
For more information on Linux on System z, visit
http://wiki.linuxvm.org/

Reply via email to