I have two servers hosting vservers.  The first is a 1 GHz Duron with 1 GB memory and 
an IDE disk, ext3.  It's on 60+ days of uptime, and the last time it was down was to 
upgrade the memory.  It runs 9 vservers and some stuff in the root server also, 
without a complaint.  Many of the vserver clients are running mostly idle AOLserver 
instances, so I have about 500MB of swap in use (2GB swap available) pretty regularly. 
 Loads are reasonable (about .7 during the day, often less than .2 overnight), the 
server is peppy, and everyone is happy.  This is a Redhat 7.2 server, with the 
pre-built kernel (2.4.18-ctx12).  That kernel isn't set up for highmem, so actually 
I'm only using about 900MB of my 1GB.

Enter server #2.  Server #2 is an P4 1.6GHz, with 1GB memory and RAID1, ext3.  I 
wanted a highmem kernel, so I compiled this one.  This is a Redhat 7.3 server, with 
2.4.19ctx-13, patched and compiled by yours truly.  It has had 4-6 vservers running on 
it, loads in the .1-.5 range, and little if any swap in use.
(Currently:
Mem:  1033596K av, 1019324K used,   14272K free, 0K shrd, 272916K buff
Swap: 2048276K av, 796K used, 2047480K free 194720K cached)

This server is very responsive for a while after a reboot.  Days to maybe a week.  
Then it will appear to hang.  It doesn't respond to SSH or http requests (to either 
root server or vserver), although it doesn't actually drop the packets.  It remains 
pingable.  It doesn't run cron jobs.  At the point where the problem starts, all 
logging stops, but there's no indication of a problem on the horizon prior to the 
cessation of logging.  The server still responds at a console.  Two times I've had the 
data center tech run sar -u on it before rebooting.  Once showed complete cpu usage, 
once showed the cpu almost entirely idle.  The vps run by the data center tech also 
doesn't show anything unusual, although in both cases the server had been unresponsive 
for a while before the sar and vps commands were run.

Further weirdness: when the server is told to shutdown at the console, it becomes 
ssh-able again for a few moments during the shutdown process.  This suggests to me 
that there's some process running that causes the server to be unresponsive, and when 
it's killed during the server shutdown, things revert to normal again.  (Of course, 
then the server reboots.)  I *really* wish this server wasn't in a data center 
half-way across the country!

The datacenter swapped out the network card, motherboard, and memory last week but 
I've seen another server hang since.  

I'm stumped.  I think the next course of action is to try running the precompiled 
kernel on this server, but that'll lose me the highmem features.

I realize that there are probably waaaay too many variables different between these 
two servers for the source of the problem to be, but I wonder if anyone has seen 
anything similar and might suggest a course of action.  Do these symptoms sound at all 
familiar?  Trying to solve this sort of problem by experiment is wretched with a 
server in a datacenter and a problem that isn't reliably reproducible!

Thanks in advance for any ideas, suggestions for further investigation, or 
encouragement!

Cathy Sarisky 




________________________________________________________________
Sent via the WebMail system at webmail.pioneernet.net


 
                   

Reply via email to