Thank you, Herbert, for the suggestions.

---------- Original Message ----------------------------------
From: Herbert Poetzl <[EMAIL PROTECTED]>
Reply-to: [EMAIL PROTECTED]
Date: Thu, 7 Nov 2002 06:56:10 +0100

>On Wed, Nov 06, 2002 at 04:15:51PM -0800, Cathy Sarisky wrote:
>> I have two servers hosting vservers.  The first is a 1 GHz Duron with 1 GB memory 
>and an IDE disk, ext3.  It's on 60+ days of uptime, and the last time it was down was 
>to upgrade the memory.  It runs 9 vservers and some stuff in the root server also, 
>without a complaint.  Many of the vserver clients are running mostly idle AOLserver 
>instances, so I have about 500MB of swap in use (2GB swap available) pretty 
>regularly.  Loads are reasonable (about .7 during the day, often less than .2 
>overnight), the server is peppy, and everyone is happy.  This is a Redhat 7.2 server, 
>with the pre-built kernel (2.4.18-ctx12).  That kernel isn't set up for highmem, so 
>actually I'm only using about 900MB of my 1GB.
>> 
>> Enter server #2.  Server #2 is an P4 1.6GHz, with 1GB memory and RAID1, ext3.  I 
>wanted a highmem kernel, so I compiled this one.  This is a Redhat 7.3 server, with 
>2.4.19ctx-13, patched and compiled by yours truly.  It has had 4-6 vservers running 
>on it, loads in the .1-.5 range, and little if any swap in use.
>> (Currently:
>> Mem:  1033596K av, 1019324K used,   14272K free, 0K shrd, 272916K buff
>> Swap: 2048276K av, 796K used, 2047480K free 194720K cached)
>> 
>> This server is very responsive for a while after a reboot.  Days to maybe a week.  
>Then it will appear to hang.  It doesn't respond to SSH or http requests (to either 
>root server or vserver), although it doesn't actually drop the packets.  It remains 
>pingable.  It doesn't run cron jobs.  At the point where the problem starts, all 
>logging stops, but there's no indication of a problem on the horizon prior to the 
>cessation of logging.  The server still responds at a console.  Two times I've had 
>the data center tech run sar -u on it before rebooting.  Once showed complete cpu 
>usage, once showed the cpu almost entirely idle.  The vps run by the data center tech 
>also doesn't show anything unusual, although in both cases the server had been 
>unresponsive for a while before the sar and vps commands were run.
>> 
>> Further weirdness: when the server is told to shutdown at the console, it becomes 
>ssh-able again for a few moments during the shutdown process.  This suggests to me 
>that there's some process running that causes the server to be unresponsive, and when 
>it's killed during the server shutdown, things revert to normal again.  (Of course, 
>then the server reboots.)  I *really* wish this server wasn't in a data center 
>half-way across the country!
>
>let us assume there is such a process running ...
>- where should such a process come from?
>  - cron jobs? no you don't run cron jobs on the server!
>  - left over process from virtual server XY?
>- what would a single process do to stop the logging
>  and block remote logins (by accident)?
>  - temporarily replace the filesystem/network/etc?
>  - capture all tcp/udp packets?
>
>so I do not believe that a process could cause this kind
>of starvation, more likely some device i/o or system
>(read kernel) resource exhaustion will be the cause.
>
>I would check for the following:
>
>- file/inode maximum setting
>- virtual memory limitations
>- maximum process time (maybe log/ssh/etc gets killed?)
>
>- how much time is spend in system state
>- how many processes are there, and how long is
>  the oldest process running?
>
>- what is the last entry in the log?
>- what happened at this moment on other log files?
>
>- what about power management? ACPI/APM/SpeedStep
>- what about I/O errors on the harddisk?
>
>I also would not draw many conclusions from the ping-ability
>of the server, because the icmp echo reply is at such
>a low (kernel stack) level, that often even when the kernel
>is completely unresponsive, icmp echo replies come back.
>Second, do the replies come from your machine at all?
>Many firewalls nowadays send an icmp echo reply back, 
>without even asking the addressed machine ...
>
>> The datacenter swapped out the network card, motherboard, and memory last week but 
>I've seen another server hang since.  
>> 
>> I'm stumped.  I think the next course of action is to try running the precompiled 
>kernel on this server, but that'll lose me the highmem features.
>
>I would suggest, you install some tools, which
>monitor the system state/resource usage and send
>this data immediately to another host, and/or
>store it on an dedicated partition/disk ...
>
>> I realize that there are probably waaaay too many variables different between these 
>two servers for the source of the problem to be, but I wonder if anyone has seen 
>anything similar and might suggest a course of action.  Do these symptoms sound at 
>all familiar?  Trying to solve this sort of problem by experiment is wretched with a 
>server in a datacenter and a problem that isn't reliably reproducible!
>> 
>> Thanks in advance for any ideas, suggestions for further investigation, or 
>encouragement!
>
>try to narrow down the possibilities by proving
>that some or all of my suggestions/assumptions are
>wrong ...
>
>best,
>Herbert
>
>
>> 
>> Cathy Sarisky 
>> 
>> ________________________________________________________________
>> Sent via the WebMail system at webmail.pioneernet.net
>> 
>>                    
>
 




________________________________________________________________
Sent via the WebMail system at webmail.pioneernet.net


 
                   

Reply via email to