Thank you, Herbert, for the suggestions.
---------- Original Message ----------------------------------
From: Herbert Poetzl <[EMAIL PROTECTED]>
Reply-to: [EMAIL PROTECTED]
Date: Thu, 7 Nov 2002 06:56:10 +0100
>On Wed, Nov 06, 2002 at 04:15:51PM -0800, Cathy Sarisky wrote:
>> I have two servers hosting vservers. The first is a 1 GHz Duron with 1 GB memory
>and an IDE disk, ext3. It's on 60+ days of uptime, and the last time it was down was
>to upgrade the memory. It runs 9 vservers and some stuff in the root server also,
>without a complaint. Many of the vserver clients are running mostly idle AOLserver
>instances, so I have about 500MB of swap in use (2GB swap available) pretty
>regularly. Loads are reasonable (about .7 during the day, often less than .2
>overnight), the server is peppy, and everyone is happy. This is a Redhat 7.2 server,
>with the pre-built kernel (2.4.18-ctx12). That kernel isn't set up for highmem, so
>actually I'm only using about 900MB of my 1GB.
>>
>> Enter server #2. Server #2 is an P4 1.6GHz, with 1GB memory and RAID1, ext3. I
>wanted a highmem kernel, so I compiled this one. This is a Redhat 7.3 server, with
>2.4.19ctx-13, patched and compiled by yours truly. It has had 4-6 vservers running
>on it, loads in the .1-.5 range, and little if any swap in use.
>> (Currently:
>> Mem: 1033596K av, 1019324K used, 14272K free, 0K shrd, 272916K buff
>> Swap: 2048276K av, 796K used, 2047480K free 194720K cached)
>>
>> This server is very responsive for a while after a reboot. Days to maybe a week.
>Then it will appear to hang. It doesn't respond to SSH or http requests (to either
>root server or vserver), although it doesn't actually drop the packets. It remains
>pingable. It doesn't run cron jobs. At the point where the problem starts, all
>logging stops, but there's no indication of a problem on the horizon prior to the
>cessation of logging. The server still responds at a console. Two times I've had
>the data center tech run sar -u on it before rebooting. Once showed complete cpu
>usage, once showed the cpu almost entirely idle. The vps run by the data center tech
>also doesn't show anything unusual, although in both cases the server had been
>unresponsive for a while before the sar and vps commands were run.
>>
>> Further weirdness: when the server is told to shutdown at the console, it becomes
>ssh-able again for a few moments during the shutdown process. This suggests to me
>that there's some process running that causes the server to be unresponsive, and when
>it's killed during the server shutdown, things revert to normal again. (Of course,
>then the server reboots.) I *really* wish this server wasn't in a data center
>half-way across the country!
>
>let us assume there is such a process running ...
>- where should such a process come from?
> - cron jobs? no you don't run cron jobs on the server!
> - left over process from virtual server XY?
>- what would a single process do to stop the logging
> and block remote logins (by accident)?
> - temporarily replace the filesystem/network/etc?
> - capture all tcp/udp packets?
>
>so I do not believe that a process could cause this kind
>of starvation, more likely some device i/o or system
>(read kernel) resource exhaustion will be the cause.
>
>I would check for the following:
>
>- file/inode maximum setting
>- virtual memory limitations
>- maximum process time (maybe log/ssh/etc gets killed?)
>
>- how much time is spend in system state
>- how many processes are there, and how long is
> the oldest process running?
>
>- what is the last entry in the log?
>- what happened at this moment on other log files?
>
>- what about power management? ACPI/APM/SpeedStep
>- what about I/O errors on the harddisk?
>
>I also would not draw many conclusions from the ping-ability
>of the server, because the icmp echo reply is at such
>a low (kernel stack) level, that often even when the kernel
>is completely unresponsive, icmp echo replies come back.
>Second, do the replies come from your machine at all?
>Many firewalls nowadays send an icmp echo reply back,
>without even asking the addressed machine ...
>
>> The datacenter swapped out the network card, motherboard, and memory last week but
>I've seen another server hang since.
>>
>> I'm stumped. I think the next course of action is to try running the precompiled
>kernel on this server, but that'll lose me the highmem features.
>
>I would suggest, you install some tools, which
>monitor the system state/resource usage and send
>this data immediately to another host, and/or
>store it on an dedicated partition/disk ...
>
>> I realize that there are probably waaaay too many variables different between these
>two servers for the source of the problem to be, but I wonder if anyone has seen
>anything similar and might suggest a course of action. Do these symptoms sound at
>all familiar? Trying to solve this sort of problem by experiment is wretched with a
>server in a datacenter and a problem that isn't reliably reproducible!
>>
>> Thanks in advance for any ideas, suggestions for further investigation, or
>encouragement!
>
>try to narrow down the possibilities by proving
>that some or all of my suggestions/assumptions are
>wrong ...
>
>best,
>Herbert
>
>
>>
>> Cathy Sarisky
>>
>> ________________________________________________________________
>> Sent via the WebMail system at webmail.pioneernet.net
>>
>>
>
________________________________________________________________
Sent via the WebMail system at webmail.pioneernet.net