Yesterday there turned out to be two unresponsive servers at work.  I wasn't on 
call, so I didn't immediately know about the first one.  But, nagios had 
complained about the second one.

So, I had connected to work and tried to ssh into our loghost.  No go, not 
enough disk space left to do that.  No worries, its a zone and its zfs dataset 
has a quota.  I'll just remove that temporarily so that I can get on and clean 
up.  Which is why we keep some of the zpool back from all the zones on the 
system.  Would've had to steal space, its also why some of us still prefer /var 
to be separate from root....though its not the way zones are being done.

I then find out that loghost ran out of space, because another server is 
spewing that /tmp was full, and that there was no more swap.  Well, that 
happens on Solaris.  I tried to ssh to it, of course that failed.  So, I track 
down how to get to console, log in and clean up /tmp and kill off the backlog 
of cron jobs that are filling up /tmp.

Later the on call person calls me, he was on his way in to reboot the server 
when he saw that I had fixed the problem.  Wanted to know how I had gotten on.

I suppose the simple answer was he should've tried console when ssh didn't 
work, and left it at that.  But, he asked the all important question, that so 
rarely comes up.  "Why does that work?"

So, the longer explanation is that sshd handles incoming connections by wanting 
to fork itself first, which is hard when there's practically no free memory on 
the system.

getty OTOH, handles the user authentication and then exec's (which replaces the 
process memory that its using, with) the login shell.  And, luckily there was 
still enough available for that to work.  Also its necessary to use root 
instead of our individual admin accounts, because /bin/sh is only about 20% 
bigger than getty.  While shells like bash/tcsh/zsh are more than 5 times 
bigger than /bin/sh.  Most of us use bash for our admin accounts, one person 
uses tcsh (though he never complained that his account had stopped working, 
because tcsh wasn't being made available anymore...he'd just use root 
directly), and another has been playing around with zsh.

Though not sure how it translates if it was a system with ttymon and more than 
one tty port....

Once on, it was then a challenge figuring out how to identify the where the 
problem files were and deal with them...by first going after older files, since 
removing an open file won't fix the problem :)

I recall that there are other tricks in this area, but ls worked intermittently 
enough to get things working again.

Good thing...I'd hate to end the uptime streak that this server has....it had 
been up 2525 days. (that means it was somewhere that didn't lose power the 
times we had transfer switch problem and a significant number of UPSs had run 
down before something was done....in one case we stayed on generator power for 
7 days.... 2525 would be its been up since the PDU problems that had been going 
on when I first started (It has been 2540 days since I started).  The PDU 
problem was that it wasn't correctly configured for use with a generator.  We 
could go on to generator, but it would shutoff when switching back to utility 
power.  Eventually somebody read the documentation and saw that a switch needs 
to be changed for use with a generator.....

Wonder if we have any other servers with that kind of uptime?  I know there was 
been a request that we provide continuous reporting of system uptimes.  Though 
I suspect its to make sure that we don't have high uptimes on any of our 
systems.....

-- 
Who: Lawrence K. Chen, P.Eng. - W0LKC - Senior Unix Systems Administrator
For: Enterprise Server Technologies (EST) -- & SafeZone Ally
Snail: Computing and Telecommunications Services (CTS)
Kansas State University, 109 East Stadium, Manhattan, KS 66506-3102
Phone: (785) 532-4916 - Fax: (785) 532-3515 - Email: [email protected]
Web: http://www-personal.ksu.edu/~lkchen - Where: 11 Hale Library
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to