On 1/18/2013 7:53 PM, dormitionsk...@hotmail.com wrote:
On Jan 17, 2013, at 8:47 PM, Reginald Beardsley wrote:
As far as I'm concerned, problems like this are a bottomless abyss. Which is
why I'm still putting up w/ my OI box hanging. It's annoying, but not
critical. It's also why critical stuff still runs on Solaris 10.
Intermittent failures are the worst time sink there is. There is no assurance
that devoting all your time to the problem will fix it even at very high skill
levels w/ a full complement of the very best tools.
If you're getting crash dumps there is hope of finding the cause, so that's a
big improvement.
Good luck,
Reg
BTW Back in the 80's there was a VAX operator in Texas who went out to his
truck, got a .357 and shot the computer. His employer was not happy. But I
can certainly understand how the operator felt.
From 1992 to I used to 1998, I used to work at the Denver Museum of Natural
History -- now the Denver Museum of Nature and Science. We had two or three
DEC Vax's and an AIX machine there. It was their policy that once a week we
had to power each of the servers all the way down to clear out any memory
problems -- or whatever -- as preventive maintenance.
Since then, I've always had the habit of setting up a cron job to reboot my
servers once a week. It's not as good as a full power down, but it's better
than nothing. And in all these years, I've never had to deal with intermittent
problems like this, except for a few brief times when I used Red Hat Linux ten
plus years ago. (I've tried most of Red Hat's versions since 6.2, and RHEL 6
is the first version I've found that runs decent enough on our hardware, and
that I'm happy enough with, for us to use.)
So, if you can do it, you might want try setting up a cron job to reboot your
server once a week -- or every night. I reboot our LTSP thin client server
every night just because it gets hit with running lots of desktop applications
that I think give it a greater potential for these kinds of memory problems.
On the other hand, we have all of our websites hosted on one of our
parishioner's servers -- and he doesn't reboot his machines periodically like I
do -- and about every two months, I have to call him up and tell him something
is wrong. And he goes and powers down his system -- sometimes he has to even
unplug it -- and then turn it back on, and everything works again.
I know there are system admins that just love to brag about how great their
up-times are on their machines -- but this might just save you a lot of time
and grief.
Of course, if you're running a real high-volume server, this might not be
workable for you; but it only takes 2-5 minutes or so to reboot... Perhaps in
the middle of the night you might be able to spare it being down that short
time?
Just a friendly suggestion.
Shared experience.
I know others may tell you that that's no longer necessary anymore in these
more modern times; but my experience has been otherwise.
I hope it helps.
+Peter, hieromonk
Haven't we passed the days of mystical sysadmin without understanding
and characterization? Keeping up tradition for tradition's sake without
understanding the underlying reasons really doesn't do anybody a favor.
If there are memory leaks, we posses the technology to find them. My
organization has thousands of machines that run jobs sometimes for
months at a time. If I had to reboot servers once a week, my users would
be at the doors with pitchforks. The only time we take downtime is when
there are reasons to do so, including OS updates, hardware failures, and
user software run amok. They can run a very long time like this.
Not that memory leaks never happen. Of course they do, but they
eventually get found and fixed, or the program causing them passes into
obsolescence. Always.
I encourage discovery rather than superstition, and diagnosis rather
than repetition.
Be a knight, not a victim!
_______________________________________________
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss