Mario Witte wrote:
> On Fri, Aug 03, 2001 at 03:42:28AM -0700, Van wrote:
> > Mario Witte wrote:
> > Memory or motherboard. How many sticks of RAM do you have in the machine?  Are
> > they the same speed (p100/p133, etc.)?  What kind of motherboard? Is updatedb
> > running at this time?  (might be a hard-drive croaking while trying to update
> > the locate database).....
> There are 2x256 MB and 2x128MB sticks in there, all at a spped of 133.
> Please don't ask me what kind of motherboard we're running in there, but
> that shouldn't be a problem.

Couple months ago I had 256 MBytes in my Slackware Athlon workstation labeled
100MHz-128MBytes on both chips.  Turns out one of the chips was 133MHz and the
other was 100MHz (Fry's electronics labeling dep't).  Sadly, the bucks I spent
on a new PIII true Intel Board qualified the chips and I put the 2 133MHz chips
into the Athlon and the 2 qualified P100Mhz chips into the PIII Intel board. 
Honestly, my Athlon was crashing on Slackware and the Intel was crashing on
Advanced Server regularly.  Slackware with 2.4.1 kernel sometimes twice;
sometimes 3 times in a day; sometimes would go for a few days;  The Advanced
Server couldn't stay up long enough to show a login without a BSOD, except
randomly sometimes.  Switched the chips after verifying them, and now the Athlon
can run 3 weeks at a time

vanboers@sedona:~$ w
  4:06am  up 22 days, 13:27,  0 users,  load average: 1.10, 1.24, 1.14
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU  WHAT

(developer machine, what can I say?  1+ Load Avg due to SetiAtHome, BTW) and
I've seen the Advanced server run close to a month (between virus patches and
IIS security updates).  On average, the developer Athlon machine still beats the
Advanced Server machine on uptime, but the point is that the memory was killing
both of them.  Don't take the label for granted.  I can't tell you how much time
I wasted determining this.  I can tell you I lost over $5k US in billing, though
because I assumed the label was correct.  I lost the time because I assumed I
was doing something wrong and couldn't bill my client for development during the
month it took me to find out what the problem was; disparate memory on the same
motherboard.   If the labeling had been correct; I would have just swapped the
chips.  Hope that makes sense.

> Updatedb is running around midnight, but I just found out that
> cron.hourly could be a problem in there as we experienced another crash
> tonight which was at 1:59, the crash yesterday occured at 4:59. Always
> around the full hour. I've tried and disabled cron.hourly for now,
> hoping it will help. Seems like it wasn't a problem of mysql, it was
> just mysql which was killed and thus appeared in the kerne ltrace or
> something.

You're probably onto something here.  Great forensics work! I have experience
with the cron.hourly/cron.daily/etc. processes that fire up when you have the
logrotate packages installed.  It's been a while, but while I was using RedHat
at Intel I convinced them these crons and the logrotate packages should either
be audited thoroughly, or pitched because of the second-guessing they do to the
admin of the machine/network.  

Intel opted to replace logrotate with an implementation (msgarch) I've had
running on all of my production machines for several years on some of their
monitoring servers in the division in which I was working.   (Intel applied my
implementation on modified Red Hat and Slackware monitoring servers at that
time.  I have no idea what they've done with their Red Hat implementations and
don't know if they currently deploy Slackware servers at this point in that
division).  I haven't been at Intel for over 4 months, so I have no idea what
they're up to with their server software in that division, at this point.

If logrotate is the cause, I'll send you msgarch and the cron entries for
msgarch.  Sorry I didn't OSS msgarch before, but I hadn't heard of many
complaints on logrotate.   The BSD people use it also and most with a certain
level of satisfaction.  msgarch is my own recipe, but has been implemented by
many of my affiliates for many years.  I just didn't OSS it because most people
have been using LogRotate and I thought it redundant to toss msgarch to the
community.  If that assumption was wrong, let me know.  I'll pitch msgarch into
the community.  

> > Seems your machine might have a wrong hardware component somewhere.  I'd check
> > it out if it's a production machine.
> We sure will, but the system is located about 500 kilometers from where
> our bureau is, so I hope it will stay alive at least over the weekend
> :-)

This is a problematic situation.  Hardware is SO important in remote
deployments.  I hate to say this, but my most important machine is 2000 miles
away from me, but I tested it locally for 2 months on the hardware I put it on
before I deployed it.  That might be the lesson, here.  The hardware didn't fail
after 2 months testing.  Not comprehensive, but might be enough for most
purposes. Who knows?

You can't win them all.  I'm lucky that I tested it on a really stable kernel,
and on hardware that ran physically next to me for 2 months before I deployed
it.  I'm also very lucky that it's been running for 294 days 2000 miles away. 
Anyone could take it down, if they knew how; and I wouldn't be able to do squat
about it but divert the traffic to a backup server.  Point is, the kernel is
solid.  That's where you need to get with your machine.  The kernel messages are
telling you something about the hardware; not MySQL.

> > mysqld can't run as the only service.  You can't run anything without initd.
> Ok, you won! ;)

No-one won.  We need to find out which service caused the panic and fix it.

> Thanks for your fast help,
> With regards,
> --
> Mario Witte <[EMAIL PROTECTED]>
Any time.

Best Regards,
Linux rocks!!!

Before posting, please check:   (the manual)           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try:

Reply via email to