On Monday 16 July 2001 21:34, D. R. Evans wrote:
> On 16 Jul 01, at 22:09, civileme wrote:
> > On Monday 16 July 2001 17:05, D. R. Evans wrote:
> > > I had thought that I had pinpointed the process that was causing my
> > > disk occasionally to lock on solid (to the point where I can do nothing
> > > except power down my machine) but it seems, after more than a week
> > > without problem, that I was mistaken.
> > >
> > > I am running LM 7.2 without modification on a stock 700MHz Athlon. (LM
> > > 7.2 was pre-installed.)
> > >
> > > At 09:48 this morning, something happened to lock the disk activity
> > > light on solid. When I detected the problem fifty minutes later, I
> > > could do absolutely nothing to investigate the state of the machine. I
> > > know that the time at which the failure occurred is 09:48 because the
> > > clock on my KDE desktop stuck at that time.
> > >
> > > After a reboot, I looked in the syslog file, and the only thing at that
> > > time is the following:
> > >
> > > Jul 16 09:48:26 localhost kernel: VM: killing process multiload_apple
> > >
> > > whatever that means.
> > >
> > > There are a few subsequent entries before I came into the room and
> > > noticed the failure about 50 minutes later. There are several CROND
> > > entries that say things like:
> > >
> > > CROND[11016]: (root) CMD ( /sbin/rmmod -as)
> > >
> > > and another couple of instances of VM killing processes.
> > >
> > > Does anyone have any suggestions how to figure out:
> > >
> > > 1. what's going on
> > > 2. what's causing it
> > >
> > >   Doc Evans
> >
> > Well a little more information on what the usual state of affairs is
> > would be helpful, like are you using VMWare?  what does ps aux say?  What
> > does lsmod say?  How about dmesg and rpm -qa ?
>
> Well, those are exactly the sorts of questions that I wanted to know to
> ask, since I have never seen anything like this and haven't got a clue
> as to what the problem might be.
>
> Here are the answers:
>
> The usual state of affairs is that the box sits there running 24/7 as a
> firewall/router to the internet. Other than that, there are a few
> processes running but typically not actually doing anything (e.g.,
> apache is running). The machine stays up fine for days and then
> suddenly goes into this full-disk-activity mode, and when it does so
> the only recourse is to reboot, since I can't get enough CPU cycles to
> investigate what's happening.
>
> No VMware.
>
> Do you really want a copy of the output from "ps aux"? It's 113 lines
> long and looks perfectly normal. Of course, if I could take a snapshot
> when it was in disk-active mode, I expect it would tell me what the
> problem is. But I can't.
>
> lsmod:
>
> Module                  Size  Used by
> soundcore               2800   0  (autoclean) (unused)
> smbfs                  25664   1  (autoclean)
> vfat                    9408   0  (autoclean) (unused)
> fat                    30432   0  (autoclean) [vfat]
> 8139too                12064   2  (autoclean)
> ip_masq_vdolive         1440   0  (unused)
> ip_masq_cuseeme         1184   0  (unused)
> ip_masq_quake           1456   0  (unused)
> ip_masq_irc             1664   0  (unused)
> ip_masq_raudio          3072   0  (unused)
> ip_masq_ftp             4032   0  (unused)
> supermount             14224   2  (autoclean)
>
> dmesg just contains usual stuff form the reboot after I was forced to
> power down the machine. Nothing abnormal there.
>
> rpm -qa gives 890 lines of stuff. Is there anything in particular I
> should be looking for? It all looks reasonable to me; basically a bunch
> of mdk packages.
>
> -----
>
> The best thing I could think of is to run "top" in batch mode, writing
> output every 60 seconds -- since then I would have a record of at least
> some of the machine state that would survive the forced reboot. I did
> this for five days and decided that the problem had gone away. So, of
> course, two days later, the disk went into its ultra-active mode again.
>
> Every log I have looked at seems to be reasonable, except for the
> messages about VM killing processes at about the time that the problem
> occurs.
>
> The most informative one is the "messages" file. The period around the
> time that the problem occurred this morning looks like this:
>
> ul 16 09:00:00 localhost CROND[10337]: (root) CMD (   /sbin/rmmod -as)
> Jul 16 09:01:00 localhost CROND[10339]: (root) CMD (run-parts
> /etc/cron.hourly)
> Jul 16 09:10:00 localhost CROND[10341]: (root) CMD (   /sbin/rmmod -as)
> Jul 16 09:15:00 localhost CROND[10343]: (root) CMD (/usr/sbin/ntpdate -
> B -s -t 4 -u 63.173.194.12)
> Jul 16 09:15:02 localhost ntpdate[10343]: adjust time server
> 63.173.194.12 offset 0.263972 sec
> Jul 16 09:20:00 localhost CROND[10345]: (root) CMD (   /sbin/rmmod -as)
> Jul 16 09:30:00 localhost CROND[10347]: (root) CMD (   /sbin/rmmod -as)
> Jul 16 09:31:10 localhost kernel: cdrom: open failed.
> Jul 16 09:31:10 localhost kernel: end_request: I/O error, dev 02:00
> (floppy), sector 0
> Jul 16 09:31:10 localhost kernel: smb_retry: signal failed, error=-3
> Jul 16 09:31:10 localhost last message repeated 3 times
> Jul 16 09:40:00 localhost CROND[11012]: (root) CMD (   /sbin/rmmod -as)
> Jul 16 09:41:06 localhost named[485]: Cleaned cache of 23 RRsets
> Jul 16 09:41:07 localhost named[485]: USAGE 995298066 994567228
> CPU=2.42u/1.38s CHILDCPU=0u/0s
> Jul 16 09:41:07 localhost named[485]: NSTATS 995298066 994567228 A=3183
> PTR=353 TXT=1
> Jul 16 09:41:07 localhost named[485]: XSTATS 995298066 994567228
> RR=4333 RNXD=87 RFwdR=2781 RDupR=16 RFail=5 RFErr=0 RErr=0 RAXFR=0
> RLame=9 ROpts=0 SSysQ=868 SAns=1518 SFwdQ=2280 SDupQ=469 SErr=1 RQ=3537
> RIQ=0 RFwdQ=0 RDupQ=221 RTCP=0 SFwdR=2781 SFail=0 SFErr=0 SNaAns=1496
> SNXD=398
> Jul 16 09:48:26 localhost kernel: VM: killing process multiload_apple
> Jul 16 09:50:56 localhost CROND[11014]: (root) CMD (   /sbin/rmmod -as)
>
> I don't know what those messages around 09:31 mean, but they seem to
> appear sporadically throughout the "messages" file, usually a few times
> per day.
>
> So, any clues there? Or any other places I might look? I'm completely
> stymied.
>
> If only I could figure out which process was causing the
> aberrant behaviour, I could start to do something about it.
>
>   Doc
>

I still need to see dmesg.  If it were software, it would be regular or have 
a reason.  The software is locking for some specific reason.

Obviously, you could run supermount -i disable since the machine is serving 
and not automounting cds and floppies, but this intermittent behavior could 
be anything from a cooling fan to a motherboard glitch.  It is very unlikely 
that it is software since that same kernel was beaten to death by IBM Labs in 
22 hours of torture testing that they normally stop at 8 hours.

But I know several key disk/chipset interaction combos that I could either 
identify or eliminate with a look at dmesg.  For example, it could be a WD 
and a Maxtor on the same channel which would do this and possibly worse.

Civileme

Reply via email to