DJA wrote:
Carl Lowenstein wrote:
On 5/10/05, Ralph Shumaker <[EMAIL PROTECTED]> wrote:
I told mozilla to do a search of my email messages and it seemed to
freeze and the entire system was *very* sluggish. After trying several
things, I finally did [Ctrl][L-Alt][1] where I saw a login prompt. But
before I even had a chance to begin to log in, I started getting an
error message that kept repeating about once per second. [L-Alt][2]
showed the same thing as did 3 and 4 (where I gave up). I had managed
to get the detailed system monitor running in the GUI, so I switched
back over to that and killed mozilla.
by "detailed system monitor" you mean /usr/bin/top, I presume
Responsiveness returned and the error message was no longer repeating. Here's the error (assuming that I copied it correctly since I could not figure out how to cut and paste from console 1 to the GUI):
hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=22206449, high=1, low=5429333, sector=84880 end_request: I/O error, dev 21:06 (hde), sector 84880
Does this suggest that there is something wrong with my HDD? Should I just disable dma (or just cripple it)?
Suggests to me that you have a bad (uncorrectable) error in at least
one sector on your HDD. Back up anything valuable (not replaceable by
reinstalling the system) to some other place not on that drive. Consider replacing the drive. As a stopgap, you could try "mke2fs -c"
which will wipe out everything but also check for bad blocks. But
consider that the drive is on its way out.
carl
Not so quick. Lots of motherboards (in fact most) have bugs in their HDD controllers and chipsets affecting DMA.
I'm using a Promise controller (which is why it is hde instead of hda through hdd). I originally purchased the Promise controller for an 80G HDD which I purchased for a different motherboard (one with known BIOS limitations above 33G or so). Because of a stupid mistake, I fried the controller on that drive as well as the motherboard. The Promise card was in between them but appeared to be the only thing still working. I figured that the hardware on the Promise card probably could give faster performance than the controller built into the new motherboard (newer than the fried one), so I slapped it in between the newer motherboard (P2 267) and the new 160G HDD (which I set up as hde). The whole system has been sailing smoothly ever since, until now, about 6 or 7 months later.
Recently, I tried switching the HDD system over to a yet newer motherboard (P2 300, still ancient, but faster) given to me by Josh. I had really wierd lockups, first with Open Office Writer (IIRC), then with Mozilla, as well as others. Since I did not have memtest86, and could not run long enough to download it, I switched back to the 267. Then I had to figure out why I was still unable to launch Mozilla (some lock file that had to be hunted down and destroyed). Then, things were back to normal, for a while.
Later, about a month ago, I moved the system to a different location. When I booted it up, it claimed to have been shut down uncleanly. It had been several days between shutting it off and booting back up so I couldn't be sure. It would not boot up because of some error. I didn't log what I did, but I recall doing something with mke2fs (or maybe it was e2fsck or something), crossing my fingers, and waiting for it to do its thing. (It took a while.) After it was done, it seemed that all was well, until this dma error (which only happened when I told Mozilla to search messages for something, which would suggest to me that the error (if truly a HDD error) is either on /dev/hde9 (swap) or on /dev/hde6 which has everything except swap and /boot).
Like Carl said, first back up your data. But before you go wiping the drive, here are a few other things I would do (especially important if you are running an AMD CPU:
o Run Memtest86. Bad memory can precipitate all kinds of errors, including drive errors. I had a bad memory controller go south on one of my boxes, which masqueraded itself as both a bad hard drive and bad RAM.
I'm trying to get that set up right now. The instructions explain how to do it with lilo, but I cannot figure out how to adapt it to grub. For now, I will dig for a floppy and get it going that way if I can. But, I /would/ like to set it up in grub. Any ideas?
o Make sure there is not some other problem causing the behavior: All cables tight? Box interior not overheating? CPU not overheat- ing? PSU not flaking out (still running one the marginal units which came in the case you bought for the PII-400 before you upgraded to the XP3500+ and FX6800)?
Cables checked. Box is open. PSU and motherboard (P2 267) came together.
o Download and use the manufacturer's diagnostic software for your drive. AFAIK, all of the major brands (IBM, Maxtor, WD, etc.) have such a program available on their Websites.
I'll check on this after memtest has its run.
o Check the mobo vendor's website for BIOS updates which may address the problem.
I'll check this before going with the drive diagnostic program.
o Do some research (Google) on the specific chipset on your motherboard. For instance, VIA chipsets are notorious for having DMA problems on Athlon motherboards. Also research similar problems for your specific hard drive (i.e. for your make and model).
I'll check this one also before the drive diagnostics.
If using DMA with the hard drive is a known problem for your motherboard, there are boot options which can be used in Grub or LILO to mitigate the problem (such as disabling DMA altogether).
I discovered that root has 187 mail messages that I never checked. Some of them give some details about dma errors and the like.
That'll get you closer to the truth before just blindly assuming that the hard drive is going bad, and it might save you both time and money in replacing the wrong parts for the wrong reasons.
Thank you very much for this valuable checklist!!!
-- [email protected] http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list
