DJA wrote:
Ralph Shumaker wrote:
DJA wrote:
On 5/10/05, Ralph Shumaker <[EMAIL PROTECTED]> wrote:
Responsiveness returned and the
error message was no longer repeating. Here's the error (assuming
that
I copied it correctly since I could not figure out how to cut and
paste
from console 1 to the GUI):
hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=22206449,
high=1, low=5429333, sector=84880
end_request: I/O error, dev 21:06 (hde), sector 84880
Does this suggest that there is something wrong with my HDD?
Should I
just disable dma (or just cripple it)?
Not so quick. Lots of motherboards (in fact most) have bugs in their
HDD controllers and chipsets affecting DMA.
I'm using a Promise controller (which is why it is hde instead of hda
through hdd). I originally purchased the Promise controller for an 80G
HDD which I purchased for a different motherboard (one with known BIOS
limitations above 33G or so). Because of a stupid mistake, I fried the
controller on that drive as well as the motherboard. The Promise card
was in between them but appeared to be the only thing still working.
In this context, "appeared to be ... still working" is pretty weak
evidence that you didn't damage it as well.
[Note: this assumes I understand the scenario: you had the promise
controller in a box which you subsequently "fried". You determined
that the both the motherboard and hard drive were dead. But the
Promise controller was not. Have I got that right?]
Correct, hence "*appeared* to be ...".
I figured that the hardware on the Promise card probably could give
faster
performance than the controller built into the new motherboard (newer
than the fried one), so I slapped it in between the newer motherboard
(P2 267) and the new 160G HDD (which I set up as hde). The whole system
has been sailing smoothly ever since, until now, about 6 or 7 months
later.
Which is a pretty pitiful lifetime for a hard drive. If you want to
cut to the chase, assuming that the drive is bad, take it back. Of
course, if the replacement shows the same symptoms, the merchant is
not going to like you very much when you bring it back too.
I misplaced my receipt. I'm still searching.
I am also assuming that you are using the same RAM which was installed
in the "fried" motherboard. And what about the CPU? I generally don't
consider CPU's because, in my experience, they're either on or off.
But RAM is a bit more forgiving; that is, it is more prone to dying a
slow death.
I think I transfered one or two of the RAM sticks, though memtest86 has
not detected any errors. I am only aware of the standard 8 tests and
the 9th (bit fade) test.
Recently, I tried switching the HDD system...
By this, you mean the Promise controller and hard drive?
Oops. Incomplete info, my bad.
3G HDD (primary (only) drive on primary IDE channel on both systems)
DVD+-R/W (primary drive on secondary IDE channel on both systems)
CDROM (secondary drive on secondary IDE channel on both systems)
Promise card (providing a third IDE channel on both systems)
160G HDD (primary (only) drive on Promise card on both systems)
Also migrated are a VGA card (PCI), a modem (PCI), a NIC (ISA), and a
sound card (ISA).
CPUs and motherboards remain sets, the old set and the new (-er) set.
...over to a yet newer
motherboard (P2 300, still ancient, but faster) given to me by Josh.
And the reason for this was?
Hoping for a sliver of performance boost. Personally, I didn't care.
But my brother was frustrated with certain games. He is no longer
around, so I no longer have urgency to gain said sliver. (And the
sliver *was* noticed.)
I had really wierd lockups, first with Open Office Writer (IIRC), then
with Mozilla, as well as others.
Well, see now you've thrown in yet another unknown factor: another
motherboard. Best to make one change at a time if you really want to
narrow things down.
Hold on there. At that point, I was unaware of anything wrong with this
HDD. At that point, it seemed that the HDD (and its Promise card) were
fine on the second system.
First system fried.
Second system seemed fine.
Third system had strange problems.
Second system seemed fine, again, for a while.
Since I did not have memtest86, and
could not run long enough to download it, I switched back to the 267.
So now we're back to the original system.
Well, the second system, but yes.
Then I had to figure out why I was still unable to launch Mozilla (some
lock file that had to be hunted down and destroyed). Then, things were
back to normal, for a while.
Mozilla can and does do this all on its own without any help from
errant hardware. OO.o puking starts to look suspicious. Suspiciously
like RAM. Both programs like lots of healthy RAM.
Perhaps the RAM was flaky on system #3, but it passes memtest86 tests
1-9 flawlessly.
Later, about a month ago, I moved the system to a different location.
When I booted it up, it claimed to have been shut down uncleanly. It
had been several days between shutting it off and booting back up so I
couldn't be sure. It would not boot up because of some error. I didn't
log what I did, but I recall doing something with mke2fs (or maybe it
was e2fsck or something), crossing my fingers, and waiting for it to do
its thing. (It took a while.) After it was done, it seemed that all
was well, until this dma error (which only happened when I told Mozilla
to search messages for something, which would suggest to me that the
error (if truly a HDD error) is either on /dev/hde9 (swap) or on
/dev/hde6 which has everything except swap and /boot).
It tells me that the one thing in common with all your problems is the
Promise card (and maybe the RAM). From my end, that's the guy at the
top of my suspect list. It's been in every system you've described,
and each one has given you trouble. If you are using the same RAM, the
RAM moves to the top of the list with the Promise controller second.
I don't think Josh would knowingly give you a bad mobo, you have a
relatively new hard drive, and you seem to have reasons to believe the
267 mobo is fine. The Promise controller is the only part have been in
a wreck. And I am still not clear about the RAM.
Test everything. Pull it. Then test everything. In fact, I'd pull first.
1) Test the RAM (Full 11-test suite, ~24 hours for 512 MB)
2) Test HDD w/o Promise
3) Test HDD w Promise.
I will have to figure out how to get Linux working as hdb instead of
hde. System #1 *needed* the promise controller to see past 32G. So the
HDD was set up on the Promise controller as hde.
Steps two and three assume previous tests passed.
Like Carl said, first back up your data. But before you go wiping
the drive, here are a few other things I would do (especially
important if you are running an AMD CPU:
o Run Memtest86. Bad memory can precipitate all kinds of errors,
including drive errors. I had a bad memory controller go south
on one of my boxes, which masqueraded itself as both a bad hard
drive and bad RAM.
I'm trying to get that set up right now. The instructions explain how
to do it with lilo, but I cannot figure out how to adapt it to grub.
For now, I will dig for a floppy and get it going that way if I can.
But, I /would/ like to set it up in grub. Any ideas?
Yes. Forget putting it in the boot manager menu. At least for now. Run
it properly from a dedicated floppy. Having Memtest86 on your menu is
fine for something portable like Knoppix - it's a good utility to have
around when you're trying to diagnose someone else's box and you don't
have your box of utilities at hand.
But if you have a desktop box at home, which is not likely going
anywhere, then you should also have a floppy close by with Memtest86
on it. Having the memory tester on a desktop box sounds cool and geeky
and all, but in practice is not all that useful.
(And don't anyone go off on me about "but floppies get lost or go
bad", cuz if that happens to you, you've got more than just hardware
problems! :^) ).
o Make sure there is not some other problem causing the behavior:
All cables tight? Box interior not overheating? CPU not overheat-
ing? PSU not flaking out (still running one the marginal units
which came in the case you bought for the PII-400 before you
upgraded to the XP3500+ and FX6800)?
Cables checked. Box is open. PSU and motherboard (P2 267) came
together.
I don't think "PSU" means what you think it means. PSU = Power Supply
Unit. The big hunk of iron that came with the case, not the CPU. The
last component that anyone ever suspects and the one component that
can literally burn up everything in your box (or even house, if put to
it).
The case, PSU, motherboard, and CPU (P2 267) came together. The context
made it clear about the PSU.
If the PSU came with the case (with few brand name exceptions), I can
almost guarantee that it's a piece of crap. But then you'll have to go
to my brother for that lecture. ;^)
Then perhaps I shall have to buy a new one.
Maybe you're talking about the heatsink-fan which came with the CPU?
If the CPU's still running, then the heatsink-fan is fine.
o Download and use the manufacturer's diagnostic software for your
drive. AFAIK, all of the major brands (IBM, Maxtor, WD, etc.)
have such a program available on their Websites.
I'll check on this after memtest has its run.
Good.
I think it's Samsung, but I guess I'll have to power down to check.
"dmesg" only tells me
hde: HDS722516VLAT80, ATA DISK drive
o Check the mobo vendor's website for BIOS updates which may address
the
problem.
I'll check this before going with the drive diagnostic program.
I'd check the HDD first.
The BIOS nor motherboard problems (assuming they have problems) wouldn't
interfere?
o Do some research (Google) on the specific chipset on your
motherboard.
For instance, VIA chipsets are notorious for having DMA problems on
Athlon motherboards. Also research similar problems for your specific
hard drive (i.e. for your make and model).
I'll check this one also before the drive diagnostics.
Really, it's okay, you can run the drive diags any time. No need to
wait. ;^)
Check if first without the Promise controller in the system. Don't'
worry about booting problems: the diagnostic software runs only off of
floppy, or in some cases CD-ROM.
If the drive passes w/o the Promise card, run it again with the card.
Well, in that case, I guess I don't need to switch linux from hde to hdb.
If using DMA with the hard drive is a known problem for your
motherboard, there are boot options which can be used in Grub
or LILO to mitigate the problem (such as disabling DMA altogether).
I discovered that root has 187 mail messages that I never checked. Some
of them give some details about dma errors and the like.
I haven't used a Promise controller since, ohh, back when VLB was
popular, so I can't give any reasons why DMA might or might be
problematic on those cards. Paul? Where are you when I need you?
That'll get you closer to the truth before just blindly assuming
that the hard drive is going bad, and it might save you both time
and money in replacing the wrong parts for the wrong reasons.
Thank you very much for this valuable checklist!!!
Anytime. Just don't ask. ;^)
I'll try. :)
--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list