Re: ATA_DMA errors

2005-06-25 Thread Martin
Johny Mattsson wrote:

 Basically, the problem seems to be related to using more than one
 channel on the IDE controller.

This isn't a solution to my problem. I only have one hard drive.
It's 120GB Seagate. We seem to have different problems, btw.

I also don't think, my problem is ATA-related. It shows effect
on ATA, but I don't see any modifications that have been done
to ATA on -STABLE between May 26 and May 30. It is something else
going on there.

Today at night the system was up and the security scan showed
bad descriptors and bad block-errors. This was the effect
of my last experiment with latest -STABLE. (I previously thought
that the file system was intact, but it's not true.)

I don't know why this is called bad block. It confuses users
(at least me) making them think they have physically destroyed
hard disk areas, but this is not the case, as a simple dd shows.
And bad blocks will not appear after using a new kernel and
disappear when I reinstall an older one and use fsck.

Martin




Here my dmesg (kernel date: May 26th 00:00:00):

Copyright (c) 1992-2005 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD 5.4-STABLE #0: Mon Jun 20 21:44:05 CEST 2005
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/KLOTZ
ACPI APIC Table: AMIINT VIA_K7  
Timecounter i8254 frequency 1193182 Hz quality 0
CPU: AMD Sempron(tm) 2200+ (1499.52-MHz 686-class CPU)
  Origin = AuthenticAMD  Id = 0x681  Stepping = 1

Features=0x383fbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CM
OV,PAT,PSE36,MMX,FXSR,SSE  AMD Features=0xc048MP,AMIE,DSP,3DNow!
real memory  = 536805376 (511 MB)
avail memory = 511455232 (487 MB)
ioapic0 Version 0.3 irqs 0-23 on motherboard
netsmb_dev: loaded
npx0: math processor on motherboard
npx0: INT 16 interface
acpi0: AMIINT VIA_K7 on motherboard
acpi0: Power Button (fixed)
Timecounter ACPI-fast frequency 3579545 Hz quality 1000
acpi_timer0: 24-bit timer at 3.579545MHz port 0x808-0x80b on acpi0
cpu0: ACPI CPU on acpi0
acpi_button0: Power Button on acpi0
pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0
pci0: ACPI PCI bus on pcib0
agp0: VIA 8377 (Apollo KT400/KT400A/KT600) host to PCI bridge mem
0xe000-0xe3ff at device 0.0 on pci0
pcib1: PCI-PCI bridge at device 1.0 on pci0
pci1: PCI bus on pcib1
nvidia0: GeForce4 Ti 4200 mem
0xddc8-0xddcf,0xd000-0xd7ff,0xde00-0xdeff irq 16
at device 0.0 on pci1
xl0: 3Com 3c905-TX Fast Etherlink XL port 0xec00-0xec3f irq 17 at
device 9.0 on pci0
miibus0: MII bus on xl0
nsphy0: DP83840 10/100 media interface on miibus0
nsphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
xl0: Ethernet address: 00:60:08:4e:42:3b
ath0: Atheros 5212 mem 0xdffd-0xdffd irq 18 at device 10.0 on pci0
ath0: mac 5.9 phy 4.3 5ghz radio 4.6
ath0: Ethernet address: 00:0f:b5:28:de:4b
ath0: 11b rates: 1Mbps 2Mbps 5.5Mbps 11Mbps
ath0: 11g rates: 1Mbps 2Mbps 5.5Mbps 11Mbps 6Mbps 9Mbps 12Mbps 18Mbps
24Mbps 36Mbps 48Mbps 54Mbps
bktr0: BrookTree 878 mem 0xdddfe000-0xdddfefff irq 19 at device 11.0
on pci0
bktr0: Hauppauge Model 44804 C108
bktr0: Detected a MSP34255?-?31 at 0x80
bktr0: Hauppauge WinCast/TV, Philips PAL I tuner, msp3400c stereo.
pci0: multimedia at device 11.1 (no driver attached)
sym0: 875 port 0xe800-0xe8ff mem
0xdfffe000-0xdfffefff,0xdf00-0xdfff irq 17 at device 13.0 on pci0
sym0: Tekram NVRAM, ID 7, Fast-20, SE, parity checking
uhci0: VIA 83C572 USB controller port 0xdc00-0xdc1f irq 21 at device
16.0 on pci0
usb0: VIA 83C572 USB controller on uhci0
usb0: USB revision 1.0
uhub0: VIA UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1: VIA 83C572 USB controller port 0xe000-0xe01f irq 21 at device
16.1 on pci0
usb1: VIA 83C572 USB controller on uhci1
usb1: USB revision 1.0
uhub1: VIA UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
uhci2: VIA 83C572 USB controller port 0xe400-0xe41f irq 21 at device
16.2 on pci0
usb2: VIA 83C572 USB controller on uhci2
usb2: USB revision 1.0
uhub2: VIA UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub2: 2 ports with 2 removable, self powered
pci0: serial bus, USB at device 16.3 (no driver attached)
isab0: PCI-ISA bridge at device 17.0 on pci0
isa0: ISA bus on isab0
atapci0: VIA 8235 UDMA133 controller port
0xfc00-0xfc0f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 17.1 on pci0
ata0: channel #0 on atapci0
ata1: channel #1 on atapci0
pcm0: VIA VT8235 port 0xd800-0xd8ff irq 22 at device 17.5 on pci0
pcm0: Unknown AC97 Codec (id = 0x434d4983)
vr0: VIA VT6102 Rhine II 10/100BaseTX port 0xd400-0xd4ff mem
0xdd00-0xddff irq 23 at device 18.0 on pci0
miibus1: MII bus on vr0
ukphy0: Generic IEEE 802.3u media interface on miibus1
ukphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
vr0: Ethernet address: 

Re: ATA_DMA errors

2005-06-25 Thread Rich Wales
I had a READ_DMA timeout situation which I'm pretty sure was
related to a drive problem.

I'm running 5.3-RELEASE-p5 on an older machine (333 MHz AMD
K6).  The 20 GB hard drive in this system periodically, but
only occasionally, gave READ_DMA timeout errors.  These errors
sometimes cited identical block (LBA) numbers from one time to
the next.

I tried running the system with the case open, in case it was
an overheating problem, but this had no effect.  I considered
replacing the power supply, but I never got around to doing
this.

Finally, about a week ago, I copied the entire system to a
new hard drive.  So far, I haven't had even one READ_DMA error
since going to the new hard drive.

At least in this one case, it seems fairly certain that the
problem has something to do with a particular hard drive.

Curiously, I did =not= get any READ_DMA errors while I was
making a full backup of the old drive in preparation for
copying the data onto the new drive.

Rich Wales[EMAIL PROTECTED]http://www.richw.org

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ATA_DMA errors - [ workaround for me ]

2005-06-25 Thread Johny Mattsson

Hi all,

Today I've taken a fresh stab at the problem (I'm never at my best at 
5am in the morning having worked through the night), and I have managed 
to come up with what appears to amount to a successful workaround. It 
would be good if my observations could be confirmed by someone else.


Basically, the problem seems to be related to using more than one 
channel on the IDE controller. Data points for this are:


[ SiI 0680 ]
 Channel 1: 40 GB Seagate
 Channel 2: 60 GB Seagate + 160 GB Western Digital
Result: 200k worth of DMA_READ timed out and DMA_WRITE UDMA ICRC 
error messages, inability to obtain SMART info from the WD drive, WD 
drive info garbled, and WD drive being removed/detached from the config. 
The errors only appeared after a few hours operation, but once they were 
there, no amount of reboots would get rid of them/improve the situation.


To attempt to save the data on the WD disk before the FS got completely 
hammered, I pulled it out, and observed the following:


[ SiI 0680 ]
 Channel 1: 40 GB Seagate
 Channel 2: 60 GB Seagate
Result: DMA_READ timed out errors for both drives, and DMA_WRITE UDMA 
ICRC error messages for the 60 GB Seagate.



Since I had an older ATA-100 controller available, I tried with it (it 
can't handle 120GB drives though, so I couldn't as many combinations as 
I would have liked):


[ CMD 649 ]
 Channel 1: 40 GB Seagate
 Channel 2: 60 GB Seagate
Result: DMA_READ timed out errors, but only when both drives are in use 
at the same time. Running fsck on a slice on either drive in parallell 
reliably reproduced the DMA_READ errors. Whenever an error was reported 
for one drive, another error for the other drive always followed right 
after.


[ CMD 649 ]
 Channel 1:
 Channel 2: 40 GB Seagate + 60 GB Seagate
Result: No error messages.


[ CMD 649 ]
 Channel 1: 40 GB Seagate + 60 GB Seagate
 Channel 2:
Result: No error messages.


Encouraged by these findings, I swapped back to the SiI controller to 
test the 160 GB drive:


[ SiI 0680 ]
 Channel 1:
 Channel 2: 160 GB WD
Result: No error messages

[ SiI 0680 ]
 Channel 1: 160 GB WD
 Channel 2:
Result: No error messages


Finally, I tried everything together:

[ SiI 0680 ]
 Channel 1: 160 GB WD
 Channel 2:
[ CMD 649 ]
 Channel 1: 40 GB Seagate + 60 GB Seagate
 Channel 2:
Result: No errors messages.


What I haven't mentioned in the above is that I also tried some 
combinations with different cables, and also at reduced speed (UDM66 vs 
UDMA100). Neither changes had any effect on the behaviour.


With the WD drive alone on the SiI 0680, I was also able to retrieve 
SMART information from it, and it's showing no errors for the drive at 
all. Likewise so for the 60 GB Seagate drive. All drives pass their 
self-tests without any errors.


As mentioned in my previous email, my system drive is hanging off the 
built-in PIIX4 controller, as a single drive and only one channel on the 
controller used. I never saw any errors for that drive throughout my 
testing.



My conclusion is thusly that there is something that's crept in that's 
affecting stability when multiple channels are used on the same 
controller. I'm not versed enough in driver internals to know if it's 
IRQ, DMA, ISR or anything-else related though. Below are my latest dmesg 
and pciconf listings - hopefully this will help someone locate the 
culprit. (Soren?)


So, now I'm stuck with a system with three IDE controllers and one SCSI 
controller, and a motherboard that is utterly confused when I ask it 
boot off an external controller... (i.e. I can only boot off the 
built-in controller now).



Please let me know if there's some other info I can get for you; I'll 
have limited ability to move drives around since this is the file server 
and people get annoyed when it's unavailable, but do ask if you think it 
will help you! :)


Cheers,
/Johny


=== dmesg 
Copyright (c) 1992-2005 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD 5.4-RELEASE #0: Sun May  8 10:21:06 UTC 2005
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC
Timecounter i8254 frequency 1193182 Hz quality 0
CPU: Pentium II/Pentium II Xeon/Celeron (467.73-MHz 686-class CPU)
  Origin = GenuineIntel  Id = 0x665  Stepping = 5

Features=0x183f9ffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PA
T,PSE36,MMX,FXSR
real memory  = 805240832 (767 MB)
avail memory = 778231808 (742 MB)
npx0: math processor on motherboard
npx0: INT 16 interface
acpi0: AWARD AWRDACPI on motherboard
acpi0: Power Button (fixed)
Timecounter ACPI-safe frequency 3579545 Hz quality 1000
acpi_timer0: 24-bit timer at 3.579545MHz port 0x4008-0x400b on acpi0
cpu0: ACPI CPU (3 Cx states) on acpi0
acpi_throttle0: ACPI CPU Throttling on cpu0
acpi_button0: Power Button on acpi0
pcib0: ACPI Host-PCI bridge port 
0x5000-0x500f,0x4000-0x4041,0xcf8-0xcff on acpi0

pci0: ACPI PCI bus on pcib0
agp0: 

Re: ATA_DMA errors

2005-06-24 Thread Johny Mattsson

twesky wrote:

I am having ATA_DMA errors on 5.4R and 5 STABLE up to June 16 (haven't
done a cvsup again).  It doesn't happen on 5.3R or lower.


I've just upgraded my fileserver from 5.1-R to 5.4-R, and I'm seeing 
this problem too now on 3 out of 4 drives.




The exact error message is below:

It happens within a few hours of use.  The laptop will then reboot,
and fsck must be ran.  After fsck the timeouts happen within a few
seconds of booting.


My system uses a SiI 0680 UDMA133 controller in addition to the old 
built-in Intel PIIX4 UDMA33 controller. My system drive hangs off the 
PIIX4 controller and I see no issues with it, only drives off the SiI;


ad0: 8207MB ST38641A/3.29 [16676/16/63] at ata0-master UDMA33
ad4: 57241MB ST360021A/3.05 [116301/16/63] at ata2-master UDMA100
ad6: 76319MB ST380021A/3.19 [155061/16/63] at ata3-master UDMA100
ad7: 152627MB WDC WD1600JB-00DUA3/75.13B75 [310101/16/63] at 
ata3-slave UDMA100



Right after the upgrade things worked well for a couple of hours, and 
then I got a reboot all of a sudden. Upon inspection I found tons of 
both READ_DMA timed out as well as WRITE_DMA UDMA ICRC error 
messages in log prior to the reboot. After the reboot it went to do the 
fsck and made it perhaps halfway through it before it started churning 
out READ_DMA timed out messages again, followed by the ad7: warning - 
removed from configuration message.


Things did not get better from there, but with each sucessive reboot 
more and more started going wrong. In order to be able to get the system 
to even boot in the end I had to physically disconnect the ad7 drive, 
but even so I'm getting READ_DMA timed out messages for ad4 and ad6.


Since I'm getting WRITE_DMA errors on both ad6 and ad7 now (I haven't 
written anything to ad4 yet, so I don't know if I'll get errors on that 
one too), and I wasn't a few hours ago when I was running 5.1-R, I 
refuse to believe that two disks have gone bad in that timespan!


I'm not sure what I should do at this point - theoretically I could 
proceed to roll back to 5.1 to prevent further data loss, but I'm 
guessing it'd be good if I kept it for a little while so that I could 
run tests for patches :-/



Seeing the comments about possible failing controller hardware, I might 
see if I can find a replacement controller tomorrow... any ideas in the 
meantime will be appreciated though!


Still feels very iffy that this started happening right after the 
upgrade... I was expecting to get rid of some of the quirks from the 
early preview, not get far worse ones! :-(



Oh, btw, using smartmontools' smartctl, I've gotten the information that 
ad4 has had 32 write errors in total, ad6 have had 0 (despite seeing the 
WRITE_DMA errors in the system log), and ad7 refuses to even talk SMART.



###

Here's the contents of the dmesg from before I pulled ad7 out:

Jun 24 18:22:19 kernel: FreeBSD 5.4-RELEASE #0: Sun May 8 10:21:06 UTC 2005
Jun 24 18:22:19 kernel:
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC
Jun 24 18:22:19 kernel: Timecounter i8254 frequency 1193182 Hz quality 0
Jun 24 18:22:19 kernel: CPU: Pentium II/Pentium II Xeon/Celeron
(467.73-MHz 686-class CPU)
Jun 24 18:22:19 kernel: Origin = GenuineIntel Id = 0x665 Stepping = 5
Jun 24 18:22:19 kernel:
Features=0x183f9ffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,S
EP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR
Jun 24 18:22:19 kernel: real memory = 805240832 (767 MB)
Jun 24 18:22:19 kernel: avail memory = 778231808 (742 MB)
Jun 24 18:22:19 kernel: npx0: math processor on motherboard
Jun 24 18:22:19 kernel: npx0: INT 16 interface
Jun 24 18:22:19 kernel: acpi0: AWARD AWRDACPI on motherboard
Jun 24 18:22:19 kernel: acpi0: Power Button (fixed)
Jun 24 18:22:19 kernel: Timecounter ACPI-safe frequency 3579545 Hz
quality 1000
Jun 24 18:22:19 kernel: acpi_timer0: 24-bit timer at 3.579545MHz port
0x4008-0x400b on acpi0
Jun 24 18:22:19 kernel: cpu0: ACPI CPU (3 Cx states) on acpi0
Jun 24 18:22:19 kernel: acpi_throttle0: ACPI CPU Throttling on cpu0
Jun 24 18:22:19 kernel: acpi_button0: Power Button on acpi0
Jun 24 18:22:19 kernel: pcib0: ACPI Host-PCI bridge port
0x5000-0x500f,0x4000-0x4041,0xcf8-0xcff on acpi0
Jun 24 18:22:19 kernel: pci0: ACPI PCI bus on pcib0
Jun 24 18:22:19 kernel: agp0: Intel 82443BX (440 BX) host to PCI
bridge mem 0xe000-0xe3ff at device 0.0 on pci0
Jun 24 18:22:19 kernel: pcib1: PCI-PCI bridge at device 1.0 on pci0
Jun 24 18:22:19 kernel: pci1: PCI bus on pcib1
Jun 24 18:22:19 kernel: isab0: PCI-ISA bridge at device 7.0 on pci0
Jun 24 18:22:19 kernel: isa0: ISA bus on isab0
Jun 24 18:22:19 kernel: atapci0: Intel PIIX4 UDMA33 controller port
0xf000-0xf00f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 7.1 on pci0
Jun 24 18:22:19 kernel: ata0: channel #0 on atapci0
Jun 24 18:22:19 kernel: ata1: channel #1 on atapci0
Jun 24 18:22:19 kernel: uhci0: Intel 82371AB/EB (PIIX4) USB controller
port 0x9000-0x901f irq 11 at device 7.2 on pci0
Jun 24 18:22:19 kernel: usb0: Intel 82371AB/EB (PIIX4) USB

Re: ATA_DMA errors

2005-06-24 Thread twesky
I don't think it is a hardware problem.  Unless you replace it with
the exact same hardware, it'll be difficult to determine if it was the
hardware.

I haven't had any issues with 5.3R or any stable version before April
15.  I am going to do some checking this weekend and see if it is
hardware or software what is causing my timeouts.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re[2]: ATA_DMA errors (and fs corruption!) (JM)

2005-06-21 Thread Tony Byrne
Hello Martin,

Monday, June 20, 2005, 9:09:15 PM, you wrote:

M I just compiled the kernel from May 26th. Works fine. It looks like
M for me it's broken between May 26th and May 30th.

M I tried these kernels:
M 2005-06-16 broken
M 2005-05-31 broken
M 2005-05-30 (00:00:00) broken
M 2005-05-26 (00:00:00) ok
M 2005-05-22 ok
M 2005-05-15 ok
M 2005-05-09 ok

M The problem appears under heavy disk load.

There's definitely something up with the driver for the Intel ICH5
controller. I have second machine with the same chipset, this time a
desktop, which is exhibiting the same DMA timeout problem with its
SATA disk. It has a RELENG_5 kernel which was built from sources
updated yesterday.

Regards,

Tony.

-- 
Tony Byrne


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re[2]: ATA_DMA errors (and fs corruption!)

2005-06-20 Thread Tony Byrne
Hello twesky,

t atapci0: Intel ICH4 UDMA100 controller port
t 0x1860-0x186f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 31.1 on
t pci0
t ata0: channel #0 on atapci0
t ata1: channel #1 on atapci0

t The last known good stable version for me was aprox April 25, my next
t cvsup was May 17, but I have problems with 5.4 Release so I assume
t (probably incorrectly) that something changed between April 25 and
t 5.4R.

t I don't exactly recall my shutdown errors, but I did have to restore
t my file systems to get my laptop back to a functioning state.

We've been seeing the same problem in a server equipped with an Intel
ICH5 controller and SATA Hard Disk. The problems seemed to start after
an update in mid-May. We noticed that processes such as our imap
server would stall for a few seconds and the console would indicate
either a READ_DMA or WRITE_DMA timeout.  On two occasions the the disk
became detached requiring a reboot.  The frequency of these timeouts
were such that we couldn't do any work with the server.

We didn't have this problem prior to the update. We are tracking
RELENG_5, but have now reverted to a May 9th kernel, which doesn't
seem to be quite so fussy and has reduced the problem to a handful of
timeouts every day.

What's bugging me is that this list has been very quiet about this
problem. The Intel ICH* controllers must be common in the field and
I'm surprised that this problem has gone unnoticed. Of course, there
can be hardware reasons for timeouts such as a dying disk or cable,
but I think we've eliminated these in our case. The disk works fine
when transferred to another machine and the SATA cable works fine when
used with another disk (albeit one of smaller capacity) in the server.
So we've come to the conclusion that it's the combination of
controller, disk and FreeBSD version that holds the key to this.

Jun 20 10:20:04 roo kernel: atapci0: Intel ICH5 SATA150 controller port 
0xffa0-0xffaf,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 31.2 on pci0
Jun 20 10:20:04 roo kernel: ata0: channel #0 on atapci0
Jun 20 10:20:04 roo kernel: ata1: channel #1 on atapci0

...

Jun 20 10:20:04 roo kernel: ad0: 190782MB WDC WD2000JD-00FYB0/02.05D02 
[387621/16/63] at ata0-master SATA150
Jun 20 10:20:04 roo kernel: acd0: CDROM SAMSUNG CD-ROM SC-152G/C400 at 
ata1-master PIO4

...



Regards,

Tony.

-- 
Tony Byrne


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re[2]: ATA_DMA errors (and fs corruption!)

2005-06-20 Thread Bob Bishop

At 11:09 20/06/2005, Tony Byrne wrote:

[...]
Of course, there
can be hardware reasons for timeouts such as a dying disk or cable,
but I think we've eliminated these in our case. [etc]


Don't ignore the possibility of failing controller hardware. We had 
comparable mysterious problems on a client system, causing a lot of 
head-scratching. Eventually the failure went hard and we had to replace the 
motherboard.



--
Bob Bishop  +44 (0)118 940 1243
[EMAIL PROTECTED]   fax +44 (0)118 940 1295

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re[3]: ATA_DMA errors (and fs corruption!)

2005-06-20 Thread Tony Byrne
Hello Bob,


can be hardware reasons for timeouts such as a dying disk or cable,
but I think we've eliminated these in our case. [etc]

BB Don't ignore the possibility of failing controller hardware. We had
BB comparable mysterious problems on a client system, causing a lot of
BB head-scratching. Eventually the failure went hard and we had to replace the
BB motherboard.

I hear ya!  However, moving back to an older kernel changes the
severity of the problem from a timeout every 2 to three minutes during
heavy activity to about 4 or 5 in a 24 hour period.  That doesn't
sound like hardware to me.

Regards,

Tony.

-- 
Tony Byrne


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ATA_DMA errors (and fs corruption!)

2005-06-20 Thread JM

Tony Byrne wrote:


Hello Bob,


 


can be hardware reasons for timeouts such as a dying disk or cable,
but I think we've eliminated these in our case. [etc]
 



BB Don't ignore the possibility of failing controller hardware. We had
BB comparable mysterious problems on a client system, causing a lot of
BB head-scratching. Eventually the failure went hard and we had to replace the
BB motherboard.

I hear ya!  However, moving back to an older kernel changes the
severity of the problem from a timeout every 2 to three minutes during
heavy activity to about 4 or 5 in a 24 hour period.  That doesn't
sound like hardware to me.

Regards,

Tony.

 

i have these same errors on my VIA 823x series chipset. however, the 
problem is only with the secondary device (acd0 in this case), and might 
be stemming from some other problem.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re[3]: ATA_DMA errors (and fs corruption!)

2005-06-20 Thread Bob Bishop

At 12:12 20/06/2005, Tony Byrne wrote:

Hello Bob,


can be hardware reasons for timeouts such as a dying disk or cable,
but I think we've eliminated these in our case. [etc]

BB Don't ignore the possibility of failing controller hardware. We had
BB comparable mysterious problems on a client system, causing a lot of
BB head-scratching. Eventually the failure went hard and we had to 
replace the

BB motherboard.

I hear ya!  However, moving back to an older kernel changes the
severity of the problem from a timeout every 2 to three minutes during
heavy activity to about 4 or 5 in a 24 hour period.  That doesn't
sound like hardware to me.


It didn't to me either. Note the use of 'mysterious' :-)
I'd eliminated drives and cables, and then did it all over again when the 
failure went hard, leaving the controller (or something else on the mobo). 
With a new mobo all the annoying timeouts which I'd put down to driver 
misbehaviour just went away.


--
Bob Bishop  +44 (0)118 940 1243
[EMAIL PROTECTED]   fax +44 (0)118 940 1295

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re[4]: ATA_DMA errors (and fs corruption!)

2005-06-20 Thread Tony Byrne
Hello Bob,

BB It didn't to me either. Note the use of 'mysterious' :-)
BB I'd eliminated drives and cables, and then did it all over again when the
BB failure went hard, leaving the controller (or something else on the mobo).
BB With a new mobo all the annoying timeouts which I'd put down to driver
BB misbehaviour just went away.

Did you replace the motherboard with one of the same brand and model?

Regards,

Tony.

-- 
Tony Byrne


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re[4]: ATA_DMA errors (and fs corruption!)

2005-06-20 Thread Bob Bishop

At 13:19 20/06/2005, Tony Byrne wrote:

Hello Bob,

BB It didn't to me either. Note the use of 'mysterious' :-)
BB I'd eliminated drives and cables, and then did it all over again when the
BB failure went hard, leaving the controller (or something else on the mobo).
BB With a new mobo all the annoying timeouts which I'd put down to driver
BB misbehaviour just went away.

Did you replace the motherboard with one of the same brand and model?


No, but as it happened they both have the same SATA controller chip.

--
Bob Bishop  +44 (0)118 940 1243
[EMAIL PROTECTED]   fax +44 (0)118 940 1295

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ATA_DMA errors (and fs corruption!) (JM)

2005-06-20 Thread Jayton Garnett
I had a similar problem and i changed system cases where i was getting a 
ICRC error and FreeBSD refused to load or even mount the root fs, it was 
also giving errors with something to do with the ATA something or other, 
it turned out to be the cable i used after rebuilding the system in the 
new case, i used a normal EIDE cable instead of a ATA cable :-/


hope that helps(probably not)

Jay



Tony Byrne wrote:

 


Hello Bob,




   


can be hardware reasons for timeouts such as a dying disk or cable,
but I think we've eliminated these in our case. [etc]


   


BB Don't ignore the possibility of failing controller hardware. We had
BB comparable mysterious problems on a client system, causing a lot of
BB head-scratching. Eventually the failure went hard and we had to replace the
BB motherboard.

I hear ya!  However, moving back to an older kernel changes the
severity of the problem from a timeout every 2 to three minutes during
heavy activity to about 4 or 5 in a 24 hour period.  That doesn't
sound like hardware to me.

Regards,

Tony.



   

i have these same errors on my VIA 823x series chipset. however, the 
problem is only with the secondary device (acd0 in this case), and might 
be stemming from some other problem.


 




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re[2]: ATA_DMA errors (and fs corruption!) (JM)

2005-06-20 Thread Tony Byrne
Hello Jayton,

Monday, June 20, 2005, 3:46:20 PM, you wrote:

JG I had a similar problem and i changed system cases where i was getting a
JG ICRC error and FreeBSD refused to load or even mount the root fs, it was
JG also giving errors with something to do with the ATA something or other,
JG it turned out to be the cable i used after rebuilding the system in the
JG new case, i used a normal EIDE cable instead of a ATA cable :-/

In our case it's a SATA drive.

Regards,

Tony.

-- 
Tony Byrne


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ATA_DMA errors (and fs corruption!) (JM)

2005-06-20 Thread JM

Jayton Garnett wrote:

I had a similar problem and i changed system cases where i was getting 
a ICRC error and FreeBSD refused to load or even mount the root fs, it 
was also giving errors with something to do with the ATA something or 
other, it turned out to be the cable i used after rebuilding the 
system in the new case, i used a normal EIDE cable instead of a ATA 
cable :-/


hope that helps(probably not)


actually, that makes a lot of sense.  my computer running FreeBSD is 
actually just an Eden 5000 V-series.  the cable is trimmed to fit the 
2.5 hard drive in the tiny case and i'm sure this is having something 
to do with the timeouts... however the harddrive is recognized as UDMA33 
but the cdrom still times out.  thanks for the input.  note: this setup 
works fine in windows... maybe someone should take a look at this 
issue?  i'm running the old 5.3-RELEASE (too lazy to update) with a VIA 
VT8231 SouthBridge (82 ata controller).




Jay



Tony Byrne wrote:

 


Hello Bob,




  


can be hardware reasons for timeouts such as a dying disk or cable,
but I think we've eliminated these in our case. [etc]
   
  



BB Don't ignore the possibility of failing controller hardware. We had
BB comparable mysterious problems on a client system, causing a lot of
BB head-scratching. Eventually the failure went hard and we had to 
replace the

BB motherboard.

I hear ya!  However, moving back to an older kernel changes the
severity of the problem from a timeout every 2 to three minutes during
heavy activity to about 4 or 5 in a 24 hour period.  That doesn't
sound like hardware to me.

Regards,

Tony.



  


i have these same errors on my VIA 823x series chipset. however, the 
problem is only with the secondary device (acd0 in this case), and 
might be stemming from some other problem.


 




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]



___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


ATA_DMA errors (and fs corruption!) (JM)

2005-06-20 Thread twesky
My laptop works fine with Fedora Core 4.  I'm not sure it's a hardware
issue, and I don't have an identical laptop to test.  Do we know the
last working stable version?

-
actually, that makes a lot of sense.  my computer running FreeBSD is 
actually just an Eden 5000 V-series.  the cable is trimmed to fit the 
2.5 hard drive in the tiny case and i'm sure this is having something 
to do with the timeouts... however the harddrive is recognized as UDMA33 
but the cdrom still times out.  thanks for the input.  note: this setup 
works fine in windows... maybe someone should take a look at this 
issue?  i'm running the old 5.3-RELEASE (too lazy to update) with a VIA 
VT8231 SouthBridge (82 ata controller).
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ATA_DMA errors (and fs corruption!) (JM)

2005-06-20 Thread Martin
twesky wrote:
 My laptop works fine with Fedora Core 4.  I'm not sure it's a hardware
 issue, and I don't have an identical laptop to test.  Do we know the
 last working stable version?

I just compiled the kernel from May 26th. Works fine. It looks like
for me it's broken between May 26th and May 30th.

I tried these kernels:
2005-06-16 broken
2005-05-31 broken
2005-05-30 (00:00:00) broken
2005-05-26 (00:00:00) ok
2005-05-22 ok
2005-05-15 ok
2005-05-09 ok

The problem appears under heavy disk load.

Martin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ATA_DMA errors (and fs corruption!) (JM)

2005-06-20 Thread Matthias Buelow
Jayton Garnett [EMAIL PROTECTED] writes:

I had a similar problem and i changed system cases where i was getting a 
ICRC error and FreeBSD refused to load or even mount the root fs, it was 
also giving errors with something to do with the ATA something or other, 
it turned out to be the cable i used after rebuilding the system in the 
new case, i used a normal EIDE cable instead of a ATA cable :-/

I've just encountered the same problem on 5.4-STABLE/i386. I rebuilt my
kernel with SMP and enabled hyperthreading in loader.conf, because the
security weakness doesn't really apply to my desktop machine.

So, kernel got the DMA error at boot and couldn't mount the root fs.
When I switched off HT in the BIOS, the system came up ok.

I've cvsupped 5.4-STABLE just a few hours ago.

mkb.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ATA_DMA errors (and fs corruption!) (JM)

2005-06-20 Thread Matthias Buelow
I wrote:

So, kernel got the DMA error at boot and couldn't mount the root fs.

Ah, btw.. it's a SATA disk, on an ICH6 SATA150 controller.

mkb.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ATA_DMA errors (and fs corruption!)

2005-06-19 Thread Martin
twesky wrote:
 I am having ATA_DMA errors on 5.4R and 5 STABLE up to June 16 (haven't
 done a cvsup again).  It doesn't happen on 5.3R or lower.

I have got same problem. I tried yesterday's kernel and I got lots of
ATA DMA errors. A question: do you have a VIA IDE controller like mine?

atapci0: VIA 8235 UDMA133 controller port
0xfc00-0xfc0f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 17.1 on pci0

[EMAIL PROTECTED]:17:1:  class=0x01018a card=0x05711849 chip=0x05711106
rev=0x06 hdr=0x00
vendor   = 'VIA Technologies Inc'
device   = 'VT82 EIDE Controller (All VIA Chipsets)'
class= mass storage
subclass = ATA

Today, I noticed, the short experiment with the latest -STABLE destroyed
a part of my /usr partition. It looked like this (with May 9th kernel
today):

kernel: handle_workitem_freeblocks: block count
kernel: bad block 50333952, ino 1743780
kernel: pid 56 (syncer), uid 0 inumber 1743780 on /usr: bad block
kernel: bad block 3221252091, ino 1743780
klotz kernel: pid 56 (syncer), uid 0 inumber 1743780 on /usr: bad block
kernel: bad block 144119931884736777, ino 1743780
kernel: pid 56 (syncer), uid 0 inumber 1743780 on /usr: bad block
kernel: bad block 72340173158093844, ino 1743780
kernel: pid 56 (syncer), uid 0 inumber 1743780 on /usr: bad block
kernel: bad block 1104111992832, ino 1743780
kernel: pid 56 (syncer), uid 0 inumber 1743780 on /usr: bad block
kernel: handle_workitem_freeblocks: block count
kernel: handle_workitem_freeblocks: block count
kernel: bad block 1865342872522620032, ino 1743783

While shutting down I got this:

Jun 19 22:04:21 klotz kernel: /usr: unmount pending error: blocks
-3561100369582 68157 files 0

I restored the fs in single-user mode. And now it runs fine with
the kernel (May 9th).

See also my earlier post.

Martin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ATA_DMA errors (and fs corruption!)

2005-06-19 Thread twesky
Here is my controller:

atapci0: Intel ICH4 UDMA100 controller port
0x1860-0x186f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 31.1 on
pci0
ata0: channel #0 on atapci0
ata1: channel #1 on atapci0

The last known good stable version for me was aprox April 25, my next
cvsup was May 17, but I have problems with 5.4 Release so I assume
(probably incorrectly) that something changed between April 25 and
5.4R.

I don't exactly recall my shutdown errors, but I did have to restore
my file systems to get my laptop back to a functioning state.

On 6/19/05, Martin [EMAIL PROTECTED] wrote:
 twesky wrote:
  I am having ATA_DMA errors on 5.4R and 5 STABLE up to June 16 (haven't
  done a cvsup again).  It doesn't happen on 5.3R or lower.
 
 I have got same problem. I tried yesterday's kernel and I got lots of
 ATA DMA errors. A question: do you have a VIA IDE controller like mine?
 
 atapci0: VIA 8235 UDMA133 controller port
 0xfc00-0xfc0f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 17.1 on pci0
 
 [EMAIL PROTECTED]:17:1:  class=0x01018a card=0x05711849 chip=0x05711106
 rev=0x06 hdr=0x00
 vendor   = 'VIA Technologies Inc'
 device   = 'VT82 EIDE Controller (All VIA Chipsets)'
 class= mass storage
 subclass = ATA
 
 Today, I noticed, the short experiment with the latest -STABLE destroyed
 a part of my /usr partition. It looked like this (with May 9th kernel
 today):
 
 kernel: handle_workitem_freeblocks: block count
 kernel: bad block 50333952, ino 1743780
 kernel: pid 56 (syncer), uid 0 inumber 1743780 on /usr: bad block
 kernel: bad block 3221252091, ino 1743780
 klotz kernel: pid 56 (syncer), uid 0 inumber 1743780 on /usr: bad block
 kernel: bad block 144119931884736777, ino 1743780
 kernel: pid 56 (syncer), uid 0 inumber 1743780 on /usr: bad block
 kernel: bad block 72340173158093844, ino 1743780
 kernel: pid 56 (syncer), uid 0 inumber 1743780 on /usr: bad block
 kernel: bad block 1104111992832, ino 1743780
 kernel: pid 56 (syncer), uid 0 inumber 1743780 on /usr: bad block
 kernel: handle_workitem_freeblocks: block count
 kernel: handle_workitem_freeblocks: block count
 kernel: bad block 1865342872522620032, ino 1743783
 
 While shutting down I got this:
 
 Jun 19 22:04:21 klotz kernel: /usr: unmount pending error: blocks
 -3561100369582 68157 files 0
 
 I restored the fs in single-user mode. And now it runs fine with
 the kernel (May 9th).
 
 See also my earlier post.
 
 Martin

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


ATA_DMA errors

2005-06-18 Thread twesky
I am having ATA_DMA errors on 5.4R and 5 STABLE up to June 16 (haven't
done a cvsup again).  It doesn't happen on 5.3R or lower.

The exact error message is below:

It happens within a few hours of use.  The laptop will then reboot,
and fsck must be ran.  After fsck the timeouts happen within a few
seconds of booting.  Is this a known issue?

ERROR MSG
---
ad0: timeout - READ_DMA retrying (2 retries left) LBA=24531835
ad0: warning - removed from configuration
ata0-master: failure - READ_DMA timed out
---

The laptop is a SONY VAIO PCG-Z1WA

dmesg info
ad0: 57231MB TOSHIBA MK6021GAS/GA024A [116280/16/63] at ata0-master UDMA100

fdisk info
# fdisk
*** Working on device /dev/ad0 ***
parameters extracted from in-core disklabel are:
cylinders=116280 heads=16 sectors/track=63 (1008 blks/cyl)

Figures below won't work with BIOS for partitions not in cyl 1
parameters to be used for BIOS calculations are:
cylinders=116280 heads=16 sectors/track=63 (1008 blks/cyl)

Media sector size is 512
Warning: BIOS sector numbering starts with sector 1
Information from DOS bootblock is:
The data for partition 1 is:
sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD)
start 63, size 117210177 (57231 Meg), flag 80 (active)
beg: cyl 0/ head 1/ sector 1;
end: cyl 1023/ head 254/ sector 63
The data for partition 2 is:
UNUSED
The data for partition 3 is:
UNUSED
The data for partition 4 is:
UNUSED
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]