Re: ATA_DMA errors
Johny Mattsson wrote: Basically, the problem seems to be related to using more than one channel on the IDE controller. This isn't a solution to my problem. I only have one hard drive. It's 120GB Seagate. We seem to have different problems, btw. I also don't think, my problem is ATA-related. It shows effect on ATA, but I don't see any modifications that have been done to ATA on -STABLE between May 26 and May 30. It is something else going on there. Today at night the system was up and the security scan showed bad descriptors and bad block-errors. This was the effect of my last experiment with latest -STABLE. (I previously thought that the file system was intact, but it's not true.) I don't know why this is called bad block. It confuses users (at least me) making them think they have physically destroyed hard disk areas, but this is not the case, as a simple dd shows. And bad blocks will not appear after using a new kernel and disappear when I reinstall an older one and use fsck. Martin Here my dmesg (kernel date: May 26th 00:00:00): Copyright (c) 1992-2005 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 5.4-STABLE #0: Mon Jun 20 21:44:05 CEST 2005 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/KLOTZ ACPI APIC Table: AMIINT VIA_K7 Timecounter i8254 frequency 1193182 Hz quality 0 CPU: AMD Sempron(tm) 2200+ (1499.52-MHz 686-class CPU) Origin = AuthenticAMD Id = 0x681 Stepping = 1 Features=0x383fbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CM OV,PAT,PSE36,MMX,FXSR,SSE AMD Features=0xc048MP,AMIE,DSP,3DNow! real memory = 536805376 (511 MB) avail memory = 511455232 (487 MB) ioapic0 Version 0.3 irqs 0-23 on motherboard netsmb_dev: loaded npx0: math processor on motherboard npx0: INT 16 interface acpi0: AMIINT VIA_K7 on motherboard acpi0: Power Button (fixed) Timecounter ACPI-fast frequency 3579545 Hz quality 1000 acpi_timer0: 24-bit timer at 3.579545MHz port 0x808-0x80b on acpi0 cpu0: ACPI CPU on acpi0 acpi_button0: Power Button on acpi0 pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0 pci0: ACPI PCI bus on pcib0 agp0: VIA 8377 (Apollo KT400/KT400A/KT600) host to PCI bridge mem 0xe000-0xe3ff at device 0.0 on pci0 pcib1: PCI-PCI bridge at device 1.0 on pci0 pci1: PCI bus on pcib1 nvidia0: GeForce4 Ti 4200 mem 0xddc8-0xddcf,0xd000-0xd7ff,0xde00-0xdeff irq 16 at device 0.0 on pci1 xl0: 3Com 3c905-TX Fast Etherlink XL port 0xec00-0xec3f irq 17 at device 9.0 on pci0 miibus0: MII bus on xl0 nsphy0: DP83840 10/100 media interface on miibus0 nsphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto xl0: Ethernet address: 00:60:08:4e:42:3b ath0: Atheros 5212 mem 0xdffd-0xdffd irq 18 at device 10.0 on pci0 ath0: mac 5.9 phy 4.3 5ghz radio 4.6 ath0: Ethernet address: 00:0f:b5:28:de:4b ath0: 11b rates: 1Mbps 2Mbps 5.5Mbps 11Mbps ath0: 11g rates: 1Mbps 2Mbps 5.5Mbps 11Mbps 6Mbps 9Mbps 12Mbps 18Mbps 24Mbps 36Mbps 48Mbps 54Mbps bktr0: BrookTree 878 mem 0xdddfe000-0xdddfefff irq 19 at device 11.0 on pci0 bktr0: Hauppauge Model 44804 C108 bktr0: Detected a MSP34255?-?31 at 0x80 bktr0: Hauppauge WinCast/TV, Philips PAL I tuner, msp3400c stereo. pci0: multimedia at device 11.1 (no driver attached) sym0: 875 port 0xe800-0xe8ff mem 0xdfffe000-0xdfffefff,0xdf00-0xdfff irq 17 at device 13.0 on pci0 sym0: Tekram NVRAM, ID 7, Fast-20, SE, parity checking uhci0: VIA 83C572 USB controller port 0xdc00-0xdc1f irq 21 at device 16.0 on pci0 usb0: VIA 83C572 USB controller on uhci0 usb0: USB revision 1.0 uhub0: VIA UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 2 ports with 2 removable, self powered uhci1: VIA 83C572 USB controller port 0xe000-0xe01f irq 21 at device 16.1 on pci0 usb1: VIA 83C572 USB controller on uhci1 usb1: USB revision 1.0 uhub1: VIA UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub1: 2 ports with 2 removable, self powered uhci2: VIA 83C572 USB controller port 0xe400-0xe41f irq 21 at device 16.2 on pci0 usb2: VIA 83C572 USB controller on uhci2 usb2: USB revision 1.0 uhub2: VIA UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub2: 2 ports with 2 removable, self powered pci0: serial bus, USB at device 16.3 (no driver attached) isab0: PCI-ISA bridge at device 17.0 on pci0 isa0: ISA bus on isab0 atapci0: VIA 8235 UDMA133 controller port 0xfc00-0xfc0f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 17.1 on pci0 ata0: channel #0 on atapci0 ata1: channel #1 on atapci0 pcm0: VIA VT8235 port 0xd800-0xd8ff irq 22 at device 17.5 on pci0 pcm0: Unknown AC97 Codec (id = 0x434d4983) vr0: VIA VT6102 Rhine II 10/100BaseTX port 0xd400-0xd4ff mem 0xdd00-0xddff irq 23 at device 18.0 on pci0 miibus1: MII bus on vr0 ukphy0: Generic IEEE 802.3u media interface on miibus1 ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto vr0: Ethernet address:
Re: ATA_DMA errors
I had a READ_DMA timeout situation which I'm pretty sure was related to a drive problem. I'm running 5.3-RELEASE-p5 on an older machine (333 MHz AMD K6). The 20 GB hard drive in this system periodically, but only occasionally, gave READ_DMA timeout errors. These errors sometimes cited identical block (LBA) numbers from one time to the next. I tried running the system with the case open, in case it was an overheating problem, but this had no effect. I considered replacing the power supply, but I never got around to doing this. Finally, about a week ago, I copied the entire system to a new hard drive. So far, I haven't had even one READ_DMA error since going to the new hard drive. At least in this one case, it seems fairly certain that the problem has something to do with a particular hard drive. Curiously, I did =not= get any READ_DMA errors while I was making a full backup of the old drive in preparation for copying the data onto the new drive. Rich Wales[EMAIL PROTECTED]http://www.richw.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ATA_DMA errors - [ workaround for me ]
Hi all, Today I've taken a fresh stab at the problem (I'm never at my best at 5am in the morning having worked through the night), and I have managed to come up with what appears to amount to a successful workaround. It would be good if my observations could be confirmed by someone else. Basically, the problem seems to be related to using more than one channel on the IDE controller. Data points for this are: [ SiI 0680 ] Channel 1: 40 GB Seagate Channel 2: 60 GB Seagate + 160 GB Western Digital Result: 200k worth of DMA_READ timed out and DMA_WRITE UDMA ICRC error messages, inability to obtain SMART info from the WD drive, WD drive info garbled, and WD drive being removed/detached from the config. The errors only appeared after a few hours operation, but once they were there, no amount of reboots would get rid of them/improve the situation. To attempt to save the data on the WD disk before the FS got completely hammered, I pulled it out, and observed the following: [ SiI 0680 ] Channel 1: 40 GB Seagate Channel 2: 60 GB Seagate Result: DMA_READ timed out errors for both drives, and DMA_WRITE UDMA ICRC error messages for the 60 GB Seagate. Since I had an older ATA-100 controller available, I tried with it (it can't handle 120GB drives though, so I couldn't as many combinations as I would have liked): [ CMD 649 ] Channel 1: 40 GB Seagate Channel 2: 60 GB Seagate Result: DMA_READ timed out errors, but only when both drives are in use at the same time. Running fsck on a slice on either drive in parallell reliably reproduced the DMA_READ errors. Whenever an error was reported for one drive, another error for the other drive always followed right after. [ CMD 649 ] Channel 1: Channel 2: 40 GB Seagate + 60 GB Seagate Result: No error messages. [ CMD 649 ] Channel 1: 40 GB Seagate + 60 GB Seagate Channel 2: Result: No error messages. Encouraged by these findings, I swapped back to the SiI controller to test the 160 GB drive: [ SiI 0680 ] Channel 1: Channel 2: 160 GB WD Result: No error messages [ SiI 0680 ] Channel 1: 160 GB WD Channel 2: Result: No error messages Finally, I tried everything together: [ SiI 0680 ] Channel 1: 160 GB WD Channel 2: [ CMD 649 ] Channel 1: 40 GB Seagate + 60 GB Seagate Channel 2: Result: No errors messages. What I haven't mentioned in the above is that I also tried some combinations with different cables, and also at reduced speed (UDM66 vs UDMA100). Neither changes had any effect on the behaviour. With the WD drive alone on the SiI 0680, I was also able to retrieve SMART information from it, and it's showing no errors for the drive at all. Likewise so for the 60 GB Seagate drive. All drives pass their self-tests without any errors. As mentioned in my previous email, my system drive is hanging off the built-in PIIX4 controller, as a single drive and only one channel on the controller used. I never saw any errors for that drive throughout my testing. My conclusion is thusly that there is something that's crept in that's affecting stability when multiple channels are used on the same controller. I'm not versed enough in driver internals to know if it's IRQ, DMA, ISR or anything-else related though. Below are my latest dmesg and pciconf listings - hopefully this will help someone locate the culprit. (Soren?) So, now I'm stuck with a system with three IDE controllers and one SCSI controller, and a motherboard that is utterly confused when I ask it boot off an external controller... (i.e. I can only boot off the built-in controller now). Please let me know if there's some other info I can get for you; I'll have limited ability to move drives around since this is the file server and people get annoyed when it's unavailable, but do ask if you think it will help you! :) Cheers, /Johny === dmesg Copyright (c) 1992-2005 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 5.4-RELEASE #0: Sun May 8 10:21:06 UTC 2005 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC Timecounter i8254 frequency 1193182 Hz quality 0 CPU: Pentium II/Pentium II Xeon/Celeron (467.73-MHz 686-class CPU) Origin = GenuineIntel Id = 0x665 Stepping = 5 Features=0x183f9ffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PA T,PSE36,MMX,FXSR real memory = 805240832 (767 MB) avail memory = 778231808 (742 MB) npx0: math processor on motherboard npx0: INT 16 interface acpi0: AWARD AWRDACPI on motherboard acpi0: Power Button (fixed) Timecounter ACPI-safe frequency 3579545 Hz quality 1000 acpi_timer0: 24-bit timer at 3.579545MHz port 0x4008-0x400b on acpi0 cpu0: ACPI CPU (3 Cx states) on acpi0 acpi_throttle0: ACPI CPU Throttling on cpu0 acpi_button0: Power Button on acpi0 pcib0: ACPI Host-PCI bridge port 0x5000-0x500f,0x4000-0x4041,0xcf8-0xcff on acpi0 pci0: ACPI PCI bus on pcib0 agp0:
Re: ATA_DMA errors
twesky wrote: I am having ATA_DMA errors on 5.4R and 5 STABLE up to June 16 (haven't done a cvsup again). It doesn't happen on 5.3R or lower. I've just upgraded my fileserver from 5.1-R to 5.4-R, and I'm seeing this problem too now on 3 out of 4 drives. The exact error message is below: It happens within a few hours of use. The laptop will then reboot, and fsck must be ran. After fsck the timeouts happen within a few seconds of booting. My system uses a SiI 0680 UDMA133 controller in addition to the old built-in Intel PIIX4 UDMA33 controller. My system drive hangs off the PIIX4 controller and I see no issues with it, only drives off the SiI; ad0: 8207MB ST38641A/3.29 [16676/16/63] at ata0-master UDMA33 ad4: 57241MB ST360021A/3.05 [116301/16/63] at ata2-master UDMA100 ad6: 76319MB ST380021A/3.19 [155061/16/63] at ata3-master UDMA100 ad7: 152627MB WDC WD1600JB-00DUA3/75.13B75 [310101/16/63] at ata3-slave UDMA100 Right after the upgrade things worked well for a couple of hours, and then I got a reboot all of a sudden. Upon inspection I found tons of both READ_DMA timed out as well as WRITE_DMA UDMA ICRC error messages in log prior to the reboot. After the reboot it went to do the fsck and made it perhaps halfway through it before it started churning out READ_DMA timed out messages again, followed by the ad7: warning - removed from configuration message. Things did not get better from there, but with each sucessive reboot more and more started going wrong. In order to be able to get the system to even boot in the end I had to physically disconnect the ad7 drive, but even so I'm getting READ_DMA timed out messages for ad4 and ad6. Since I'm getting WRITE_DMA errors on both ad6 and ad7 now (I haven't written anything to ad4 yet, so I don't know if I'll get errors on that one too), and I wasn't a few hours ago when I was running 5.1-R, I refuse to believe that two disks have gone bad in that timespan! I'm not sure what I should do at this point - theoretically I could proceed to roll back to 5.1 to prevent further data loss, but I'm guessing it'd be good if I kept it for a little while so that I could run tests for patches :-/ Seeing the comments about possible failing controller hardware, I might see if I can find a replacement controller tomorrow... any ideas in the meantime will be appreciated though! Still feels very iffy that this started happening right after the upgrade... I was expecting to get rid of some of the quirks from the early preview, not get far worse ones! :-( Oh, btw, using smartmontools' smartctl, I've gotten the information that ad4 has had 32 write errors in total, ad6 have had 0 (despite seeing the WRITE_DMA errors in the system log), and ad7 refuses to even talk SMART. ### Here's the contents of the dmesg from before I pulled ad7 out: Jun 24 18:22:19 kernel: FreeBSD 5.4-RELEASE #0: Sun May 8 10:21:06 UTC 2005 Jun 24 18:22:19 kernel: [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC Jun 24 18:22:19 kernel: Timecounter i8254 frequency 1193182 Hz quality 0 Jun 24 18:22:19 kernel: CPU: Pentium II/Pentium II Xeon/Celeron (467.73-MHz 686-class CPU) Jun 24 18:22:19 kernel: Origin = GenuineIntel Id = 0x665 Stepping = 5 Jun 24 18:22:19 kernel: Features=0x183f9ffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,S EP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR Jun 24 18:22:19 kernel: real memory = 805240832 (767 MB) Jun 24 18:22:19 kernel: avail memory = 778231808 (742 MB) Jun 24 18:22:19 kernel: npx0: math processor on motherboard Jun 24 18:22:19 kernel: npx0: INT 16 interface Jun 24 18:22:19 kernel: acpi0: AWARD AWRDACPI on motherboard Jun 24 18:22:19 kernel: acpi0: Power Button (fixed) Jun 24 18:22:19 kernel: Timecounter ACPI-safe frequency 3579545 Hz quality 1000 Jun 24 18:22:19 kernel: acpi_timer0: 24-bit timer at 3.579545MHz port 0x4008-0x400b on acpi0 Jun 24 18:22:19 kernel: cpu0: ACPI CPU (3 Cx states) on acpi0 Jun 24 18:22:19 kernel: acpi_throttle0: ACPI CPU Throttling on cpu0 Jun 24 18:22:19 kernel: acpi_button0: Power Button on acpi0 Jun 24 18:22:19 kernel: pcib0: ACPI Host-PCI bridge port 0x5000-0x500f,0x4000-0x4041,0xcf8-0xcff on acpi0 Jun 24 18:22:19 kernel: pci0: ACPI PCI bus on pcib0 Jun 24 18:22:19 kernel: agp0: Intel 82443BX (440 BX) host to PCI bridge mem 0xe000-0xe3ff at device 0.0 on pci0 Jun 24 18:22:19 kernel: pcib1: PCI-PCI bridge at device 1.0 on pci0 Jun 24 18:22:19 kernel: pci1: PCI bus on pcib1 Jun 24 18:22:19 kernel: isab0: PCI-ISA bridge at device 7.0 on pci0 Jun 24 18:22:19 kernel: isa0: ISA bus on isab0 Jun 24 18:22:19 kernel: atapci0: Intel PIIX4 UDMA33 controller port 0xf000-0xf00f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 7.1 on pci0 Jun 24 18:22:19 kernel: ata0: channel #0 on atapci0 Jun 24 18:22:19 kernel: ata1: channel #1 on atapci0 Jun 24 18:22:19 kernel: uhci0: Intel 82371AB/EB (PIIX4) USB controller port 0x9000-0x901f irq 11 at device 7.2 on pci0 Jun 24 18:22:19 kernel: usb0: Intel 82371AB/EB (PIIX4) USB
Re: ATA_DMA errors
I don't think it is a hardware problem. Unless you replace it with the exact same hardware, it'll be difficult to determine if it was the hardware. I haven't had any issues with 5.3R or any stable version before April 15. I am going to do some checking this weekend and see if it is hardware or software what is causing my timeouts. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re[2]: ATA_DMA errors (and fs corruption!) (JM)
Hello Martin, Monday, June 20, 2005, 9:09:15 PM, you wrote: M I just compiled the kernel from May 26th. Works fine. It looks like M for me it's broken between May 26th and May 30th. M I tried these kernels: M 2005-06-16 broken M 2005-05-31 broken M 2005-05-30 (00:00:00) broken M 2005-05-26 (00:00:00) ok M 2005-05-22 ok M 2005-05-15 ok M 2005-05-09 ok M The problem appears under heavy disk load. There's definitely something up with the driver for the Intel ICH5 controller. I have second machine with the same chipset, this time a desktop, which is exhibiting the same DMA timeout problem with its SATA disk. It has a RELENG_5 kernel which was built from sources updated yesterday. Regards, Tony. -- Tony Byrne ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re[2]: ATA_DMA errors (and fs corruption!)
Hello twesky, t atapci0: Intel ICH4 UDMA100 controller port t 0x1860-0x186f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 31.1 on t pci0 t ata0: channel #0 on atapci0 t ata1: channel #1 on atapci0 t The last known good stable version for me was aprox April 25, my next t cvsup was May 17, but I have problems with 5.4 Release so I assume t (probably incorrectly) that something changed between April 25 and t 5.4R. t I don't exactly recall my shutdown errors, but I did have to restore t my file systems to get my laptop back to a functioning state. We've been seeing the same problem in a server equipped with an Intel ICH5 controller and SATA Hard Disk. The problems seemed to start after an update in mid-May. We noticed that processes such as our imap server would stall for a few seconds and the console would indicate either a READ_DMA or WRITE_DMA timeout. On two occasions the the disk became detached requiring a reboot. The frequency of these timeouts were such that we couldn't do any work with the server. We didn't have this problem prior to the update. We are tracking RELENG_5, but have now reverted to a May 9th kernel, which doesn't seem to be quite so fussy and has reduced the problem to a handful of timeouts every day. What's bugging me is that this list has been very quiet about this problem. The Intel ICH* controllers must be common in the field and I'm surprised that this problem has gone unnoticed. Of course, there can be hardware reasons for timeouts such as a dying disk or cable, but I think we've eliminated these in our case. The disk works fine when transferred to another machine and the SATA cable works fine when used with another disk (albeit one of smaller capacity) in the server. So we've come to the conclusion that it's the combination of controller, disk and FreeBSD version that holds the key to this. Jun 20 10:20:04 roo kernel: atapci0: Intel ICH5 SATA150 controller port 0xffa0-0xffaf,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 31.2 on pci0 Jun 20 10:20:04 roo kernel: ata0: channel #0 on atapci0 Jun 20 10:20:04 roo kernel: ata1: channel #1 on atapci0 ... Jun 20 10:20:04 roo kernel: ad0: 190782MB WDC WD2000JD-00FYB0/02.05D02 [387621/16/63] at ata0-master SATA150 Jun 20 10:20:04 roo kernel: acd0: CDROM SAMSUNG CD-ROM SC-152G/C400 at ata1-master PIO4 ... Regards, Tony. -- Tony Byrne ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re[2]: ATA_DMA errors (and fs corruption!)
At 11:09 20/06/2005, Tony Byrne wrote: [...] Of course, there can be hardware reasons for timeouts such as a dying disk or cable, but I think we've eliminated these in our case. [etc] Don't ignore the possibility of failing controller hardware. We had comparable mysterious problems on a client system, causing a lot of head-scratching. Eventually the failure went hard and we had to replace the motherboard. -- Bob Bishop +44 (0)118 940 1243 [EMAIL PROTECTED] fax +44 (0)118 940 1295 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re[3]: ATA_DMA errors (and fs corruption!)
Hello Bob, can be hardware reasons for timeouts such as a dying disk or cable, but I think we've eliminated these in our case. [etc] BB Don't ignore the possibility of failing controller hardware. We had BB comparable mysterious problems on a client system, causing a lot of BB head-scratching. Eventually the failure went hard and we had to replace the BB motherboard. I hear ya! However, moving back to an older kernel changes the severity of the problem from a timeout every 2 to three minutes during heavy activity to about 4 or 5 in a 24 hour period. That doesn't sound like hardware to me. Regards, Tony. -- Tony Byrne ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ATA_DMA errors (and fs corruption!)
Tony Byrne wrote: Hello Bob, can be hardware reasons for timeouts such as a dying disk or cable, but I think we've eliminated these in our case. [etc] BB Don't ignore the possibility of failing controller hardware. We had BB comparable mysterious problems on a client system, causing a lot of BB head-scratching. Eventually the failure went hard and we had to replace the BB motherboard. I hear ya! However, moving back to an older kernel changes the severity of the problem from a timeout every 2 to three minutes during heavy activity to about 4 or 5 in a 24 hour period. That doesn't sound like hardware to me. Regards, Tony. i have these same errors on my VIA 823x series chipset. however, the problem is only with the secondary device (acd0 in this case), and might be stemming from some other problem. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re[3]: ATA_DMA errors (and fs corruption!)
At 12:12 20/06/2005, Tony Byrne wrote: Hello Bob, can be hardware reasons for timeouts such as a dying disk or cable, but I think we've eliminated these in our case. [etc] BB Don't ignore the possibility of failing controller hardware. We had BB comparable mysterious problems on a client system, causing a lot of BB head-scratching. Eventually the failure went hard and we had to replace the BB motherboard. I hear ya! However, moving back to an older kernel changes the severity of the problem from a timeout every 2 to three minutes during heavy activity to about 4 or 5 in a 24 hour period. That doesn't sound like hardware to me. It didn't to me either. Note the use of 'mysterious' :-) I'd eliminated drives and cables, and then did it all over again when the failure went hard, leaving the controller (or something else on the mobo). With a new mobo all the annoying timeouts which I'd put down to driver misbehaviour just went away. -- Bob Bishop +44 (0)118 940 1243 [EMAIL PROTECTED] fax +44 (0)118 940 1295 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re[4]: ATA_DMA errors (and fs corruption!)
Hello Bob, BB It didn't to me either. Note the use of 'mysterious' :-) BB I'd eliminated drives and cables, and then did it all over again when the BB failure went hard, leaving the controller (or something else on the mobo). BB With a new mobo all the annoying timeouts which I'd put down to driver BB misbehaviour just went away. Did you replace the motherboard with one of the same brand and model? Regards, Tony. -- Tony Byrne ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re[4]: ATA_DMA errors (and fs corruption!)
At 13:19 20/06/2005, Tony Byrne wrote: Hello Bob, BB It didn't to me either. Note the use of 'mysterious' :-) BB I'd eliminated drives and cables, and then did it all over again when the BB failure went hard, leaving the controller (or something else on the mobo). BB With a new mobo all the annoying timeouts which I'd put down to driver BB misbehaviour just went away. Did you replace the motherboard with one of the same brand and model? No, but as it happened they both have the same SATA controller chip. -- Bob Bishop +44 (0)118 940 1243 [EMAIL PROTECTED] fax +44 (0)118 940 1295 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ATA_DMA errors (and fs corruption!) (JM)
I had a similar problem and i changed system cases where i was getting a ICRC error and FreeBSD refused to load or even mount the root fs, it was also giving errors with something to do with the ATA something or other, it turned out to be the cable i used after rebuilding the system in the new case, i used a normal EIDE cable instead of a ATA cable :-/ hope that helps(probably not) Jay Tony Byrne wrote: Hello Bob, can be hardware reasons for timeouts such as a dying disk or cable, but I think we've eliminated these in our case. [etc] BB Don't ignore the possibility of failing controller hardware. We had BB comparable mysterious problems on a client system, causing a lot of BB head-scratching. Eventually the failure went hard and we had to replace the BB motherboard. I hear ya! However, moving back to an older kernel changes the severity of the problem from a timeout every 2 to three minutes during heavy activity to about 4 or 5 in a 24 hour period. That doesn't sound like hardware to me. Regards, Tony. i have these same errors on my VIA 823x series chipset. however, the problem is only with the secondary device (acd0 in this case), and might be stemming from some other problem. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re[2]: ATA_DMA errors (and fs corruption!) (JM)
Hello Jayton, Monday, June 20, 2005, 3:46:20 PM, you wrote: JG I had a similar problem and i changed system cases where i was getting a JG ICRC error and FreeBSD refused to load or even mount the root fs, it was JG also giving errors with something to do with the ATA something or other, JG it turned out to be the cable i used after rebuilding the system in the JG new case, i used a normal EIDE cable instead of a ATA cable :-/ In our case it's a SATA drive. Regards, Tony. -- Tony Byrne ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ATA_DMA errors (and fs corruption!) (JM)
Jayton Garnett wrote: I had a similar problem and i changed system cases where i was getting a ICRC error and FreeBSD refused to load or even mount the root fs, it was also giving errors with something to do with the ATA something or other, it turned out to be the cable i used after rebuilding the system in the new case, i used a normal EIDE cable instead of a ATA cable :-/ hope that helps(probably not) actually, that makes a lot of sense. my computer running FreeBSD is actually just an Eden 5000 V-series. the cable is trimmed to fit the 2.5 hard drive in the tiny case and i'm sure this is having something to do with the timeouts... however the harddrive is recognized as UDMA33 but the cdrom still times out. thanks for the input. note: this setup works fine in windows... maybe someone should take a look at this issue? i'm running the old 5.3-RELEASE (too lazy to update) with a VIA VT8231 SouthBridge (82 ata controller). Jay Tony Byrne wrote: Hello Bob, can be hardware reasons for timeouts such as a dying disk or cable, but I think we've eliminated these in our case. [etc] BB Don't ignore the possibility of failing controller hardware. We had BB comparable mysterious problems on a client system, causing a lot of BB head-scratching. Eventually the failure went hard and we had to replace the BB motherboard. I hear ya! However, moving back to an older kernel changes the severity of the problem from a timeout every 2 to three minutes during heavy activity to about 4 or 5 in a 24 hour period. That doesn't sound like hardware to me. Regards, Tony. i have these same errors on my VIA 823x series chipset. however, the problem is only with the secondary device (acd0 in this case), and might be stemming from some other problem. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
ATA_DMA errors (and fs corruption!) (JM)
My laptop works fine with Fedora Core 4. I'm not sure it's a hardware issue, and I don't have an identical laptop to test. Do we know the last working stable version? - actually, that makes a lot of sense. my computer running FreeBSD is actually just an Eden 5000 V-series. the cable is trimmed to fit the 2.5 hard drive in the tiny case and i'm sure this is having something to do with the timeouts... however the harddrive is recognized as UDMA33 but the cdrom still times out. thanks for the input. note: this setup works fine in windows... maybe someone should take a look at this issue? i'm running the old 5.3-RELEASE (too lazy to update) with a VIA VT8231 SouthBridge (82 ata controller). ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ATA_DMA errors (and fs corruption!) (JM)
twesky wrote: My laptop works fine with Fedora Core 4. I'm not sure it's a hardware issue, and I don't have an identical laptop to test. Do we know the last working stable version? I just compiled the kernel from May 26th. Works fine. It looks like for me it's broken between May 26th and May 30th. I tried these kernels: 2005-06-16 broken 2005-05-31 broken 2005-05-30 (00:00:00) broken 2005-05-26 (00:00:00) ok 2005-05-22 ok 2005-05-15 ok 2005-05-09 ok The problem appears under heavy disk load. Martin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ATA_DMA errors (and fs corruption!) (JM)
Jayton Garnett [EMAIL PROTECTED] writes: I had a similar problem and i changed system cases where i was getting a ICRC error and FreeBSD refused to load or even mount the root fs, it was also giving errors with something to do with the ATA something or other, it turned out to be the cable i used after rebuilding the system in the new case, i used a normal EIDE cable instead of a ATA cable :-/ I've just encountered the same problem on 5.4-STABLE/i386. I rebuilt my kernel with SMP and enabled hyperthreading in loader.conf, because the security weakness doesn't really apply to my desktop machine. So, kernel got the DMA error at boot and couldn't mount the root fs. When I switched off HT in the BIOS, the system came up ok. I've cvsupped 5.4-STABLE just a few hours ago. mkb. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ATA_DMA errors (and fs corruption!) (JM)
I wrote: So, kernel got the DMA error at boot and couldn't mount the root fs. Ah, btw.. it's a SATA disk, on an ICH6 SATA150 controller. mkb. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ATA_DMA errors (and fs corruption!)
twesky wrote: I am having ATA_DMA errors on 5.4R and 5 STABLE up to June 16 (haven't done a cvsup again). It doesn't happen on 5.3R or lower. I have got same problem. I tried yesterday's kernel and I got lots of ATA DMA errors. A question: do you have a VIA IDE controller like mine? atapci0: VIA 8235 UDMA133 controller port 0xfc00-0xfc0f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 17.1 on pci0 [EMAIL PROTECTED]:17:1: class=0x01018a card=0x05711849 chip=0x05711106 rev=0x06 hdr=0x00 vendor = 'VIA Technologies Inc' device = 'VT82 EIDE Controller (All VIA Chipsets)' class= mass storage subclass = ATA Today, I noticed, the short experiment with the latest -STABLE destroyed a part of my /usr partition. It looked like this (with May 9th kernel today): kernel: handle_workitem_freeblocks: block count kernel: bad block 50333952, ino 1743780 kernel: pid 56 (syncer), uid 0 inumber 1743780 on /usr: bad block kernel: bad block 3221252091, ino 1743780 klotz kernel: pid 56 (syncer), uid 0 inumber 1743780 on /usr: bad block kernel: bad block 144119931884736777, ino 1743780 kernel: pid 56 (syncer), uid 0 inumber 1743780 on /usr: bad block kernel: bad block 72340173158093844, ino 1743780 kernel: pid 56 (syncer), uid 0 inumber 1743780 on /usr: bad block kernel: bad block 1104111992832, ino 1743780 kernel: pid 56 (syncer), uid 0 inumber 1743780 on /usr: bad block kernel: handle_workitem_freeblocks: block count kernel: handle_workitem_freeblocks: block count kernel: bad block 1865342872522620032, ino 1743783 While shutting down I got this: Jun 19 22:04:21 klotz kernel: /usr: unmount pending error: blocks -3561100369582 68157 files 0 I restored the fs in single-user mode. And now it runs fine with the kernel (May 9th). See also my earlier post. Martin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ATA_DMA errors (and fs corruption!)
Here is my controller: atapci0: Intel ICH4 UDMA100 controller port 0x1860-0x186f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 31.1 on pci0 ata0: channel #0 on atapci0 ata1: channel #1 on atapci0 The last known good stable version for me was aprox April 25, my next cvsup was May 17, but I have problems with 5.4 Release so I assume (probably incorrectly) that something changed between April 25 and 5.4R. I don't exactly recall my shutdown errors, but I did have to restore my file systems to get my laptop back to a functioning state. On 6/19/05, Martin [EMAIL PROTECTED] wrote: twesky wrote: I am having ATA_DMA errors on 5.4R and 5 STABLE up to June 16 (haven't done a cvsup again). It doesn't happen on 5.3R or lower. I have got same problem. I tried yesterday's kernel and I got lots of ATA DMA errors. A question: do you have a VIA IDE controller like mine? atapci0: VIA 8235 UDMA133 controller port 0xfc00-0xfc0f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 17.1 on pci0 [EMAIL PROTECTED]:17:1: class=0x01018a card=0x05711849 chip=0x05711106 rev=0x06 hdr=0x00 vendor = 'VIA Technologies Inc' device = 'VT82 EIDE Controller (All VIA Chipsets)' class= mass storage subclass = ATA Today, I noticed, the short experiment with the latest -STABLE destroyed a part of my /usr partition. It looked like this (with May 9th kernel today): kernel: handle_workitem_freeblocks: block count kernel: bad block 50333952, ino 1743780 kernel: pid 56 (syncer), uid 0 inumber 1743780 on /usr: bad block kernel: bad block 3221252091, ino 1743780 klotz kernel: pid 56 (syncer), uid 0 inumber 1743780 on /usr: bad block kernel: bad block 144119931884736777, ino 1743780 kernel: pid 56 (syncer), uid 0 inumber 1743780 on /usr: bad block kernel: bad block 72340173158093844, ino 1743780 kernel: pid 56 (syncer), uid 0 inumber 1743780 on /usr: bad block kernel: bad block 1104111992832, ino 1743780 kernel: pid 56 (syncer), uid 0 inumber 1743780 on /usr: bad block kernel: handle_workitem_freeblocks: block count kernel: handle_workitem_freeblocks: block count kernel: bad block 1865342872522620032, ino 1743783 While shutting down I got this: Jun 19 22:04:21 klotz kernel: /usr: unmount pending error: blocks -3561100369582 68157 files 0 I restored the fs in single-user mode. And now it runs fine with the kernel (May 9th). See also my earlier post. Martin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
ATA_DMA errors
I am having ATA_DMA errors on 5.4R and 5 STABLE up to June 16 (haven't done a cvsup again). It doesn't happen on 5.3R or lower. The exact error message is below: It happens within a few hours of use. The laptop will then reboot, and fsck must be ran. After fsck the timeouts happen within a few seconds of booting. Is this a known issue? ERROR MSG --- ad0: timeout - READ_DMA retrying (2 retries left) LBA=24531835 ad0: warning - removed from configuration ata0-master: failure - READ_DMA timed out --- The laptop is a SONY VAIO PCG-Z1WA dmesg info ad0: 57231MB TOSHIBA MK6021GAS/GA024A [116280/16/63] at ata0-master UDMA100 fdisk info # fdisk *** Working on device /dev/ad0 *** parameters extracted from in-core disklabel are: cylinders=116280 heads=16 sectors/track=63 (1008 blks/cyl) Figures below won't work with BIOS for partitions not in cyl 1 parameters to be used for BIOS calculations are: cylinders=116280 heads=16 sectors/track=63 (1008 blks/cyl) Media sector size is 512 Warning: BIOS sector numbering starts with sector 1 Information from DOS bootblock is: The data for partition 1 is: sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD) start 63, size 117210177 (57231 Meg), flag 80 (active) beg: cyl 0/ head 1/ sector 1; end: cyl 1023/ head 254/ sector 63 The data for partition 2 is: UNUSED The data for partition 3 is: UNUSED The data for partition 4 is: UNUSED ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]