Re: [BUG 2.6.21-rc3-git9] SATA NCQ failure with Samsum HD401LJ
Christian wrote: On Sunday 18 March 2007 06:43:09 you wrote: Christian wrote: This does indeed look like a drive side issue to me (the controller is reporting CPBs with response flags 2 which as far as I can tell indicates it's still waiting for the drive to complete the request). I have been using this hw-config (SATA II, NCQ) since the nvidia ADMA support made it in the -mm kernel (maybe around 2.6.19-mm? or even earlyer). I'm seeing this problem excessively since I upgraded to 2.6.21-rc3-mm1. I think something got broken recently... Can you post the result of "hdparm -I /dev/sdX"? Output generated on 2.6.21-rc3-mm1 #3 SMP PREEMPT [EMAIL PROTECTED]:~$ sudo hdparm -I /dev/sda /dev/sda: ATA device, with non-removable media Model Number: SAMSUNG HD401LJ Serial Number: S0HVJ1FL900207 Firmware Revision: ZZ100-15 Standards: Used: ATA/ATAPI-7 T13 1532D revision 4a Supported: 7 6 5 4 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBAuser addressable sectors: 268435455 LBA48 user addressable sectors: 781422768 device size with M = 1024*1024: 381554 MBytes device size with M = 1000*1000: 400088 MBytes (400 GB) Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = 16 Recommended acoustic management value: 254, current value: 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 udma7 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: *SMART feature set Security Mode feature set *Power Management feature set *Write cache *Look-ahead *Host Protected Area feature set *WRITE_BUFFER command *READ_BUFFER command *NOP cmd *DOWNLOAD_MICROCODE SET_MAX security extension Automatic Acoustic Management feature set *48-bit Address feature set *Device Configuration Overlay feature set *Mandatory FLUSH_CACHE *FLUSH_CACHE_EXT *SMART error logging *SMART self-test *General Purpose Logging feature set *SATA-I signaling speed (1.5Gb/s) *SATA-II signaling speed (3.0Gb/s) *Native Command Queueing (NCQ) *Host-initiated interface power management *Phy event counters DMA Setup Auto-Activate optimization Device-initiated interface power management *Software settings preservation *SMART Command Transport (SCT) feature set *SCT Long Sector Access (AC1) *SCT LBA Segment Access (AC2) *SCT Error Recovery Control (AC3) *SCT Features Control (AC4) *SCT Data Tables (AC5) Security: Master password revision code = 65534 supported not enabled not locked frozen not expired: security count supported: enhanced erase 228min for SECURITY ERASE UNIT. 228min for ENHANCED SECURITY ERASE UNIT. Checksum: correct [EMAIL PROTECTED]:~$ sudo hdparm -I /dev/sdb /dev/sdb: ATA device, with non-removable media Model Number: SAMSUNG SP2504C Serial Number: S09QJ1LYC06381 Firmware Revision: VT100-33 Standards: Used: ATA/ATAPI-7 T13 1532D revision 4a Supported: 7 6 5 4 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBAuser addressable sectors: 268435455 LBA48 user addressable sectors: 488397168 device size with M = 1024*1024: 238475 MBytes device size with M = 1000*1000: 250059 MBytes (250 GB) Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = 16 Recommended acoustic management value: 254, current value: 254 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 udma7 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4
Re: SATA problems
Tejun Heo wrote: Pablo Sebastian Greco wrote: Tejun Heo wrote: * Pablo, the bug you saw was bad interaction between blacklisted NCQ device and dynamic queue depth adjustment. Patches are submitted to fix the problem. Just drop the blacklist patch. Your drives should work fine in NCQ mode. My gut feeling is that your problem is power related from the beginning. I had the same problems with a new Power Supply, Now everything is ok with the old Power Supply and the new drives. So, it was bad drives? Are you using the same model or different ones? NCQ works okay now? All I can say is that now is working, other things changed with the new drives: 1.5Gbps instead of 3Gbps, also new drives don't use NCQ (I'm reattaching a full dmesg). Also I've found this firmware upgrade (http://www.samsung.com/Products/HardDiskDrive/support/faqs/faqs_20060414_246673.htm) for the old drives, but couldn't confirm if it should be applied because the server is in Brazil and I live in Argentina. Won't be there until April to test. Thanks. Pablo. Linux version 2.6.19-1.2895.fc6 ([EMAIL PROTECTED]) (gcc version 4.1.1 20070105 (Red Hat 4.1.1-51)) #1 SMP Wed Jan 10 18:50:56 EST 2007 Command line: ro root=LABEL=/ BIOS-provided physical RAM map: BIOS-e820: - 0009ec00 (usable) BIOS-e820: 0009ec00 - 0010 (reserved) BIOS-e820: 0010 - df938000 (usable) BIOS-e820: df938000 - df9d2000 (ACPI NVS) BIOS-e820: df9d2000 - dfa42000 (usable) BIOS-e820: dfa42000 - dfa9a000 (reserved) BIOS-e820: dfa9a000 - dfab8000 (usable) BIOS-e820: dfab8000 - dfb1a000 (ACPI NVS) BIOS-e820: dfb1a000 - dfb2c000 (usable) BIOS-e820: dfb2c000 - dfb3a000 (ACPI data) BIOS-e820: dfb3a000 - dfc0 (usable) BIOS-e820: ffc0 - ffc0c000 (reserved) BIOS-e820: 0001 - 00012000 (usable) Entering add_active_range(0, 0, 158) 0 entries of 3200 used Entering add_active_range(0, 256, 915768) 1 entries of 3200 used Entering add_active_range(0, 915922, 916034) 2 entries of 3200 used Entering add_active_range(0, 916122, 916152) 3 entries of 3200 used Entering add_active_range(0, 916250, 916268) 4 entries of 3200 used Entering add_active_range(0, 916282, 916480) 5 entries of 3200 used Entering add_active_range(0, 1048576, 1179648) 6 entries of 3200 used end_pfn_map = 1179648 DMI 2.4 present. ACPI: RSDP (v002 INTEL ) @ 0x000f0350 ACPI: XSDT (v001 INTEL S5000VSA 0x INTL 0x0113) @ 0xdfb39120 ACPI: FADT (v003 INTEL S5000VSA 0x INTL 0x0113) @ 0xdfb36000 ACPI: MADT (v001 INTEL S5000VSA 0x INTL 0x0113) @ 0xdfb35000 ACPI: SPCR (v001 INTEL S5000VSA 0x INTL 0x0113) @ 0xdfb2f000 ACPI: HPET (v001 INTEL S5000VSA 0x0001 INTL 0x0113) @ 0xdfb2e000 ACPI: MCFG (v001 INTEL S5000VSA 0x0001 INTL 0x0113) @ 0xdfb2d000 ACPI: SSDT (v002 INTEL S5000VSA 0x4000 INTL 0x0113) @ 0xdfb2c000 ACPI: DSDT (v002 INTEL S5000VSA 0x0008 INTL 0x0113) @ 0x No NUMA configuration found Faking a node at -00012000 Entering add_active_range(0, 0, 158) 0 entries of 3200 used Entering add_active_range(0, 256, 915768) 1 entries of 3200 used Entering add_active_range(0, 915922, 916034) 2 entries of 3200 used Entering add_active_range(0, 916122, 916152) 3 entries of 3200 used Entering add_active_range(0, 916250, 916268) 4 entries of 3200 used Entering add_active_range(0, 916282, 916480) 5 entries of 3200 used Entering add_active_range(0, 1048576, 1179648) 6 entries of 3200 used Bootmem setup node 0 -00012000 Zone PFN ranges: DMA 0 -> 4096 DMA324096 -> 1048576 Normal1048576 -> 1179648 early_node_map[7] active PFN ranges 0:0 -> 158 0: 256 -> 915768 0: 915922 -> 916034 0: 916122 -> 916152 0: 916250 -> 916268 0: 916282 -> 916480 0: 1048576 -> 1179648 On node 0 totalpages: 1047100 DMA zone: 64 pages used for memmap DMA zone: 1450 pages reserved DMA zone: 2484 pages, LIFO batch:0 DMA32 zone: 16320 pages used for memmap DMA32 zone: 895710 pages, LIFO batch:31 Normal zone: 2048 pages used for memmap Normal zone: 129024 pages, LIFO batch:31 ACPI: PM-Timer IO Port: 0x408 ACPI: Local APIC address 0xfee0 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 (Bootup-CPU) ACPI: LAPIC (acpi_id[0x01] lapic_id[0x02] enabled) Processor #2 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled) Processor #1 ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] enabled) Processor #3 ACPI: LAPIC (acpi_id[0x04] lapic_id[0x84] disabled) ACPI: LA
Re: SATA problems
Tejun Heo wrote: * Pablo, the bug you saw was bad interaction between blacklisted NCQ device and dynamic queue depth adjustment. Patches are submitted to fix the problem. Just drop the blacklist patch. Your drives should work fine in NCQ mode. My gut feeling is that your problem is power related from the beginning. * Marcus, you're on via's ahci controller, right? The problem you saw was bad interaction between blacklisted NCQ _controller_ and dynamic queue depth adjustment. Patches submitted. Thanks. I had the same problems with a new Power Supply, Now everything is ok with the old Power Supply and the new drives. Pablo. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA problems
Marcus Haebler wrote: I opened a bug report (228979) on bugzilla.redhat.com on this one because I have the same issue under FC6 2.6.19-1.2895. Here is the link: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=228979 Do you have any more updates on this problem? Is there a way I can help by providing debug data? Thanks, Marcus On 1/23/07, Tejun Heo <[EMAIL PROTECTED]> wrote: Pablo Sebastian Greco wrote: > Well, it took me a few days, but I think I'm ready to report back. One > of the drives was failing, and it stopped after rewiring power supply so > the last problem seems to be corrected. > OTOH, your blacklist seems to be needed too, now I'm running FC6 > distribution kernel 2.6.19-1.2895.fc6 (2.6.19.2 + some patches by > fedora) and setting > echo 1 >/sys/block/sdX/device/queue_depth > on all the SAMSUNG drives (sdb, sdc and sdd) > The second I type > echo 31 >/sys/block/sdX/device/queue_depth > on any of the drives I get these messages > > Jan 23 12:36:30 squid kernel: BUG: warning: (ap->ops->error_handler && > ata_tag_valid(ap->active_tag)) at > drivers/ata/libata-core.c:4602/ata_qc_issue() (Not ta > inted) This is kernel bug that needs fixing. I'll investigate. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ On my side, all the problems dissapeared on all the kernels after changing all 3 drives to non-NCQ drives, I was going crazy. New dmesg attached Pablo. Linux version 2.6.19-1.2895.fc6 ([EMAIL PROTECTED]) (gcc version 4.1.1 20070105 (Red Hat 4.1.1-51)) #1 SMP Wed Jan 10 18:50:56 EST 2007 Command line: ro root=LABEL=/ BIOS-provided physical RAM map: BIOS-e820: - 0009ec00 (usable) BIOS-e820: 0009ec00 - 0010 (reserved) BIOS-e820: 0010 - df938000 (usable) BIOS-e820: df938000 - df9d2000 (ACPI NVS) BIOS-e820: df9d2000 - dfa42000 (usable) BIOS-e820: dfa42000 - dfa9a000 (reserved) BIOS-e820: dfa9a000 - dfab8000 (usable) BIOS-e820: dfab8000 - dfb1a000 (ACPI NVS) BIOS-e820: dfb1a000 - dfb2c000 (usable) BIOS-e820: dfb2c000 - dfb3a000 (ACPI data) BIOS-e820: dfb3a000 - dfc0 (usable) BIOS-e820: ffc0 - ffc0c000 (reserved) BIOS-e820: 0001 - 00012000 (usable) Entering add_active_range(0, 0, 158) 0 entries of 3200 used Entering add_active_range(0, 256, 915768) 1 entries of 3200 used Entering add_active_range(0, 915922, 916034) 2 entries of 3200 used Entering add_active_range(0, 916122, 916152) 3 entries of 3200 used Entering add_active_range(0, 916250, 916268) 4 entries of 3200 used Entering add_active_range(0, 916282, 916480) 5 entries of 3200 used Entering add_active_range(0, 1048576, 1179648) 6 entries of 3200 used end_pfn_map = 1179648 DMI 2.4 present. ACPI: RSDP (v002 INTEL ) @ 0x000f0350 ACPI: XSDT (v001 INTEL S5000VSA 0x INTL 0x0113) @ 0xdfb39120 ACPI: FADT (v003 INTEL S5000VSA 0x INTL 0x0113) @ 0xdfb36000 ACPI: MADT (v001 INTEL S5000VSA 0x INTL 0x0113) @ 0xdfb35000 ACPI: SPCR (v001 INTEL S5000VSA 0x INTL 0x0113) @ 0xdfb2f000 ACPI: HPET (v001 INTEL S5000VSA 0x0001 INTL 0x0113) @ 0xdfb2e000 ACPI: MCFG (v001 INTEL S5000VSA 0x0001 INTL 0x0113) @ 0xdfb2d000 ACPI: SSDT (v002 INTEL S5000VSA 0x4000 INTL 0x0113) @ 0xdfb2c000 ACPI: DSDT (v002 INTEL S5000VSA 0x0008 INTL 0x0113) @ 0x No NUMA configuration found Faking a node at -00012000 Entering add_active_range(0, 0, 158) 0 entries of 3200 used Entering add_active_range(0, 256, 915768) 1 entries of 3200 used Entering add_active_range(0, 915922, 916034) 2 entries of 3200 used Entering add_active_range(0, 916122, 916152) 3 entries of 3200 used Entering add_active_range(0, 916250, 916268) 4 entries of 3200 used Entering add_active_range(0, 916282, 916480) 5 entries of 3200 used Entering add_active_range(0, 1048576, 1179648) 6 entries of 3200 used Bootmem setup node 0 -00012000 Zone PFN ranges: DMA 0 -> 4096 DMA324096 -> 1048576 Normal1048576 -> 1179648 early_node_map[7] active PFN ranges 0:0 -> 158 0: 256 -> 915768 0: 915922 -> 916034 0: 916122 -> 916152 0: 916250 -> 916268 0: 916282 -> 916480 0: 1048576 -> 1179648 On node 0 totalpages: 1047100 DMA zone: 64 pages used for memmap DMA zone: 1450 pages reserved
Re: cpu load balancing problem on smp
Arjan van de Ven wrote: Pablo Sebastian Greco wrote: 2296:427426436 134563009 PCI-MSI-edge eth1 2297:252252 135926471257 PCI-MSI-edge eth0 this suggests that cores would be busy rather than only one - Yes, but you are looking at mm kernel statistics, but if you look at the standard kernel, you'll see that eth interrupts are on the same core according to attached /proc/cpuinfo. OTOH, take a look at timer interrupt distribution processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 6 model name : Intel(R) Xeon(TM) CPU 2.66GHz stepping: 4 cpu MHz : 2656.000 cache size : 2048 KB physical id : 0 siblings: 4 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 6 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est cid cx16 xtpr lahf_lm bogomips: 5324.82 clflush size: 64 cache_alignment : 128 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 6 model name : Intel(R) Xeon(TM) CPU 2.66GHz stepping: 4 cpu MHz : 2656.000 cache size : 2048 KB physical id : 0 siblings: 4 core id : 1 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 6 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est cid cx16 xtpr lahf_lm bogomips: 5320.06 clflush size: 64 cache_alignment : 128 address sizes : 36 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 15 model : 6 model name : Intel(R) Xeon(TM) CPU 2.66GHz stepping: 4 cpu MHz : 2656.000 cache size : 2048 KB physical id : 0 siblings: 4 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 6 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est cid cx16 xtpr lahf_lm bogomips: 5320.20 clflush size: 64 cache_alignment : 128 address sizes : 36 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 15 model : 6 model name : Intel(R) Xeon(TM) CPU 2.66GHz stepping: 4 cpu MHz : 2656.000 cache size : 2048 KB physical id : 0 siblings: 4 core id : 1 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 6 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est cid cx16 xtpr lahf_lm bogomips: 5320.16 clflush size: 64 cache_alignment : 128 address sizes : 36 bits physical, 48 bits virtual power management:
Re: cpu load balancing problem on smp
Arjan van de Ven wrote: Marc Donner wrote: see http://www.irqbalance.org to get irqbalance I now have tried irqloadbalance, but the same problem. can you send me the output of cat /proc/interrupts (taken when you are or have been loading the network) maybe there's something fishy going on - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ Please take a look at this, taken from the same machine running different vanilla kernels on fc6. Current 2.6.19 fedora kernel, looks like 2.6.20rc3 (non mm) in the attachment. 2.6.20-rc3 [EMAIL PROTECTED] ~]# rpm -q irqbalance irqbalance-0.55-2.fc6 [EMAIL PROTECTED] ~]# uptime 11:51:50 up 6 days, 30 min, 3 users, load average: 5.31, 5.08, 4.02 [EMAIL PROTECTED] ~]# service irqbalance status irqbalance (pid 2310) is running... [EMAIL PROTECTED] ~]# cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 520209517 0 0 0 IO-APIC-edge timer 1: 12 0 0 0 IO-APIC-edge i8042 8: 1 0 0 0 IO-APIC-edge rtc 9: 0 0 0 0 IO-APIC-fasteoi acpi 12:103 0 0 0 IO-APIC-edge i8042 14: 0 0 0 0 IO-APIC-edge libata 15: 0 0 0 0 IO-APIC-edge libata 20: 138736 188194096 06797630 IO-APIC-fasteoi libata 22: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb2, uhci_hcd:usb4 23: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb1, uhci_hcd:usb3, ehci_hcd:usb5 2296: 1367 0 0 849270653 PCI-MSI-edge eth1 2297: 1022 835083968 0 0 PCI-MSI-edge eth0 NMI: 47756 146249 47617 146186 LOC: 516828752 517331906 516828611 517331771 ERR: 0 2.6.20-rc3-mm1 [EMAIL PROTECTED] kernel]# uptime 12:17:54 up 1 day, 21:58, 2 users, load average: 9.47, 9.79, 10.28 [EMAIL PROTECTED] kernel]# cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 60031592 61350247 22273772 21780215 IO-APIC-edge timer 1: 0 6 1 1 IO-APIC-edge i8042 8: 0 0 1 0 IO-APIC-edge rtc 9: 0 0 0 0 IO-APIC-fasteoi acpi 12:148283104136 IO-APIC-edge i8042 14: 0 0 0 0 IO-APIC-edge libata 15: 0 0 0 0 IO-APIC-edge libata 20: 104827951477821 93306 641628 IO-APIC-fasteoi libata 22: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb2, uhci_hcd:usb4 23: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb1, uhci_hcd:usb3, ehci_hcd:usb5 2296:427426436 134563009 PCI-MSI-edge eth1 2297:252252 135926471257 PCI-MSI-edge eth0 NMI: 0 0 0 0 LOC: 164661140 165163503 164660992 165163305 ERR: 0
Re: SATA problems
Tejun Heo wrote: Hello, Pablo. Please apply common hardware debugging method. You know, swap drives. Use separate power supply for disks, swap cables, etc... It seems more like a hardware problem at this point. Thanks. Well, it took me a few days, but I think I'm ready to report back. One of the drives was failing, and it stopped after rewiring power supply so the last problem seems to be corrected. OTOH, your blacklist seems to be needed too, now I'm running FC6 distribution kernel 2.6.19-1.2895.fc6 (2.6.19.2 + some patches by fedora) and setting echo 1 >/sys/block/sdX/device/queue_depth on all the SAMSUNG drives (sdb, sdc and sdd) The second I type echo 31 >/sys/block/sdX/device/queue_depth on any of the drives I get these messages Jan 23 12:36:30 squid kernel: BUG: warning: (ap->ops->error_handler && ata_tag_valid(ap->active_tag)) at drivers/ata/libata-core.c:4602/ata_qc_issue() (Not ta inted) Jan 23 12:36:30 squid kernel: Jan 23 12:36:30 squid kernel: Call Trace: Jan 23 12:36:30 squid kernel: [] show_trace+0x34/0x47 Jan 23 12:36:30 squid kernel: [] dump_stack+0x12/0x17 Jan 23 12:36:30 squid kernel: [] :libata:ata_qc_issue+0x61/0x551 Jan 23 12:36:30 squid kernel: [] :libata:ata_scsi_translate+0xd1/0x11a Jan 23 12:36:30 squid kernel: [] :libata:ata_scsi_queuecmd+0x103/0x122 Jan 23 12:36:30 squid kernel: [] :scsi_mod:scsi_dispatch_cmd+0x27c/0x30d Jan 23 12:36:30 squid kernel: [] :scsi_mod:scsi_request_fn+0x2ca/0x395 Jan 23 12:36:30 squid kernel: [] elv_insert+0x15a/0x226 Jan 23 12:36:30 squid kernel: [] __make_request+0x439/0x487 Jan 23 12:36:30 squid kernel: [] generic_make_request+0x207/0x21e Jan 23 12:36:30 squid kernel: [] submit_bio+0xee/0xf7 Jan 23 12:36:30 squid kernel: [] submit_bh+0x130/0x150 Jan 23 12:36:30 squid kernel: [] ll_rw_block+0x9d/0xc0 Jan 23 12:36:30 squid kernel: [] :reiserfs:search_by_key+0x13d/0xce7 Jan 23 12:36:30 squid kernel: [] :reiserfs:search_for_position_by_key+0x34/0x2ad Jan 23 12:36:30 squid kernel: [] :reiserfs:_get_block_create_0+0x86/0x544 Jan 23 12:36:30 squid kernel: [] :reiserfs:reiserfs_get_block+0xcd/0xfdd Jan 23 12:36:30 squid kernel: [] do_mpage_readpage+0x16d/0x4b0 Jan 23 12:36:30 squid kernel: [] mpage_readpages+0xb3/0x146 Jan 23 12:36:30 squid kernel: [] __do_page_cache_readahead+0x119/0x209 Jan 23 12:36:30 squid kernel: [] blockable_page_cache_readahead+0x56/0xb5 Jan 23 12:36:30 squid kernel: [] page_cache_readahead+0xd6/0x1af Jan 23 12:36:30 squid kernel: [] do_generic_mapping_read+0x129/0x40b Jan 23 12:36:30 squid kernel: [] generic_file_aio_read+0x15f/0x1b1 Jan 23 12:36:30 squid kernel: [] do_sync_read+0xc9/0x10c Jan 23 12:36:30 squid kernel: [] vfs_read+0xcb/0x170 Jan 23 12:36:30 squid kernel: [] sys_read+0x45/0x6e Jan 23 12:36:30 squid kernel: [] system_call+0x7e/0x83 Jan 23 12:36:30 squid kernel: [<00359ccbfb80>] Jan 23 12:36:30 squid kernel: Thanks for everything. Pablo. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA problems
Pablo Sebastian Greco wrote: Tejun Heo wrote: Pablo Sebastian Greco wrote: After an uptime of 13:34 under heavy load and no errors, I'm pretty sure your patch is correct. Is there a way to backport this to 2.6.18.x? I forgot this (even though I implemented it) but you can turn off NCQ by doing the following. # echo 1 > /sys/block/sdX/device/queue_depth Can you put the seagate drive under load to verify that it's the samsung drive's problem not the controller's? Just an off topic question, does anyone know why I get so uneven IRQ handling on 2.6.19-20 and almost perfect on 2.6.20-rc2-mm1? I dunno. You have much better chance of getting a useful answer by asking it on a separate thread with proper subject line. People usualyl screen threads by subject. There are just too many message in LKML for anyone to follow all the message. Thanks. Guess I spoke too soon :( Today I found this Jan 8 04:01:40 squid kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Jan 8 04:01:40 squid kernel: ata2.00: cmd 25/00:08:49:ee:e8/00:00:16:00:00/e0 tag 0 cdb 0x0 data 4096 in Jan 8 04:01:40 squid kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 8 04:01:40 squid kernel: ata2: soft resetting port Jan 8 04:01:40 squid kernel: ata2: softreset failed (port busy but CLO unavailable) Jan 8 04:01:40 squid kernel: ata2: softreset failed, retrying in 5 secs Jan 8 04:01:45 squid kernel: ata2: hard resetting port Jan 8 04:01:53 squid kernel: ata2: port is slow to respond, please be patient (Status 0x80) Jan 8 04:02:16 squid kernel: ata2: port failed to respond (30 secs, Status 0x80) Jan 8 04:02:16 squid kernel: ata2: COMRESET failed (device not ready) Jan 8 04:02:16 squid kernel: ata2: hardreset failed, retrying in 5 secs Jan 8 04:02:21 squid kernel: ata2: hard resetting port Jan 8 04:02:21 squid kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jan 8 04:02:21 squid kernel: ata2.00: configured for UDMA/133 Jan 8 04:02:21 squid kernel: ata2: EH complete Jan 8 04:02:21 squid kernel: SCSI device sdb: 488397168 512-byte hdwr sectors (250059 MB) Jan 8 04:02:21 squid kernel: sdb: Write Protect is off Jan 8 04:02:21 squid kernel: SCSI device sdb: write cache: enabled, read cache: enabled, doesn't support DPO or FUA #uptime 10:10:12 up 3 days, 22:48, 1 user, load average: 0.22, 0.19, 0.18 4 am is the lowest load ever, so I don't get it. I've found two differences with older errors SAct is now 0x0 when before was 0x7fff And the cmd/res used to be really long, now it's just one command About heavy loading the seagate, I've tested as suggested on other thread dd if= of=/dev/null for all 4 drives simultaneously, on top of usual load, and all was perfect with current kernel (2.6.20-rc3 + blacklist). Don't know what to do to help Thanks. Pablo. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ And now this :( , still running rc3+blacklist without rebooting Jan 9 05:30:36 squid kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Jan 9 05:30:36 squid kernel: ata2.00: cmd c8/00:08:87:83:85/00:00:00:00:00/e2 tag 0 cdb 0x0 data 4096 in Jan 9 05:30:36 squid kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 9 05:30:36 squid kernel: ata2: soft resetting port Jan 9 05:30:36 squid kernel: ata2: softreset failed (port busy but CLO unavailable) Jan 9 05:30:36 squid kernel: ata2: softreset failed, retrying in 5 secs Jan 9 05:30:41 squid kernel: ata2: hard resetting port Jan 9 05:30:49 squid kernel: ata2: port is slow to respond, please be patient (Status 0x80) Jan 9 05:31:12 squid kernel: ata2: port failed to respond (30 secs, Status 0x80) Jan 9 05:31:12 squid kernel: ata2: COMRESET failed (device not ready) Jan 9 05:31:12 squid kernel: ata2: hardreset failed, retrying in 5 secs Jan 9 05:31:17 squid kernel: ata2: hard resetting port Jan 9 05:31:17 squid kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jan 9 05:31:17 squid kernel: ata2.00: configured for UDMA/133 Jan 9 05:31:17 squid kernel: ata2: EH complete Jan 9 05:31:17 squid kernel: SCSI device sdb: 488397168 512-byte hdwr sectors (250059 MB) Jan 9 05:31:17 squid kernel: sdb: Write Protect is off Jan 9 05:31:17 squid kernel: SCSI device sdb: write cache: enabled, read cache: enabled, doesn't support DPO or FUA Jan 9 05:32:17 squid kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Jan 9 05:32:17 squid kernel: ata2.00: cmd c8/00:08:37:ac:04/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 in Jan 9 05:32:17 squid kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Jan
Re: SATA problems
Tejun Heo wrote: Pablo Sebastian Greco wrote: After an uptime of 13:34 under heavy load and no errors, I'm pretty sure your patch is correct. Is there a way to backport this to 2.6.18.x? I forgot this (even though I implemented it) but you can turn off NCQ by doing the following. # echo 1 > /sys/block/sdX/device/queue_depth Can you put the seagate drive under load to verify that it's the samsung drive's problem not the controller's? Just an off topic question, does anyone know why I get so uneven IRQ handling on 2.6.19-20 and almost perfect on 2.6.20-rc2-mm1? I dunno. You have much better chance of getting a useful answer by asking it on a separate thread with proper subject line. People usualyl screen threads by subject. There are just too many message in LKML for anyone to follow all the message. Thanks. Guess I spoke too soon :( Today I found this Jan 8 04:01:40 squid kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Jan 8 04:01:40 squid kernel: ata2.00: cmd 25/00:08:49:ee:e8/00:00:16:00:00/e0 tag 0 cdb 0x0 data 4096 in Jan 8 04:01:40 squid kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 8 04:01:40 squid kernel: ata2: soft resetting port Jan 8 04:01:40 squid kernel: ata2: softreset failed (port busy but CLO unavailable) Jan 8 04:01:40 squid kernel: ata2: softreset failed, retrying in 5 secs Jan 8 04:01:45 squid kernel: ata2: hard resetting port Jan 8 04:01:53 squid kernel: ata2: port is slow to respond, please be patient (Status 0x80) Jan 8 04:02:16 squid kernel: ata2: port failed to respond (30 secs, Status 0x80) Jan 8 04:02:16 squid kernel: ata2: COMRESET failed (device not ready) Jan 8 04:02:16 squid kernel: ata2: hardreset failed, retrying in 5 secs Jan 8 04:02:21 squid kernel: ata2: hard resetting port Jan 8 04:02:21 squid kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jan 8 04:02:21 squid kernel: ata2.00: configured for UDMA/133 Jan 8 04:02:21 squid kernel: ata2: EH complete Jan 8 04:02:21 squid kernel: SCSI device sdb: 488397168 512-byte hdwr sectors (250059 MB) Jan 8 04:02:21 squid kernel: sdb: Write Protect is off Jan 8 04:02:21 squid kernel: SCSI device sdb: write cache: enabled, read cache: enabled, doesn't support DPO or FUA #uptime 10:10:12 up 3 days, 22:48, 1 user, load average: 0.22, 0.19, 0.18 4 am is the lowest load ever, so I don't get it. I've found two differences with older errors SAct is now 0x0 when before was 0x7fff And the cmd/res used to be really long, now it's just one command About heavy loading the seagate, I've tested as suggested on other thread dd if= of=/dev/null for all 4 drives simultaneously, on top of usual load, and all was perfect with current kernel (2.6.20-rc3 + blacklist). Don't know what to do to help Thanks. Pablo. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA problems
Pablo Sebastian Greco wrote: Tejun Heo wrote: Pablo Sebastian Greco wrote: By crash I mean the whole system going down, having to reset the entire machine. I'm sending you 4 files: dmesg: current boot dmesg, just a boot, because no errors appeared after last crash, since the server is out of production right now (errors usually appear under heavy load, and this primarily a transparent proxy for about 1000 simultaneous users) lspci: the way you asked for it messages and messages.1: files where you can see old boots and crashes (even a soft lockup). If there is anything else I can do, let me know. If you need direct access to the server, I can arrange that too. Can you try 2.6.20-rc3 and see if 'CLO not available' message goes away (please post boot dmesg)? The crash/lock is because filesystem code does not cope with IO errors very well. I can't tell why timeouts are occurring in the first place. It seems that only samsung drives are affected (sda2, 3, 4). Hmmm... Please apply the attached patch to 2.6.20-rc3 and test it. Thanks. Here's boot dmesg with 2.6.20-rc3 + blacklist. And you are right about only affecting samsung drives, but since only those drives get all the heavy load, couldn't tell exactly. I'm putting the server in production right now, so I think in a few hours I'll have more info. Thanks. Pablo. After an uptime of 13:34 under heavy load and no errors, I'm pretty sure your patch is correct. Is there a way to backport this to 2.6.18.x? Just an off topic question, does anyone know why I get so uneven IRQ handling on 2.6.19-20 and almost perfect on 2.6.20-rc2-mm1? Thanks for everything. Pablo. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA problems
Tejun Heo wrote: Pablo Sebastian Greco wrote: By crash I mean the whole system going down, having to reset the entire machine. I'm sending you 4 files: dmesg: current boot dmesg, just a boot, because no errors appeared after last crash, since the server is out of production right now (errors usually appear under heavy load, and this primarily a transparent proxy for about 1000 simultaneous users) lspci: the way you asked for it messages and messages.1: files where you can see old boots and crashes (even a soft lockup). If there is anything else I can do, let me know. If you need direct access to the server, I can arrange that too. Can you try 2.6.20-rc3 and see if 'CLO not available' message goes away (please post boot dmesg)? The crash/lock is because filesystem code does not cope with IO errors very well. I can't tell why timeouts are occurring in the first place. It seems that only samsung drives are affected (sda2, 3, 4). Hmmm... Please apply the attached patch to 2.6.20-rc3 and test it. Thanks. Here's boot dmesg with 2.6.20-rc3 + blacklist. And you are right about only affecting samsung drives, but since only those drives get all the heavy load, couldn't tell exactly. I'm putting the server in production right now, so I think in a few hours I'll have more info. Thanks. Pablo. dmesg.bz2 Description: Binary data
SATA problems
First of all, thanks for everything, and my excuses if I'm doing anything wrong, this is my first lkml mail, but I've read all the faq, so should be OK. This is the machine with the problem: Intel ServerBoard S5000VSA Dual Core Xeon 2.66 (Intel(R) Xeon(TM) CPU 2.66GHz stepping 04) 4G Kingston 1 Seagate 80G sata (ST380211AS) (sda) 3 Samsung 250G sata (SAMSUNG SP2504C) (sdb,c,d) Installed distribution is FC6 x86_64 I've been getting these messages with distribution and vanilla kernels Jan 1 16:29:08 squid kernel: ata4.00: exception Emask 0x0 SAct 0x7fff SErr 0x0 action 0x2 frozen Jan 1 16:29:08 squid kernel: ata4.00: cmd 61/60:00:c9:6d:8e/00:00:0e:00:00/40 tag 0 cdb 0x0 data 49152 out Jan 1 16:29:08 squid kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 1 16:29:08 squid kernel: ata4.00: cmd 60/08:08:f7:7d:56/00:00:0e:00:00/40 tag 1 cdb 0x0 data 4096 in Jan 1 16:29:08 squid kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 1 16:29:08 squid kernel: ata4: soft resetting port Jan 1 16:29:08 squid kernel: ata4: softreset failed (port busy but CLO unavailable) Jan 1 16:29:08 squid kernel: ata4: softreset failed, retrying in 5 secs Jan 1 16:29:13 squid kernel: ata4: hard resetting port Jan 1 16:29:21 squid kernel: ata4: port is slow to respond, please be patient (Status 0x80) Jan 1 16:29:43 squid kernel: ata4: port failed to respond (30 secs, Status 0x80) Jan 1 16:29:43 squid kernel: ata4: COMRESET failed (device not ready) Jan 1 16:29:43 squid kernel: ata4: hardreset failed, retrying in 5 secs Jan 1 16:29:48 squid kernel: ata4: hard resetting port Jan 1 16:29:49 squid kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jan 1 16:29:49 squid kernel: ata4.00: configured for UDMA/133 Jan 1 16:29:49 squid kernel: ata4: EH complete Jan 1 16:29:49 squid kernel: SCSI device sdd: 488397168 512-byte hdwr sectors (250059 MB) Jan 1 16:29:49 squid kernel: sdd: Write Protect is off Jan 1 16:29:49 squid kernel: SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO or FUA lots of them, and eventually crashing the system. Tested from fc6 2.6.18 kernel to vanilla 2.6.20-rc2-mm1. Old kernels just crash, newer ones log these things and then crash. I don't want to flood with this mail with useless info, so please tell me what to send and I'll do it (dmesg, smartctl... you name it) BTW, memtest was running for about 2 days without errors, and and badblocks on all 4 drives returned nothing. Reallocated_Sector_Ct raw_value was 0 on all 4 drives Thanks in advance. Pablo. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/