Re: [BUG 2.6.21-rc3-git9] SATA NCQ failure with Samsum HD401LJ

2007-03-19 Thread Pablo Sebastian Greco

Christian wrote:

On Sunday 18 March 2007 06:43:09 you wrote:
  

Christian wrote:


This does indeed look like a drive side issue to me (the controller is
reporting CPBs with response flags 2 which as far as I can tell
indicates it's still waiting for the drive to complete the request).


I have been using this hw-config (SATA II, NCQ) since the nvidia ADMA
support made it in the -mm kernel (maybe around 2.6.19-mm? or even
earlyer). I'm seeing this problem excessively since I upgraded to
2.6.21-rc3-mm1. I think something got broken recently...
  

Can you post the result of "hdparm -I /dev/sdX"?



Output generated on 2.6.21-rc3-mm1 #3 SMP PREEMPT

[EMAIL PROTECTED]:~$ sudo hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
Model Number:   SAMSUNG HD401LJ
Serial Number:  S0HVJ1FL900207
Firmware Revision:  ZZ100-15
Standards:
Used: ATA/ATAPI-7 T13 1532D revision 4a
Supported: 7 6 5 4
Configuration:
Logical max current
cylinders   16383   16383
heads   16  16
sectors/track   63  63
--
CHS current addressable sectors:   16514064
LBAuser addressable sectors:  268435455
LBA48  user addressable sectors:  781422768
device size with M = 1024*1024:  381554 MBytes
device size with M = 1000*1000:  400088 MBytes (400 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16  Current = 16
Recommended acoustic management value: 254, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
udma7

 Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
 Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
Enabled Supported:
   *SMART feature set
Security Mode feature set
   *Power Management feature set
   *Write cache
   *Look-ahead
   *Host Protected Area feature set
   *WRITE_BUFFER command
   *READ_BUFFER command
   *NOP cmd
   *DOWNLOAD_MICROCODE
SET_MAX security extension
Automatic Acoustic Management feature set
   *48-bit Address feature set
   *Device Configuration Overlay feature set
   *Mandatory FLUSH_CACHE
   *FLUSH_CACHE_EXT
   *SMART error logging
   *SMART self-test
   *General Purpose Logging feature set
   *SATA-I signaling speed (1.5Gb/s)
   *SATA-II signaling speed (3.0Gb/s)
   *Native Command Queueing (NCQ)
   *Host-initiated interface power management
   *Phy event counters
DMA Setup Auto-Activate optimization
Device-initiated interface power management
   *Software settings preservation
   *SMART Command Transport (SCT) feature set
   *SCT Long Sector Access (AC1)
   *SCT LBA Segment Access (AC2)
   *SCT Error Recovery Control (AC3)
   *SCT Features Control (AC4)
   *SCT Data Tables (AC5)
Security:
Master password revision code = 65534
supported
not enabled
not locked
frozen
not expired: security count
supported: enhanced erase
228min for SECURITY ERASE UNIT. 228min for ENHANCED SECURITY ERASE 
UNIT.

Checksum: correct


[EMAIL PROTECTED]:~$ sudo hdparm -I /dev/sdb

/dev/sdb:

ATA device, with non-removable media
Model Number:   SAMSUNG SP2504C
Serial Number:  S09QJ1LYC06381
Firmware Revision:  VT100-33
Standards:
Used: ATA/ATAPI-7 T13 1532D revision 4a
Supported: 7 6 5 4
Configuration:
Logical max current
cylinders   16383   16383
heads   16  16
sectors/track   63  63
--
CHS current addressable sectors:   16514064
LBAuser addressable sectors:  268435455
LBA48  user addressable sectors:  488397168
device size with M = 1024*1024:  238475 MBytes
device size with M = 1000*1000:  250059 MBytes (250 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16  Current = 16
Recommended acoustic management value: 254, current value: 254
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
udma7

 Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
 

Re: SATA problems

2007-02-21 Thread Pablo Sebastian Greco

Tejun Heo wrote:

Pablo Sebastian Greco wrote:
  

Tejun Heo wrote:


* Pablo, the bug you saw was bad interaction between blacklisted NCQ
device and dynamic queue depth adjustment.  Patches are submitted to fix
the problem.  Just drop the blacklist patch.  Your drives should work
fine in NCQ mode.  My gut feeling is that your problem is power related
from the beginning.
  
  

I had the same problems with a new Power Supply, Now everything is ok
with the old Power Supply and the new drives.



So, it was bad drives?  Are you using the same model or different ones?
 NCQ works okay now?

  
All I can say is that now is working, other things changed with the new 
drives: 1.5Gbps instead of 3Gbps, also new drives don't use NCQ (I'm 
reattaching  a full dmesg).
Also I've found this firmware upgrade 
(http://www.samsung.com/Products/HardDiskDrive/support/faqs/faqs_20060414_246673.htm) 
for the old drives, but couldn't confirm if it should be applied because 
the server is in Brazil and I live in Argentina. Won't be there until 
April to test.


Thanks.
Pablo.
Linux version 2.6.19-1.2895.fc6 ([EMAIL PROTECTED]) (gcc version 4.1.1 20070105 
(Red Hat 4.1.1-51)) #1 SMP Wed Jan 10 18:50:56 EST 2007
Command line: ro root=LABEL=/
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009ec00 (usable)
 BIOS-e820: 0009ec00 - 0010 (reserved)
 BIOS-e820: 0010 - df938000 (usable)
 BIOS-e820: df938000 - df9d2000 (ACPI NVS)
 BIOS-e820: df9d2000 - dfa42000 (usable)
 BIOS-e820: dfa42000 - dfa9a000 (reserved)
 BIOS-e820: dfa9a000 - dfab8000 (usable)
 BIOS-e820: dfab8000 - dfb1a000 (ACPI NVS)
 BIOS-e820: dfb1a000 - dfb2c000 (usable)
 BIOS-e820: dfb2c000 - dfb3a000 (ACPI data)
 BIOS-e820: dfb3a000 - dfc0 (usable)
 BIOS-e820: ffc0 - ffc0c000 (reserved)
 BIOS-e820: 0001 - 00012000 (usable)
Entering add_active_range(0, 0, 158) 0 entries of 3200 used
Entering add_active_range(0, 256, 915768) 1 entries of 3200 used
Entering add_active_range(0, 915922, 916034) 2 entries of 3200 used
Entering add_active_range(0, 916122, 916152) 3 entries of 3200 used
Entering add_active_range(0, 916250, 916268) 4 entries of 3200 used
Entering add_active_range(0, 916282, 916480) 5 entries of 3200 used
Entering add_active_range(0, 1048576, 1179648) 6 entries of 3200 used
end_pfn_map = 1179648
DMI 2.4 present.
ACPI: RSDP (v002 INTEL ) @ 0x000f0350
ACPI: XSDT (v001 INTEL  S5000VSA 0x INTL 0x0113) @ 
0xdfb39120
ACPI: FADT (v003 INTEL  S5000VSA 0x INTL 0x0113) @ 
0xdfb36000
ACPI: MADT (v001 INTEL  S5000VSA 0x INTL 0x0113) @ 
0xdfb35000
ACPI: SPCR (v001 INTEL  S5000VSA 0x INTL 0x0113) @ 
0xdfb2f000
ACPI: HPET (v001 INTEL  S5000VSA 0x0001 INTL 0x0113) @ 
0xdfb2e000
ACPI: MCFG (v001 INTEL  S5000VSA 0x0001 INTL 0x0113) @ 
0xdfb2d000
ACPI: SSDT (v002 INTEL  S5000VSA 0x4000 INTL 0x0113) @ 
0xdfb2c000
ACPI: DSDT (v002 INTEL  S5000VSA 0x0008 INTL 0x0113) @ 
0x
No NUMA configuration found
Faking a node at -00012000
Entering add_active_range(0, 0, 158) 0 entries of 3200 used
Entering add_active_range(0, 256, 915768) 1 entries of 3200 used
Entering add_active_range(0, 915922, 916034) 2 entries of 3200 used
Entering add_active_range(0, 916122, 916152) 3 entries of 3200 used
Entering add_active_range(0, 916250, 916268) 4 entries of 3200 used
Entering add_active_range(0, 916282, 916480) 5 entries of 3200 used
Entering add_active_range(0, 1048576, 1179648) 6 entries of 3200 used
Bootmem setup node 0 -00012000
Zone PFN ranges:
  DMA 0 -> 4096
  DMA324096 ->  1048576
  Normal1048576 ->  1179648
early_node_map[7] active PFN ranges
0:0 ->  158
0:  256 ->   915768
0:   915922 ->   916034
0:   916122 ->   916152
0:   916250 ->   916268
0:   916282 ->   916480
0:  1048576 ->  1179648
On node 0 totalpages: 1047100
  DMA zone: 64 pages used for memmap
  DMA zone: 1450 pages reserved
  DMA zone: 2484 pages, LIFO batch:0
  DMA32 zone: 16320 pages used for memmap
  DMA32 zone: 895710 pages, LIFO batch:31
  Normal zone: 2048 pages used for memmap
  Normal zone: 129024 pages, LIFO batch:31
ACPI: PM-Timer IO Port: 0x408
ACPI: Local APIC address 0xfee0
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 (Bootup-CPU)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x02] enabled)
Processor #2
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
Processor #1
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] enabled)
Processor #3
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x84] disabled)
ACPI: LA

Re: SATA problems

2007-02-20 Thread Pablo Sebastian Greco

Tejun Heo wrote:

* Pablo, the bug you saw was bad interaction between blacklisted NCQ
device and dynamic queue depth adjustment.  Patches are submitted to fix
the problem.  Just drop the blacklist patch.  Your drives should work
fine in NCQ mode.  My gut feeling is that your problem is power related
from the beginning.

* Marcus, you're on via's ahci controller, right?  The problem you saw
was bad interaction between blacklisted NCQ _controller_ and dynamic
queue depth adjustment.  Patches submitted.

Thanks.

  
I had the same problems with a new Power Supply, Now everything is ok 
with the old Power Supply and the new drives.


Pablo.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA problems

2007-02-17 Thread Pablo Sebastian Greco

Marcus Haebler wrote:

I opened a bug report (228979) on bugzilla.redhat.com on this one because
I have the same issue under FC6 2.6.19-1.2895. Here is the link:

   https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=228979

Do you have any more updates on this problem? Is there a way I can help
by providing debug data?

Thanks,

Marcus

On 1/23/07, Tejun Heo <[EMAIL PROTECTED]> wrote:

Pablo Sebastian Greco wrote:
> Well, it took me a few days,  but I think I'm ready to report back. 
One
> of the drives was failing, and it stopped after rewiring power 
supply so

> the last problem seems to be corrected.
> OTOH, your blacklist seems to be needed too, now I'm running FC6
> distribution kernel 2.6.19-1.2895.fc6 (2.6.19.2 + some patches by
> fedora) and setting
> echo 1 >/sys/block/sdX/device/queue_depth
> on all the SAMSUNG drives (sdb, sdc and sdd)
> The second I type
> echo 31 >/sys/block/sdX/device/queue_depth
> on any of the drives I get these messages
>
> Jan 23 12:36:30 squid kernel: BUG: warning: (ap->ops->error_handler &&
> ata_tag_valid(ap->active_tag)) at
> drivers/ata/libata-core.c:4602/ata_qc_issue() (Not ta
> inted)

This is kernel bug that needs fixing.  I'll investigate.

--
tejun
-
To unsubscribe from this list: send the line "unsubscribe 
linux-kernel" in

the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/





On my side, all the problems dissapeared on all the kernels after 
changing all 3 drives to non-NCQ drives, I was going crazy.


New dmesg attached

Pablo.
Linux version 2.6.19-1.2895.fc6 ([EMAIL PROTECTED]) (gcc version 4.1.1 20070105 
(Red Hat 4.1.1-51)) #1 SMP Wed Jan 10 18:50:56 EST 2007
Command line: ro root=LABEL=/
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009ec00 (usable)
 BIOS-e820: 0009ec00 - 0010 (reserved)
 BIOS-e820: 0010 - df938000 (usable)
 BIOS-e820: df938000 - df9d2000 (ACPI NVS)
 BIOS-e820: df9d2000 - dfa42000 (usable)
 BIOS-e820: dfa42000 - dfa9a000 (reserved)
 BIOS-e820: dfa9a000 - dfab8000 (usable)
 BIOS-e820: dfab8000 - dfb1a000 (ACPI NVS)
 BIOS-e820: dfb1a000 - dfb2c000 (usable)
 BIOS-e820: dfb2c000 - dfb3a000 (ACPI data)
 BIOS-e820: dfb3a000 - dfc0 (usable)
 BIOS-e820: ffc0 - ffc0c000 (reserved)
 BIOS-e820: 0001 - 00012000 (usable)
Entering add_active_range(0, 0, 158) 0 entries of 3200 used
Entering add_active_range(0, 256, 915768) 1 entries of 3200 used
Entering add_active_range(0, 915922, 916034) 2 entries of 3200 used
Entering add_active_range(0, 916122, 916152) 3 entries of 3200 used
Entering add_active_range(0, 916250, 916268) 4 entries of 3200 used
Entering add_active_range(0, 916282, 916480) 5 entries of 3200 used
Entering add_active_range(0, 1048576, 1179648) 6 entries of 3200 used
end_pfn_map = 1179648
DMI 2.4 present.
ACPI: RSDP (v002 INTEL ) @ 0x000f0350
ACPI: XSDT (v001 INTEL  S5000VSA 0x INTL 0x0113) @ 
0xdfb39120
ACPI: FADT (v003 INTEL  S5000VSA 0x INTL 0x0113) @ 
0xdfb36000
ACPI: MADT (v001 INTEL  S5000VSA 0x INTL 0x0113) @ 
0xdfb35000
ACPI: SPCR (v001 INTEL  S5000VSA 0x INTL 0x0113) @ 
0xdfb2f000
ACPI: HPET (v001 INTEL  S5000VSA 0x0001 INTL 0x0113) @ 
0xdfb2e000
ACPI: MCFG (v001 INTEL  S5000VSA 0x0001 INTL 0x0113) @ 
0xdfb2d000
ACPI: SSDT (v002 INTEL  S5000VSA 0x4000 INTL 0x0113) @ 
0xdfb2c000
ACPI: DSDT (v002 INTEL  S5000VSA 0x0008 INTL 0x0113) @ 
0x
No NUMA configuration found
Faking a node at -00012000
Entering add_active_range(0, 0, 158) 0 entries of 3200 used
Entering add_active_range(0, 256, 915768) 1 entries of 3200 used
Entering add_active_range(0, 915922, 916034) 2 entries of 3200 used
Entering add_active_range(0, 916122, 916152) 3 entries of 3200 used
Entering add_active_range(0, 916250, 916268) 4 entries of 3200 used
Entering add_active_range(0, 916282, 916480) 5 entries of 3200 used
Entering add_active_range(0, 1048576, 1179648) 6 entries of 3200 used
Bootmem setup node 0 -00012000
Zone PFN ranges:
  DMA 0 -> 4096
  DMA324096 ->  1048576
  Normal1048576 ->  1179648
early_node_map[7] active PFN ranges
0:0 ->  158
0:  256 ->   915768
0:   915922 ->   916034
0:   916122 ->   916152
0:   916250 ->   916268
0:   916282 ->   916480
0:  1048576 ->  1179648
On node 0 totalpages: 1047100
  DMA zone: 64 pages used for memmap
  DMA zone: 1450 pages reserved
  

Re: cpu load balancing problem on smp

2007-02-06 Thread Pablo Sebastian Greco

Arjan van de Ven wrote:

Pablo Sebastian Greco wrote:

2296:427426436  134563009   PCI-MSI-edge  
eth1
2297:252252  135926471257   PCI-MSI-edge  
eth0


this suggests that  cores would be busy rather than only one
-
Yes, but you are looking at mm kernel statistics, but if you look at the 
standard kernel, you'll see that eth interrupts are on the same core 
according to attached /proc/cpuinfo.

OTOH, take a look at timer interrupt distribution
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 15
model   : 6
model name  :   Intel(R) Xeon(TM) CPU 2.66GHz
stepping: 4
cpu MHz : 2656.000
cache size  : 2048 KB
physical id : 0
siblings: 4
core id : 0
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 6
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm 
constant_tsc pni monitor ds_cpl vmx est cid cx16 xtpr lahf_lm
bogomips: 5324.82
clflush size: 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 15
model   : 6
model name  :   Intel(R) Xeon(TM) CPU 2.66GHz
stepping: 4
cpu MHz : 2656.000
cache size  : 2048 KB
physical id : 0
siblings: 4
core id : 1
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 6
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm 
constant_tsc pni monitor ds_cpl vmx est cid cx16 xtpr lahf_lm
bogomips: 5320.06
clflush size: 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 2
vendor_id   : GenuineIntel
cpu family  : 15
model   : 6
model name  :   Intel(R) Xeon(TM) CPU 2.66GHz
stepping: 4
cpu MHz : 2656.000
cache size  : 2048 KB
physical id : 0
siblings: 4
core id : 0
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 6
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm 
constant_tsc pni monitor ds_cpl vmx est cid cx16 xtpr lahf_lm
bogomips: 5320.20
clflush size: 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 3
vendor_id   : GenuineIntel
cpu family  : 15
model   : 6
model name  :   Intel(R) Xeon(TM) CPU 2.66GHz
stepping: 4
cpu MHz : 2656.000
cache size  : 2048 KB
physical id : 0
siblings: 4
core id : 1
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 6
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm 
constant_tsc pni monitor ds_cpl vmx est cid cx16 xtpr lahf_lm
bogomips: 5320.16
clflush size: 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:


Re: cpu load balancing problem on smp

2007-02-06 Thread Pablo Sebastian Greco

Arjan van de Ven wrote:

Marc Donner wrote:


see http://www.irqbalance.org to get irqbalance


I now have tried irqloadbalance, but the same problem.



can you send me the output of

cat /proc/interrupts

(taken when you are or have been loading the network)

maybe there's something fishy going on
-
To unsubscribe from this list: send the line "unsubscribe 
linux-kernel" in

the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Please take a look at this, taken from the same machine running 
different vanilla kernels on fc6.
Current 2.6.19 fedora kernel, looks like 2.6.20rc3 (non mm) in the 
attachment.


2.6.20-rc3
[EMAIL PROTECTED] ~]# rpm -q irqbalance
irqbalance-0.55-2.fc6
[EMAIL PROTECTED] ~]# uptime
 11:51:50 up 6 days, 30 min,  3 users,  load average: 5.31, 5.08, 4.02
[EMAIL PROTECTED] ~]# service irqbalance status
irqbalance (pid 2310) is running...
[EMAIL PROTECTED] ~]# cat /proc/interrupts
   CPU0   CPU1   CPU2   CPU3
  0:  520209517  0  0  0   IO-APIC-edge  timer
  1: 12  0  0  0   IO-APIC-edge  i8042
  8:  1  0  0  0   IO-APIC-edge  rtc
  9:  0  0  0  0   IO-APIC-fasteoi   acpi
 12:103  0  0  0   IO-APIC-edge  i8042
 14:  0  0  0  0   IO-APIC-edge  libata
 15:  0  0  0  0   IO-APIC-edge  libata
 20: 138736  188194096  06797630   IO-APIC-fasteoi   libata
 22:  0  0  0  0   IO-APIC-fasteoi   
uhci_hcd:usb2, uhci_hcd:usb4
 23:  0  0  0  0   IO-APIC-fasteoi   
uhci_hcd:usb1, uhci_hcd:usb3, ehci_hcd:usb5
2296:   1367  0  0  849270653   PCI-MSI-edge  eth1
2297:   1022  835083968  0  0   PCI-MSI-edge  eth0
NMI:  47756 146249  47617 146186
LOC:  516828752  517331906  516828611  517331771
ERR:  0
2.6.20-rc3-mm1
[EMAIL PROTECTED] kernel]# uptime
 12:17:54 up 1 day, 21:58,  2 users,  load average: 9.47, 9.79, 10.28
[EMAIL PROTECTED] kernel]# cat /proc/interrupts
   CPU0   CPU1   CPU2   CPU3
  0:   60031592   61350247   22273772   21780215   IO-APIC-edge  timer
  1:  0  6  1  1   IO-APIC-edge  i8042
  8:  0  0  1  0   IO-APIC-edge  rtc
  9:  0  0  0  0   IO-APIC-fasteoi   acpi
 12:148283104136   IO-APIC-edge  i8042
 14:  0  0  0  0   IO-APIC-edge  libata
 15:  0  0  0  0   IO-APIC-edge  libata
 20:   104827951477821  93306 641628   IO-APIC-fasteoi   libata
 22:  0  0  0  0   IO-APIC-fasteoi   
uhci_hcd:usb2, uhci_hcd:usb4
 23:  0  0  0  0   IO-APIC-fasteoi   
uhci_hcd:usb1, uhci_hcd:usb3, ehci_hcd:usb5
2296:427426436  134563009   PCI-MSI-edge  eth1
2297:252252  135926471257   PCI-MSI-edge  eth0
NMI:  0  0  0  0
LOC:  164661140  165163503  164660992  165163305
ERR:  0


Re: SATA problems

2007-01-23 Thread Pablo Sebastian Greco

Tejun Heo wrote:

Hello, Pablo.

Please apply common hardware debugging method.  You know, swap drives.
Use separate power supply for disks, swap cables, etc...

It seems more like a hardware problem at this point.

Thanks.

  
Well, it took me a few days,  but I think I'm ready to report back. One 
of the drives was failing, and it stopped after rewiring power supply so 
the last problem seems to be corrected.
OTOH, your blacklist seems to be needed too, now I'm running FC6 
distribution kernel 2.6.19-1.2895.fc6 (2.6.19.2 + some patches by 
fedora) and setting

echo 1 >/sys/block/sdX/device/queue_depth
on all the SAMSUNG drives (sdb, sdc and sdd)
The second I type
echo 31 >/sys/block/sdX/device/queue_depth
on any of the drives I get these messages

Jan 23 12:36:30 squid kernel: BUG: warning: (ap->ops->error_handler && 
ata_tag_valid(ap->active_tag)) at 
drivers/ata/libata-core.c:4602/ata_qc_issue() (Not ta

inted)
Jan 23 12:36:30 squid kernel:
Jan 23 12:36:30 squid kernel: Call Trace:
Jan 23 12:36:30 squid kernel:  [] show_trace+0x34/0x47
Jan 23 12:36:30 squid kernel:  [] dump_stack+0x12/0x17
Jan 23 12:36:30 squid kernel:  [] 
:libata:ata_qc_issue+0x61/0x551
Jan 23 12:36:30 squid kernel:  [] 
:libata:ata_scsi_translate+0xd1/0x11a
Jan 23 12:36:30 squid kernel:  [] 
:libata:ata_scsi_queuecmd+0x103/0x122
Jan 23 12:36:30 squid kernel:  [] 
:scsi_mod:scsi_dispatch_cmd+0x27c/0x30d
Jan 23 12:36:30 squid kernel:  [] 
:scsi_mod:scsi_request_fn+0x2ca/0x395

Jan 23 12:36:30 squid kernel:  [] elv_insert+0x15a/0x226
Jan 23 12:36:30 squid kernel:  [] 
__make_request+0x439/0x487
Jan 23 12:36:30 squid kernel:  [] 
generic_make_request+0x207/0x21e

Jan 23 12:36:30 squid kernel:  [] submit_bio+0xee/0xf7
Jan 23 12:36:30 squid kernel:  [] submit_bh+0x130/0x150
Jan 23 12:36:30 squid kernel:  [] ll_rw_block+0x9d/0xc0
Jan 23 12:36:30 squid kernel:  [] 
:reiserfs:search_by_key+0x13d/0xce7
Jan 23 12:36:30 squid kernel:  [] 
:reiserfs:search_for_position_by_key+0x34/0x2ad
Jan 23 12:36:30 squid kernel:  [] 
:reiserfs:_get_block_create_0+0x86/0x544
Jan 23 12:36:30 squid kernel:  [] 
:reiserfs:reiserfs_get_block+0xcd/0xfdd
Jan 23 12:36:30 squid kernel:  [] 
do_mpage_readpage+0x16d/0x4b0
Jan 23 12:36:30 squid kernel:  [] 
mpage_readpages+0xb3/0x146
Jan 23 12:36:30 squid kernel:  [] 
__do_page_cache_readahead+0x119/0x209
Jan 23 12:36:30 squid kernel:  [] 
blockable_page_cache_readahead+0x56/0xb5
Jan 23 12:36:30 squid kernel:  [] 
page_cache_readahead+0xd6/0x1af
Jan 23 12:36:30 squid kernel:  [] 
do_generic_mapping_read+0x129/0x40b
Jan 23 12:36:30 squid kernel:  [] 
generic_file_aio_read+0x15f/0x1b1

Jan 23 12:36:30 squid kernel:  [] do_sync_read+0xc9/0x10c
Jan 23 12:36:30 squid kernel:  [] vfs_read+0xcb/0x170
Jan 23 12:36:30 squid kernel:  [] sys_read+0x45/0x6e
Jan 23 12:36:30 squid kernel:  [] system_call+0x7e/0x83
Jan 23 12:36:30 squid kernel:  [<00359ccbfb80>]
Jan 23 12:36:30 squid kernel:

Thanks for everything.
Pablo.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA problems

2007-01-09 Thread Pablo Sebastian Greco

Pablo Sebastian Greco wrote:

Tejun Heo wrote:

Pablo Sebastian Greco wrote:
 

After an uptime of  13:34 under heavy load and no errors, I'm pretty
sure your patch is correct. Is there a way to backport this to 
2.6.18.x?



I forgot this (even though I implemented it) but you can turn off NCQ by
doing the following.

# echo 1 > /sys/block/sdX/device/queue_depth

Can you put the seagate drive under load to verify that it's the samsung
drive's problem not the controller's?

 

Just an off topic question, does anyone know why I get so uneven IRQ
handling on 2.6.19-20 and almost perfect on 2.6.20-rc2-mm1?



I dunno.  You have much better chance of getting a useful answer by
asking it on a separate thread with proper subject line.  People usualyl
screen threads by subject.  There are just too many message in LKML for
anyone to follow all the message.

Thanks.

  

Guess I spoke too soon :(
Today I found this
Jan  8 04:01:40 squid kernel: ata2.00: exception Emask 0x0 SAct 0x0 
SErr 0x0 action 0x2 frozen
Jan  8 04:01:40 squid kernel: ata2.00: cmd 
25/00:08:49:ee:e8/00:00:16:00:00/e0 tag 0 cdb 0x0 data 4096 in
Jan  8 04:01:40 squid kernel:  res 
40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Jan  8 04:01:40 squid kernel: ata2: soft resetting port
Jan  8 04:01:40 squid kernel: ata2: softreset failed (port busy but 
CLO unavailable)

Jan  8 04:01:40 squid kernel: ata2: softreset failed, retrying in 5 secs
Jan  8 04:01:45 squid kernel: ata2: hard resetting port
Jan  8 04:01:53 squid kernel: ata2: port is slow to respond, please be 
patient (Status 0x80)
Jan  8 04:02:16 squid kernel: ata2: port failed to respond (30 secs, 
Status 0x80)

Jan  8 04:02:16 squid kernel: ata2: COMRESET failed (device not ready)
Jan  8 04:02:16 squid kernel: ata2: hardreset failed, retrying in 5 secs
Jan  8 04:02:21 squid kernel: ata2: hard resetting port
Jan  8 04:02:21 squid kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 
SControl 300)

Jan  8 04:02:21 squid kernel: ata2.00: configured for UDMA/133
Jan  8 04:02:21 squid kernel: ata2: EH complete
Jan  8 04:02:21 squid kernel: SCSI device sdb: 488397168 512-byte hdwr 
sectors (250059 MB)

Jan  8 04:02:21 squid kernel: sdb: Write Protect is off
Jan  8 04:02:21 squid kernel: SCSI device sdb: write cache: enabled, 
read cache: enabled, doesn't support DPO or FUA

#uptime
10:10:12 up 3 days, 22:48,  1 user,  load average: 0.22, 0.19, 0.18
4 am is the lowest load ever, so I don't get it.
I've found two differences with older errors
   SAct is now 0x0 when before was 0x7fff
   And the cmd/res used to be really long, now it's just one command
About heavy loading the seagate, I've tested as suggested on other 
thread dd if= of=/dev/null
for all 4 drives simultaneously, on top of usual load, and all was 
perfect with current kernel (2.6.20-rc3 + blacklist).

Don't know what to do to help

Thanks.
Pablo.
-
To unsubscribe from this list: send the line "unsubscribe 
linux-kernel" in

the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


And now this :( , still  running rc3+blacklist without rebooting

Jan  9 05:30:36 squid kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 
0x0 action 0x2 frozen
Jan  9 05:30:36 squid kernel: ata2.00: cmd 
c8/00:08:87:83:85/00:00:00:00:00/e2 tag 0 cdb 0x0 data 4096 in
Jan  9 05:30:36 squid kernel:  res 
40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Jan  9 05:30:36 squid kernel: ata2: soft resetting port
Jan  9 05:30:36 squid kernel: ata2: softreset failed (port busy but CLO 
unavailable)

Jan  9 05:30:36 squid kernel: ata2: softreset failed, retrying in 5 secs
Jan  9 05:30:41 squid kernel: ata2: hard resetting port
Jan  9 05:30:49 squid kernel: ata2: port is slow to respond, please be 
patient (Status 0x80)
Jan  9 05:31:12 squid kernel: ata2: port failed to respond (30 secs, 
Status 0x80)

Jan  9 05:31:12 squid kernel: ata2: COMRESET failed (device not ready)
Jan  9 05:31:12 squid kernel: ata2: hardreset failed, retrying in 5 secs
Jan  9 05:31:17 squid kernel: ata2: hard resetting port
Jan  9 05:31:17 squid kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 
SControl 300)

Jan  9 05:31:17 squid kernel: ata2.00: configured for UDMA/133
Jan  9 05:31:17 squid kernel: ata2: EH complete
Jan  9 05:31:17 squid kernel: SCSI device sdb: 488397168 512-byte hdwr 
sectors (250059 MB)

Jan  9 05:31:17 squid kernel: sdb: Write Protect is off
Jan  9 05:31:17 squid kernel: SCSI device sdb: write cache: enabled, 
read cache: enabled, doesn't support DPO or FUA
Jan  9 05:32:17 squid kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 
0x0 action 0x2 frozen
Jan  9 05:32:17 squid kernel: ata2.00: cmd 
c8/00:08:37:ac:04/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 in
Jan  9 05:32:17 squid kernel:  res 
40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Jan

Re: SATA problems

2007-01-08 Thread Pablo Sebastian Greco

Tejun Heo wrote:

Pablo Sebastian Greco wrote:
  

After an uptime of  13:34 under heavy load and no errors, I'm pretty
sure your patch is correct. Is there a way to backport this to 2.6.18.x?



I forgot this (even though I implemented it) but you can turn off NCQ by
doing the following.

# echo 1 > /sys/block/sdX/device/queue_depth

Can you put the seagate drive under load to verify that it's the samsung
drive's problem not the controller's?

  

Just an off topic question, does anyone know why I get so uneven IRQ
handling on 2.6.19-20 and almost perfect on 2.6.20-rc2-mm1?



I dunno.  You have much better chance of getting a useful answer by
asking it on a separate thread with proper subject line.  People usualyl
screen threads by subject.  There are just too many message in LKML for
anyone to follow all the message.

Thanks.

  

Guess I spoke too soon :(
Today I found this
Jan  8 04:01:40 squid kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 
0x0 action 0x2 frozen
Jan  8 04:01:40 squid kernel: ata2.00: cmd 
25/00:08:49:ee:e8/00:00:16:00:00/e0 tag 0 cdb 0x0 data 4096 in
Jan  8 04:01:40 squid kernel:  res 
40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Jan  8 04:01:40 squid kernel: ata2: soft resetting port
Jan  8 04:01:40 squid kernel: ata2: softreset failed (port busy but CLO 
unavailable)

Jan  8 04:01:40 squid kernel: ata2: softreset failed, retrying in 5 secs
Jan  8 04:01:45 squid kernel: ata2: hard resetting port
Jan  8 04:01:53 squid kernel: ata2: port is slow to respond, please be 
patient (Status 0x80)
Jan  8 04:02:16 squid kernel: ata2: port failed to respond (30 secs, 
Status 0x80)

Jan  8 04:02:16 squid kernel: ata2: COMRESET failed (device not ready)
Jan  8 04:02:16 squid kernel: ata2: hardreset failed, retrying in 5 secs
Jan  8 04:02:21 squid kernel: ata2: hard resetting port
Jan  8 04:02:21 squid kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 
SControl 300)

Jan  8 04:02:21 squid kernel: ata2.00: configured for UDMA/133
Jan  8 04:02:21 squid kernel: ata2: EH complete
Jan  8 04:02:21 squid kernel: SCSI device sdb: 488397168 512-byte hdwr 
sectors (250059 MB)

Jan  8 04:02:21 squid kernel: sdb: Write Protect is off
Jan  8 04:02:21 squid kernel: SCSI device sdb: write cache: enabled, 
read cache: enabled, doesn't support DPO or FUA

#uptime
10:10:12 up 3 days, 22:48,  1 user,  load average: 0.22, 0.19, 0.18
4 am is the lowest load ever, so I don't get it.
I've found two differences with older errors
   SAct is now 0x0 when before was 0x7fff
   And the cmd/res used to be really long, now it's just one command
About heavy loading the seagate, I've tested as suggested on other 
thread dd if= of=/dev/null
for all 4 drives simultaneously, on top of usual load, and all was 
perfect with current kernel (2.6.20-rc3 + blacklist).

Don't know what to do to help

Thanks.
Pablo.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA problems

2007-01-04 Thread Pablo Sebastian Greco

Pablo Sebastian Greco wrote:

Tejun Heo wrote:

Pablo Sebastian Greco wrote:
 

By crash I mean the whole system going down, having to reset the entire
machine.
I'm sending you 4 files:
dmesg: current boot dmesg, just a boot, because no errors appeared 
after

last crash, since the server is out of production right now (errors
usually appear under heavy load, and this primarily a transparent proxy
for about 1000 simultaneous users)
lspci: the way you asked for it
messages and messages.1: files where you can see old boots and crashes
(even a soft lockup).
If there is anything else I can do, let me know. If you need direct
access to the server, I can arrange that too.



Can you try 2.6.20-rc3 and see if 'CLO not available' message goes away
(please post boot dmesg)?

The crash/lock is because filesystem code does not cope with IO errors
very well.  I can't tell why timeouts are occurring in the first place.
 It seems that only samsung drives are affected (sda2, 3, 4).  Hmmm...
Please apply the attached patch to 2.6.20-rc3 and test it.

Thanks.

  
Here's boot dmesg with 2.6.20-rc3 + blacklist. And you are right about 
only affecting samsung drives, but since only those drives get all the 
heavy load, couldn't tell exactly.
I'm putting the server in production right now, so I think in a few 
hours I'll have more info.


Thanks.
Pablo.
After an uptime of  13:34 under heavy load and no errors, I'm pretty 
sure your patch is correct. Is there a way to backport this to 2.6.18.x?
Just an off topic question, does anyone know why I get so uneven IRQ 
handling on 2.6.19-20 and almost perfect on 2.6.20-rc2-mm1?


Thanks for everything.
Pablo.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA problems

2007-01-04 Thread Pablo Sebastian Greco

Tejun Heo wrote:

Pablo Sebastian Greco wrote:
  

By crash I mean the whole system going down, having to reset the entire
machine.
I'm sending you 4 files:
dmesg: current boot dmesg, just a boot, because no errors appeared after
last crash, since the server is out of production right now (errors
usually appear under heavy load, and this primarily a transparent proxy
for about 1000 simultaneous users)
lspci: the way you asked for it
messages and messages.1: files where you can see old boots and crashes
(even a soft lockup).
If there is anything else I can do, let me know. If you need direct
access to the server, I can arrange that too.



Can you try 2.6.20-rc3 and see if 'CLO not available' message goes away
(please post boot dmesg)?

The crash/lock is because filesystem code does not cope with IO errors
very well.  I can't tell why timeouts are occurring in the first place.
 It seems that only samsung drives are affected (sda2, 3, 4).  Hmmm...
Please apply the attached patch to 2.6.20-rc3 and test it.

Thanks.

  
Here's boot dmesg with 2.6.20-rc3 + blacklist. And you are right about 
only affecting samsung drives, but since only those drives get all the 
heavy load, couldn't tell exactly.
I'm putting the server in production right now, so I think in a few 
hours I'll have more info.


Thanks.
Pablo.


dmesg.bz2
Description: Binary data


SATA problems

2007-01-02 Thread Pablo Sebastian Greco
First of all, thanks for everything, and my excuses if I'm doing 
anything wrong, this is my first lkml mail, but I've read all the faq, 
so should be OK.

This is the machine with the problem:

Intel ServerBoard S5000VSA
Dual Core Xeon 2.66 (Intel(R) Xeon(TM) CPU 2.66GHz stepping 04)
4G Kingston
1 Seagate 80G sata (ST380211AS) (sda)
3 Samsung 250G sata (SAMSUNG SP2504C) (sdb,c,d)

Installed distribution is FC6 x86_64

I've been getting these messages with distribution and vanilla kernels

Jan  1 16:29:08 squid kernel: ata4.00: exception Emask 0x0 SAct 
0x7fff SErr 0x0 action 0x2 frozen
Jan  1 16:29:08 squid kernel: ata4.00: cmd 
61/60:00:c9:6d:8e/00:00:0e:00:00/40 tag 0 cdb 0x0 data 49152 out
Jan  1 16:29:08 squid kernel:  res 
40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  1 16:29:08 squid kernel: ata4.00: cmd 
60/08:08:f7:7d:56/00:00:0e:00:00/40 tag 1 cdb 0x0 data 4096 in
Jan  1 16:29:08 squid kernel:  res 
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)


Jan  1 16:29:08 squid kernel: ata4: soft resetting port
Jan  1 16:29:08 squid kernel: ata4: softreset failed (port busy but CLO 
unavailable)

Jan  1 16:29:08 squid kernel: ata4: softreset failed, retrying in 5 secs
Jan  1 16:29:13 squid kernel: ata4: hard resetting port
Jan  1 16:29:21 squid kernel: ata4: port is slow to respond, please be 
patient (Status 0x80)
Jan  1 16:29:43 squid kernel: ata4: port failed to respond (30 secs, 
Status 0x80)

Jan  1 16:29:43 squid kernel: ata4: COMRESET failed (device not ready)
Jan  1 16:29:43 squid kernel: ata4: hardreset failed, retrying in 5 secs
Jan  1 16:29:48 squid kernel: ata4: hard resetting port
Jan  1 16:29:49 squid kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 
SControl 300)

Jan  1 16:29:49 squid kernel: ata4.00: configured for UDMA/133
Jan  1 16:29:49 squid kernel: ata4: EH complete
Jan  1 16:29:49 squid kernel: SCSI device sdd: 488397168 512-byte hdwr 
sectors (250059 MB)

Jan  1 16:29:49 squid kernel: sdd: Write Protect is off
Jan  1 16:29:49 squid kernel: SCSI device sdd: write cache: enabled, 
read cache: enabled, doesn't support DPO or FUA


lots of them, and eventually crashing the system.
Tested from fc6 2.6.18 kernel to vanilla 2.6.20-rc2-mm1. Old kernels 
just crash, newer ones log these things and then crash.
I don't want to flood with this mail with useless info, so please tell 
me what to send and I'll do it (dmesg, smartctl... you name it)
BTW, memtest was running for about 2 days without errors, and and 
badblocks on all 4 drives returned nothing. Reallocated_Sector_Ct 
raw_value was 0 on all 4 drives


Thanks in advance.
Pablo.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/