Re: ata command timeout

2007-02-19 Thread Marc Marais
On Tue, 20 Feb 2007 13:07:50 +0900, Tejun Heo wrote
> [EMAIL PROTECTED] wrote:
> > Hello,
> > 
> > I have been running 2.6.18 for two months and the last couple of days these 
> > error messages have appeared in my logs 
> > (sata_promise kernel module, sda:SATA sdb:PATA disks):
> > 
> >ata1: command timeout
> > Feb 17 22:23:14 linux kernel: ata1: no sense translation for status: 0x40
> > Feb 17 22:23:14 linux kernel: ata1: translated ATA stat/err 0x40/00 to SCSI 
> > SK/ASC/ASCQ 0xb/00/00
> > Feb 17 22:23:14 linux kernel: ata1: status=0x40 { DriveReady }
> > Feb 17 22:23:14 linux kernel: sd 0:0:0:0: SCSI error: return code = 
> > 0x0802
> > Feb 17 22:23:14 linux kernel: sda: Current: sense key: Aborted Command
> > Feb 17 22:23:14 linux kernel: Additional sense: No additional sense 
> > information
> > Feb 17 22:23:14 linux kernel: end_request: I/O error, dev sda, sector 
> > 145179585
> > Feb 17 22:23:14 linux kernel: Buffer I/O error on device sda2, logical 
> > block 
> > 2787300
> > Feb 17 22:23:14 linux kernel: lost page write due to I/O error on sda2
> > 
> > and 
> > 
> >ata1: command timeout
> > Feb 19 20:39:31 linux kernel: ata1: no sense translation for status: 0x40
> > Feb 19 20:39:31 linux kernel: ata1: translated ATA stat/err 0x40/00 to SCSI 
> > SK/ASC/ASCQ 0xb/00/00
> > Feb 19 20:39:31 linux kernel: ata1: status=0x40 { DriveReady }
> > Feb 19 20:39:31 linux kernel: sd 0:0:0:0: SCSI error: return code = 
> > 0x0802
> > Feb 19 20:39:31 linux kernel: sda: Current: sense key: Aborted Command
> > Feb 19 20:39:31 linux kernel: Additional sense: No additional sense 
> > information
> > Feb 19 20:39:31 linux kernel: end_request: I/O error, dev sda, sector 
> > 89553479
> > 
> > without any other ill-effects that I know of(I did smart tests on the 
> > drive; 
> > all passed successfully).
> > I have read that hddtemp may be the cause of this (I am running version 
> > 0.3) 
> > so is there any reason
> > to worry and prepare for a HDD replacement?
> 
> Not really.  If the problem occurs very infrequently, you don't need 
> to worry about it too much.  Command timeouts do occur on otherwise healthy
> systems from time to time.
> 
> -- 
> tejun
> -

I'm experiencing the exact same problem with my setup also with sata_promise.
I have posted to the linux-ide list but it wasn't really acknowledged as a
problem in the driver. Are these command timeouts? The log entry doesn't seem
to say that - just an error with 'DriveReady' and command aborted. I would
think some kind of retry should be performed (and if it is then logged too).

The errors may be benign but the problem is when using software raid (md
driver) that this error may cause a degraded array and worse a damaged array
should a read error like this occur when an array is already degraded.

The question is what happens after the error is reported, is the operation
retried? In my situation the md layer receives the error and recovers by
taking the data from another drive in the array. 

The fact that you are also experiencing this means it might be an issue that
needs further investigation in my opinion.



Regards,
Marc

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ata command timeout

2007-02-19 Thread Marc Marais
On Tue, 20 Feb 2007 13:07:50 +0900, Tejun Heo wrote
 [EMAIL PROTECTED] wrote:
  Hello,
  
  I have been running 2.6.18 for two months and the last couple of days these 
  error messages have appeared in my logs 
  (sata_promise kernel module, sda:SATA sdb:PATA disks):
  
 ata1: command timeout
  Feb 17 22:23:14 linux kernel: ata1: no sense translation for status: 0x40
  Feb 17 22:23:14 linux kernel: ata1: translated ATA stat/err 0x40/00 to SCSI 
  SK/ASC/ASCQ 0xb/00/00
  Feb 17 22:23:14 linux kernel: ata1: status=0x40 { DriveReady }
  Feb 17 22:23:14 linux kernel: sd 0:0:0:0: SCSI error: return code = 
  0x0802
  Feb 17 22:23:14 linux kernel: sda: Current: sense key: Aborted Command
  Feb 17 22:23:14 linux kernel: Additional sense: No additional sense 
  information
  Feb 17 22:23:14 linux kernel: end_request: I/O error, dev sda, sector 
  145179585
  Feb 17 22:23:14 linux kernel: Buffer I/O error on device sda2, logical 
  block 
  2787300
  Feb 17 22:23:14 linux kernel: lost page write due to I/O error on sda2
  
  and 
  
 ata1: command timeout
  Feb 19 20:39:31 linux kernel: ata1: no sense translation for status: 0x40
  Feb 19 20:39:31 linux kernel: ata1: translated ATA stat/err 0x40/00 to SCSI 
  SK/ASC/ASCQ 0xb/00/00
  Feb 19 20:39:31 linux kernel: ata1: status=0x40 { DriveReady }
  Feb 19 20:39:31 linux kernel: sd 0:0:0:0: SCSI error: return code = 
  0x0802
  Feb 19 20:39:31 linux kernel: sda: Current: sense key: Aborted Command
  Feb 19 20:39:31 linux kernel: Additional sense: No additional sense 
  information
  Feb 19 20:39:31 linux kernel: end_request: I/O error, dev sda, sector 
  89553479
  
  without any other ill-effects that I know of(I did smart tests on the 
  drive; 
  all passed successfully).
  I have read that hddtemp may be the cause of this (I am running version 
  0.3) 
  so is there any reason
  to worry and prepare for a HDD replacement?
 
 Not really.  If the problem occurs very infrequently, you don't need 
 to worry about it too much.  Command timeouts do occur on otherwise healthy
 systems from time to time.
 
 -- 
 tejun
 -

I'm experiencing the exact same problem with my setup also with sata_promise.
I have posted to the linux-ide list but it wasn't really acknowledged as a
problem in the driver. Are these command timeouts? The log entry doesn't seem
to say that - just an error with 'DriveReady' and command aborted. I would
think some kind of retry should be performed (and if it is then logged too).

The errors may be benign but the problem is when using software raid (md
driver) that this error may cause a degraded array and worse a damaged array
should a read error like this occur when an array is already degraded.

The question is what happens after the error is reported, is the operation
retried? In my situation the md layer receives the error and recovers by
taking the data from another drive in the array. 

The fact that you are also experiencing this means it might be an issue that
needs further investigation in my opinion.



Regards,
Marc

--
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: md: md6_raid5 crash 2.6.20

2007-02-11 Thread Marc Marais
On Mon, 12 Feb 2007 09:02:33 +1100, Neil Brown wrote
> On Sunday February 11, [EMAIL PROTECTED] wrote:
> > Greetings,
> > 
> > I've been running md on my server for some time now and a few days ago one 
> > of
> > the (3) drives in the raid5 array starting giving read errors. The result 
> > was
> > usually system hangs and this was with kernel 2.6.17.13. I upgraded to the
> > latest production 2.6.20 kernel and experienced the same behaviour.
> 
> System hangs suggest a problem with the drive controller.  However
> this "kernel BUG" is something newly introduced in 2.6.20 which 
> should be fixed in 2.6.20.1.  Patch is below.
> 
> If you still get hangs with this patch installed, then please report
> detail, and probably copy to [EMAIL PROTECTED]
> 
> NeilBrown
> 
> Fix various bugs with aligned reads in RAID5.
> 
> It is possible for raid5 to be sent a bio that is too big
> for an underlying device.  So if it is a READ that we
> pass stright down to a device, it will fail and confuse
> RAID5.
> 
> So in 'chunk_aligned_read' we check that the bio fits within the
> parameters for the target device and if it doesn't fit, fall back
> on reading through the stripe cache and making lots of one-page
> requests.
> 
> Note that this is the earliest time we can check against the device
> because earlier we don't have a lock on the device, so it could 
> change underneath us.
> 
> Also, the code for handling a retry through the cache when a read
> fails has not been tested and was badly broken.  This patch fixes 
> that code.
> 
> Signed-off-by: Neil Brown <[EMAIL PROTECTED]>
> 

Thanks for the quick response Neil unfortunately the kernel doesn't build with
this patch due to a missing symbol:

WARNING: "blk_recount_segments" [drivers/md/raid456.ko] undefined!

Is that in another file that needs patching or within raid5.c?

Marc

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: md: md6_raid5 crash 2.6.20

2007-02-11 Thread Marc Marais
On Mon, 12 Feb 2007 09:02:33 +1100, Neil Brown wrote
 On Sunday February 11, [EMAIL PROTECTED] wrote:
  Greetings,
  
  I've been running md on my server for some time now and a few days ago one 
  of
  the (3) drives in the raid5 array starting giving read errors. The result 
  was
  usually system hangs and this was with kernel 2.6.17.13. I upgraded to the
  latest production 2.6.20 kernel and experienced the same behaviour.
 
 System hangs suggest a problem with the drive controller.  However
 this kernel BUG is something newly introduced in 2.6.20 which 
 should be fixed in 2.6.20.1.  Patch is below.
 
 If you still get hangs with this patch installed, then please report
 detail, and probably copy to [EMAIL PROTECTED]
 
 NeilBrown
 
 Fix various bugs with aligned reads in RAID5.
 
 It is possible for raid5 to be sent a bio that is too big
 for an underlying device.  So if it is a READ that we
 pass stright down to a device, it will fail and confuse
 RAID5.
 
 So in 'chunk_aligned_read' we check that the bio fits within the
 parameters for the target device and if it doesn't fit, fall back
 on reading through the stripe cache and making lots of one-page
 requests.
 
 Note that this is the earliest time we can check against the device
 because earlier we don't have a lock on the device, so it could 
 change underneath us.
 
 Also, the code for handling a retry through the cache when a read
 fails has not been tested and was badly broken.  This patch fixes 
 that code.
 
 Signed-off-by: Neil Brown [EMAIL PROTECTED]
 

Thanks for the quick response Neil unfortunately the kernel doesn't build with
this patch due to a missing symbol:

WARNING: blk_recount_segments [drivers/md/raid456.ko] undefined!

Is that in another file that needs patching or within raid5.c?

Marc

--
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


md: md6_raid5 crash 2.6.20

2007-02-10 Thread Marc Marais
Greetings,

I've been running md on my server for some time now and a few days ago one of
the (3) drives in the raid5 array starting giving read errors. The result was
usually system hangs and this was with kernel 2.6.17.13. I upgraded to the
latest production 2.6.20 kernel and experienced the same behaviour. 

I've tried to recondition the drive with dd writes to force reallocation but
there are still read errors (appearing at other addresses) - anyway, the drive
is bad, the point is there is a problem with how it is handled and I was
starting to suspect my hardware, however I got this while trying to backup the
array (which was fully sychronised - eventually) it appears it hit another
read error. 

Note I'm using PATA drives on with a promise IDE controller (using the
sata_promise module).

Please investigate / let me know if you need more info and can help me out
here. Thanks.

Syslog extract:
--
Feb 11 15:08:59 xerces kernel: ata4: command timeout
Feb 11 15:08:59 xerces kernel: ata4: no sense translation for status: 0x40
Feb 11 15:08:59 xerces kernel: ata4: translated ATA stat/err 0x40/00 to SCSI
SK/ASC/ASCQ 0xb/00/00
Feb 11 15:08:59 xerces kernel: ata4: status=0x40 { DriveReady }
Feb 11 15:08:59 xerces kernel: sd 4:0:0:0: SCSI error: return code = 0x0802
Feb 11 15:08:59 xerces kernel: sdc: Current [descriptor]: sense key: Aborted
Command
Feb 11 15:08:59 xerces kernel: Additional sense: No additional sense
information
Feb 11 15:08:59 xerces kernel: Descriptor sense data with sense descriptors
(in hex):
Feb 11 15:08:59 xerces kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00
00 00 00 
Feb 11 15:08:59 xerces kernel: 00 00 00 00 
Feb 11 15:08:59 xerces kernel: end_request: I/O error, dev sdc, sector 54744127
Feb 11 15:08:59 xerces kernel: [ cut here ]
Feb 11 15:08:59 xerces kernel: kernel BUG at mm/filemap.c:537!
Feb 11 15:08:59 xerces kernel: invalid opcode:  [#1]
Feb 11 15:08:59 xerces kernel: SMP 
Feb 11 15:08:59 xerces kernel: Modules linked in: aes cbc blkcipher dm_crypt
sg sr_mod vmnet(P) parport_pc parport vmmon(P) st binfmt_misc raid456 xor
sd_mod dm_mod snd_cmipci gameport snd_pcm_oss snd_mixer_oss snd_pcm
snd_page_alloc snd_opl3_lib snd_timer snd_hwdep snd_mpu401_uart snd_rawmidi
snd_seq_device snd soundcore w83781d hwmon_vid i2c_isa i2c_amd756 i2c_core
sata_promise aic7xxx scsi_transport_spi e1000 usbhid ohci_hcd usbcore amd_rng
rng_core video1394 ohci1394 sbp2 raw1394 ieee1394 3c59x mii psmouse ide_cd
cdrom rtc ext3 jbd mbcache ide_disk ide_generic via82cxxx siimage pdc202xx_old
generic cmd64x amd74xx pdc202xx_new ide_core raid1 md_mod unix
Feb 11 15:08:59 xerces kernel: CPU:1
Feb 11 15:08:59 xerces kernel: EIP:0060:[unlock_page+14/48]Tainted: P
 VLI
Feb 11 15:08:59 xerces kernel: EFLAGS: 00010246   (2.6.20 #1)
Feb 11 15:08:59 xerces kernel: EIP is at unlock_page+0xe/0x30
Feb 11 15:08:59 xerces kernel: eax:    ebx: c10d   ecx: c218af20 
 edx: c10d
Feb 11 15:08:59 xerces kernel: esi: dfd6db00   edi: 0001   ebp: dfd6db00 
 esp: f755ff00
Feb 11 15:08:59 xerces kernel: ds: 007b   es: 007b   ss: 0068
Feb 11 15:08:59 xerces kernel: Process md6_raid5 (pid: 1402, ti=f755e000
task=c218f070 task.ti=f755e000)
Feb 11 15:08:59 xerces kernel: Stack: c218af2c c0180d08 0686a800 
f7854334 f8bfd51d 0002 f755ff44 
Feb 11 15:08:59 xerces kernel:f755ff48 f7854260  
0686a800  03435400  
Feb 11 15:08:59 xerces kernel:f7854260 0002 0001 0003
f7854260 f7854334 f7d6c200 f8bfd677 
Feb 11 15:08:59 xerces kernel: Call Trace:
Feb 11 15:08:59 xerces kernel:  [mpage_end_io_read+72/112]
mpage_end_io_read+0x48/0x70
Feb 11 15:08:59 xerces kernel:  [pg0+948458781/1070097408]
retry_aligned_read+0xfd/0x1d0 [raid456]
Feb 11 15:08:59 xerces kernel:  [pg0+948459127/1070097408] raid5d+0x87/0x130
[raid456]
Feb 11 15:08:59 xerces kernel:  [pg0+944523815/1070097408]
md_thread+0x57/0x110 [md_mod]
Feb 11 15:08:59 xerces kernel:  [autoremove_wake_function+0/80]
autoremove_wake_function+0x0/0x50
Feb 11 15:08:59 xerces kernel:  [autoremove_wake_function+0/80]
autoremove_wake_function+0x0/0x50
Feb 11 15:08:59 xerces kernel:  [pg0+944523728/1070097408] md_thread+0x0/0x110
[md_mod]
Feb 11 15:08:59 xerces kernel:  [kthread+154/160] kthread+0x9a/0xa0
Feb 11 15:08:59 xerces kernel:  [kthread+0/160] kthread+0x0/0xa0
Feb 11 15:08:59 xerces kernel:  [kernel_thread_helper+7/20]
kernel_thread_helper+0x7/0x14
Feb 11 15:08:59 xerces kernel:  ===
Feb 11 15:08:59 xerces kernel: Code: 29 ff ff ff b9 c0 d4 13 c0 8d 54 24 24 c7
04 24 02 00 00 00 e8 a4 15 14 00 eb cd 89 f6 53 89 c3 f0 0f ba 30 00 19 c0 85
c0 75 04 <0f> 0b eb fe 89 d8 e8 f7 fe ff ff 89 da 31 c9 5b e9 7d 12 ff ff 
Feb 11 15:08:59 xerces kernel: EIP: [unlock_page+14/48] unlock_page+0xe/0x30
SS:ESP 0068:f755ff00

-
DMESG:

md: md6_raid5 crash 2.6.20

2007-02-10 Thread Marc Marais
Greetings,

I've been running md on my server for some time now and a few days ago one of
the (3) drives in the raid5 array starting giving read errors. The result was
usually system hangs and this was with kernel 2.6.17.13. I upgraded to the
latest production 2.6.20 kernel and experienced the same behaviour. 

I've tried to recondition the drive with dd writes to force reallocation but
there are still read errors (appearing at other addresses) - anyway, the drive
is bad, the point is there is a problem with how it is handled and I was
starting to suspect my hardware, however I got this while trying to backup the
array (which was fully sychronised - eventually) it appears it hit another
read error. 

Note I'm using PATA drives on with a promise IDE controller (using the
sata_promise module).

Please investigate / let me know if you need more info and can help me out
here. Thanks.

Syslog extract:
--
Feb 11 15:08:59 xerces kernel: ata4: command timeout
Feb 11 15:08:59 xerces kernel: ata4: no sense translation for status: 0x40
Feb 11 15:08:59 xerces kernel: ata4: translated ATA stat/err 0x40/00 to SCSI
SK/ASC/ASCQ 0xb/00/00
Feb 11 15:08:59 xerces kernel: ata4: status=0x40 { DriveReady }
Feb 11 15:08:59 xerces kernel: sd 4:0:0:0: SCSI error: return code = 0x0802
Feb 11 15:08:59 xerces kernel: sdc: Current [descriptor]: sense key: Aborted
Command
Feb 11 15:08:59 xerces kernel: Additional sense: No additional sense
information
Feb 11 15:08:59 xerces kernel: Descriptor sense data with sense descriptors
(in hex):
Feb 11 15:08:59 xerces kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00
00 00 00 
Feb 11 15:08:59 xerces kernel: 00 00 00 00 
Feb 11 15:08:59 xerces kernel: end_request: I/O error, dev sdc, sector 54744127
Feb 11 15:08:59 xerces kernel: [ cut here ]
Feb 11 15:08:59 xerces kernel: kernel BUG at mm/filemap.c:537!
Feb 11 15:08:59 xerces kernel: invalid opcode:  [#1]
Feb 11 15:08:59 xerces kernel: SMP 
Feb 11 15:08:59 xerces kernel: Modules linked in: aes cbc blkcipher dm_crypt
sg sr_mod vmnet(P) parport_pc parport vmmon(P) st binfmt_misc raid456 xor
sd_mod dm_mod snd_cmipci gameport snd_pcm_oss snd_mixer_oss snd_pcm
snd_page_alloc snd_opl3_lib snd_timer snd_hwdep snd_mpu401_uart snd_rawmidi
snd_seq_device snd soundcore w83781d hwmon_vid i2c_isa i2c_amd756 i2c_core
sata_promise aic7xxx scsi_transport_spi e1000 usbhid ohci_hcd usbcore amd_rng
rng_core video1394 ohci1394 sbp2 raw1394 ieee1394 3c59x mii psmouse ide_cd
cdrom rtc ext3 jbd mbcache ide_disk ide_generic via82cxxx siimage pdc202xx_old
generic cmd64x amd74xx pdc202xx_new ide_core raid1 md_mod unix
Feb 11 15:08:59 xerces kernel: CPU:1
Feb 11 15:08:59 xerces kernel: EIP:0060:[unlock_page+14/48]Tainted: P
 VLI
Feb 11 15:08:59 xerces kernel: EFLAGS: 00010246   (2.6.20 #1)
Feb 11 15:08:59 xerces kernel: EIP is at unlock_page+0xe/0x30
Feb 11 15:08:59 xerces kernel: eax:    ebx: c10d   ecx: c218af20 
 edx: c10d
Feb 11 15:08:59 xerces kernel: esi: dfd6db00   edi: 0001   ebp: dfd6db00 
 esp: f755ff00
Feb 11 15:08:59 xerces kernel: ds: 007b   es: 007b   ss: 0068
Feb 11 15:08:59 xerces kernel: Process md6_raid5 (pid: 1402, ti=f755e000
task=c218f070 task.ti=f755e000)
Feb 11 15:08:59 xerces kernel: Stack: c218af2c c0180d08 0686a800 
f7854334 f8bfd51d 0002 f755ff44 
Feb 11 15:08:59 xerces kernel:f755ff48 f7854260  
0686a800  03435400  
Feb 11 15:08:59 xerces kernel:f7854260 0002 0001 0003
f7854260 f7854334 f7d6c200 f8bfd677 
Feb 11 15:08:59 xerces kernel: Call Trace:
Feb 11 15:08:59 xerces kernel:  [mpage_end_io_read+72/112]
mpage_end_io_read+0x48/0x70
Feb 11 15:08:59 xerces kernel:  [pg0+948458781/1070097408]
retry_aligned_read+0xfd/0x1d0 [raid456]
Feb 11 15:08:59 xerces kernel:  [pg0+948459127/1070097408] raid5d+0x87/0x130
[raid456]
Feb 11 15:08:59 xerces kernel:  [pg0+944523815/1070097408]
md_thread+0x57/0x110 [md_mod]
Feb 11 15:08:59 xerces kernel:  [autoremove_wake_function+0/80]
autoremove_wake_function+0x0/0x50
Feb 11 15:08:59 xerces kernel:  [autoremove_wake_function+0/80]
autoremove_wake_function+0x0/0x50
Feb 11 15:08:59 xerces kernel:  [pg0+944523728/1070097408] md_thread+0x0/0x110
[md_mod]
Feb 11 15:08:59 xerces kernel:  [kthread+154/160] kthread+0x9a/0xa0
Feb 11 15:08:59 xerces kernel:  [kthread+0/160] kthread+0x0/0xa0
Feb 11 15:08:59 xerces kernel:  [kernel_thread_helper+7/20]
kernel_thread_helper+0x7/0x14
Feb 11 15:08:59 xerces kernel:  ===
Feb 11 15:08:59 xerces kernel: Code: 29 ff ff ff b9 c0 d4 13 c0 8d 54 24 24 c7
04 24 02 00 00 00 e8 a4 15 14 00 eb cd 89 f6 53 89 c3 f0 0f ba 30 00 19 c0 85
c0 75 04 0f 0b eb fe 89 d8 e8 f7 fe ff ff 89 da 31 c9 5b e9 7d 12 ff ff 
Feb 11 15:08:59 xerces kernel: EIP: [unlock_page+14/48] unlock_page+0xe/0x30
SS:ESP 0068:f755ff00

-
DMESG:
Linux