Bug#625922: SATA devices get reset without real hardware failure

Natalia Portillo Sat, 26 Nov 2011 06:21:43 -0800

El 26/11/2011, a las 07:49, Jonathan Nieder escribió:

> Hi,
> 
> Natalia Portillo wrote:
> 
>> While running stock Debian's sid linux 2.6.38-8-amd64 kernel I'm
>> getting random fails on SATA devices.
>> 
>> I have a RAID5 system with 5 disks and 3 of them showed the same
>> exact failure, one each 48 hours.
>> 
>> On reboot, the devices work perfectly, and badblocks runs through
>> them without a single failure.
>> 
>> Kernel exact failure is:
>> 
>> [255352.928063] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 
>> frozen
>> [255352.928071] ata4.00: failed command: FLUSH CACHE EXT
> [...]
>> Devices are in different SATA ports (first failed ata2, then ata5,
>> then ata4) and are all Seagate ST2000DL003-9VT166.
>> 
>> Same exact hardware has been running on Linux 2.6.32-gentoo for
>> weeks without a single failure.
> 
> Thanks for reporting it, and sorry for the slow response.
> 
> Some questions:
> 
> - what kernel are you using now?


claunia@hades:~$ uname -a
Linux hades 3.0.0-1-amd64 #1 SMP Sat Aug 27 16:21:11 UTC 2011 x86_64 GNU/Linux

wheezy

> - can you still reproduce this?

have been only two weeks with this kernel, and there is a bug, another one

> - can you reproduce it with a squeeze kernel, too?

with all squeeze kernels up to two weeks away

> - do you know what exact version the working 2.6.32-gentoo kernel
>   was?

r6 I think

> - please attach a log of the initialization of the kernel, either by
>   saving full "dmesg" output right after booting or by gathering it
>   from /var/log/dmesg*

I will have to dig up on the rotated logs, stay tuned

> - any workarounds or other weird symptoms?

Curiously, no workarounds, but other weird symptons in same and other kernels.

On both squeeze and wheezy kernel the following happen almost once a day 
(always on high network transfers):

[118801.372070] INFO: task bacula-sd:27996 blocked for more than 120 seconds.
[118801.372091] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[118801.372113] bacula-sd       D ffff88009f63a2c0     0 27996      1 0x00000000
[118801.372122]  ffff88009f63a2c0 0000000000000082 0000000000000000 
ffff880000000000
[118801.372130]  ffff8800bc3780c0 0000000000012800 ffff88008954dfd8 
ffff88008954dfd8
[118801.372138]  0000000000012800 ffff88009f63a2c0 0000000000012800 
0000000000012800
[118801.372146] Call Trace:
[118801.372161]  [<ffffffff81335bb1>] ? schedule_timeout+0x2d/0xd7
[118801.372170]  [<ffffffff8119287b>] ? blk_peek_request+0x1a7/0x1bc
[118801.372176]  [<ffffffff8133598b>] ? wait_for_common+0x9d/0x116
[118801.372184]  [<ffffffff8103f0a4>] ? try_to_wake_up+0x199/0x199
[118801.372190]  [<ffffffff81336a8d>] ? _raw_spin_lock_irq+0xd/0x1a
[118801.372218]  [<ffffffffa00fdab9>] ? st_do_scsi.clone.10+0x2d9/0x309 [st]
[118801.372228]  [<ffffffffa00fe384>] ? st_int_ioctl+0x673/0xad5 [st]
[118801.372234]  [<ffffffff8103aec8>] ? mmdrop+0xd/0x1c
[118801.372241]  [<ffffffff8103840a>] ? should_resched+0x5/0x24
[118801.372250]  [<ffffffffa0100689>] ? st_ioctl+0xb5e/0xedf [st]
[118801.372259]  [<ffffffff81062efc>] ? hrtimer_try_to_cancel+0x3c/0x46
[118801.372265]  [<ffffffff81062f12>] ? hrtimer_cancel+0xc/0x16
[118801.372272]  [<ffffffff8110905d>] ? do_vfs_ioctl+0x45b/0x49c
[118801.372278]  [<ffffffff81062b83>] ? update_rmtp+0x62/0x62
[118801.372284]  [<ffffffff81063279>] ? hrtimer_start_expires+0x16/0x1b
[118801.372290]  [<ffffffff811090e9>] ? sys_ioctl+0x4b/0x72
[118801.372297]  [<ffffffff8133bd12>] ? system_call_fastpath+0x16/0x1b

And repeats a lot of times (the stack trace is always different, always being 
the process that's doing the transfer, like bacula-sd or netatalk, or the XFS 
or MDRAID processes)

On squeeze kernel when this happens nothing works. That is, if you open another 
processes, it does not open. If you kill one process, it stays opened. Hard 
reboot is the only way.
On wheezy system continues working.

Curiously I received an Efika MX Smartbook machine yesterday that exhibits 
another bug, but really similar.

With kernel Linux 2.6.31.14.26-efikamx the internal SSD suffers a lost 
interrupt and resets when there is high cpu usage. Sorry have to dig logs also.

> 
> If you can reproduce this reliably with a 3.1.y kernel, we should
> take this upstream (looks like that's [email protected]
> plus [email protected]; please cc me or this bug log if
> writing there so we can track it).
> 
> Hope that helps,
> Jonathan




--
To UNSUBSCRIBE, email to [email protected]
with a subject of "unsubscribe". Trouble? Contact [email protected]
Archive: 
http://lists.debian.org/[email protected]

Bug#625922: SATA devices get reset without real hardware failure

Reply via email to