On Fri, Dec 12, 2008 at 02:58:55PM +1100, Alex Samad wrote:
> On Thu, Dec 11, 2008 at 08:06:12PM -0600, lee wrote:

> > One option I haven't tried yet is to plug both SATA disks into the
> > same channel (i. e. use adjacent plugs). I didn't do that because they
> > might be blocking each other --- this isn't SCSI :( It shouldn't make
> > a difference, but then, who knows? Maybe both disks go offline if I do
> > that ...
> maybe its a port (on the mother board)

Well, I had the same problem with my old motherboard.

> The other thought that came to mind, maybe be a bit far fetch, is the
> drive going into powersaving mode ?

I tried that by putting it to sleep with hdparm. It woke up just fine.

> what shows up in dmesg when the drive dies, do you see any sata resets
> or ???

[...]
Dec 11 02:39:02 cat /USR/SBIN/CRON[19809]: (root) CMD (  [ -x 
/usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ 
-type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | xargs -n 200 -r -0 rm)
Dec 11 02:49:29 cat kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 
action 0x6 frozen
Dec 11 02:49:29 cat kernel: ata5.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 
tag 0
Dec 11 02:49:29 cat kernel:          res 40/00:00:00:4f:c2/00:00:00:c2:00/00 
Emask 0x4 (timeout)
Dec 11 02:49:29 cat kernel: ata5.00: status: { DRDY }
Dec 11 02:49:29 cat kernel: ata5: hard resetting link
Dec 11 02:49:30 cat kernel: ata5: SATA link down (SStatus 0 SControl 300)
Dec 11 02:49:35 cat kernel: ata5: hard resetting link
Dec 11 02:49:35 cat kernel: ata5: SATA link down (SStatus 0 SControl 300)
Dec 11 02:49:40 cat kernel: ata5: hard resetting link
Dec 11 02:49:40 cat kernel: ata5: SATA link down (SStatus 0 SControl 300)
Dec 11 02:49:40 cat kernel: ata5.00: disabled
Dec 11 02:49:40 cat kernel: sd 4:0:0:0: rejecting I/O to offline device
Dec 11 02:49:40 cat kernel: sd 4:0:0:0: rejecting I/O to offline device
Dec 11 02:49:40 cat kernel: end_request: I/O error, dev sdb, sector 478543967
Dec 11 02:49:40 cat kernel: md: super_written gets error=-5, uptodate=0
Dec 11 02:49:40 cat kernel: raid1: Disk failure on sdb2, disabling device.
Dec 11 02:49:40 cat kernel: raid1: Operation continuing on 1 devices.
Dec 11 02:49:40 cat kernel: ata5: EH complete
Dec 11 02:49:40 cat kernel: ata5.00: detaching (SCSI 4:0:0:0)
Dec 11 02:49:40 cat kernel: sd 4:0:0:0: [sdb] Synchronizing SCSI cache
Dec 11 02:49:40 cat kernel: sd 4:0:0:0: [sdb] Result: hostbyte=0x04 
driverbyte=0x00
Dec 11 02:49:40 cat kernel: sd 4:0:0:0: [sdb] Stopping disk
Dec 11 02:49:40 cat kernel: sd 4:0:0:0: [sdb] START_STOP FAILED
Dec 11 02:49:40 cat kernel: sd 4:0:0:0: [sdb] Result: hostbyte=0x04 
driverbyte=0x00
Dec 11 02:49:40 cat kernel: RAID1 conf printout:
Dec 11 02:49:40 cat kernel:  --- wd:1 rd:2
Dec 11 02:49:40 cat kernel:  disk 0, wo:0, o:1, dev:sda2
Dec 11 02:49:40 cat kernel:  disk 1, wo:1, o:0, dev:sdb2
Dec 11 02:49:40 cat kernel: RAID1 conf printout:
Dec 11 02:49:40 cat kernel:  --- wd:1 rd:2
Dec 11 02:49:40 cat kernel:  disk 0, wo:0, o:1, dev:sda2
Dec 11 02:49:40 cat mdadm[1891]: Fail event detected on md device /dev/md1, 
component device /dev/sdb2
Dec 11 03:06:38 cat kernel: scsi 4:0:0:0: rejecting I/O to dead device
Dec 11 03:06:38 cat kernel: scsi 4:0:0:0: rejecting I/O to dead device
Dec 11 03:06:38 cat kernel: end_request: I/O error, dev sdb, sector 146496512
Dec 11 03:06:38 cat kernel: md: super_written gets error=-5, uptodate=0
Dec 11 03:06:38 cat kernel: raid1: Disk failure on sdb1, disabling device.
Dec 11 03:06:38 cat kernel: raid1: Operation continuing on 1 devices.
Dec 11 03:06:38 cat kernel: RAID1 conf printout:
Dec 11 03:06:38 cat kernel:  --- wd:1 rd:2
Dec 11 03:06:38 cat kernel:  disk 0, wo:0, o:1, dev:sda1
Dec 11 03:06:38 cat kernel:  disk 1, wo:1, o:0, dev:sdb1
Dec 11 03:06:38 cat kernel: RAID1 conf printout:
Dec 11 03:06:38 cat kernel:  --- wd:1 rd:2
Dec 11 03:06:38 cat kernel:  disk 0, wo:0, o:1, dev:sda1
Dec 11 03:06:38 cat mdadm[1891]: Fail event detected on md device /dev/md0, 
component device /dev/sdb1
Dec 11 03:06:40 cat smartd[1797]: Device: /dev/sdb, No such device, open() 
failed 
Dec 11 03:06:40 cat smartd[1797]: Sending warning via 
/usr/share/smartmontools/smartd-runner to root ... 
Dec 11 03:06:41 cat smartd[1797]: Warning via 
/usr/share/smartmontools/smartd-runner to root: successful 
[...]
Dec 11 03:36:40 cat smartd[1797]: Device: /dev/hdb, SMART Usage Attribute: 194 
Temperature_Celsius changed from 113 to 112 
Dec 11 03:36:40 cat smartd[1797]: Device: /dev/sda, SMART Usage Attribute: 190 
Airflow_Temperature_Cel changed from 72 to 71 
Dec 11 03:36:40 cat smartd[1797]: Device: /dev/sdb, No such device, open() 
failed 
[...]
Dec 11 04:06:40 cat smartd[1797]: Device: /dev/sdb, No such device, open() 
failed 
[...]
Dec 11 04:36:40 cat smartd[1797]: Device: /dev/hdb, SMART Usage Attribute: 194 
Temperature_Celsius changed from 112 to 113 
Dec 11 04:36:41 cat smartd[1797]: Device: /dev/sdb, No such device, open() 
failed 
[...]
Dec 11 05:06:40 cat smartd[1797]: Device: /dev/hdb, SMART Usage Attribute: 194 
Temperature_Celsius changed from 113 to 112 
Dec 11 05:06:40 cat smartd[1797]: Device: /dev/sdb, No such device, open() 
failed 
[...]
Dec 11 05:36:40 cat smartd[1797]: Device: /dev/hdb, SMART Usage Attribute: 194 
Temperature_Celsius changed from 112 to 113 
Dec 11 05:36:40 cat smartd[1797]: Device: /dev/sda, SMART Prefailure Attribute: 
8 Seek_Time_Performance changed from 247 to 246 
Dec 11 05:36:40 cat smartd[1797]: Device: /dev/sdb, No such device, open() 
failed 
[...]
Dec 11 06:06:40 cat smartd[1797]: Device: /dev/hdb, SMART Usage Attribute: 194 
Temperature_Celsius changed from 113 to 112 
Dec 11 06:06:41 cat smartd[1797]: Device: /dev/sdb, No such device, open() 
failed 
[...]
Dec 11 06:25:12 cat kernel: raid1: sdb3: rescheduling sector 12320
Dec 11 06:25:12 cat kernel: raid1: Disk failure on sdb3, disabling device.
Dec 11 06:25:12 cat kernel: raid1: Operation continuing on 1 devices.
Dec 11 06:25:12 cat kernel: raid1: sda3: redirecting sector 12320 to another 
mirror
Dec 11 06:25:12 cat kernel: RAID1 conf printout:
Dec 11 06:25:12 cat kernel:  --- wd:1 rd:2
Dec 11 06:25:12 cat kernel:  disk 0, wo:0, o:1, dev:sda3
Dec 11 06:25:12 cat kernel:  disk 1, wo:1, o:0, dev:sdb3
Dec 11 06:25:12 cat kernel: RAID1 conf printout:
Dec 11 06:25:12 cat kernel:  --- wd:1 rd:2
Dec 11 06:25:12 cat kernel:  disk 0, wo:0, o:1, dev:sda3
Dec 11 06:25:12 cat mdadm[1891]: Fail event detected on md device /dev/md2, 
component device /dev/sdb3
Dec 11 06:25:13 cat syslogd 1.5.0#5: restart.
Dec 11 06:26:12 cat mdadm[1891]: SpareActive event detected on md device 
/dev/md2, component device /dev/sdb3


There are no spares in the RAID.

As you can see, it comes out of nowhere, there are 10 minutes between
the last entry in the log and the exception.

Hm, I could swap the drives to see if the problem is with a particular
disk or if it's with the second disk. If it's always with the second
disk, it must be a software problem. If it's not, it's probably the
disk being broken from factory.


-- 
"Don't let them, daddy. Don't let the stars run down."
http://adin.dyndns.org/adin/TheLastQ.htm


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to