Re: Hard disk woes

2005-09-06 Thread Norberto Meijome

Michael Abbott wrote:


I still think the question: why does FreeBSD hang? is interesting.


indeed - no idea how Linux handles - win32 would probably BSOD (I had 
W2K servers BSOD because someone accidently powered down an external 
drive it was writing to. nasty).
anyway, i had a weird problem too, ad4 (SATA drive) got detached 
overnight - more details at:


http://lists.freebsd.org/pipermail/freebsd-questions/2005-September/097607.html

When I got to the console in the morning, the box was completelly frozen 
at the console, though I could access just fine via ssh.


Would anyone care to provide some explanation about this?

(After a couple of full scans with mhdd and no problems detected, I put 
the drive back into the server and it's been running ok since then. 
bloody weird.)


thanks in advance,
beto
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Hard disk woes

2005-09-05 Thread Michael Abbott

I'm having some very odd behaviour from one of my hard disks and I wonder
what anybody makes of it.

In brief, the hard disk in questions works just fine much of the time, but
when high volume data transfers are requested I get the following in
/var/log/messages:

Sep  3 15:21:02 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - 
resetting
Sep  3 15:21:02 saturn /kernel: ata3: resetting devices .. done
Sep  3 15:21:12 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - 
resetting
Sep  3 15:21:12 saturn /kernel: ata3: resetting devices .. done
Sep  3 15:21:23 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - 
resetting
Sep  3 15:21:23 saturn /kernel: ata3: resetting devices .. done
Sep  3 15:21:33 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - 
resetting
Sep  3 15:21:33 saturn /kernel: ad6: trying fallback to PIO mode
Sep  3 15:21:33 saturn /kernel: ata3: resetting devices .. done
Sep  3 15:21:43 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - 
resetting
Sep  3 15:21:43 saturn /kernel: ata3: resetting devices .. ata3-slave: ATA 
identify retries exceeded
Sep  3 15:21:43 saturn /kernel: done

After this point the hard disk in question is frozen until I reboot, and
any process that tries to touch it is similarly frozen (doesn't even
respond to kill -9).  `shutdown -r` is enough to restore operation, and
the rest of the system seemed happy enough.

Another interesting effect.  I placed a replacement hard disk on the same
ATA bus (as a slave, device ad7) and tried copying files from ad6 to ad7.
This time when ad6 froze and the kerned decided to give up on ata3 (and so
decided to disable ad7 at the same time, naturally enough) the entire
system froze!  No response from the console, stone cold dead, hard reset
needed.


So some questions seem to me to arise from this.

1.  Why does FreeBSD handle this so ungracefully?  If restarting is
sufficient to bring ata3 back then can't the ata driver do a proper
restart?

2.  Goodness me, FreeBSD froze!  I know it's a hardware failure, but
still: it's on a auxillary ATA controller with no system files attached.
Is this problem of general interest?  It's certainly a massive hint to me
not to consider (parallel) ATA for RAID!

3.  Any thoughts on what is wrong with the hard disk in question?  I've
changed ATA controllers, so it seems to be the disk, not the controller.
The behaviour is very odd.  If I copy files off one at a time, eg using:
find . -type f -exec cp {} $TARGET/{} \; -exec echo -n '.' \;
the disk seems to hang in there, but if I just do
cp -R . $TARGET
then it freezes!  (This statement may not have been thoroughly tested:
having to restart each time gets old quite quickly.)


Ok, now for the boring bits.

$ uname -a
FreeBSD saturn.araneidae.co.uk 4.11-RELEASE-p11 FreeBSD 4.11-RELEASE-p11 #6: 
Sat Aug 27 16:33:58 GMT 2005 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC 
 i386
$ dmesg | grep ata
atapci0: HighPoint HPT370 ATA100 controller port 
0xa000-0xa0ff,0x9c00-0x9c03,0x9800-0x9807,0x9400-0x9403,0x9000-0x9007 irq 12 at 
device 11.0 on pci0
ata2: at 0x9000 on atapci0
ata3: at 0x9800 on atapci0
atapci1: VIA 8233 ATA133 controller port 0xa800-0xa80f at device 17.1 on pci0
ata0: at 0x1f0 irq 14 on atapci1
ata1: at 0x170 irq 15 on atapci1
atapci2: HighPoint HPT372 ATA133 controller port 
0xc400-0xc4ff,0xc000-0xc003,0xbc00-0xbc07,0xb800-0xb803,0xb400-0xb407 irq 10 at 
device 19.0 on pci0
ata4: at 0xb400 on atapci2
ata5: at 0xbc00 on atapci2
ad0: 39083MB Maxtor 4D040H2 [79408/16/63] at ata0-master UDMA100
ad1: 190782MB SAMSUNG SP2014N [387621/16/63] at ata0-slave UDMA133
ad4: 76319MB ST380021A [155061/16/63] at ata2-master UDMA100
ad6: 76319MB ST380021A [155061/16/63] at ata3-master UDMA100
acd0: DVD-ROM CREATIVEDVD-ROM DVD2240E 12/24/97 at ata1-master PIO4
$ sudo atacontrol cap ata3 0
ATA channel 3, Master, device ad6:

ATA/ATAPI revision5
device model  ST380021A
serial number 3HV0MYL9
firmware revision 3.10
cylinders 16383
heads 16
sectors/track 63
lba supported 156301488 sectors
lba48 not supported dma supported
overlap not supported

Feature  Support  EnableValue   Vendor
write cacheyes  yes
read ahead yes  yes
dma queued no   no  0/00
SMART  yes  no
microcode download yes  yes
security   yes  no
power management   yes  yes
advanced power management  no   no  65278/FEFE
automatic acoustic management  yes  yes 128/80  128/80
$

That's everything I can think of.

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Hard disk woes

2005-09-05 Thread Jason Morgan
On Mon, Sep 05, 2005 at 03:16:13PM +, Michael Abbott wrote:
 I'm having some very odd behaviour from one of my hard disks and I wonder
 what anybody makes of it.
 
 In brief, the hard disk in questions works just fine much of the time, but
 when high volume data transfers are requested I get the following in
 /var/log/messages:
 
 Sep  3 15:21:02 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - 
 resetting
 Sep  3 15:21:02 saturn /kernel: ata3: resetting devices .. done
 Sep  3 15:21:12 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - 
 resetting
 Sep  3 15:21:12 saturn /kernel: ata3: resetting devices .. done
 Sep  3 15:21:23 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - 
 resetting
 Sep  3 15:21:23 saturn /kernel: ata3: resetting devices .. done
 Sep  3 15:21:33 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - 
 resetting
 Sep  3 15:21:33 saturn /kernel: ad6: trying fallback to PIO mode
 Sep  3 15:21:33 saturn /kernel: ata3: resetting devices .. done
 Sep  3 15:21:43 saturn /kernel: ad6: READ command timeout tag=0 serv=0 - 
 resetting
 Sep  3 15:21:43 saturn /kernel: ata3: resetting devices .. ata3-slave: ATA 
 identify retries exceeded
 Sep  3 15:21:43 saturn /kernel: done
 
 After this point the hard disk in question is frozen until I reboot, and
 any process that tries to touch it is similarly frozen (doesn't even
 respond to kill -9).  `shutdown -r` is enough to restore operation, and
 the rest of the system seemed happy enough.
 
 Another interesting effect.  I placed a replacement hard disk on the same
 ATA bus (as a slave, device ad7) and tried copying files from ad6 to ad7.
 This time when ad6 froze and the kerned decided to give up on ata3 (and so
 decided to disable ad7 at the same time, naturally enough) the entire
 system froze!  No response from the console, stone cold dead, hard reset
 needed.
 
 
 So some questions seem to me to arise from this.
 
 1.  Why does FreeBSD handle this so ungracefully?  If restarting is
 sufficient to bring ata3 back then can't the ata driver do a proper
 restart?
 
 2.  Goodness me, FreeBSD froze!  I know it's a hardware failure, but
 still: it's on a auxillary ATA controller with no system files attached.
 Is this problem of general interest?  It's certainly a massive hint to me
 not to consider (parallel) ATA for RAID!
 
 3.  Any thoughts on what is wrong with the hard disk in question?  I've
 changed ATA controllers, so it seems to be the disk, not the controller.
 The behaviour is very odd.  If I copy files off one at a time, eg using:
   find . -type f -exec cp {} $TARGET/{} \; -exec echo -n '.' \;
 the disk seems to hang in there, but if I just do
   cp -R . $TARGET
 then it freezes!  (This statement may not have been thoroughly tested:
 having to restart each time gets old quite quickly.)
 
 
 Ok, now for the boring bits.
 
 $ uname -a
 FreeBSD saturn.araneidae.co.uk 4.11-RELEASE-p11 FreeBSD 4.11-RELEASE-p11 
 #6: Sat Aug 27 16:33:58 GMT 2005 
 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC  i386
 $ dmesg | grep ata
 atapci0: HighPoint HPT370 ATA100 controller port 
 0xa000-0xa0ff,0x9c00-0x9c03,0x9800-0x9807,0x9400-0x9403,0x9000-0x9007 irq 
 12 at device 11.0 on pci0
 ata2: at 0x9000 on atapci0
 ata3: at 0x9800 on atapci0
 atapci1: VIA 8233 ATA133 controller port 0xa800-0xa80f at device 17.1 on 
 pci0
 ata0: at 0x1f0 irq 14 on atapci1
 ata1: at 0x170 irq 15 on atapci1
 atapci2: HighPoint HPT372 ATA133 controller port 
 0xc400-0xc4ff,0xc000-0xc003,0xbc00-0xbc07,0xb800-0xb803,0xb400-0xb407 irq 
 10 at device 19.0 on pci0
 ata4: at 0xb400 on atapci2
 ata5: at 0xbc00 on atapci2
 ad0: 39083MB Maxtor 4D040H2 [79408/16/63] at ata0-master UDMA100
 ad1: 190782MB SAMSUNG SP2014N [387621/16/63] at ata0-slave UDMA133
 ad4: 76319MB ST380021A [155061/16/63] at ata2-master UDMA100
 ad6: 76319MB ST380021A [155061/16/63] at ata3-master UDMA100
 acd0: DVD-ROM CREATIVEDVD-ROM DVD2240E 12/24/97 at ata1-master PIO4
 $ sudo atacontrol cap ata3 0
 ATA channel 3, Master, device ad6:
 
 ATA/ATAPI revision5
 device model  ST380021A
 serial number 3HV0MYL9
 firmware revision 3.10
 cylinders 16383
 heads 16
 sectors/track 63
 lba supported 156301488 sectors
 lba48 not supported dma supported
 overlap not supported
 
 Feature  Support  EnableValue   Vendor
 write cacheyes  yes
 read ahead yes  yes
 dma queued no   no  0/00
 SMART  yes  no
 microcode download yes  yes
 security   yes  no
 power management   yes  yes
 advanced power management  no   no  65278/FEFE
 automatic acoustic management  yes  yes 128/80  128/80
 $
 
 That's everything I can think of.
 

Just a general comment:

I had a very similar problem a while back. After replacing the drive in
question, 

Re: Hard disk woes

2005-09-05 Thread Michael Abbott

On Mon, 5 Sep 2005, Jason Morgan wrote:

On Mon, Sep 05, 2005 at 03:16:13PM +, Michael Abbott wrote:

I'm having some very odd behaviour from one of my hard disks and I wonder
what anybody makes of it.

In brief, the hard disk in questions works just fine much of the time, but
when high volume data transfers are requested I get the following in
/var/log/messages:

Sep  3 15:21:02 saturn /kernel: ad6: READ command timeout tag=0 serv=0 -
resetting


I had a very similar problem a while back. After replacing the drive in 
question, then replacing the motherboard, I discovered it was a power 
issue. The power supply was freaking out at medium to high loads, which 
was causing the device to continually reset.


Well, I hope that's not it.  I'm encouraged to think not:
- the problem seems to be tied to one particular hard disk and I
  presently run with four hard disks
- the system has operated trouble free for three years
- my memory is that it was a good quality power supply.
I don't really see how I'd diagnose a power supply problem, but as I say, 
the hard disk in question is the only part with problems.

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Hard disk woes

2005-09-05 Thread David Kelly


On Sep 5, 2005, at 1:56 PM, Michael Abbott wrote:

I had a very similar problem a while back. After replacing the  
drive in question, then replacing the motherboard, I discovered it  
was a power issue. The power supply was freaking out at medium to  
high loads, which was causing the device to continually reset.




Well, I hope that's not it.  I'm encouraged to think not:
- the problem seems to be tied to one particular hard disk and I
  presently run with four hard disks
- the system has operated trouble free for three years
- my memory is that it was a good quality power supply.
I don't really see how I'd diagnose a power supply problem, but as  
I say, the hard disk in question is the only part with problems.


Yeah But... Power supplies wear out. Particularly the capacitors.

I have seen every single component replaced in denial that the  
problem could be related to the power supply. Then the PS was finally  
replaced because it was the only thing which had not. And the problem  
was the PS all along.


--
David Kelly N4HHE, [EMAIL PROTECTED]

Whom computers would destroy, they must first drive mad.

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Hard disk woes

2005-09-05 Thread Michael Abbott

On Mon, 5 Sep 2005, David Kelly wrote:
I had a very similar problem a while back. After replacing the drive in 
question, then replacing the motherboard, I discovered it was a power 
issue. The power supply was freaking out at medium to high loads, which 
was causing the device to continually reset.



On Sep 5, 2005, at 1:56 PM, Michael Abbott wrote:

Well, I hope that's not it.  I'm encouraged to think not:

Yeah But... Power supplies wear out. Particularly the capacitors.

I have seen every single component replaced in denial that the problem could 
be related to the power supply. Then the PS was finally replaced because it 
was the only thing which had not. And the problem was the PS all along.


Well, I do have another reason for thinking that it's nothing to do with 
the power supply: a bit of history I didn't mention (because it's long 
and not particularly interesting).


When I first installed this machine (a bit over three years ago) I used 
the offending disk together with another disk of the same model.  I first 
used the motherboard hardware RAID (using striping for speed, more fool 
me) on the motherboard and installed FreeBSD.  It broke, really quite 
quickly (within a week or so).


I blamed the RAID controller and tried again, this time using vinum.  The 
system survived quite a bit longer (can't remember how long, a month or so 
maybe), but suddenly failed quite horribly: I lost all data.  I retired 
the two disks and started again, and the resulting system has run sweetly 
for three years.


Recently I brought the two disks out of retirement, and one of them seems 
most unhappy (as described).  I'm strongly persuaded (convinced, even) 
that that one disk is dodgy.  I think I'm going to have to bin it, unless 
somebody can come up with a way to reliably molycoddle it.


I still think the question: why does FreeBSD hang? is interesting.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]