Re: Followup to 'fallback to PIO mode' on dual processor AMD systems

2003-01-02 Thread nate
Bruce Campbell said:


  - try UDMA100 with the drives directly attached (ie. no removable tray) -
 maybe try a non onboard IDE controller

yes I would reccomend a PCI ide controller, such as the Promise ATA/100, or
Promise ATA/66. Also be sure your IDE cables are 18 and not 24 or 32 some
people like to go crazy with overly long IDE cables. Sometimes for me longer
then 18 and I get CRC errors(but nothing fatal).


  - shuffle the disks to see if the problems follow the disks or not

 At present, I don't suspect bad media because the error message is WRITE
 command timeout tag=0 serv=0 which doesn't suggest a specific
 sector/track etc, and running with UDMA33 instead of UDMA100 makes the
 problem  appear to vanish.

I read your burn in procedures, a couple additions to throw in I'd
reccomend:

CPUBurn:
http://users.ev1.net/~redelm/

I've only tried it on linux but the page lists *BSD too. This package
also includes a memory tester, I usually run 1 CPUburn process per CPU
and as many memory testers as I have RAM. If you try to load too many
the newest process will segfault(since it can't allocate memory), harmless.
Run this for at least 24 hours.

memtest86:
http://www.memtest86.com/

when you boot it, go to the options screen and turn on all tests, and run
it through once or twice, with your system I'd expect 1 pass of all tests
to be done in about 20 hours.

most of my servers that run IDE have DMA/33 controllers, the few that have
faster ones all use Promise ATA/100 cards or 3ware 6800 series raid cards.
I haven't trusted recent AMD/VIA/Intel IDE chips for a while.

nate




To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message



Re: Followup to fallback to PIO mode on dual processor AMD systems

2003-01-02 Thread Bruce Evans
On Thu, 2 Jan 2003, Bruce Campbell wrote:

 At present, I don't suspect bad media because the error message is
 WRITE command timeout tag=0 serv=0 which doesn't suggest a specific
 sector/track etc, and running with UDMA33 instead of UDMA100 makes the problem
 appear to vanish.

The fallback is clearly wrong because it turns isolated media errors
into pessimized i/o for the whole disk at best, system hangs during
resets next best, and system crashes at worst.  I keep a disk with bad
media on line for testing some of this, and zap the fallback using the
following patch (hope this is complete; it was edited from a larger
patch).

%%%
Index: ata-disk.c
===
RCS file: /home/ncvs/src/sys/dev/ata/ata-disk.c,v
retrieving revision 1.139
diff -u -2 -r1.139 ata-disk.c
--- ata-disk.c  17 Dec 2002 16:26:22 -  1.139
+++ ata-disk.c  18 Dec 2002 01:03:37 -
@@ -597,5 +606,5 @@
else {
ata_dmainit(adp-device, ata_pmode(adp-device-param), -1, -1);
-   printf( falling back to PIO mode\n);
+   printf( NOT falling back to PIO mode\n);
}
TAILQ_INSERT_HEAD(adp-device-channel-ata_queue, request, chain);
@@ -603,4 +612,5 @@
}

+#if 0
/* if using DMA, try once again in PIO mode */
if (request-flags  ADR_F_DMA_USED) {
@@ -613,4 +623,5 @@
return ATA_OP_FINISHED;
}
+#endif

request-flags |= ADR_F_ERROR;
%%%

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message



Re: Followup to fallback to PIO mode on dual processor AMD systems

2003-01-02 Thread Bruce Campbell
Quoting Bruce Evans [EMAIL PROTECTED]:
 On Thu, 2 Jan 2003, Bruce Campbell wrote:
 
  At present, I don't suspect bad media because the error message is
  WRITE command timeout tag=0 serv=0 which doesn't suggest a specific
  sector/track etc, and running with UDMA33 instead of UDMA100 makes the
 problem
  appear to vanish.
 
 The fallback is clearly wrong because it turns isolated media errors
 into pessimized i/o for the whole disk at best, system hangs during
 resets next best, and system crashes at worst.  I keep a disk with bad
 media on line for testing some of this, and zap the fallback using the
 following patch (hope this is complete; it was edited from a larger
 patch).

Thanks for the patch.  Under moderate load, I am seeing occasional
instances of:

/kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting
/kernel: ata0: resetting devices .. done

and everything keeps on working normally via DMA. ie it does not drop to PIO.

The more manacing case is this:

Dec 30 23:26:59 /kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting
Dec 30 23:26:59 /kernel: ata0: resetting devices .. done
Dec 30 23:26:59 /kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting
Dec 30 23:27:00 /kernel: ata0: resetting devices .. done
Dec 30 23:27:00 /kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting
Dec 30 23:27:00 /kernel: ata0: resetting devices .. done
Dec 30 23:27:00 /kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting
Dec 30 23:27:00 /kernel: ad0: timeout waiting for cmd=ef s=d0 e=00
Dec 30 23:27:00 /kernel: ad0: trying fallback to PIO mode
Dec 30 23:27:00 /kernel: ata0: resetting devices .. done

So it appears it would no longer with DMA, but it would work with PIO.
If it is manually set back to UDMA with the atacontrol command, it times
out again, and falls back to PIO.

However, a soft reboot, and all is well again.

 
 %%%
 Index: ata-disk.c
 ===
 RCS file: /home/ncvs/src/sys/dev/ata/ata-disk.c,v
 retrieving revision 1.139
 diff -u -2 -r1.139 ata-disk.c
 --- ata-disk.c17 Dec 2002 16:26:22 -  1.139
 +++ ata-disk.c18 Dec 2002 01:03:37 -
 @@ -597,5 +606,5 @@
   else {
   ata_dmainit(adp-device, ata_pmode(adp-device-param), -1, -1);
 - printf( falling back to PIO mode\n);
 + printf( NOT falling back to PIO mode\n);
   }
   TAILQ_INSERT_HEAD(adp-device-channel-ata_queue, request, chain);
 @@ -603,4 +612,5 @@
   }
 
 +#if 0
   /* if using DMA, try once again in PIO mode */
   if (request-flags  ADR_F_DMA_USED) {
 @@ -613,4 +623,5 @@
   return ATA_OP_FINISHED;
   }
 +#endif
 
   request-flags |= ADR_F_ERROR;
 %%%
 
 Bruce
 


-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message



Re: Followup to fallback to PIO mode on dual processor AMD systems

2003-01-02 Thread Barney Wolff
On Fri, Jan 03, 2003 at 06:36:29AM +1100, Bruce Evans wrote:
 
 The fallback is clearly wrong because it turns isolated media errors
 into pessimized i/o for the whole disk at best, system hangs during
 resets next best, and system crashes at worst.  I keep a disk with bad
 media on line for testing some of this, and zap the fallback using the
 following patch (hope this is complete; it was edited from a larger
 patch).

Perhaps the right answer is to test uptime and do the fallback if the
error happens in the first minute, at least for permanently-mounted
disks.  In any case, retries in the current mode should be exhausted
first.

-- 
Barney Wolff http://www.databus.com/bwresume.pdf
I'm available by contract or FT, in the NYC metro area or via the 'Net.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message