Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout

2013-06-04 Thread Alexander Motin

On 03.06.2013 23:22, Jeremy Chadwick wrote:

On Mon, Jun 03, 2013 at 03:06:53PM +0100, Mike Pumford wrote:

Ian Lepore wrote:

On Wed, 2013-05-29 at 16:21 +0200, Oliver Fromme wrote:

Steven Hartland wrote:
   Have you checked your sata cables and psu outputs?
  
   Both of these could be the underlying cause of poor signalling.

I can't easily check that because it is a cheap rented
server in a remote location.

But I don't believe it is bad cabling or PSU anyway, or
otherwise the problem would occur intermittently all the
time if the load on the disks is sufficiently high.
But it only occurs at tags=3 and above.  At tags=2 it does
not occur at all, no matter how hard I hammer on the disks.

At the moment I'm inclined to believe that it is either
a bug in the HDD firmware or in the controller.  The disks
aren't exactly new, they're 400 GB Samsung ones that are
several years old.  I think it's not uncommon to have bugs
in the NCQ implementation in such disks.

The only thing that puzzles me is the fact that the problem
also disappears completely when I reduce the SATA rev from
II to I, even at tags=32.



It seems to me that you dismiss signaling problems too quickly.
Consider the possibilities... A bad cable leads to intermittant errors
at higher speeds.  When NCQ is disabled or limited the software handles
these errors pretty much transparently.  When NCQ is not limitted and
there are many outstanding requests, suddenly the error handling in the
software breaks down somehow and a minor recoverable problem becomes an
in-your-face error.


It could also be a software bug in the way CAM handles the failure
of NCQ commands. When command queueing is used on a SCSI drive and a
queued command fails only that command fails. A queued command
failure on a SATA device fails ALL currently queued commands. I've
not looked at the code but do the SATA CAM drivers do the right
thing here?


Quoting T13/2015-D ATA8-ACS2 WD spec:

If an error occurs while the device is processing an NCQ command, then
the device shall return command aborted for all NCQ commands that are in
the queue and shall return command aborted for any new commands, except
a READ LOG EXT command requesting log address 10h, until the device
completes a READ LOG EXT command requesting log address 10h (i.e.,
reading the NCQ Command Error log) without error.

While I can't easily provide an answer to your question, I can tell you
that sys/dev/ahci/ahci.c does execute READ LOG EXT (command 0x2f) for
certain scenarios (the code is in function ahci_issue_recovery()).


I am not aware about any flows in present CAM ATA error recovery logic. 
READ LOG EXT sending indeed implemented on ahci(4) driver level (same as 
siis(4) and mvs(4)) since it was complicated/impossible to do in shared 
code because higher levels have no idea about tags allocation done by 
lower-level drivers.



The one person who can answer this question is mav@, who is now CC'd.


Less commands queued makes it less likely that multiple commands
will be in progress when a failure occurs.  A lower link rate also
makes you more immune to signal failures.


He isn't seeing SATA-level signal/link failure; the AHCI driver would
complain about that, and those messages aren't there.  Unless, of
course, those messages are only visible when verbose booting is enabled
(I hope not).


Just a curious history point: I had one old system on NVIDIA MCP55 
chipset where Linux worked well before, but FreeBSD had problems with 
SATA -- all disk transfers were really slow, but without reporting any 
errors, and after some point system started to hang. That series of 
chipsets had long history of problems, so for some time I was looking 
for some way to handle it in software. But after many experiments I've 
accidentally found out that disabling 6 small but very powerful fans 
workarounded the problem. I've checked PSU voltages, and they were fine. 
Switching fans to separate PSU also helped. Finally I've just replaced 
system's main PSU with different one and problems have gone. My best 
guess was that capacitors in that PSU due to old age were unable to 
filter fan's electric noise that started to interfere with SATA and 
later other signals. Now the same PSU works perfectly fine in the same 
case with smaller Atom-based motherbard without any issues.


I am not telling that ahci(4) driver is perfect, but hardware issues are 
always possible even if system worked fine before that.


--
Alexander Motin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout

2013-06-03 Thread Mike Pumford

Ian Lepore wrote:

On Wed, 2013-05-29 at 16:21 +0200, Oliver Fromme wrote:

Steven Hartland wrote:
   Have you checked your sata cables and psu outputs?
  
   Both of these could be the underlying cause of poor signalling.

I can't easily check that because it is a cheap rented
server in a remote location.

But I don't believe it is bad cabling or PSU anyway, or
otherwise the problem would occur intermittently all the
time if the load on the disks is sufficiently high.
But it only occurs at tags=3 and above.  At tags=2 it does
not occur at all, no matter how hard I hammer on the disks.

At the moment I'm inclined to believe that it is either
a bug in the HDD firmware or in the controller.  The disks
aren't exactly new, they're 400 GB Samsung ones that are
several years old.  I think it's not uncommon to have bugs
in the NCQ implementation in such disks.

The only thing that puzzles me is the fact that the problem
also disappears completely when I reduce the SATA rev from
II to I, even at tags=32.



It seems to me that you dismiss signaling problems too quickly.
Consider the possibilities... A bad cable leads to intermittant errors
at higher speeds.  When NCQ is disabled or limited the software handles
these errors pretty much transparently.  When NCQ is not limitted and
there are many outstanding requests, suddenly the error handling in the
software breaks down somehow and a minor recoverable problem becomes an
in-your-face error.

It could also be a software bug in the way CAM handles the failure of 
NCQ commands. When command queueing is used on a SCSI drive and a queued 
command fails only that command fails. A queued command failure on a 
SATA device fails ALL currently queued commands. I've not looked at the 
code but do the SATA CAM drivers do the right thing here?


Less commands queued makes it less likely that multiple commands will be 
in progress when a failure occurs.  A lower link rate also makes you 
more immune to signal failures.


Mike

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout

2013-06-03 Thread Jeremy Chadwick
On Mon, Jun 03, 2013 at 03:06:53PM +0100, Mike Pumford wrote:
 Ian Lepore wrote:
 On Wed, 2013-05-29 at 16:21 +0200, Oliver Fromme wrote:
 Steven Hartland wrote:
Have you checked your sata cables and psu outputs?
   
Both of these could be the underlying cause of poor signalling.
 
 I can't easily check that because it is a cheap rented
 server in a remote location.
 
 But I don't believe it is bad cabling or PSU anyway, or
 otherwise the problem would occur intermittently all the
 time if the load on the disks is sufficiently high.
 But it only occurs at tags=3 and above.  At tags=2 it does
 not occur at all, no matter how hard I hammer on the disks.
 
 At the moment I'm inclined to believe that it is either
 a bug in the HDD firmware or in the controller.  The disks
 aren't exactly new, they're 400 GB Samsung ones that are
 several years old.  I think it's not uncommon to have bugs
 in the NCQ implementation in such disks.
 
 The only thing that puzzles me is the fact that the problem
 also disappears completely when I reduce the SATA rev from
 II to I, even at tags=32.
 
 
 It seems to me that you dismiss signaling problems too quickly.
 Consider the possibilities... A bad cable leads to intermittant errors
 at higher speeds.  When NCQ is disabled or limited the software handles
 these errors pretty much transparently.  When NCQ is not limitted and
 there are many outstanding requests, suddenly the error handling in the
 software breaks down somehow and a minor recoverable problem becomes an
 in-your-face error.
 
 It could also be a software bug in the way CAM handles the failure
 of NCQ commands. When command queueing is used on a SCSI drive and a
 queued command fails only that command fails. A queued command
 failure on a SATA device fails ALL currently queued commands. I've
 not looked at the code but do the SATA CAM drivers do the right
 thing here?

Quoting T13/2015-D ATA8-ACS2 WD spec:

If an error occurs while the device is processing an NCQ command, then
the device shall return command aborted for all NCQ commands that are in
the queue and shall return command aborted for any new commands, except
a READ LOG EXT command requesting log address 10h, until the device
completes a READ LOG EXT command requesting log address 10h (i.e.,
reading the NCQ Command Error log) without error.

While I can't easily provide an answer to your question, I can tell you
that sys/dev/ahci/ahci.c does execute READ LOG EXT (command 0x2f) for
certain scenarios (the code is in function ahci_issue_recovery()).

The one person who can answer this question is mav@, who is now CC'd.

 Less commands queued makes it less likely that multiple commands
 will be in progress when a failure occurs.  A lower link rate also
 makes you more immune to signal failures.

He isn't seeing SATA-level signal/link failure; the AHCI driver would
complain about that, and those messages aren't there.  Unless, of
course, those messages are only visible when verbose booting is enabled
(I hope not).

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


9.1-stable: ATI IXP600 AHCI: CAM timeout

2013-05-29 Thread Oliver Fromme
Hi,

Yesterday I have downloaded the latest 9.1 snapshot (May 15th)
from ftp.freebsd.org and installed it on a machine that was
previously running Linux.  It works fine, except that I get
many the following when there is heavy disk I/O, e.g. when
building world or ports:

ahcich0: Timeout on slot 23 port 0
ahcich0: is  cs f07f ss  rs  tfd c0 serr  
cmd 0004bc17
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 c9 e0 40 04 00 00 00 00 
00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command

It happens for *both* ahcich0/ada0 and ahcich1/ada1 equally
often (it's a gmirror), sometimes even at exactly the same
time so the messages for ada0 and ada1 are interleaved in
the dmesg output.

The worst thing is that the whole system seems to freeze
completely for about 10 seconds each time it happens.
Other than that, I haven't seen any ill effects, i.e. no
processes dying and no panics (so far).  But the system is
quite unusable because of the freezes.

I'm pretty sure the hardware has no defects.  The machine
was running Linux fine until recently.

Are there any known issues with FreeBSD + ATI IXP600?

The kernel is the default GENERIC from the snapshot, the
only additional modules loaded are geom_mirror and linux.ko.
The dmesg messages related to disks are copied below, and
the full dmesg can be found here:
http://www.secnetix.de/olli/tmp/dmesg.nox.txt

Best regards
   Oliver

FreeBSD 9.1-STABLE #0: Mon May 13 05:10:23 UTC 2013
r...@snap.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64
..
ahci0: ATI IXP600 AHCI SATA controller port 
0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0x7000-0x700f mem 
0xfe7ff800-0xfe7ffbff irq 22 at device 18.0 on pci0
ahci0: AHCI v1.10 with 4 3Gbps ports, Port Multiplier supported
ahcich0: AHCI channel at channel 0 on ahci0
ahcich1: AHCI channel at channel 1 on ahci0
ahcich2: AHCI channel at channel 2 on ahci0
ahcich3: AHCI channel at channel 3 on ahci0
..
..
(aprobe0:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich0:0:15:0): CAM status: Command timeout
(aprobe0:ahcich0:0:15:0): Error 5, Retries exhausted
(aprobe1:ahcich1:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe1:ahcich1:0:15:0): CAM status: Command timeout
(aprobe1:ahcich1:0:15:0): Error 5, Retries exhausted
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: SAMSUNG HD403LJ CT100-12 ATA-8 SATA 2.x device
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 381554MB (781422768 512 byte sectors: 16H 63S/T 16383C)
ada0: Previously was known as ad4
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: SAMSUNG HD403LJ CT100-12 ATA-8 SATA 2.x device
ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 381554MB (781422768 512 byte sectors: 16H 63S/T 16383C)
ada1: Previously was known as ad6
..
GEOM_MIRROR: Device mirror/gm0 launched (2/2).
..
Trying to mount root from ufs:/dev/mirror/gm0s1a [rw]...
..
ahcich0: Timeout on slot 23 port 0
ahcich0: is  cs f07f ss  rs  tfd c0 serr  
cmd 0004bc17
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 c9 e0 40 04 00 00 00 00 
00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
ahcich1: Timeout on slot 12 port 0
ahcich1: is  cs 8fff ss  rs  tfd 40 serr  
cmd 0004ee17
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 80 85 e3 40 04 00 00 00 00 
00
(ada1:ahcich1:0:0:0): CAM status: Command timeout
(ada1:ahcich1:0:0:0): Retrying command
ahcich1: Timeout on slot 2 port 0
ahcich1: is  cs  ss 001c rs 001c tfd 40 serr  
cmd 0004e417
ahcich0: Timeout on slot 12 port 0
(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 e0 04 e3 40 04 00 00 00 00 
00
(ada1:ahcich1:0:0:0): CAM status: Command timeout
ahcich0: is  cs  ss 7000 rs 7000 tfd 40 serr  
cmd 0004ee17
(ada1:ahcich1:0:0:0): Retrying command
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 e0 04 e3 40 04 00 00 00 00 
00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
pid 40615 (try), uid 0: exited on signal 10 (core dumped)
ahcich1: Timeout on slot 7 port 0
ahcich1: is  cs f07f ss  rs  tfd c0 serr  
cmd 0004ac17
ahcich0: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 7d 92 40 02 00 
00 00 00 00
Timeout on slot 19 port 0
(ada1:ahcich1:0:0:0): CAM status: Command timeout
ahcich0: is  cs ff07 ss  rs  tfd c0 serr  
cmd 0004b817
(ada1:ahcich1:0:0:0): Retrying command
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 7d 92 40 02 00 00 00 00 
00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
ahcich1: Timeout on slot 12 port 0
ahcich1: is  cs  ss f000 rs f000 tfd 

Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout

2013-05-29 Thread Oliver Fromme
Now I have some more information ...

The problem disappears when I disable NCQ, i.e. set the
number of tags to 1 with camcontrol.  Using binary search
I found out that the problem also disappears with 2 tags,
but with 3 tags I get the same amout of errors as with
the default of 32 tags.

Interestingly, the problems also disappears when I reduce
the SATA level from II to I (i.e. from 3 to 1.5 Gbit/s),
even if the NCQ tags are left at the default of 32.

Now the question is:  Is it better to reduce the NCQ tags
from 32 to 2, or to reduce the SATA bandwidth from 3 Gbps
to 1.5 Gbps?  What is more likely to impact performance
on a mixed server with shell users, apache, sendmail, DNS
and a few other things?

Best regards
   Oliver


-- 
Oliver Fromme,  secnetix GmbH  Co. KG,  Marktplatz 29, 85567 Grafing
Handelsregister:  Amtsgericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsreg.: Amtsgericht München,
HRB 125758, Geschäftsführer:  Maik Bachmann,  Olaf Erb,  Ralf Gebhart

FreeBSD-Dienstleistungen/-Produkte + mehr: http://www.secnetix.de/bsd

In my experience the term transparent proxy is an oxymoron (like jumbo
shrimp).  Transparent proxies seem to vary from the distortions of a
funhouse mirror to barely translucent.  I really, really dislike them
when trying to figure out the corrective lenses needed with each of them.
-- R. Kevin Oberman, Network Engineer
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout

2013-05-29 Thread Steven Hartland

Have you checked your sata cables and psu outputs?

Both of these could be the underlying cause of poor signalling.


- Original Message - 
From: Oliver Fromme o...@lurza.secnetix.de

To: freebsd-stable@FreeBSD.ORG
Sent: Wednesday, May 29, 2013 2:05 PM
Subject: Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout


Now I have some more information ...

The problem disappears when I disable NCQ, i.e. set the
number of tags to 1 with camcontrol.  Using binary search
I found out that the problem also disappears with 2 tags,
but with 3 tags I get the same amout of errors as with
the default of 32 tags.

Interestingly, the problems also disappears when I reduce
the SATA level from II to I (i.e. from 3 to 1.5 Gbit/s),
even if the NCQ tags are left at the default of 32.

Now the question is:  Is it better to reduce the NCQ tags
from 32 to 2, or to reduce the SATA bandwidth from 3 Gbps
to 1.5 Gbps?  What is more likely to impact performance
on a mixed server with shell users, apache, sendmail, DNS
and a few other things?

Best regards
  Oliver


--
Oliver Fromme,  secnetix GmbH  Co. KG,  Marktplatz 29, 85567 Grafing
Handelsregister:  Amtsgericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsreg.: Amtsgericht München,
HRB 125758, Geschäftsführer:  Maik Bachmann,  Olaf Erb,  Ralf Gebhart

FreeBSD-Dienstleistungen/-Produkte + mehr: http://www.secnetix.de/bsd

In my experience the term transparent proxy is an oxymoron (like jumbo
shrimp).  Transparent proxies seem to vary from the distortions of a
funhouse mirror to barely translucent.  I really, really dislike them
when trying to figure out the corrective lenses needed with each of them.
   -- R. Kevin Oberman, Network Engineer
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org



This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 


In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout

2013-05-29 Thread Oliver Fromme
Steven Hartland wrote:
  Have you checked your sata cables and psu outputs?
  
  Both of these could be the underlying cause of poor signalling.

I can't easily check that because it is a cheap rented
server in a remote location.

But I don't believe it is bad cabling or PSU anyway, or
otherwise the problem would occur intermittently all the
time if the load on the disks is sufficiently high.
But it only occurs at tags=3 and above.  At tags=2 it does
not occur at all, no matter how hard I hammer on the disks.

At the moment I'm inclined to believe that it is either
a bug in the HDD firmware or in the controller.  The disks
aren't exactly new, they're 400 GB Samsung ones that are
several years old.  I think it's not uncommon to have bugs
in the NCQ implementation in such disks.

The only thing that puzzles me is the fact that the problem
also disappears completely when I reduce the SATA rev from
II to I, even at tags=32.

Best regards
   Oliver


-- 
Oliver Fromme,  secnetix GmbH  Co. KG,  Marktplatz 29, 85567 Grafing
Handelsregister:  Amtsgericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsreg.: Amtsgericht München,
HRB 125758, Geschäftsführer:  Maik Bachmann,  Olaf Erb,  Ralf Gebhart

FreeBSD-Dienstleistungen/-Produkte + mehr: http://www.secnetix.de/bsd

People still program in C.  People keep writing shell scripts.  *Most*
people don't realize the shortcomings of the tools they are using because
they a) don't reflect on their workflows and they are b) too lazy to check
out alternatives to realize there is help. -- Simon 'corecode' Schubert
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout

2013-05-29 Thread Ian Lepore
On Wed, 2013-05-29 at 16:21 +0200, Oliver Fromme wrote:
 Steven Hartland wrote:
   Have you checked your sata cables and psu outputs?
   
   Both of these could be the underlying cause of poor signalling.
 
 I can't easily check that because it is a cheap rented
 server in a remote location.
 
 But I don't believe it is bad cabling or PSU anyway, or
 otherwise the problem would occur intermittently all the
 time if the load on the disks is sufficiently high.
 But it only occurs at tags=3 and above.  At tags=2 it does
 not occur at all, no matter how hard I hammer on the disks.
 
 At the moment I'm inclined to believe that it is either
 a bug in the HDD firmware or in the controller.  The disks
 aren't exactly new, they're 400 GB Samsung ones that are
 several years old.  I think it's not uncommon to have bugs
 in the NCQ implementation in such disks.
 
 The only thing that puzzles me is the fact that the problem
 also disappears completely when I reduce the SATA rev from
 II to I, even at tags=32.
 

It seems to me that you dismiss signaling problems too quickly.
Consider the possibilities... A bad cable leads to intermittant errors
at higher speeds.  When NCQ is disabled or limited the software handles
these errors pretty much transparently.  When NCQ is not limitted and
there are many outstanding requests, suddenly the error handling in the
software breaks down somehow and a minor recoverable problem becomes an
in-your-face error.

I'm not saying any of the foregoing is true, just that you should
consider the possibility that you're dealing with multiple problems
which are only loosely coupled, but together can seem like a single more
serious problem.  You don't know enough yet to casually dismiss
anything.

-- Ian


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout

2013-05-29 Thread Oliver Fromme
Ian Lepore wrote:
  On Wed, 2013-05-29 at 16:21 +0200, Oliver Fromme wrote:
   Steven Hartland wrote:
Have you checked your sata cables and psu outputs?

Both of these could be the underlying cause of poor signalling.
   
   I can't easily check that because it is a cheap rented
   server in a remote location.
   
   But I don't believe it is bad cabling or PSU anyway, or
   otherwise the problem would occur intermittently all the
   time if the load on the disks is sufficiently high.
   But it only occurs at tags=3 and above.  At tags=2 it does
   not occur at all, no matter how hard I hammer on the disks.
   
   At the moment I'm inclined to believe that it is either
   a bug in the HDD firmware or in the controller.  The disks
   aren't exactly new, they're 400 GB Samsung ones that are
   several years old.  I think it's not uncommon to have bugs
   in the NCQ implementation in such disks.
   
   The only thing that puzzles me is the fact that the problem
   also disappears completely when I reduce the SATA rev from
   II to I, even at tags=32.
  
  It seems to me that you dismiss signaling problems too quickly.
  Consider the possibilities... A bad cable leads to intermittant errors
  at higher speeds.  When NCQ is disabled or limited the software handles
  these errors pretty much transparently.  When NCQ is not limitted and
  there are many outstanding requests, suddenly the error handling in the
  software breaks down somehow and a minor recoverable problem becomes an
  in-your-face error.
  
  I'm not saying any of the foregoing is true, just that you should
  consider the possibility that you're dealing with multiple problems
  which are only loosely coupled, but together can seem like a single more
  serious problem.  You don't know enough yet to casually dismiss
  anything.

Well ...  I also can't dismiss the possibility that there is
a mouse in the machine that is pulling the SATA cables twice
every minute.  :-)

But seriously ...  I don't see how bad cabling could cause
errors at tags=3 and no errors at all at tags=2.  It shouldn't
make a difference for the cables if there are two or three
tags used.  And by the way, it doesn't make a difference at
all whether I use tags=3 or tags=32; the rate of errors is the
same in both cases (about two per minute during buildword).

I have googled a bit; the Samsung HD401LJ and HD403LJ don't
seem to be innocent ...  There are lots of pages mentioning
problems with NCQ and SATA I vs. II.

Best regards
   Oliver


-- 
Oliver Fromme,  secnetix GmbH  Co. KG,  Marktplatz 29, 85567 Grafing
Handelsregister:  Amtsgericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsreg.: Amtsgericht München,
HRB 125758, Geschäftsführer:  Maik Bachmann,  Olaf Erb,  Ralf Gebhart

FreeBSD-Dienstleistungen/-Produkte + mehr: http://www.secnetix.de/bsd

A misleading benchmark test can accomplish in minutes
what years of good engineering can never do. -- Dilbert (2009-03-02)
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout

2013-05-29 Thread Adam McDougall
On 05/29/13 10:21, Oliver Fromme wrote:
 Steven Hartland wrote:
   Have you checked your sata cables and psu outputs?
   
   Both of these could be the underlying cause of poor signalling.
 
 I can't easily check that because it is a cheap rented
 server in a remote location.
 
 But I don't believe it is bad cabling or PSU anyway, or
 otherwise the problem would occur intermittently all the
 time if the load on the disks is sufficiently high.
 But it only occurs at tags=3 and above.  At tags=2 it does
 not occur at all, no matter how hard I hammer on the disks.
 
 At the moment I'm inclined to believe that it is either
 a bug in the HDD firmware or in the controller.  The disks
 aren't exactly new, they're 400 GB Samsung ones that are
 several years old.  I think it's not uncommon to have bugs
 in the NCQ implementation in such disks.
 
 The only thing that puzzles me is the fact that the problem
 also disappears completely when I reduce the SATA rev from
 II to I, even at tags=32.
 
 Best regards
Oliver
 
 

Jeremy Chadwick knows of some hardware faults with IXP600/700,
there may be more information on the freebsd-fs mailing list archives or
if you can discuss with him:

http://docs.freebsd.org/cgi/mid.cgi?20130414194440.GB38338

That email mentions port multipliers but the problems may extend beyond.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout

2013-05-29 Thread Jeremy Chadwick
On Wed, May 29, 2013 at 10:09:14AM +0200, Oliver Fromme wrote:
 Hi,
 
 Yesterday I have downloaded the latest 9.1 snapshot (May 15th)
 from ftp.freebsd.org and installed it on a machine that was
 previously running Linux.  It works fine, except that I get
 many the following when there is heavy disk I/O, e.g. when
 building world or ports:
 
 ahcich0: Timeout on slot 23 port 0
 ahcich0: is  cs f07f ss  rs  tfd c0 serr  
 cmd 0004bc17
 (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 c9 e0 40 04 00 00 00 
 00 00
 (ada0:ahcich0:0:0:0): CAM status: Command timeout
 (ada0:ahcich0:0:0:0): Retrying command

The messages above indicate two things:

1) The AHCI driver is reporting an internal timeout when trying to speak
to the underlying device (disk) attached to whatever maps to ahcich0;
this is an AHCI-level timeout, and the 2nd line shows all of the
AHCI-level status conditions at that time,

2) CAM reports what it was trying to do when that happened,
specifically issue WRITE_FPDMA_QUEUED (an NCQ-based write to ada0),
which timed out after 30 seconds (kern.cam.ada.default_timeout).

 It happens for *both* ahcich0/ada0 and ahcich1/ada1 equally
 often (it's a gmirror), sometimes even at exactly the same
 time so the messages for ada0 and ada1 are interleaved in
 the dmesg output.

Both surprising and not surprising (to me anyway), on numerous levels.

 The worst thing is that the whole system seems to freeze
 completely for about 10 seconds each time it happens.
 Other than that, I haven't seen any ill effects, i.e. no
 processes dying and no panics (so far).  But the system is
 quite unusable because of the freezes.

There isn't much you can do about this.  I get the impression from your
statement this is the first time you've ever encountered an I/O timeout
in your life?  :-)  This is just how it works -- pretty much the entire
I/O subsystem (for the device(s) involved) stalls until a response to
the CDB gets received.  It's like this on all OSes, all systems; it's
how I/O works.

The AHCI driver may have different timeout settings; I haven't looked.

The same CDB gets re-submit to the controller 5 times
(kern.cam.ada.retry_count will say 4, but it starts at 0 if I remember
right), in hopes that the I/O transaction will eventually go through.

Repeated device timeouts with no successful responses will eventually
cause CAM or AHCI (I forget which driver/subsystem) to drop the disk.
In your case, this could mean ada0 and ada1 eventually getting dropped,
which would induce a panic since you're using them for your root
filesystem.  (I wonder if there are readers of this thread who are
starting to see why I use a single disk for my main OS drive...)

 I'm pretty sure the hardware has no defects.  The machine
 was running Linux fine until recently.

 Are there any known issues with FreeBSD + ATI IXP600?

This is opening a can of worms, which I've discussed in the past.
Please see my posts to freebsd-fs and/or freebsd-stable archives
(another person in this thread mentioned it as well).

Fact: there is still not enough low-level, hard evidence at this time to
determine if the problem is with the AHCI driver, the AMD/ATI IXP600
controller, or Samsung disks.  The situations I have dealt with in the
past always were inconclusive.  There have been reports of problems with
non-Samsung disks as well, but the report count there is extremely low
in comparison.

Fact: You will find complaints on Linux lists about both the controller
and the drives as well (in combo).  Take that to mean whatever you wish.
Use Google and search for SB600 HD403LJ Linux or SB600 Samsung Linux
and see for yourself.

Fact: Samsung's SpinPoint series has had a troubling past of firmware
bugs.  Things have gotten better on their newer-ish drives, but the
slightly older ones, to me, seemed more like a learning experience
for engineers.  I am not picking on Samsung exclusively; all drive
vendors have had problems historically, there is no such thing as a
reliable drive vendor in this day and age.  You go with whatever works
for you/whatever your experiences justify.

All that said:

There is some code in sys/dev/ahci/ahci.c that indicates one-off
behaviour for the SB600/IXP600, pertaining to NCQ.  However this was
committed a long time ago (r196777 and r196796).  I look at this
code and I can think of one problem with it, but answers to my below
questions will provide what I need.

 The kernel is the default GENERIC from the snapshot, the
 only additional modules loaded are geom_mirror and linux.ko.
 The dmesg messages related to disks are copied below, and
 the full dmesg can be found here:
 http://www.secnetix.de/olli/tmp/dmesg.nox.txt
 
 Best regards
Oliver
 
 FreeBSD 9.1-STABLE #0: Mon May 13 05:10:23 UTC 2013
 r...@snap.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64
 ..
 ahci0: ATI IXP600 AHCI SATA controller port 
 0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0x7000-0x700f mem