Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout
On 03.06.2013 23:22, Jeremy Chadwick wrote: On Mon, Jun 03, 2013 at 03:06:53PM +0100, Mike Pumford wrote: Ian Lepore wrote: On Wed, 2013-05-29 at 16:21 +0200, Oliver Fromme wrote: Steven Hartland wrote: Have you checked your sata cables and psu outputs? Both of these could be the underlying cause of poor signalling. I can't easily check that because it is a cheap rented server in a remote location. But I don't believe it is bad cabling or PSU anyway, or otherwise the problem would occur intermittently all the time if the load on the disks is sufficiently high. But it only occurs at tags=3 and above. At tags=2 it does not occur at all, no matter how hard I hammer on the disks. At the moment I'm inclined to believe that it is either a bug in the HDD firmware or in the controller. The disks aren't exactly new, they're 400 GB Samsung ones that are several years old. I think it's not uncommon to have bugs in the NCQ implementation in such disks. The only thing that puzzles me is the fact that the problem also disappears completely when I reduce the SATA rev from II to I, even at tags=32. It seems to me that you dismiss signaling problems too quickly. Consider the possibilities... A bad cable leads to intermittant errors at higher speeds. When NCQ is disabled or limited the software handles these errors pretty much transparently. When NCQ is not limitted and there are many outstanding requests, suddenly the error handling in the software breaks down somehow and a minor recoverable problem becomes an in-your-face error. It could also be a software bug in the way CAM handles the failure of NCQ commands. When command queueing is used on a SCSI drive and a queued command fails only that command fails. A queued command failure on a SATA device fails ALL currently queued commands. I've not looked at the code but do the SATA CAM drivers do the right thing here? Quoting T13/2015-D ATA8-ACS2 WD spec: If an error occurs while the device is processing an NCQ command, then the device shall return command aborted for all NCQ commands that are in the queue and shall return command aborted for any new commands, except a READ LOG EXT command requesting log address 10h, until the device completes a READ LOG EXT command requesting log address 10h (i.e., reading the NCQ Command Error log) without error. While I can't easily provide an answer to your question, I can tell you that sys/dev/ahci/ahci.c does execute READ LOG EXT (command 0x2f) for certain scenarios (the code is in function ahci_issue_recovery()). I am not aware about any flows in present CAM ATA error recovery logic. READ LOG EXT sending indeed implemented on ahci(4) driver level (same as siis(4) and mvs(4)) since it was complicated/impossible to do in shared code because higher levels have no idea about tags allocation done by lower-level drivers. The one person who can answer this question is mav@, who is now CC'd. Less commands queued makes it less likely that multiple commands will be in progress when a failure occurs. A lower link rate also makes you more immune to signal failures. He isn't seeing SATA-level signal/link failure; the AHCI driver would complain about that, and those messages aren't there. Unless, of course, those messages are only visible when verbose booting is enabled (I hope not). Just a curious history point: I had one old system on NVIDIA MCP55 chipset where Linux worked well before, but FreeBSD had problems with SATA -- all disk transfers were really slow, but without reporting any errors, and after some point system started to hang. That series of chipsets had long history of problems, so for some time I was looking for some way to handle it in software. But after many experiments I've accidentally found out that disabling 6 small but very powerful fans workarounded the problem. I've checked PSU voltages, and they were fine. Switching fans to separate PSU also helped. Finally I've just replaced system's main PSU with different one and problems have gone. My best guess was that capacitors in that PSU due to old age were unable to filter fan's electric noise that started to interfere with SATA and later other signals. Now the same PSU works perfectly fine in the same case with smaller Atom-based motherbard without any issues. I am not telling that ahci(4) driver is perfect, but hardware issues are always possible even if system worked fine before that. -- Alexander Motin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout
Ian Lepore wrote: On Wed, 2013-05-29 at 16:21 +0200, Oliver Fromme wrote: Steven Hartland wrote: Have you checked your sata cables and psu outputs? Both of these could be the underlying cause of poor signalling. I can't easily check that because it is a cheap rented server in a remote location. But I don't believe it is bad cabling or PSU anyway, or otherwise the problem would occur intermittently all the time if the load on the disks is sufficiently high. But it only occurs at tags=3 and above. At tags=2 it does not occur at all, no matter how hard I hammer on the disks. At the moment I'm inclined to believe that it is either a bug in the HDD firmware or in the controller. The disks aren't exactly new, they're 400 GB Samsung ones that are several years old. I think it's not uncommon to have bugs in the NCQ implementation in such disks. The only thing that puzzles me is the fact that the problem also disappears completely when I reduce the SATA rev from II to I, even at tags=32. It seems to me that you dismiss signaling problems too quickly. Consider the possibilities... A bad cable leads to intermittant errors at higher speeds. When NCQ is disabled or limited the software handles these errors pretty much transparently. When NCQ is not limitted and there are many outstanding requests, suddenly the error handling in the software breaks down somehow and a minor recoverable problem becomes an in-your-face error. It could also be a software bug in the way CAM handles the failure of NCQ commands. When command queueing is used on a SCSI drive and a queued command fails only that command fails. A queued command failure on a SATA device fails ALL currently queued commands. I've not looked at the code but do the SATA CAM drivers do the right thing here? Less commands queued makes it less likely that multiple commands will be in progress when a failure occurs. A lower link rate also makes you more immune to signal failures. Mike ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout
On Mon, Jun 03, 2013 at 03:06:53PM +0100, Mike Pumford wrote: Ian Lepore wrote: On Wed, 2013-05-29 at 16:21 +0200, Oliver Fromme wrote: Steven Hartland wrote: Have you checked your sata cables and psu outputs? Both of these could be the underlying cause of poor signalling. I can't easily check that because it is a cheap rented server in a remote location. But I don't believe it is bad cabling or PSU anyway, or otherwise the problem would occur intermittently all the time if the load on the disks is sufficiently high. But it only occurs at tags=3 and above. At tags=2 it does not occur at all, no matter how hard I hammer on the disks. At the moment I'm inclined to believe that it is either a bug in the HDD firmware or in the controller. The disks aren't exactly new, they're 400 GB Samsung ones that are several years old. I think it's not uncommon to have bugs in the NCQ implementation in such disks. The only thing that puzzles me is the fact that the problem also disappears completely when I reduce the SATA rev from II to I, even at tags=32. It seems to me that you dismiss signaling problems too quickly. Consider the possibilities... A bad cable leads to intermittant errors at higher speeds. When NCQ is disabled or limited the software handles these errors pretty much transparently. When NCQ is not limitted and there are many outstanding requests, suddenly the error handling in the software breaks down somehow and a minor recoverable problem becomes an in-your-face error. It could also be a software bug in the way CAM handles the failure of NCQ commands. When command queueing is used on a SCSI drive and a queued command fails only that command fails. A queued command failure on a SATA device fails ALL currently queued commands. I've not looked at the code but do the SATA CAM drivers do the right thing here? Quoting T13/2015-D ATA8-ACS2 WD spec: If an error occurs while the device is processing an NCQ command, then the device shall return command aborted for all NCQ commands that are in the queue and shall return command aborted for any new commands, except a READ LOG EXT command requesting log address 10h, until the device completes a READ LOG EXT command requesting log address 10h (i.e., reading the NCQ Command Error log) without error. While I can't easily provide an answer to your question, I can tell you that sys/dev/ahci/ahci.c does execute READ LOG EXT (command 0x2f) for certain scenarios (the code is in function ahci_issue_recovery()). The one person who can answer this question is mav@, who is now CC'd. Less commands queued makes it less likely that multiple commands will be in progress when a failure occurs. A lower link rate also makes you more immune to signal failures. He isn't seeing SATA-level signal/link failure; the AHCI driver would complain about that, and those messages aren't there. Unless, of course, those messages are only visible when verbose booting is enabled (I hope not). -- | Jeremy Chadwick j...@koitsu.org | | UNIX Systems Administratorhttp://jdc.koitsu.org/ | | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
9.1-stable: ATI IXP600 AHCI: CAM timeout
Hi, Yesterday I have downloaded the latest 9.1 snapshot (May 15th) from ftp.freebsd.org and installed it on a machine that was previously running Linux. It works fine, except that I get many the following when there is heavy disk I/O, e.g. when building world or ports: ahcich0: Timeout on slot 23 port 0 ahcich0: is cs f07f ss rs tfd c0 serr cmd 0004bc17 (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 c9 e0 40 04 00 00 00 00 00 (ada0:ahcich0:0:0:0): CAM status: Command timeout (ada0:ahcich0:0:0:0): Retrying command It happens for *both* ahcich0/ada0 and ahcich1/ada1 equally often (it's a gmirror), sometimes even at exactly the same time so the messages for ada0 and ada1 are interleaved in the dmesg output. The worst thing is that the whole system seems to freeze completely for about 10 seconds each time it happens. Other than that, I haven't seen any ill effects, i.e. no processes dying and no panics (so far). But the system is quite unusable because of the freezes. I'm pretty sure the hardware has no defects. The machine was running Linux fine until recently. Are there any known issues with FreeBSD + ATI IXP600? The kernel is the default GENERIC from the snapshot, the only additional modules loaded are geom_mirror and linux.ko. The dmesg messages related to disks are copied below, and the full dmesg can be found here: http://www.secnetix.de/olli/tmp/dmesg.nox.txt Best regards Oliver FreeBSD 9.1-STABLE #0: Mon May 13 05:10:23 UTC 2013 r...@snap.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 .. ahci0: ATI IXP600 AHCI SATA controller port 0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0x7000-0x700f mem 0xfe7ff800-0xfe7ffbff irq 22 at device 18.0 on pci0 ahci0: AHCI v1.10 with 4 3Gbps ports, Port Multiplier supported ahcich0: AHCI channel at channel 0 on ahci0 ahcich1: AHCI channel at channel 1 on ahci0 ahcich2: AHCI channel at channel 2 on ahci0 ahcich3: AHCI channel at channel 3 on ahci0 .. .. (aprobe0:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 (aprobe0:ahcich0:0:15:0): CAM status: Command timeout (aprobe0:ahcich0:0:15:0): Error 5, Retries exhausted (aprobe1:ahcich1:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 (aprobe1:ahcich1:0:15:0): CAM status: Command timeout (aprobe1:ahcich1:0:15:0): Error 5, Retries exhausted ada0 at ahcich0 bus 0 scbus0 target 0 lun 0 ada0: SAMSUNG HD403LJ CT100-12 ATA-8 SATA 2.x device ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada0: Command Queueing enabled ada0: 381554MB (781422768 512 byte sectors: 16H 63S/T 16383C) ada0: Previously was known as ad4 ada1 at ahcich1 bus 0 scbus1 target 0 lun 0 ada1: SAMSUNG HD403LJ CT100-12 ATA-8 SATA 2.x device ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada1: Command Queueing enabled ada1: 381554MB (781422768 512 byte sectors: 16H 63S/T 16383C) ada1: Previously was known as ad6 .. GEOM_MIRROR: Device mirror/gm0 launched (2/2). .. Trying to mount root from ufs:/dev/mirror/gm0s1a [rw]... .. ahcich0: Timeout on slot 23 port 0 ahcich0: is cs f07f ss rs tfd c0 serr cmd 0004bc17 (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 c9 e0 40 04 00 00 00 00 00 (ada0:ahcich0:0:0:0): CAM status: Command timeout (ada0:ahcich0:0:0:0): Retrying command ahcich1: Timeout on slot 12 port 0 ahcich1: is cs 8fff ss rs tfd 40 serr cmd 0004ee17 (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 80 85 e3 40 04 00 00 00 00 00 (ada1:ahcich1:0:0:0): CAM status: Command timeout (ada1:ahcich1:0:0:0): Retrying command ahcich1: Timeout on slot 2 port 0 ahcich1: is cs ss 001c rs 001c tfd 40 serr cmd 0004e417 ahcich0: Timeout on slot 12 port 0 (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 e0 04 e3 40 04 00 00 00 00 00 (ada1:ahcich1:0:0:0): CAM status: Command timeout ahcich0: is cs ss 7000 rs 7000 tfd 40 serr cmd 0004ee17 (ada1:ahcich1:0:0:0): Retrying command (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 e0 04 e3 40 04 00 00 00 00 00 (ada0:ahcich0:0:0:0): CAM status: Command timeout (ada0:ahcich0:0:0:0): Retrying command pid 40615 (try), uid 0: exited on signal 10 (core dumped) ahcich1: Timeout on slot 7 port 0 ahcich1: is cs f07f ss rs tfd c0 serr cmd 0004ac17 ahcich0: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 7d 92 40 02 00 00 00 00 00 Timeout on slot 19 port 0 (ada1:ahcich1:0:0:0): CAM status: Command timeout ahcich0: is cs ff07 ss rs tfd c0 serr cmd 0004b817 (ada1:ahcich1:0:0:0): Retrying command (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 7d 92 40 02 00 00 00 00 00 (ada0:ahcich0:0:0:0): CAM status: Command timeout (ada0:ahcich0:0:0:0): Retrying command ahcich1: Timeout on slot 12 port 0 ahcich1: is cs ss f000 rs f000 tfd
Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout
Now I have some more information ... The problem disappears when I disable NCQ, i.e. set the number of tags to 1 with camcontrol. Using binary search I found out that the problem also disappears with 2 tags, but with 3 tags I get the same amout of errors as with the default of 32 tags. Interestingly, the problems also disappears when I reduce the SATA level from II to I (i.e. from 3 to 1.5 Gbit/s), even if the NCQ tags are left at the default of 32. Now the question is: Is it better to reduce the NCQ tags from 32 to 2, or to reduce the SATA bandwidth from 3 Gbps to 1.5 Gbps? What is more likely to impact performance on a mixed server with shell users, apache, sendmail, DNS and a few other things? Best regards Oliver -- Oliver Fromme, secnetix GmbH Co. KG, Marktplatz 29, 85567 Grafing Handelsregister: Amtsgericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsreg.: Amtsgericht München, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen/-Produkte + mehr: http://www.secnetix.de/bsd In my experience the term transparent proxy is an oxymoron (like jumbo shrimp). Transparent proxies seem to vary from the distortions of a funhouse mirror to barely translucent. I really, really dislike them when trying to figure out the corrective lenses needed with each of them. -- R. Kevin Oberman, Network Engineer ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout
Have you checked your sata cables and psu outputs? Both of these could be the underlying cause of poor signalling. - Original Message - From: Oliver Fromme o...@lurza.secnetix.de To: freebsd-stable@FreeBSD.ORG Sent: Wednesday, May 29, 2013 2:05 PM Subject: Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout Now I have some more information ... The problem disappears when I disable NCQ, i.e. set the number of tags to 1 with camcontrol. Using binary search I found out that the problem also disappears with 2 tags, but with 3 tags I get the same amout of errors as with the default of 32 tags. Interestingly, the problems also disappears when I reduce the SATA level from II to I (i.e. from 3 to 1.5 Gbit/s), even if the NCQ tags are left at the default of 32. Now the question is: Is it better to reduce the NCQ tags from 32 to 2, or to reduce the SATA bandwidth from 3 Gbps to 1.5 Gbps? What is more likely to impact performance on a mixed server with shell users, apache, sendmail, DNS and a few other things? Best regards Oliver -- Oliver Fromme, secnetix GmbH Co. KG, Marktplatz 29, 85567 Grafing Handelsregister: Amtsgericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsreg.: Amtsgericht München, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen/-Produkte + mehr: http://www.secnetix.de/bsd In my experience the term transparent proxy is an oxymoron (like jumbo shrimp). Transparent proxies seem to vary from the distortions of a funhouse mirror to barely translucent. I really, really dislike them when trying to figure out the corrective lenses needed with each of them. -- R. Kevin Oberman, Network Engineer ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout
Steven Hartland wrote: Have you checked your sata cables and psu outputs? Both of these could be the underlying cause of poor signalling. I can't easily check that because it is a cheap rented server in a remote location. But I don't believe it is bad cabling or PSU anyway, or otherwise the problem would occur intermittently all the time if the load on the disks is sufficiently high. But it only occurs at tags=3 and above. At tags=2 it does not occur at all, no matter how hard I hammer on the disks. At the moment I'm inclined to believe that it is either a bug in the HDD firmware or in the controller. The disks aren't exactly new, they're 400 GB Samsung ones that are several years old. I think it's not uncommon to have bugs in the NCQ implementation in such disks. The only thing that puzzles me is the fact that the problem also disappears completely when I reduce the SATA rev from II to I, even at tags=32. Best regards Oliver -- Oliver Fromme, secnetix GmbH Co. KG, Marktplatz 29, 85567 Grafing Handelsregister: Amtsgericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsreg.: Amtsgericht München, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen/-Produkte + mehr: http://www.secnetix.de/bsd People still program in C. People keep writing shell scripts. *Most* people don't realize the shortcomings of the tools they are using because they a) don't reflect on their workflows and they are b) too lazy to check out alternatives to realize there is help. -- Simon 'corecode' Schubert ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout
On Wed, 2013-05-29 at 16:21 +0200, Oliver Fromme wrote: Steven Hartland wrote: Have you checked your sata cables and psu outputs? Both of these could be the underlying cause of poor signalling. I can't easily check that because it is a cheap rented server in a remote location. But I don't believe it is bad cabling or PSU anyway, or otherwise the problem would occur intermittently all the time if the load on the disks is sufficiently high. But it only occurs at tags=3 and above. At tags=2 it does not occur at all, no matter how hard I hammer on the disks. At the moment I'm inclined to believe that it is either a bug in the HDD firmware or in the controller. The disks aren't exactly new, they're 400 GB Samsung ones that are several years old. I think it's not uncommon to have bugs in the NCQ implementation in such disks. The only thing that puzzles me is the fact that the problem also disappears completely when I reduce the SATA rev from II to I, even at tags=32. It seems to me that you dismiss signaling problems too quickly. Consider the possibilities... A bad cable leads to intermittant errors at higher speeds. When NCQ is disabled or limited the software handles these errors pretty much transparently. When NCQ is not limitted and there are many outstanding requests, suddenly the error handling in the software breaks down somehow and a minor recoverable problem becomes an in-your-face error. I'm not saying any of the foregoing is true, just that you should consider the possibility that you're dealing with multiple problems which are only loosely coupled, but together can seem like a single more serious problem. You don't know enough yet to casually dismiss anything. -- Ian ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout
Ian Lepore wrote: On Wed, 2013-05-29 at 16:21 +0200, Oliver Fromme wrote: Steven Hartland wrote: Have you checked your sata cables and psu outputs? Both of these could be the underlying cause of poor signalling. I can't easily check that because it is a cheap rented server in a remote location. But I don't believe it is bad cabling or PSU anyway, or otherwise the problem would occur intermittently all the time if the load on the disks is sufficiently high. But it only occurs at tags=3 and above. At tags=2 it does not occur at all, no matter how hard I hammer on the disks. At the moment I'm inclined to believe that it is either a bug in the HDD firmware or in the controller. The disks aren't exactly new, they're 400 GB Samsung ones that are several years old. I think it's not uncommon to have bugs in the NCQ implementation in such disks. The only thing that puzzles me is the fact that the problem also disappears completely when I reduce the SATA rev from II to I, even at tags=32. It seems to me that you dismiss signaling problems too quickly. Consider the possibilities... A bad cable leads to intermittant errors at higher speeds. When NCQ is disabled or limited the software handles these errors pretty much transparently. When NCQ is not limitted and there are many outstanding requests, suddenly the error handling in the software breaks down somehow and a minor recoverable problem becomes an in-your-face error. I'm not saying any of the foregoing is true, just that you should consider the possibility that you're dealing with multiple problems which are only loosely coupled, but together can seem like a single more serious problem. You don't know enough yet to casually dismiss anything. Well ... I also can't dismiss the possibility that there is a mouse in the machine that is pulling the SATA cables twice every minute. :-) But seriously ... I don't see how bad cabling could cause errors at tags=3 and no errors at all at tags=2. It shouldn't make a difference for the cables if there are two or three tags used. And by the way, it doesn't make a difference at all whether I use tags=3 or tags=32; the rate of errors is the same in both cases (about two per minute during buildword). I have googled a bit; the Samsung HD401LJ and HD403LJ don't seem to be innocent ... There are lots of pages mentioning problems with NCQ and SATA I vs. II. Best regards Oliver -- Oliver Fromme, secnetix GmbH Co. KG, Marktplatz 29, 85567 Grafing Handelsregister: Amtsgericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsreg.: Amtsgericht München, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen/-Produkte + mehr: http://www.secnetix.de/bsd A misleading benchmark test can accomplish in minutes what years of good engineering can never do. -- Dilbert (2009-03-02) ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout
On 05/29/13 10:21, Oliver Fromme wrote: Steven Hartland wrote: Have you checked your sata cables and psu outputs? Both of these could be the underlying cause of poor signalling. I can't easily check that because it is a cheap rented server in a remote location. But I don't believe it is bad cabling or PSU anyway, or otherwise the problem would occur intermittently all the time if the load on the disks is sufficiently high. But it only occurs at tags=3 and above. At tags=2 it does not occur at all, no matter how hard I hammer on the disks. At the moment I'm inclined to believe that it is either a bug in the HDD firmware or in the controller. The disks aren't exactly new, they're 400 GB Samsung ones that are several years old. I think it's not uncommon to have bugs in the NCQ implementation in such disks. The only thing that puzzles me is the fact that the problem also disappears completely when I reduce the SATA rev from II to I, even at tags=32. Best regards Oliver Jeremy Chadwick knows of some hardware faults with IXP600/700, there may be more information on the freebsd-fs mailing list archives or if you can discuss with him: http://docs.freebsd.org/cgi/mid.cgi?20130414194440.GB38338 That email mentions port multipliers but the problems may extend beyond. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9.1-stable: ATI IXP600 AHCI: CAM timeout
On Wed, May 29, 2013 at 10:09:14AM +0200, Oliver Fromme wrote: Hi, Yesterday I have downloaded the latest 9.1 snapshot (May 15th) from ftp.freebsd.org and installed it on a machine that was previously running Linux. It works fine, except that I get many the following when there is heavy disk I/O, e.g. when building world or ports: ahcich0: Timeout on slot 23 port 0 ahcich0: is cs f07f ss rs tfd c0 serr cmd 0004bc17 (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 00 c9 e0 40 04 00 00 00 00 00 (ada0:ahcich0:0:0:0): CAM status: Command timeout (ada0:ahcich0:0:0:0): Retrying command The messages above indicate two things: 1) The AHCI driver is reporting an internal timeout when trying to speak to the underlying device (disk) attached to whatever maps to ahcich0; this is an AHCI-level timeout, and the 2nd line shows all of the AHCI-level status conditions at that time, 2) CAM reports what it was trying to do when that happened, specifically issue WRITE_FPDMA_QUEUED (an NCQ-based write to ada0), which timed out after 30 seconds (kern.cam.ada.default_timeout). It happens for *both* ahcich0/ada0 and ahcich1/ada1 equally often (it's a gmirror), sometimes even at exactly the same time so the messages for ada0 and ada1 are interleaved in the dmesg output. Both surprising and not surprising (to me anyway), on numerous levels. The worst thing is that the whole system seems to freeze completely for about 10 seconds each time it happens. Other than that, I haven't seen any ill effects, i.e. no processes dying and no panics (so far). But the system is quite unusable because of the freezes. There isn't much you can do about this. I get the impression from your statement this is the first time you've ever encountered an I/O timeout in your life? :-) This is just how it works -- pretty much the entire I/O subsystem (for the device(s) involved) stalls until a response to the CDB gets received. It's like this on all OSes, all systems; it's how I/O works. The AHCI driver may have different timeout settings; I haven't looked. The same CDB gets re-submit to the controller 5 times (kern.cam.ada.retry_count will say 4, but it starts at 0 if I remember right), in hopes that the I/O transaction will eventually go through. Repeated device timeouts with no successful responses will eventually cause CAM or AHCI (I forget which driver/subsystem) to drop the disk. In your case, this could mean ada0 and ada1 eventually getting dropped, which would induce a panic since you're using them for your root filesystem. (I wonder if there are readers of this thread who are starting to see why I use a single disk for my main OS drive...) I'm pretty sure the hardware has no defects. The machine was running Linux fine until recently. Are there any known issues with FreeBSD + ATI IXP600? This is opening a can of worms, which I've discussed in the past. Please see my posts to freebsd-fs and/or freebsd-stable archives (another person in this thread mentioned it as well). Fact: there is still not enough low-level, hard evidence at this time to determine if the problem is with the AHCI driver, the AMD/ATI IXP600 controller, or Samsung disks. The situations I have dealt with in the past always were inconclusive. There have been reports of problems with non-Samsung disks as well, but the report count there is extremely low in comparison. Fact: You will find complaints on Linux lists about both the controller and the drives as well (in combo). Take that to mean whatever you wish. Use Google and search for SB600 HD403LJ Linux or SB600 Samsung Linux and see for yourself. Fact: Samsung's SpinPoint series has had a troubling past of firmware bugs. Things have gotten better on their newer-ish drives, but the slightly older ones, to me, seemed more like a learning experience for engineers. I am not picking on Samsung exclusively; all drive vendors have had problems historically, there is no such thing as a reliable drive vendor in this day and age. You go with whatever works for you/whatever your experiences justify. All that said: There is some code in sys/dev/ahci/ahci.c that indicates one-off behaviour for the SB600/IXP600, pertaining to NCQ. However this was committed a long time ago (r196777 and r196796). I look at this code and I can think of one problem with it, but answers to my below questions will provide what I need. The kernel is the default GENERIC from the snapshot, the only additional modules loaded are geom_mirror and linux.ko. The dmesg messages related to disks are copied below, and the full dmesg can be found here: http://www.secnetix.de/olli/tmp/dmesg.nox.txt Best regards Oliver FreeBSD 9.1-STABLE #0: Mon May 13 05:10:23 UTC 2013 r...@snap.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 .. ahci0: ATI IXP600 AHCI SATA controller port 0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0x7000-0x700f mem