Re: problems with AHCI on FreeBSD 8.2
Yesterday I did a backup of the sensible stuff of the pool and decided to just break stuff on purpose ;) I writed with dd over the sector marked as faulty by smartctl and runned a smartctl short test. I repeated the process several times until smartctl gave no errors at all on ada3. After that i left the pool doing a scrub and it seemed to repair the integrity of the pool: -- [root@zaibach ~]# zpool status pool: tank state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scan: scrub repaired 398K in 10h39m with 0 errors on Thu Feb 16 09:15:59 2012 config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ada2p1 ONLINE 0 0 0 ada1p1 ONLINE 0 0 0 ada3p1 ONLINE 0 011 ada0p1 ONLINE 0 0 0 - But funnily i got an ahci timeout on other drive, /dev/ada2. - Feb 16 04:08:23 zaibach kernel: ahcich2: Timeout on slot 15 port 0 Feb 16 04:08:23 zaibach kernel: ahcich2: is cs 0004 ss 00078000 rs 00078000 tfd c0 serr cmd 0004d217 --- At least a short smartctl test on /dev/ada2 doesn't seem to complain this time. On Thu, Feb 16, 2012 at 5:48 AM, John wrote: > Jeremy Chadwick wrote: >> >> CRC errors ... >> >>I have no real advice for tracking this kind of problem down. The most >>common response is "replace cables", which isn't necessarily the root >>cause. I have no advice or tips on how to track down interference >>issues, or how to truly examine a disk PCB or controller PCB for the >>latter item. "Flaky traces" on a PCB could cause this sort of thing. >>Folks in the EE field would know more about these issues; I am not an EE >>person. >> >>Since the attribute increased on both drives simultaneously (I have to >>assume simultaneously?), it's more likely that the problem is not with >>SATA cables or the drives but the controller on the motherboard. I'd >>recommend replacing the motherboard. I make no guarantees this will fix >>anything however, but it is the "common point" for both of your drives. > > This EE agrees with your advise. I would add if replacing the motherboard > fails > to fix the problem, then replace the power supply. Even with extremely high > end test equipment, you likely would never be able to see the failure occur > for at least two reasons; the most likely failure mode is inside a single IC, > and adding probes would alter the environment enough to change the failure > mode. > > John Theus > TheUs Group > TheUsGroup.com > ___ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org" ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: problems with AHCI on FreeBSD 8.2
Jeremy Chadwick wrote: > > CRC errors ... > >I have no real advice for tracking this kind of problem down. The most >common response is "replace cables", which isn't necessarily the root >cause. I have no advice or tips on how to track down interference >issues, or how to truly examine a disk PCB or controller PCB for the >latter item. "Flaky traces" on a PCB could cause this sort of thing. >Folks in the EE field would know more about these issues; I am not an EE >person. > >Since the attribute increased on both drives simultaneously (I have to >assume simultaneously?), it's more likely that the problem is not with >SATA cables or the drives but the controller on the motherboard. I'd >recommend replacing the motherboard. I make no guarantees this will fix >anything however, but it is the "common point" for both of your drives. This EE agrees with your advise. I would add if replacing the motherboard fails to fix the problem, then replace the power supply. Even with extremely high end test equipment, you likely would never be able to see the failure occur for at least two reasons; the most likely failure mode is inside a single IC, and adding probes would alter the environment enough to change the failure mode. John Theus TheUs Group TheUsGroup.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: problems with AHCI on FreeBSD 8.2
On Wed, Feb 15, 2012 at 07:17:57PM +0100, Victor Balada Diaz wrote: > On Tue, Feb 14, 2012 at 06:16:01AM -0800, Jeremy Chadwick wrote: > > Thanks. Both your drives look overall fine, sort-of. I'll outline my > > concern points, and ask for some more info: > > > > * ada0 has 28 CRC errors, while ada1 has 2. These drives have been in > > use for 4688 hours and 4583 hours (respectively), which is roughly 6 > > months for each drive. CRC errors usually result in transparent > > retransmits, but this can sometimes cause I/O delays (especially if the > > CRC errors are repeated). > > > > If the timeout messages recur in the future, please run the commands I > > gave you above once more and provide the output. I can then compare the > > old to the new and see if there is anything of interest. > > I've made it fail again. You can see smartctl -a output. CRC errors are > increasing. > But i'm not sure what does it really mean. Is HD broken? both? at the same > time? CRC errors indicate one of the following, in no particular order: * Physical cabling problems (number of reasons/possibilities here are too many to list) * Dirty/dusty SATA connectors (cables/drive/host controller) * Electrical interference (badly shielded cables, etc.) * Physical electronic/electrical problems (disk PCB, host controller PCB, etc.) The important thing to remember about CRCs is that they indicate a hardware-level problem between the host controller and the controller chip on the drive. They do not indicate problems with the drive's cache (those are tracked in attribute 184), and they do not indicate software-level problems (e.g. driver bugs, etc.). I have no real advice for tracking this kind of problem down. The most common response is "replace cables", which isn't necessarily the root cause. I have no advice or tips on how to track down interference issues, or how to truly examine a disk PCB or controller PCB for the latter item. "Flaky traces" on a PCB could cause this sort of thing. Folks in the EE field would know more about these issues; I am not an EE person. Since the attribute increased on both drives simultaneously (I have to assume simultaneously?), it's more likely that the problem is not with SATA cables or the drives but the controller on the motherboard. I'd recommend replacing the motherboard. I make no guarantees this will fix anything however, but it is the "common point" for both of your drives. There really isn't anything else I can do going forward. This is pretty much where the buck stops for me, and is validation as to why each and every problem/issue has to be handled individually. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: problems with AHCI on FreeBSD 8.2
On Tue, Feb 14, 2012 at 06:16:01AM -0800, Jeremy Chadwick wrote: > Thanks. Both your drives look overall fine, sort-of. I'll outline my > concern points, and ask for some more info: > > * ada0 has 28 CRC errors, while ada1 has 2. These drives have been in > use for 4688 hours and 4583 hours (respectively), which is roughly 6 > months for each drive. CRC errors usually result in transparent > retransmits, but this can sometimes cause I/O delays (especially if the > CRC errors are repeated). > > If the timeout messages recur in the future, please run the commands I > gave you above once more and provide the output. I can then compare the > old to the new and see if there is anything of interest. I've made it fail again. You can see smartctl -a output. CRC errors are increasing. But i'm not sure what does it really mean. Is HD broken? both? at the same time? # smartctl -a /dev/ada0 smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.1-RELEASE-p5 amd64] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint F2 EG Device Model: SAMSUNG HD154UI Serial Number:S24EJ9BB200080 LU WWN Device Id: 5 0024e9 2047cb78f Firmware Version: 1AG01118 User Capacity:1,500,301,910,016 bytes [1.50 TB] Sector Size: 512 bytes logical/physical Device is:In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 3b Local Time is:Wed Feb 15 18:11:31 2012 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection:(18863) seconds. Offline data collection capabilities:(0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time:( 2) minutes. Extended self-test routine recommended polling time:( 255) minutes. Conveyance self-test routine recommended polling time:( 33) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 051Pre-fail Always - 0 3 Spin_Up_Time0x0007 072 072 011Pre-fail Always - 9330 4 Start_Stop_Count0x0032 100 100 000Old_age Always - 22 5 Reallocated_Sector_Ct 0x0033 100 100 010Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 100 100 051Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 100 100 015Pre-fail Offline - 13677 9 Power_On_Hours 0x0032 099 099 000Old_age Always - 4716 10 Spin_Retry_Count0x0033 100 100 051Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 100 100 000Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000Old_age Always - 22 13 Read_Soft_Error_Rate0x000e 100 100 000Old_age Always - 0 183 Runtime_Bad_Block 0x0032 100 100 000Old_age Always - 0 184 End-to-End_Error
Re: problems with AHCI on FreeBSD 8.2
On Tue, Feb 14, 2012 at 04:42:47PM -0700, Scott Long wrote: > On Feb 14, 2012, at 4:34 PM, Victor Balada Diaz wrote: > > On Tue, Feb 14, 2012 at 03:09:58PM -0800, Jeremy Chadwick wrote: > >> I took a stab at this, but I don't feel confident this is the proper > >> solution/method. I worry there's some sort of chicken-or-the-egg > >> condition here (quirk setup/matching comes *after* SATA capabilities > >> detection), or that it makes the code messier. Need mav@'s > >> recommendations on this. > >> > >> Below is for RELENG_8. I should note I haven't tested if this works, or > >> even compiles -- normally I don't provide such patches without testing > >> so I apologise in advance / user beware. > > > > You're amazingly fast. Thanks for all your help :) > > > > You start applying the quirks before > > > >snprintf(announce_buf, sizeof(announce_buf), > >"kern.cam.ada.%d.quirks", periph->unit_number); > >quirks = softc->quirks; > >TUNABLE_INT_FETCH(announce_buf, &quirks); > > > > So you're breaking quirk setting at boot time. > > > > See my attached patch. I can confirm it works for me. > > > > Regards. > > > > I don't think that disabling NCQ entirely is the right solution. It's a tag > starvation issue in the firmware, not a complete failure, and it can be dealt > with in the CAM XPT scheduler fairly efficiently. Alexander and I talked > about this recently, and though we differ on the details, a tag hack is not > in order, IMHO. In the short term, try just using "cam control tags ada0 -N > 1" to limit the concurrent commands to 1. > > Scott Seems changing tags on both disks doesn't fix the issue: (ada0:ahcich0:0:0:0): Request requeued (ada0:ahcich0:0:0:0): Retrying command ahcich1: Timeout on slot 0 ahcich1: is cs 0001 ss rs 0001 tfd c0 serr ahcich1: AHCI reset... ahcich1: SATA connect time=0ms status=0123 ahcich1: ready wait time=18ms ahcich1: AHCI reset done: device found (ada1:ahcich1:0:0:0): Request requeued (ada1:ahcich1:0:0:0): Retrying command (ada1:ahcich1:0:0:0): Command timed out (ada1:ahcich1:0:0:0): Retrying command ahcich0: Timeout on slot 30 ahcich0: is cs c000 ss rs c000 tfd c0 serr ahcich0: AHCI reset... ahcich0: SATA connect time=0ms status=0123 ahcich0: ready wait time=18ms ahcich0: AHCI reset done: device found The only difference is that now i get "Request requeued" message. -- La prueba más fehaciente de que existe vida inteligente en otros planetas, es que no han intentado contactar con nosotros. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: problems with AHCI on FreeBSD 8.2
On Wed, Feb 15, 2012 at 10:52 AM, Jeremy Chadwick wrote: > Sorry, I missed the in-line part of your post at the top where you said: > >> > Interesting. I have 9 SAMSUNG HD154UI 1AG01118 in my raidz setup, >> > haven't had a problem with any of them yet (touch wood). > > So that would be you using the same firmware (or so I'd like to believe, > but see my previous explanations) as others. > Yes, that draws my ire too - how can you update the firmware and not change the firmware revision, it is crazy. It's possible that even amongst my drives there are different revisions, as not all drives were bought at the same time. Cheers Tom ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: problems with AHCI on FreeBSD 8.2
schrieb Jeremy Chadwick am 15.02.2012 11:42 (localtime): > On Wed, Feb 15, 2012 at 10:19:37AM +, Tom Evans wrote: >> On Tue, Feb 14, 2012 at 7:52 PM, Jeremy Chadwick >> wrote: >>> On Tue, Feb 14, 2012 at 08:31:23PM +0100, Oscar Prieto wrote: I used to had tons of ahci errors in my 4 disk raidz1 worth of HD154UIs when the rig was built a year ago or so (with 8.0 Release), but they dissapeared after tuning ZFS. Sadly i also got a new timeout days ago followed with smartcl erros i still keep unchecked but i guess they cold be legit, i still have to test/swap cables and give it a try. >> Interesting. I have 9 SAMSUNG HD154UI 1AG01118 in my raidz setup, >> haven't had a problem with any of them yet (touch wood). >> >>> Further details which pertain to Samsung drives: >>> >>> In your case, you run smartd(8), which periodically hits the drive with >>> SMART requests, pulling attribute data down and parsing it. ??I believe >>> your model is fine for this, but for similar Samsung models, I must >>> strongly advise against this. ??There are well-documented problems with >>> Samsung firmwares and SMART behaviour which can result in data loss (yes >>> you read that right). ??Please see smartmontools' Wiki page on the matter >>> for full details. ??Just make sure you're running a fixed firmware: >>> >>> http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks >>> >> Yikes, I have just this week installed a HD204UI. From that page, >> drives manufactured after December 2010 should not be affected, which >> is fortunate as the linked firmware page doesn't seem to exist >> anymore, Samsung no longer seem to offer support for their drives and >> point you at Seagate, whose site (of course!) only has downloads for >> current Seagate drives. >> >> >> Hmm reading later on in the thread there is a patch to mark certain >> drives as having flaky NCQ - in the patch it is for the SAMSUNG >> HD154UI. As I mentioned before, I have 9 SAMSUNG HD154UI, all of which >> use ahci(4) and NCQ, and all work perfectly, no timeouts. This is >> using 9-STABLE. >> >> I suspect that there may be more going on than 'flaky NCQ', and that >> perhaps disabling NCQ masks the real issue. > It could simply be a firmware bug in the drive, which is what some > others have eluded to (and I'm in agreement with). I would love to say > "compare firmware versions on your drives", except there is real > in-the-field proof that firmware version strings often do not get > updated/changed between firmwares (at least in the case of some Seagate > and Western Digital disks). Furthermore, NCQ can "play differently" with > different AHCI controllers. > > That said, the disks / firmware versions mentioned by people involved in > this thread / referenced threads are: > > * Victor Balada Diaz -- SAMSUNG HD154UI, firmware 1AG01118 > * Claudius Herder -- SAMSUNG HD753LJ, firmware 1AA01118 > * Oscar Prieto-- SAMSUNG HD154UI, firmware 1AG01118 > - NOTE: In Oscar's case, his drives exhibit other problems. I > would provide a link but the web archive for freebsd-stable does > not show my mail which contains analysis of the situation > * Harald Schmalzbauer -- not provided, but hints at Samsung EG drives -- SAMSUNG HD154UI, firmware 1AG01118 I still have them for "outsourcing" in one server, where they idle all the time. Thanks, -Harry signature.asc Description: OpenPGP digital signature
Re: problems with AHCI on FreeBSD 8.2
On Wed, Feb 15, 2012 at 02:42:05AM -0800, Jeremy Chadwick wrote: > On Wed, Feb 15, 2012 at 10:19:37AM +, Tom Evans wrote: > > On Tue, Feb 14, 2012 at 7:52 PM, Jeremy Chadwick > > wrote: > > > On Tue, Feb 14, 2012 at 08:31:23PM +0100, Oscar Prieto wrote: > > >> I used to had tons of ahci errors in my 4 disk raidz1 worth of > > >> HD154UIs when the rig was built a year ago or so (with 8.0 Release), > > >> but they dissapeared after tuning ZFS. > > >> > > >> Sadly i also got a new timeout days ago followed with smartcl erros i > > >> still keep unchecked but i guess they cold be legit, i still have to > > >> test/swap cables and give it a try. > > > > Interesting. I have 9 SAMSUNG HD154UI 1AG01118 in my raidz setup, > > haven't had a problem with any of them yet (touch wood). > > > > > Further details which pertain to Samsung drives: > > > > > > In your case, you run smartd(8), which periodically hits the drive with > > > SMART requests, pulling attribute data down and parsing it. ??I believe > > > your model is fine for this, but for similar Samsung models, I must > > > strongly advise against this. ??There are well-documented problems with > > > Samsung firmwares and SMART behaviour which can result in data loss (yes > > > you read that right). ??Please see smartmontools' Wiki page on the matter > > > for full details. ??Just make sure you're running a fixed firmware: > > > > > > http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks > > > > > > > Yikes, I have just this week installed a HD204UI. From that page, > > drives manufactured after December 2010 should not be affected, which > > is fortunate as the linked firmware page doesn't seem to exist > > anymore, Samsung no longer seem to offer support for their drives and > > point you at Seagate, whose site (of course!) only has downloads for > > current Seagate drives. > > > > > > Hmm reading later on in the thread there is a patch to mark certain > > drives as having flaky NCQ - in the patch it is for the SAMSUNG > > HD154UI. As I mentioned before, I have 9 SAMSUNG HD154UI, all of which > > use ahci(4) and NCQ, and all work perfectly, no timeouts. This is > > using 9-STABLE. > > > > I suspect that there may be more going on than 'flaky NCQ', and that > > perhaps disabling NCQ masks the real issue. > > It could simply be a firmware bug in the drive, which is what some > others have eluded to (and I'm in agreement with). I would love to say > "compare firmware versions on your drives", except there is real > in-the-field proof that firmware version strings often do not get > updated/changed between firmwares (at least in the case of some Seagate > and Western Digital disks). Furthermore, NCQ can "play differently" with > different AHCI controllers. > > That said, the disks / firmware versions mentioned by people involved in > this thread / referenced threads are: > > * Victor Balada Diaz -- SAMSUNG HD154UI, firmware 1AG01118 > * Claudius Herder -- SAMSUNG HD753LJ, firmware 1AA01118 > * Oscar Prieto-- SAMSUNG HD154UI, firmware 1AG01118 > - NOTE: In Oscar's case, his drives exhibit other problems. I > would provide a link but the web archive for freebsd-stable does > not show my mail which contains analysis of the situation > * Harald Schmalzbauer -- not provided, but hints at Samsung EG drives > > For this to be thorough, one would need to check what all AHCI > controllers are being used and compare those as well. > > I think Scott's theory is probably on-the-ball here, as it pertains to > tag exhaustion, which would manifest itself in the described fashion: > > http://lists.freebsd.org/pipermail/freebsd-stable/2012-February/066177.html > > I'd urge people experiencing this problem to issue the command Scott > provided on all their Samsung disks and see if the problem goes away > after that. If it does, great, and I acknowledge there is no > loader.conf tunable for doing this, etc. etc. etc. so either make an > rc.d script that does it after boot-up or something. Sorry, I missed the in-line part of your post at the top where you said: > > Interesting. I have 9 SAMSUNG HD154UI 1AG01118 in my raidz setup, > > haven't had a problem with any of them yet (touch wood). So that would be you using the same firmware (or so I'd like to believe, but see my previous explanations) as others. It could be some AHCI<->NCQ drive implementation quirk. There was an example of this back in the day with Maxtor drives' NCQ implementation not behaving correctly on nVidia controllers, which Maxtor insisted was an nVidia problem yet released a drive firmware fix for. I'm one of the people this affected (on my desktop system), which is why I remember it. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977.
Re: problems with AHCI on FreeBSD 8.2
On Wed, Feb 15, 2012 at 10:19:37AM +, Tom Evans wrote: > On Tue, Feb 14, 2012 at 7:52 PM, Jeremy Chadwick > wrote: > > On Tue, Feb 14, 2012 at 08:31:23PM +0100, Oscar Prieto wrote: > >> I used to had tons of ahci errors in my 4 disk raidz1 worth of > >> HD154UIs when the rig was built a year ago or so (with 8.0 Release), > >> but they dissapeared after tuning ZFS. > >> > >> Sadly i also got a new timeout days ago followed with smartcl erros i > >> still keep unchecked but i guess they cold be legit, i still have to > >> test/swap cables and give it a try. > > Interesting. I have 9 SAMSUNG HD154UI 1AG01118 in my raidz setup, > haven't had a problem with any of them yet (touch wood). > > > Further details which pertain to Samsung drives: > > > > In your case, you run smartd(8), which periodically hits the drive with > > SMART requests, pulling attribute data down and parsing it. ??I believe > > your model is fine for this, but for similar Samsung models, I must > > strongly advise against this. ??There are well-documented problems with > > Samsung firmwares and SMART behaviour which can result in data loss (yes > > you read that right). ??Please see smartmontools' Wiki page on the matter > > for full details. ??Just make sure you're running a fixed firmware: > > > > http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks > > > > Yikes, I have just this week installed a HD204UI. From that page, > drives manufactured after December 2010 should not be affected, which > is fortunate as the linked firmware page doesn't seem to exist > anymore, Samsung no longer seem to offer support for their drives and > point you at Seagate, whose site (of course!) only has downloads for > current Seagate drives. > > > Hmm reading later on in the thread there is a patch to mark certain > drives as having flaky NCQ - in the patch it is for the SAMSUNG > HD154UI. As I mentioned before, I have 9 SAMSUNG HD154UI, all of which > use ahci(4) and NCQ, and all work perfectly, no timeouts. This is > using 9-STABLE. > > I suspect that there may be more going on than 'flaky NCQ', and that > perhaps disabling NCQ masks the real issue. It could simply be a firmware bug in the drive, which is what some others have eluded to (and I'm in agreement with). I would love to say "compare firmware versions on your drives", except there is real in-the-field proof that firmware version strings often do not get updated/changed between firmwares (at least in the case of some Seagate and Western Digital disks). Furthermore, NCQ can "play differently" with different AHCI controllers. That said, the disks / firmware versions mentioned by people involved in this thread / referenced threads are: * Victor Balada Diaz -- SAMSUNG HD154UI, firmware 1AG01118 * Claudius Herder -- SAMSUNG HD753LJ, firmware 1AA01118 * Oscar Prieto-- SAMSUNG HD154UI, firmware 1AG01118 - NOTE: In Oscar's case, his drives exhibit other problems. I would provide a link but the web archive for freebsd-stable does not show my mail which contains analysis of the situation * Harald Schmalzbauer -- not provided, but hints at Samsung EG drives For this to be thorough, one would need to check what all AHCI controllers are being used and compare those as well. I think Scott's theory is probably on-the-ball here, as it pertains to tag exhaustion, which would manifest itself in the described fashion: http://lists.freebsd.org/pipermail/freebsd-stable/2012-February/066177.html I'd urge people experiencing this problem to issue the command Scott provided on all their Samsung disks and see if the problem goes away after that. If it does, great, and I acknowledge there is no loader.conf tunable for doing this, etc. etc. etc. so either make an rc.d script that does it after boot-up or something. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: problems with AHCI on FreeBSD 8.2
On Tue, Feb 14, 2012 at 7:52 PM, Jeremy Chadwick wrote: > On Tue, Feb 14, 2012 at 08:31:23PM +0100, Oscar Prieto wrote: >> I used to had tons of ahci errors in my 4 disk raidz1 worth of >> HD154UIs when the rig was built a year ago or so (with 8.0 Release), >> but they dissapeared after tuning ZFS. >> >> Sadly i also got a new timeout days ago followed with smartcl erros i >> still keep unchecked but i guess they cold be legit, i still have to >> test/swap cables and give it a try. Interesting. I have 9 SAMSUNG HD154UI 1AG01118 in my raidz setup, haven't had a problem with any of them yet (touch wood). > Further details which pertain to Samsung drives: > > In your case, you run smartd(8), which periodically hits the drive with > SMART requests, pulling attribute data down and parsing it. I believe > your model is fine for this, but for similar Samsung models, I must > strongly advise against this. There are well-documented problems with > Samsung firmwares and SMART behaviour which can result in data loss (yes > you read that right). Please see smartmontools' Wiki page on the matter > for full details. Just make sure you're running a fixed firmware: > > http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks > Yikes, I have just this week installed a HD204UI. From that page, drives manufactured after December 2010 should not be affected, which is fortunate as the linked firmware page doesn't seem to exist anymore, Samsung no longer seem to offer support for their drives and point you at Seagate, whose site (of course!) only has downloads for current Seagate drives. Hmm reading later on in the thread there is a patch to mark certain drives as having flaky NCQ - in the patch it is for the SAMSUNG HD154UI. As I mentioned before, I have 9 SAMSUNG HD154UI, all of which use ahci(4) and NCQ, and all work perfectly, no timeouts. This is using 9-STABLE. I suspect that there may be more going on than 'flaky NCQ', and that perhaps disabling NCQ masks the real issue. Cheers Tom ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: problems with AHCI on FreeBSD 8.2
[cc list somewhat trimmed] on 15/02/2012 10:29 Victor Balada Diaz said the following: > Indeed you're right. It's a hack. Sorry for intruding and under-quoting... perhaps the following commit could be a model solution for your problem here? http://svnweb.freebsd.org/base?view=revision&revision=231745 (with scsi -> ata , of course) -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: problems with AHCI on FreeBSD 8.2
On Tue, Feb 14, 2012 at 04:10:20PM -0800, Jeremy Chadwick wrote: > I'm too tired to quite understand (in full) what's wrong with my patch, > but I think you're referring to situations where someone would have > kern.cam.ada.X.quirks set in loader.conf? > > If so, I believe that same situation would happen presently if someone > set kern.cam.ada.X.quirks in their loader.conf to a value that did not > contain bit #0 set to 1, and used one of the 4K sector disks listed in > ada_quirk_table -- what's in loader.conf looks like it would overwrite > whatever the kernel code bits chose automatically: > > 910 match = cam_quirkmatch((caddr_t)&cgd->ident_data, > 911(caddr_t)ada_quirk_table, > 912 > sizeof(ada_quirk_table)/sizeof(*ada_quirk_table), > 913sizeof(*ada_quirk_table), > ata_identify_match); > 914 if (match != NULL) > 915 softc->quirks = ((struct ada_quirk_entry > *)match)->quirks; > 916 else > 917 softc->quirks = ADA_Q_NONE; > ... > 931 snprintf(announce_buf, sizeof(announce_buf), > 932 "kern.cam.ada.%d.quirks", periph->unit_number); > 933 quirks = softc->quirks; > 934 TUNABLE_INT_FETCH(announce_buf, &quirks); > 935 softc->quirks = quirks; > > I read this to mean: > > Lines 910-917 -- if there's a device ID string match in ada_quirk_table, > set softc->quirks to the content of that struct entry. So, for example, > 4K sector disks would set softc->quirks to 0x01. > > Lines 931-933 -- assign the "quirks" variable to what softc->quirks > currently contains. Thus, for 4K sector disks, quirks = 0x01. > > Line 934 -- load into "quirks" variable the contents of loader.conf > entries pertaining to kern.cam.ada.N.quirks, if set. If someone had an > entry in loader.conf saying kern.cam.ada.N.quirks=0 then yes, this would > overwrite what the kernel "auto-chose". > > Line 935 -- assign softc->quirks = quirks, thus softc->quirks = 0x00, > assuming someone set it to such in loader.conf. That's exactly what i meant. > > See my attached patch. I can confirm it works for me. > > I thought of taking that approach, but for me it's "dirty". Here's what > I mean by that: > > ADA_FLAG_CAN_NCQ gets set in softc->flags around line 892, but then > roughly a hundred lines later, you clear the exact same flag you just > set (based on quirk conditionals). > > I dunno how people feel about that, but my impression is that it's > dirty/confusing. My opinion is to only set the bit once and not mess > about with repeated if() statements that set it, then clear it, etc... Indeed you're right. It's a hack. Would be better to move quirk evaluation before checking capabilities. -- La prueba más fehaciente de que existe vida inteligente en otros planetas, es que no han intentado contactar con nosotros. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: problems with AHCI on FreeBSD 8.2
On Feb 14, 2012, at 4:34 PM, Victor Balada Diaz wrote: > On Tue, Feb 14, 2012 at 03:09:58PM -0800, Jeremy Chadwick wrote: >> On Tue, Feb 14, 2012 at 11:15:27PM +0100, Victor Balada Diaz wrote: >>> On Tue, Feb 14, 2012 at 06:17:19PM +0100, Harald Schmalzbauer wrote: schrieb Jeremy Chadwick am 14.02.2012 17:50 (localtime): > On Tue, Feb 14, 2012 at 04:55:10PM +0100, Claudius Herder wrote: >> Hello, >> >> I have got a quite similar problem with AHCI on FreeBSD 8.2 and it still >> persists on FreeBSD 9.0 release. >> >> Switching from ahci to ataahci resolved the problem for me too. >> >> I'm using gmirror for swap, system is on a zpool and the problem first >> occurred during a zpool scrub, but it is easily reproducible with dd. >> >> The timeouts only occur when writing to disks, dd if=/dev/ada{0|1} >> of=/dev/null is not an issue. >> Sometimes I need to power off the server because after a reboot one disk >> is still missing. >> >> I really would like to help in this issue, so let me know if you need >> any more information. > I find it interesting that, at least so far, the only people reporting > problems of this type with the ahci.ko driver are people using Samsung > disks. The only difference is that your models are F1s while the OPs > are F2s. I saw such timeouts long ago and mav@ had a look at my postings and he mentioned it could be a NCQ problem. I suspected the disks firmware. I never tracked it down further, because after replacing the Samsung (F3 in that case) disks with hitachi ones solved all my problems and gave a big performance kick as well (with zfs). You can find the discussion here: http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html >>> >>> You gave me a good idea: try to disable NCQ and see if that's the fault. So >>> i went and applied the attached patch. After it, i can no longer reproduce >>> the issue with ahci driver. >>> >>> I know this is not a solution because it disables NCQ at controller level >>> instead of disk level, but at least we know for sure where the problem is. >>> >>> I think the solution would be to add a new quirk ADA_Q_NONCQ in >>> sys/cam/ata/ata_da.c. >>> Quirks infraestructure is already built, so adding a new quirk for this >>> seems >>> easy. >>> >>> Is someone interested? Do you think there is a better solution? >>> >>> If someone is interested i can build a patch to add ADA_Q_NONCQ quirk and >>> add my drives >>> to it. >> >> I took a stab at this, but I don't feel confident this is the proper >> solution/method. I worry there's some sort of chicken-or-the-egg >> condition here (quirk setup/matching comes *after* SATA capabilities >> detection), or that it makes the code messier. Need mav@'s >> recommendations on this. >> >> Below is for RELENG_8. I should note I haven't tested if this works, or >> even compiles -- normally I don't provide such patches without testing >> so I apologise in advance / user beware. > > You're amazingly fast. Thanks for all your help :) > > You start applying the quirks before > >snprintf(announce_buf, sizeof(announce_buf), >"kern.cam.ada.%d.quirks", periph->unit_number); >quirks = softc->quirks; >TUNABLE_INT_FETCH(announce_buf, &quirks); > > So you're breaking quirk setting at boot time. > > See my attached patch. I can confirm it works for me. > > Regards. > I don't think that disabling NCQ entirely is the right solution. It's a tag starvation issue in the firmware, not a complete failure, and it can be dealt with in the CAM XPT scheduler fairly efficiently. Alexander and I talked about this recently, and though we differ on the details, a tag hack is not in order, IMHO. In the short term, try just using "cam control tags ada0 -N 1" to limit the concurrent commands to 1. Scott ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: problems with AHCI on FreeBSD 8.2
On Wed, Feb 15, 2012 at 12:34:20AM +0100, Victor Balada Diaz wrote: > On Tue, Feb 14, 2012 at 03:09:58PM -0800, Jeremy Chadwick wrote: > > On Tue, Feb 14, 2012 at 11:15:27PM +0100, Victor Balada Diaz wrote: > > > On Tue, Feb 14, 2012 at 06:17:19PM +0100, Harald Schmalzbauer wrote: > > > > schrieb Jeremy Chadwick am 14.02.2012 17:50 (localtime): > > > > > On Tue, Feb 14, 2012 at 04:55:10PM +0100, Claudius Herder wrote: > > > > >> Hello, > > > > >> > > > > >> I have got a quite similar problem with AHCI on FreeBSD 8.2 and it > > > > >> still > > > > >> persists on FreeBSD 9.0 release. > > > > >> > > > > >> Switching from ahci to ataahci resolved the problem for me too. > > > > >> > > > > >> I'm using gmirror for swap, system is on a zpool and the problem > > > > >> first > > > > >> occurred during a zpool scrub, but it is easily reproducible with dd. > > > > >> > > > > >> The timeouts only occur when writing to disks, dd if=/dev/ada{0|1} > > > > >> of=/dev/null is not an issue. > > > > >> Sometimes I need to power off the server because after a reboot one > > > > >> disk > > > > >> is still missing. > > > > >> > > > > >> I really would like to help in this issue, so let me know if you need > > > > >> any more information. > > > > > I find it interesting that, at least so far, the only people reporting > > > > > problems of this type with the ahci.ko driver are people using Samsung > > > > > disks. The only difference is that your models are F1s while the OPs > > > > > are F2s. > > > > > > > > I saw such timeouts long ago and mav@ had a look at my postings and he > > > > mentioned it could be a NCQ problem. > > > > I suspected the disks firmware. > > > > I never tracked it down further, because after replacing the Samsung (F3 > > > > in that case) disks with hitachi ones solved all my problems and gave a > > > > big performance kick as well (with zfs). > > > > You can find the discussion here: > > > > http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html > > > > > > > > > > You gave me a good idea: try to disable NCQ and see if that's the fault. > > > So > > > i went and applied the attached patch. After it, i can no longer reproduce > > > the issue with ahci driver. > > > > > > I know this is not a solution because it disables NCQ at controller level > > > instead of disk level, but at least we know for sure where the problem is. > > > > > > I think the solution would be to add a new quirk ADA_Q_NONCQ in > > > sys/cam/ata/ata_da.c. > > > Quirks infraestructure is already built, so adding a new quirk for this > > > seems > > > easy. > > > > > > Is someone interested? Do you think there is a better solution? > > > > > > If someone is interested i can build a patch to add ADA_Q_NONCQ quirk and > > > add my drives > > > to it. > > > > I took a stab at this, but I don't feel confident this is the proper > > solution/method. I worry there's some sort of chicken-or-the-egg > > condition here (quirk setup/matching comes *after* SATA capabilities > > detection), or that it makes the code messier. Need mav@'s > > recommendations on this. > > > > Below is for RELENG_8. I should note I haven't tested if this works, or > > even compiles -- normally I don't provide such patches without testing > > so I apologise in advance / user beware. > > You're amazingly fast. Thanks for all your help :) > > You start applying the quirks before > > snprintf(announce_buf, sizeof(announce_buf), > "kern.cam.ada.%d.quirks", periph->unit_number); > quirks = softc->quirks; > TUNABLE_INT_FETCH(announce_buf, &quirks); > > So you're breaking quirk setting at boot time. I'm too tired to quite understand (in full) what's wrong with my patch, but I think you're referring to situations where someone would have kern.cam.ada.X.quirks set in loader.conf? If so, I believe that same situation would happen presently if someone set kern.cam.ada.X.quirks in their loader.conf to a value that did not contain bit #0 set to 1, and used one of the 4K sector disks listed in ada_quirk_table -- what's in loader.conf looks like it would overwrite whatever the kernel code bits chose automatically: 910 match = cam_quirkmatch((caddr_t)&cgd->ident_data, 911(caddr_t)ada_quirk_table, 912 sizeof(ada_quirk_table)/sizeof(*ada_quirk_table), 913sizeof(*ada_quirk_table), ata_identify_match); 914 if (match != NULL) 915 softc->quirks = ((struct ada_quirk_entry *)match)->quirks; 916 else 917 softc->quirks = ADA_Q_NONE; ... 931 snprintf(announce_buf, sizeof(announce_buf), 932 "kern.cam.ada.%d.quirks", periph->unit_number); 933 quirks = softc->quirks; 934 TUNABLE_INT_FETCH(announce_buf, &quirks); 935 softc->quirks = quirks; I read this to mean: Lines 910-917 -- if there's a device ID st
Re: problems with AHCI on FreeBSD 8.2
On Tue, Feb 14, 2012 at 03:09:58PM -0800, Jeremy Chadwick wrote: > On Tue, Feb 14, 2012 at 11:15:27PM +0100, Victor Balada Diaz wrote: > > On Tue, Feb 14, 2012 at 06:17:19PM +0100, Harald Schmalzbauer wrote: > > > schrieb Jeremy Chadwick am 14.02.2012 17:50 (localtime): > > > > On Tue, Feb 14, 2012 at 04:55:10PM +0100, Claudius Herder wrote: > > > >> Hello, > > > >> > > > >> I have got a quite similar problem with AHCI on FreeBSD 8.2 and it > > > >> still > > > >> persists on FreeBSD 9.0 release. > > > >> > > > >> Switching from ahci to ataahci resolved the problem for me too. > > > >> > > > >> I'm using gmirror for swap, system is on a zpool and the problem first > > > >> occurred during a zpool scrub, but it is easily reproducible with dd. > > > >> > > > >> The timeouts only occur when writing to disks, dd if=/dev/ada{0|1} > > > >> of=/dev/null is not an issue. > > > >> Sometimes I need to power off the server because after a reboot one > > > >> disk > > > >> is still missing. > > > >> > > > >> I really would like to help in this issue, so let me know if you need > > > >> any more information. > > > > I find it interesting that, at least so far, the only people reporting > > > > problems of this type with the ahci.ko driver are people using Samsung > > > > disks. The only difference is that your models are F1s while the OPs > > > > are F2s. > > > > > > I saw such timeouts long ago and mav@ had a look at my postings and he > > > mentioned it could be a NCQ problem. > > > I suspected the disks firmware. > > > I never tracked it down further, because after replacing the Samsung (F3 > > > in that case) disks with hitachi ones solved all my problems and gave a > > > big performance kick as well (with zfs). > > > You can find the discussion here: > > > http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html > > > > > > > You gave me a good idea: try to disable NCQ and see if that's the fault. So > > i went and applied the attached patch. After it, i can no longer reproduce > > the issue with ahci driver. > > > > I know this is not a solution because it disables NCQ at controller level > > instead of disk level, but at least we know for sure where the problem is. > > > > I think the solution would be to add a new quirk ADA_Q_NONCQ in > > sys/cam/ata/ata_da.c. > > Quirks infraestructure is already built, so adding a new quirk for this > > seems > > easy. > > > > Is someone interested? Do you think there is a better solution? > > > > If someone is interested i can build a patch to add ADA_Q_NONCQ quirk and > > add my drives > > to it. > > I took a stab at this, but I don't feel confident this is the proper > solution/method. I worry there's some sort of chicken-or-the-egg > condition here (quirk setup/matching comes *after* SATA capabilities > detection), or that it makes the code messier. Need mav@'s > recommendations on this. > > Below is for RELENG_8. I should note I haven't tested if this works, or > even compiles -- normally I don't provide such patches without testing > so I apologise in advance / user beware. You're amazingly fast. Thanks for all your help :) You start applying the quirks before snprintf(announce_buf, sizeof(announce_buf), "kern.cam.ada.%d.quirks", periph->unit_number); quirks = softc->quirks; TUNABLE_INT_FETCH(announce_buf, &quirks); So you're breaking quirk setting at boot time. See my attached patch. I can confirm it works for me. Regards. -- La prueba más fehaciente de que existe vida inteligente en otros planetas, es que no han intentado contactar con nosotros. --- ata_da.c 2012-02-14 22:17:54.0 +0100 +++ ata_da.c 2012-02-14 22:58:05.0 +0100 @@ -91,6 +91,7 @@ typedef enum { ADA_Q_NONE = 0x00, ADA_Q_4K = 0x01, + ADA_Q_NONCQ = 0x02, } ada_quirks; typedef enum { @@ -162,6 +163,14 @@ /*quirks*/ADA_Q_4K }, { + /* + * Samsung have NCQ broken: + * http://lists.freebsd.org/pipermail/freebsd-stable/2012-February/066168.html + */ + { T_DIRECT, SIP_MEDIA_FIXED, "*", "SAMSUNG HD154UI*", "*" }, + /*quirks*/ADA_Q_NONCQ + }, + { /* Samsung Advanced Format (4k) drives */ { T_DIRECT, SIP_MEDIA_FIXED, "*", "SAMSUNG HD155UI*", "*" }, /*quirks*/ADA_Q_4K @@ -967,6 +976,10 @@ softc->disk->d_maxsize = maxio; softc->disk->d_unit = periph->unit_number; softc->disk->d_flags = 0; + /* Disable NCQ if needed */ + if (softc->flags & ADA_FLAG_CAN_NCQ && + softc->quirks & ADA_Q_NONCQ) + softc->flags ^= ADA_FLAG_CAN_NCQ; if (softc->flags & ADA_FLAG_CAN_FLUSHCACHE) softc->disk->d_flags |= DISKFLAG_CANFLUSHCACHE; if ((softc->flags & ADA_FLAG_CAN_TRIM) || ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: problems with AHCI on FreeBSD 8.2
On Tue, Feb 14, 2012 at 11:15:27PM +0100, Victor Balada Diaz wrote: > On Tue, Feb 14, 2012 at 06:17:19PM +0100, Harald Schmalzbauer wrote: > > schrieb Jeremy Chadwick am 14.02.2012 17:50 (localtime): > > > On Tue, Feb 14, 2012 at 04:55:10PM +0100, Claudius Herder wrote: > > >> Hello, > > >> > > >> I have got a quite similar problem with AHCI on FreeBSD 8.2 and it still > > >> persists on FreeBSD 9.0 release. > > >> > > >> Switching from ahci to ataahci resolved the problem for me too. > > >> > > >> I'm using gmirror for swap, system is on a zpool and the problem first > > >> occurred during a zpool scrub, but it is easily reproducible with dd. > > >> > > >> The timeouts only occur when writing to disks, dd if=/dev/ada{0|1} > > >> of=/dev/null is not an issue. > > >> Sometimes I need to power off the server because after a reboot one disk > > >> is still missing. > > >> > > >> I really would like to help in this issue, so let me know if you need > > >> any more information. > > > I find it interesting that, at least so far, the only people reporting > > > problems of this type with the ahci.ko driver are people using Samsung > > > disks. The only difference is that your models are F1s while the OPs > > > are F2s. > > > > I saw such timeouts long ago and mav@ had a look at my postings and he > > mentioned it could be a NCQ problem. > > I suspected the disks firmware. > > I never tracked it down further, because after replacing the Samsung (F3 > > in that case) disks with hitachi ones solved all my problems and gave a > > big performance kick as well (with zfs). > > You can find the discussion here: > > http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html > > > > You gave me a good idea: try to disable NCQ and see if that's the fault. So > i went and applied the attached patch. After it, i can no longer reproduce > the issue with ahci driver. > > I know this is not a solution because it disables NCQ at controller level > instead of disk level, but at least we know for sure where the problem is. > > I think the solution would be to add a new quirk ADA_Q_NONCQ in > sys/cam/ata/ata_da.c. > Quirks infraestructure is already built, so adding a new quirk for this seems > easy. > > Is someone interested? Do you think there is a better solution? > > If someone is interested i can build a patch to add ADA_Q_NONCQ quirk and add > my drives > to it. I took a stab at this, but I don't feel confident this is the proper solution/method. I worry there's some sort of chicken-or-the-egg condition here (quirk setup/matching comes *after* SATA capabilities detection), or that it makes the code messier. Need mav@'s recommendations on this. Below is for RELENG_8. I should note I haven't tested if this works, or even compiles -- normally I don't provide such patches without testing so I apologise in advance / user beware. -- | Jeremy Chadwick j...@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | diff -ruN /usr/src/sys/cam/ata/ata_da.c src/sys/cam/ata/ata_da.c --- /usr/src/sys/cam/ata/ata_da.c 2012-02-10 17:22:25.0 -0800 +++ src/sys/cam/ata/ata_da.c2012-02-14 15:07:07.988814133 -0800 @@ -90,7 +90,8 @@ typedef enum { ADA_Q_NONE = 0x00, - ADA_Q_4K= 0x01, + ADA_Q_4K= 0x01, /* 4k sectors */ + ADA_Q_NONCQ = 0x02, /* device has flaky NCQ support */ } ada_quirks; typedef enum { @@ -162,6 +163,11 @@ /*quirks*/ADA_Q_4K }, { + /* Samsung Spinpoint F2 EG (EcoGreen) drives */ + { T_DIRECT, SIP_MEDIA_FIXED, "*", "SAMSUNG HD154UI*", "*" }, + /*quirks*/ADA_Q_NONCQ, + }, + { /* Samsung Advanced Format (4k) drives */ { T_DIRECT, SIP_MEDIA_FIXED, "*", "SAMSUNG HD155UI*", "*" }, /*quirks*/ADA_Q_4K @@ -887,9 +893,6 @@ softc->flags |= ADA_FLAG_CAN_FLUSHCACHE; if (cgd->ident_data.support.command1 & ATA_SUPPORT_POWERMGT) softc->flags |= ADA_FLAG_CAN_POWERMGT; - if (cgd->ident_data.satacapabilities & ATA_SUPPORT_NCQ && - (cgd->inq_flags & SID_DMA) && (cgd->inq_flags & SID_CmdQue)) - softc->flags |= ADA_FLAG_CAN_NCQ; if (cgd->ident_data.support_dsm & ATA_SUPPORT_DSM_TRIM) { softc->flags |= ADA_FLAG_CAN_TRIM; softc->trim_max_ranges = TRIM_MAX_RANGES; @@ -916,6 +919,15 @@ else softc->quirks = ADA_Q_NONE; + /* +* Do not enable NCQ for devices which have the ADA_Q_NONCQ quirk. +*/ + if (!(softc->quirks & ADA_Q_NONCQ)) { + if (cgd->ident_data.satacapabilities & ATA_SUPPORT_NCQ && + (cgd->i
Re: problems with AHCI on FreeBSD 8.2
Thank you again Jeremy, sure it helps! On Tue, Feb 14, 2012 at 9:31 PM, Jeremy Chadwick wrote: > On Tue, Feb 14, 2012 at 09:19:02PM +0100, Oscar Prieto wrote: >> Thank you Jeremy, i'm already checking your links. >> >> When i installed smartd i configured a daily short test and a weekly >> long one for all the drives while the machine remains mostly unused, >> never thought it could be a problem reading the documentation and info >> around. >> >> # /usr/local/etc/smartd.conf >> /dev/ada0 -a -o on -S on -s (S/../.././03|L/../../2/07) >> /dev/ada1 -a -o on -S on -s (S/../.././04|L/../../3/07) >> /dev/ada2 -a -o on -S on -s (S/../.././05|L/../../4/07) >> /dev/ada3 -a -o on -S on -s (S/../.././06|L/../../5/07) > > The problem is that, quite honestly, these do you zero good. All it does > is make a mess (per se) of the SMART self-test log. > > Take for example your situation with ada3: smartd(8) told you that the > number of pending sectors increased to 5, and uncorrected increased to > 1. That's really all you need to know at that point. If you want to > know the LBA numbers which are problematic, you can manually intervene. > > The point is: the drive itself is going to notice problematic or bad > sectors quicker than periodic short or long or surface scan tests will. > Let the drive do its thing normally and only use SMART tests when > there's indication something is wrong. > >> I'll remove the checks, do you advice for removing the daemon altogether? > > smartd(8) is useful because it keeps track of attributes which change in > value and logs data to syslog (if I remember right), thus you have an > exact time/date when an attribute changed. This is especially useful > for things pertaining to sector/physical media problems. > > As such, I tend to recommend folks using smartd(8) properly tune their > smartd.conf to only monitor specific attributes. This varies from drive > to drive, but the key ones are things like attributes 5, 10, 11, 192, > 193, 194 (if you want temperature logging), 196, 197, 198, 199, and 200. > I'm speaking strictly for Western Digital disks here. > > The stock defaults, if I remember right, are to "monitor everything", > which really doesn't work well given that so many vendors encode their > RAW_VALUE fields in proprietary/vendor-specific formats. People will > often monitor things like the Hardware_ECC_Recovered attribute and start > "freaking out" once day when the value goes from 0 to 838938239 or > something larger. Attribute data formats are not part of the ATA > standard, so vendors choose to encode them. Plus, not many admins that > I've run into (honest) know what that attribute actually means > disk-wise (hint: it's 100% normal for sector ECC to happen at all times; > magnetic media is not perfect, that's what the per-sector ECC section is > for!) > > However: people don't understand what SMART attribute acquisition > actually does behind the scenes -- it results in the disk having to read > from the HPA area (not user accessible or within LBA regions), which > means seeking + moving the arms to an area, reading, then reporting all > of this back. Thus, it impacts I/O performance. This is why I don't > use smartd(8) on any of our systems. But if I was to use it? I would > have it poll maybe every 120 minutes, rather than every 30. It all > depends on the system/load/etc.. I've seen people poll every 5 minutes > (I think they're absolutely crazy/paranoid). Their systems, their > problem. :-) > > Hope this helps. > > -- > | Jeremy Chadwick j...@parodius.com | > | Parodius Networking http://www.parodius.com/ | > | UNIX Systems Administrator Mountain View, CA, US | > | Making life hard for others since 1977. PGP 4BD6C0CB | > ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: problems with AHCI on FreeBSD 8.2
On Tue, Feb 14, 2012 at 06:17:19PM +0100, Harald Schmalzbauer wrote: > schrieb Jeremy Chadwick am 14.02.2012 17:50 (localtime): > > On Tue, Feb 14, 2012 at 04:55:10PM +0100, Claudius Herder wrote: > >> Hello, > >> > >> I have got a quite similar problem with AHCI on FreeBSD 8.2 and it still > >> persists on FreeBSD 9.0 release. > >> > >> Switching from ahci to ataahci resolved the problem for me too. > >> > >> I'm using gmirror for swap, system is on a zpool and the problem first > >> occurred during a zpool scrub, but it is easily reproducible with dd. > >> > >> The timeouts only occur when writing to disks, dd if=/dev/ada{0|1} > >> of=/dev/null is not an issue. > >> Sometimes I need to power off the server because after a reboot one disk > >> is still missing. > >> > >> I really would like to help in this issue, so let me know if you need > >> any more information. > > I find it interesting that, at least so far, the only people reporting > > problems of this type with the ahci.ko driver are people using Samsung > > disks. The only difference is that your models are F1s while the OPs > > are F2s. > > I saw such timeouts long ago and mav@ had a look at my postings and he > mentioned it could be a NCQ problem. > I suspected the disks firmware. > I never tracked it down further, because after replacing the Samsung (F3 > in that case) disks with hitachi ones solved all my problems and gave a > big performance kick as well (with zfs). > You can find the discussion here: > http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html > You gave me a good idea: try to disable NCQ and see if that's the fault. So i went and applied the attached patch. After it, i can no longer reproduce the issue with ahci driver. I know this is not a solution because it disables NCQ at controller level instead of disk level, but at least we know for sure where the problem is. I think the solution would be to add a new quirk ADA_Q_NONCQ in sys/cam/ata/ata_da.c. Quirks infraestructure is already built, so adding a new quirk for this seems easy. Is someone interested? Do you think there is a better solution? If someone is interested i can build a patch to add ADA_Q_NONCQ quirk and add my drives to it. Regards. -- La prueba más fehaciente de que existe vida inteligente en otros planetas, es que no han intentado contactar con nosotros. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: problems with AHCI on FreeBSD 8.2
On Tue, Feb 14, 2012 at 09:19:02PM +0100, Oscar Prieto wrote: > Thank you Jeremy, i'm already checking your links. > > When i installed smartd i configured a daily short test and a weekly > long one for all the drives while the machine remains mostly unused, > never thought it could be a problem reading the documentation and info > around. > > # /usr/local/etc/smartd.conf > /dev/ada0 -a -o on -S on -s (S/../.././03|L/../../2/07) > /dev/ada1 -a -o on -S on -s (S/../.././04|L/../../3/07) > /dev/ada2 -a -o on -S on -s (S/../.././05|L/../../4/07) > /dev/ada3 -a -o on -S on -s (S/../.././06|L/../../5/07) The problem is that, quite honestly, these do you zero good. All it does is make a mess (per se) of the SMART self-test log. Take for example your situation with ada3: smartd(8) told you that the number of pending sectors increased to 5, and uncorrected increased to 1. That's really all you need to know at that point. If you want to know the LBA numbers which are problematic, you can manually intervene. The point is: the drive itself is going to notice problematic or bad sectors quicker than periodic short or long or surface scan tests will. Let the drive do its thing normally and only use SMART tests when there's indication something is wrong. > I'll remove the checks, do you advice for removing the daemon altogether? smartd(8) is useful because it keeps track of attributes which change in value and logs data to syslog (if I remember right), thus you have an exact time/date when an attribute changed. This is especially useful for things pertaining to sector/physical media problems. As such, I tend to recommend folks using smartd(8) properly tune their smartd.conf to only monitor specific attributes. This varies from drive to drive, but the key ones are things like attributes 5, 10, 11, 192, 193, 194 (if you want temperature logging), 196, 197, 198, 199, and 200. I'm speaking strictly for Western Digital disks here. The stock defaults, if I remember right, are to "monitor everything", which really doesn't work well given that so many vendors encode their RAW_VALUE fields in proprietary/vendor-specific formats. People will often monitor things like the Hardware_ECC_Recovered attribute and start "freaking out" once day when the value goes from 0 to 838938239 or something larger. Attribute data formats are not part of the ATA standard, so vendors choose to encode them. Plus, not many admins that I've run into (honest) know what that attribute actually means disk-wise (hint: it's 100% normal for sector ECC to happen at all times; magnetic media is not perfect, that's what the per-sector ECC section is for!) However: people don't understand what SMART attribute acquisition actually does behind the scenes -- it results in the disk having to read from the HPA area (not user accessible or within LBA regions), which means seeking + moving the arms to an area, reading, then reporting all of this back. Thus, it impacts I/O performance. This is why I don't use smartd(8) on any of our systems. But if I was to use it? I would have it poll maybe every 120 minutes, rather than every 30. It all depends on the system/load/etc.. I've seen people poll every 5 minutes (I think they're absolutely crazy/paranoid). Their systems, their problem. :-) Hope this helps. -- | Jeremy Chadwick j...@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: problems with AHCI on FreeBSD 8.2
Thank you Jeremy, i'm already checking your links. When i installed smartd i configured a daily short test and a weekly long one for all the drives while the machine remains mostly unused, never thought it could be a problem reading the documentation and info around. # /usr/local/etc/smartd.conf /dev/ada0 -a -o on -S on -s (S/../.././03|L/../../2/07) /dev/ada1 -a -o on -S on -s (S/../.././04|L/../../3/07) /dev/ada2 -a -o on -S on -s (S/../.././05|L/../../4/07) /dev/ada3 -a -o on -S on -s (S/../.././06|L/../../5/07) I'll remove the checks, do you advice for removing the daemon altogether? On Tue, Feb 14, 2012 at 8:51 PM, Martin Sugioarto wrote: > Am Tue, 14 Feb 2012 20:24:32 +0100 > schrieb Harald Schmalzbauer : > >> I guess it's always the firmware of the EcoGreen models which cause >> these problems. Your drive isn't EG... >> I don't remember exactly the different model numbers, but I'm sure >> they were all EcoGreen. The lower power consumption was the reason to >> choose these specific drives (different capacities and F2/F3 series >> tried), with acceptable performance loss - I thought. But it turned >> out that EcoGreen and NCQ as well as RAIDZ demands dont' fit >> together... > > Hi, > > I intentionally did not buy any Eco or Green model because I don't like > them (Load_Cycle_Count bugs and so on). I realized, I like to use 1 Watt > more power but have the performance doubled. > > -- > Martin > ___ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org" ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: problems with AHCI on FreeBSD 8.2
On Tue, Feb 14, 2012 at 08:31:23PM +0100, Oscar Prieto wrote: > I used to had tons of ahci errors in my 4 disk raidz1 worth of > HD154UIs when the rig was built a year ago or so (with 8.0 Release), > but they dissapeared after tuning ZFS. > > Sadly i also got a new timeout days ago followed with smartcl erros i > still keep unchecked but i guess they cold be legit, i still have to > test/swap cables and give it a try. About your ada3 disk: The below SMART errors indicate your disk does in fact have physical media problems -- 1 confirmed bad sector, and 5 which are "suspect". "Suspect" LBAs are unreadable until writes are issued to them. A write will induce the drive to re-analyse the sector at that LBA and determine if it's truly bad or not. A single LBA can actually take quite a long time to analyse (it depends on what the problem is), and may result in 30+ seconds of delay. You can either let the drive figure it out over normal usage patterns, or you can do it manually yourself time permitting. Your drive that shows read failures in the SMART self-test log gives you the LBA numbers; try reading from those LBAs first. I can explain this procedure in another thread/offline/whatever. (Does anyone read what I write, re: don't hijack the thread? :-) ) About all of your disks: All of your disks are undergoing regular/periodic SMART short and long tests. Please stop this; it really, truly does no good. You will experience performance hits during these tests. About timeouts: Timeouts seen on the controller and driver level can happen in this situation; this is universal. This is usually what features like Western Digital's TLER and Hitachi + Samsung's CCTL can help alleviate, but not fully solve. I think the ada(4) default timeout of 30 seconds is a decent value, to be quite honest, but I'm not sure what the AHCI driver timeout is. mav@ would need to clue me in, or I'd need to go look at the source. (Right now in my life is not a good time for me to be reviewing source code or looking at commits, sadly. Too much on my mind recently.) I can discuss the TLER/CCTL stuff more at length if needed, but to be blatantly honest, I would rather not and here's why: people begin to rely on these features to try and circumvent actual problems with their drives. Phrased differently: people on the Internet become incredibly focused on all of these timeout durations (TLER/CCTL vs. controller vs. driver vs. storage subsystem timeouts) and try to find some bizarre "perfect harmony" between them all. Instead, just leave them all alone and watch your drives for problems. Further details which pertain to Samsung drives: In your case, you run smartd(8), which periodically hits the drive with SMART requests, pulling attribute data down and parsing it. I believe your model is fine for this, but for similar Samsung models, I must strongly advise against this. There are well-documented problems with Samsung firmwares and SMART behaviour which can result in data loss (yes you read that right). Please see smartmontools' Wiki page on the matter for full details. Just make sure you're running a fixed firmware: http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks Regarding throughput of the drives being slow (30-40MBytes/sec across a gigE link): This sounds more like a Samba tuning problem, but ZFS raidz isn't known for "amazing speed" per se. Please see a post of mine from a while back on how to tune Samba, which many followed up to with appreciation stating their throughput increased dramatically: http://lists.freebsd.org/pipermail/freebsd-stable/2011-February/061642.html I should follow up to that post with the following entry, because I've since updated my own smb.conf to tune things a bit better, and include comments as to the justifications: # # The below options increase throughput substantially. Be aware # that AIO support requires the aio.ko kernel module loaded, # and Samba to be built with AIO enabled. Important notes: # # 1) We explicitly disable sendfile(2) because it has known # problems on ZFS, including resulting in 2x the amount of memory # used on the machine (VM cache + ZFS cache). For further details, # see freebsd-fs or freebsd-stable thread, subject "8.1-STABLE: # zfs and sendfile: problem still exists". # # 2) (2011/10/03) socket options SO_SNDBUF and SO_RCVBUF do not # appear to matter on FreeBSD, or our sysctls somehow take care of # this (or maybe AIO?). The performance is the same with or without # these two socket options on 8.2-STABLE. # # 3) (2011/10/03) My previously-mentioned "aio write behind" option # is incorrect; see the officia smb.conf(5) man page for the syntax. # It's not a yes/no toggleable, thus serves no purpose. # socket options = TCP_NODELAY use sendfile = no min receivefile size = 16384 aio read size = 16384 aio write size = 16384 The rest is in the thread I linked. Hope this helps. -- | Jeremy Chadwick
Re: problems with AHCI on FreeBSD 8.2
Am Tue, 14 Feb 2012 20:24:32 +0100 schrieb Harald Schmalzbauer : > I guess it's always the firmware of the EcoGreen models which cause > these problems. Your drive isn't EG... > I don't remember exactly the different model numbers, but I'm sure > they were all EcoGreen. The lower power consumption was the reason to > choose these specific drives (different capacities and F2/F3 series > tried), with acceptable performance loss - I thought. But it turned > out that EcoGreen and NCQ as well as RAIDZ demands dont' fit > together... Hi, I intentionally did not buy any Eco or Green model because I don't like them (Load_Cycle_Count bugs and so on). I realized, I like to use 1 Watt more power but have the performance doubled. -- Martin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: problems with AHCI on FreeBSD 8.2
schrieb Martin Sugioarto am 14.02.2012 19:23 (localtime): > Am Tue, 14 Feb 2012 18:17:19 +0100 > schrieb Harald Schmalzbauer : > >>> I find it interesting that, at least so far, the only people >>> reporting problems of this type with the ahci.ko driver are people >>> using Samsung disks. The only difference is that your models are >>> F1s while the OPs are F2s. >> I saw such timeouts long ago and mav@ had a look at my postings and he >> mentioned it could be a NCQ problem. >> I suspected the disks firmware. >> I never tracked it down further, because after replacing the Samsung >> (F3 in that case) disks with hitachi ones solved all my problems and >> gave a big performance kick as well (with zfs). >> You can find the discussion here: >> http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html > Hi, > > I just want to add here that I am using 2 drives of type "Samsung > HD103SJ" (SpinPoint F3). And I did not have problems with ZFS and with > UFS either (for several years now). Everything has been deployed ontop > ada(4) since FreeBSD-8. > > Actually the speed is very good (sequential read at 140 MB/s and more). I guess it's always the firmware of the EcoGreen models which cause these problems. Your drive isn't EG... I don't remember exactly the different model numbers, but I'm sure they were all EcoGreen. The lower power consumption was the reason to choose these specific drives (different capacities and F2/F3 series tried), with acceptable performance loss - I thought. But it turned out that EcoGreen and NCQ as well as RAIDZ demands dont' fit together... -Harry signature.asc Description: OpenPGP digital signature
Re: problems with AHCI on FreeBSD 8.2
Am Tue, 14 Feb 2012 18:17:19 +0100 schrieb Harald Schmalzbauer : > > I find it interesting that, at least so far, the only people > > reporting problems of this type with the ahci.ko driver are people > > using Samsung disks. The only difference is that your models are > > F1s while the OPs are F2s. > > I saw such timeouts long ago and mav@ had a look at my postings and he > mentioned it could be a NCQ problem. > I suspected the disks firmware. > I never tracked it down further, because after replacing the Samsung > (F3 in that case) disks with hitachi ones solved all my problems and > gave a big performance kick as well (with zfs). > You can find the discussion here: > http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html Hi, I just want to add here that I am using 2 drives of type "Samsung HD103SJ" (SpinPoint F3). And I did not have problems with ZFS and with UFS either (for several years now). Everything has been deployed ontop ada(4) since FreeBSD-8. Actually the speed is very good (sequential read at 140 MB/s and more). -- Martin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: problems with AHCI on FreeBSD 8.2
schrieb Jeremy Chadwick am 14.02.2012 17:50 (localtime): > On Tue, Feb 14, 2012 at 04:55:10PM +0100, Claudius Herder wrote: >> Hello, >> >> I have got a quite similar problem with AHCI on FreeBSD 8.2 and it still >> persists on FreeBSD 9.0 release. >> >> Switching from ahci to ataahci resolved the problem for me too. >> >> I'm using gmirror for swap, system is on a zpool and the problem first >> occurred during a zpool scrub, but it is easily reproducible with dd. >> >> The timeouts only occur when writing to disks, dd if=/dev/ada{0|1} >> of=/dev/null is not an issue. >> Sometimes I need to power off the server because after a reboot one disk >> is still missing. >> >> I really would like to help in this issue, so let me know if you need >> any more information. > I find it interesting that, at least so far, the only people reporting > problems of this type with the ahci.ko driver are people using Samsung > disks. The only difference is that your models are F1s while the OPs > are F2s. I saw such timeouts long ago and mav@ had a look at my postings and he mentioned it could be a NCQ problem. I suspected the disks firmware. I never tracked it down further, because after replacing the Samsung (F3 in that case) disks with hitachi ones solved all my problems and gave a big performance kick as well (with zfs). You can find the discussion here: http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html JFI -Harry signature.asc Description: OpenPGP digital signature
Re: problems with AHCI on FreeBSD 8.2
On Tue, Feb 14, 2012 at 04:55:10PM +0100, Claudius Herder wrote: > > Hello, > > I have got a quite similar problem with AHCI on FreeBSD 8.2 and it still > persists on FreeBSD 9.0 release. > > Switching from ahci to ataahci resolved the problem for me too. > > I'm using gmirror for swap, system is on a zpool and the problem first > occurred during a zpool scrub, but it is easily reproducible with dd. > > The timeouts only occur when writing to disks, dd if=/dev/ada{0|1} > of=/dev/null is not an issue. > Sometimes I need to power off the server because after a reboot one disk > is still missing. > > I really would like to help in this issue, so let me know if you need > any more information. I find it interesting that, at least so far, the only people reporting problems of this type with the ahci.ko driver are people using Samsung disks. The only difference is that your models are F1s while the OPs are F2s. The only difference I can think of is that the ahci.ko driver may have more strict timeouts than the ata driver (ata driver includes ataahci; ataahci.ko != ahci.ko, as you know). You may be able to adjust these using loader.conf variables: kern.cam.ada.default_timeout kern.cam.ada.retry_count I also imagine that hint.ahci.X.ccc might have some involvement here, but it's something I am not familiar with. mav@ would need to comment on this -- it's outside of my familiarity scope. Furthermore, in your case, your ada1 disk has serious CRC-related problems, and your ada0 disk has seen similar just at a much lower rate. ada1 should probably be replaced (along with cables, dusting out SATA ports, etc.), but keeping ada0 is probably fine. The statistics for these are shown in the "smartctl -l sataphy" output, field labelled ID 0x0001, "Command failed due to ICRC error". These are SATA-level problems or physical problems which will manifest themselves as anomalies during any kind of I/O. The counters shown in ID 0x000a and 0x0009 are completely fine; these don't indicate any problems. Your drives don't support GP log region 0x04, which is why "smartctl -l devstat" returns the errors it does. The errors you see coming from the kernel in this situation are 100% okay/acceptable; the drive itself is literally returning ABRT status to the inquiry submit to it. Different drives from different vendors behave differently in this regard. So, what I'm trying to say is, your problem looks different than the OPs. Let's not start a big "I have this problem too" thread; that has happened so many times over the years that when it happens I immediately bow out + stop participating in the thread. > smartctl -l sataphy /dev/ada0 > > SATA Phy Event Counters (GP Log 0x11) > ID Size Value Description > 0x000a 2 150 Device-to-host register FISes sent due to a COMRESET > 0x0001 23 Command failed due to ICRC error > 0x0009 2 173 Transition from drive PhyRdy to drive PhyNRdy > > smartctl -l sataphy /dev/ada1 > > SATA Phy Event Counters (GP Log 0x11) > ID Size Value Description > 0x000a 2 155 Device-to-host register FISes sent due to a COMRESET > 0x0001 265535+ Command failed due to ICRC error > 0x0009 2 178 Transition from drive PhyRdy to drive PhyNRdy -- | Jeremy Chadwick j...@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: problems with AHCI on FreeBSD 8.2
Hello, I have got a quite similar problem with AHCI on FreeBSD 8.2 and it still persists on FreeBSD 9.0 release. Switching from ahci to ataahci resolved the problem for me too. I'm using gmirror for swap, system is on a zpool and the problem first occurred during a zpool scrub, but it is easily reproducible with dd. The timeouts only occur when writing to disks, dd if=/dev/ada{0|1} of=/dev/null is not an issue. Sometimes I need to power off the server because after a reboot one disk is still missing. I really would like to help in this issue, so let me know if you need any more information. -- Claudius dmesg: --cut-- Jan 14 01:33:57 server kernel: ahcich0: Timeout on slot 7 port 0 Jan 14 01:33:57 server kernel: ahcich0: is cs 0080 ss rs 0080 tfd c0 serr cmd 0004c717 Jan 14 01:33:57 server kernel: ahcich1: Timeout on slot 31 port 0 Jan 14 01:33:57 server kernel: ahcich1: is cs 8000 ss rs 8000 tfd c0 serr cmd 0004df17 Jan 14 01:33:57 server kernel: ahcich0: Timeout on slot 7 port 0 Jan 14 01:33:57 server kernel: ahcich0: is cs f800 ss ff80 rs ff80 tfd c0 serr cmd 0004cb17 Jan 14 01:33:57 server kernel: ahcich1: Timeout on slot 31 port 0 Jan 14 01:33:57 server kernel: ahcich1: is cs 00f8 ss 80ff rs 80ff tfd c0 serr cmd 0004c317 Jan 14 01:33:57 server kernel: ahcich0: Timeout on slot 23 port 0 Jan 14 01:33:57 server kernel: ahcich0: is cs 0180 ss rs 0180 tfd c0 serr cmd 0004d717 Jan 14 01:33:57 server kernel: ahcich1: Timeout on slot 15 port 0 Jan 14 01:33:57 server kernel: ahcich1: is cs 00018000 ss rs 00018000 tfd c0 serr cmd 0004cf17 Jan 14 01:33:57 server kernel: ahcich1: Timeout on slot 17 port 0 Jan 14 01:33:57 server kernel: ahcich1: is cs 01f8 ss 01fe rs 01fe tfd c0 serr cmd 0004d317 Jan 14 01:33:57 server kernel: ahcich0: AHCI reset: device not ready after 31000ms (tfd = 0080) Jan 14 01:33:57 server kernel: ahcich1: Timeout on slot 31 port 0 Jan 14 01:33:57 server kernel: ahcich1: is cs 8000 ss rs 8000 tfd c0 serr cmd 0004df17 Jan 14 01:33:57 server kernel: ahcich0: Timeout on slot 24 port 0 --cut-- smartctl -a /dev/ada0 smartctl 5.42 2011-10-20 r3458 [FreeBSD 9.0-RELEASE amd64] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint F1 DT Device Model: SAMSUNG HD753LJ Serial Number:S13UJDWS900110 LU WWN Device Id: 5 0024e9 0020d1bfa Firmware Version: 1AA01118 User Capacity:750,156,374,016 bytes [750 GB] Sector Size: 512 bytes logical/physical Device is:In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 3b Local Time is:Tue Feb 14 16:32:58 2012 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection:( 9429) seconds. Offline data collection capabilities:(0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time:( 2) minutes. Extended self-test routine recommended polling time:( 158) minutes. Conveyance self-test routine recommended polling time:( 17) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Error Recovery Control supported.
Re: problems with AHCI on FreeBSD 8.2
On Tue, Feb 14, 2012 at 06:16:01AM -0800, Jeremy Chadwick wrote: [..] > > Thanks. Both your drives look overall fine, sort-of. I'll outline my > concern points, and ask for some more info: > > * ada0 has 28 CRC errors, while ada1 has 2. These drives have been in > use for 4688 hours and 4583 hours (respectively), which is roughly 6 > months for each drive. CRC errors usually result in transparent > retransmits, but this can sometimes cause I/O delays (especially if the > CRC errors are repeated). > > If the timeout messages recur in the future, please run the commands I > gave you above once more and provide the output. I can then compare the > old to the new and see if there is anything of interest. I can force the error each time i want. Its 100% reproducible on my environment so i'll do the tests and send you smartctl -a output again. > > * Both drives had 2 long tests run on them a few days ago ("Extended > offline" tests). Did you induce these manually? If so, were these > tests running at the time you witnessed AHCI timeout errors on ada0? > Short, long, and selective surface scan tests are supposed to be > non-intrusive, but given the nature of the tests sometimes they can > stall the I/O subsystem. I've ran the tests, but they were not running during timeout problems. The only thing running on the disks was a newfs -J under a gjournal partiton. For the rest, they're mostly idle. > > If you do tests of this nature, you should write down the exact > dates/times when you ran them (at least from now on). > > If you didn't induce these, something must have, or possibly the drive > itself did it (and if that's the case, convenient that it induces an > entry in the self-test log!). > > I do have some familiarity with drives doing internal tests -- the best > example are old IBM Deskstar drives executing ADM on their own, > resulting in the drives spinning down and performing internal tests, > which would subsequently be interrupted by ATA I/O, drive spins back up, > etc. -- but took too long resulting in ATA timeouts on FreeBSD and > Linux. I mailed IBM about this back in 2000 and got confirmation of the > feature (which was also on their SCSI drives but defaulted to off); the > feature was mysteriously removed in future drive models and still > remains gone today: > > http://jdc.parodius.com/freebsd/ibm_email_aware_of_adm.txt > > I'm not saying your drives do this. I'm simply saying that if there is > some form of automated test that runs on these drives which is > transparent to the underlying ATA layer, then there is really nothing > you can do about it, and timeouts are possible. The IBM ADM issue was > only discovered after reviewing technical specifications/documentation > and compared to their SCSI drives. That's of course possible, but as the problem is 100% reproducible with AHCI driver and is not with ata driver, i guess this time is not drive's fault. We've also tested replacement disks and cables during the previous days. I guess the problem is in some bad interaction with AHCI driver. > > * Samsung has a notoriously bad reputation for firmware reliability on > their SpinPoint drives, but I haven't read of anything bad about the F2 > series, just the F1, F3, and F4 models. I have very little (almost > none) experience with these drives. I'm not boycotting their products, > but I wouldn't be surprised if the timeout errors you saw were caused by > something internal the drive was doing. There is absolutely zero > visibility into this kind of problem on any layer (even if you had an > ATA protocol analyser hooked up); you're completely at the mercy of the > firmware. Just something to keep in mind when working with ANY kind of > disk (MHDD, SSD, etc.). I've seen reports on freebsd lists and smartmontools wiki about firmware problems with F4 drives manufactured before december of 2010, but checking samsung's web page, seems this drives are not affected. I hope we're not hitting a new bug. More info: http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks > > All that said, could you please provide output from the following > commands as well? These may return "not supported" errors, which is > acceptable, but we have to check. > > * smartctl -l devstat /dev/ada0 > * smartctl -l sataphy /dev/ada0 > * smartctl -l devstat /dev/ada1 > * smartctl -l sataphy /dev/ada1 > Thanks a lot for you help Jeremy. Attached is the output of the commands: fe09# smartctl -l devstat /dev/ada0 smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.2-STABLE amd64] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net (pass0:ahcich0:0:0:0): READ_LOG_EXT. ACB: 2f 00 04 00 00 40 00 00 00 00 01 00 (pass0:ahcich0:0:0:0): CAM status: ATA Status Error (pass0:ahcich0:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 04 (ABRT ) (pass0:ahcich0:0:0:0): RES: 51 04 04 00 00 40 00 00 00 01 00 ATA_READ_LOG_EXT (addr=0x04:0x00, page=0, n=1) failed: Unknown
Re: problems with AHCI on FreeBSD 8.2
On Tue, Feb 14, 2012 at 02:54:35PM +0100, Victor Balada Diaz wrote: > On Tue, Feb 14, 2012 at 02:05:13AM -0800, Jeremy Chadwick wrote: > > On Tue, Feb 14, 2012 at 10:19:09AM +0100, Victor Balada Diaz wrote: > > > We're having some troubles with AHCI under FreeBSD 8.2 and 8-STABLE. The > > > error is: > > > > > > ahcich0: Timeout on slot 8 > > > ahcich0: is cs 0100 ss rs 0100 tfd c0 serr > > > > > > ahcich0: AHCI reset... > > > ahcich0: SATA connect time=0ms status=0123 > > > ahcich0: ready wait time=18ms > > > ahcich0: AHCI reset done: device found > > > (ada0:ahcich0:0:0:0): Request requeued > > > (ada0:ahcich0:0:0:0): Retrying command > > > (ada0:ahcich0:0:0:0): Command timed out > > > (ada0:ahcich0:0:0:0): Retrying command > > > ahcich0: Timeout on slot 8 > > > ahcich0: is cs 007ff000 ss 007fff00 rs 007fff00 tfd c0 serr > > > > > > ahcich0: AHCI reset... > > > ahcich0: SATA connect time=0ms status=0123 > > > ahcich0: ready wait time=84ms > > > ahcich0: AHCI reset done: device found > > > (ada0:ahcich0:0:0:0): Request requeued > > > (ada0:ahcich0:0:0:0): Retrying command > > > (ada0:ahcich0:0:0:0): Command timed out > > > (ada0:ahcich0:0:0:0): Retrying command > > > (ada0:ahcich0:0:0:0): Request requeued > > > [...] > > > > > > If we use old ATA driver we have no problems. If we just use the first > > > disk (ada0) with ahci, > > > no problems either. If we use both disks (ada0 and ada1) in gmirror setup > > > with ahci, we > > > got the above error. If we use both disks in gmirror with old ata driver, > > > no problems. > > > > Please provide SMART statistics for both disks by installing > > ports/sysutils/smartmontools (5.42 or newer please) and running > > "smartctl -a" against both disks (ada0/ada1, or ad4/ad10 -- doesn't > > matter which driver you're using). I will review the output. > > Just forgot to say that from time to time, after system hangs and i need > to reboot, one of the disks is lost. It doesn't even show after a few reboots, > nor on Linux live system. > > You can see smartctl output here: > > ada0: > > # smartctl -a /dev/ada0 > smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.2-STABLE amd64] (local build) > Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net > > === START OF INFORMATION SECTION === > Model Family: SAMSUNG SpinPoint F2 EG > Device Model: SAMSUNG HD154UI > Serial Number:S24EJ9BB200080 > LU WWN Device Id: 5 0024e9 2047cb78f > Firmware Version: 1AG01118 > User Capacity:1,500,301,910,016 bytes [1.50 TB] > Sector Size: 512 bytes logical/physical > Device is:In smartctl database [for details use: -P show] > ATA Version is: 8 > ATA Standard is: ATA-8-ACS revision 3b > Local Time is:Tue Feb 14 13:51:18 2012 CET > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > > General SMART Values: > Offline data collection status: (0x00) Offline data collection activity > was never started. > Auto Offline Data Collection: > Disabled. > Self-test execution status: ( 0) The previous self-test routine > completed > without error or no self-test has > ever > been run. > Total time to complete Offline > data collection:(18863) seconds. > Offline data collection > capabilities:(0x7b) SMART execute Offline immediate. > Auto Offline data collection on/off > support. > Suspend Offline collection upon new > command. > Offline surface scan supported. > Self-test supported. > Conveyance Self-test supported. > Selective Self-test supported. > SMART capabilities:(0x0003) Saves SMART data before entering > power-saving mode. > Supports SMART auto save timer. > Error logging capability:(0x01) Error logging supported. > General Purpose Logging supported. > Short self-test routine > recommended polling time:( 2) minutes. > Extended self-test routine > recommended polling time:( 255) minutes. > Conveyance self-test routine > recommended polling time:( 33) minutes. > SCT capabilities: (0x003f) SCT Status supported. > SCT Error Recovery Control supported. > SCT Feature Control supported. >
Re: problems with AHCI on FreeBSD 8.2
On Tue, Feb 14, 2012 at 02:05:13AM -0800, Jeremy Chadwick wrote: > On Tue, Feb 14, 2012 at 10:19:09AM +0100, Victor Balada Diaz wrote: > > We're having some troubles with AHCI under FreeBSD 8.2 and 8-STABLE. The > > error is: > > > > ahcich0: Timeout on slot 8 > > ahcich0: is cs 0100 ss rs 0100 tfd c0 serr > > > > ahcich0: AHCI reset... > > ahcich0: SATA connect time=0ms status=0123 > > ahcich0: ready wait time=18ms > > ahcich0: AHCI reset done: device found > > (ada0:ahcich0:0:0:0): Request requeued > > (ada0:ahcich0:0:0:0): Retrying command > > (ada0:ahcich0:0:0:0): Command timed out > > (ada0:ahcich0:0:0:0): Retrying command > > ahcich0: Timeout on slot 8 > > ahcich0: is cs 007ff000 ss 007fff00 rs 007fff00 tfd c0 serr > > > > ahcich0: AHCI reset... > > ahcich0: SATA connect time=0ms status=0123 > > ahcich0: ready wait time=84ms > > ahcich0: AHCI reset done: device found > > (ada0:ahcich0:0:0:0): Request requeued > > (ada0:ahcich0:0:0:0): Retrying command > > (ada0:ahcich0:0:0:0): Command timed out > > (ada0:ahcich0:0:0:0): Retrying command > > (ada0:ahcich0:0:0:0): Request requeued > > [...] > > > > If we use old ATA driver we have no problems. If we just use the first disk > > (ada0) with ahci, > > no problems either. If we use both disks (ada0 and ada1) in gmirror setup > > with ahci, we > > got the above error. If we use both disks in gmirror with old ata driver, > > no problems. > > Please provide SMART statistics for both disks by installing > ports/sysutils/smartmontools (5.42 or newer please) and running > "smartctl -a" against both disks (ada0/ada1, or ad4/ad10 -- doesn't > matter which driver you're using). I will review the output. Just forgot to say that from time to time, after system hangs and i need to reboot, one of the disks is lost. It doesn't even show after a few reboots, nor on Linux live system. You can see smartctl output here: ada0: # smartctl -a /dev/ada0 smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.2-STABLE amd64] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint F2 EG Device Model: SAMSUNG HD154UI Serial Number:S24EJ9BB200080 LU WWN Device Id: 5 0024e9 2047cb78f Firmware Version: 1AG01118 User Capacity:1,500,301,910,016 bytes [1.50 TB] Sector Size: 512 bytes logical/physical Device is:In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 3b Local Time is:Tue Feb 14 13:51:18 2012 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection:(18863) seconds. Offline data collection capabilities:(0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time:( 2) minutes. Extended self-test routine recommended polling time:( 255) minutes. Conveyance self-test routine recommended polling time:( 33) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f
Re: problems with AHCI on FreeBSD 8.2
On Tue, Feb 14, 2012 at 10:19:09AM +0100, Victor Balada Diaz wrote: > We're having some troubles with AHCI under FreeBSD 8.2 and 8-STABLE. The > error is: > > ahcich0: Timeout on slot 8 > ahcich0: is cs 0100 ss rs 0100 tfd c0 serr > ahcich0: AHCI reset... > ahcich0: SATA connect time=0ms status=0123 > ahcich0: ready wait time=18ms > ahcich0: AHCI reset done: device found > (ada0:ahcich0:0:0:0): Request requeued > (ada0:ahcich0:0:0:0): Retrying command > (ada0:ahcich0:0:0:0): Command timed out > (ada0:ahcich0:0:0:0): Retrying command > ahcich0: Timeout on slot 8 > ahcich0: is cs 007ff000 ss 007fff00 rs 007fff00 tfd c0 serr > ahcich0: AHCI reset... > ahcich0: SATA connect time=0ms status=0123 > ahcich0: ready wait time=84ms > ahcich0: AHCI reset done: device found > (ada0:ahcich0:0:0:0): Request requeued > (ada0:ahcich0:0:0:0): Retrying command > (ada0:ahcich0:0:0:0): Command timed out > (ada0:ahcich0:0:0:0): Retrying command > (ada0:ahcich0:0:0:0): Request requeued > [...] > > If we use old ATA driver we have no problems. If we just use the first disk > (ada0) with ahci, > no problems either. If we use both disks (ada0 and ada1) in gmirror setup > with ahci, we > got the above error. If we use both disks in gmirror with old ata driver, no > problems. Please provide SMART statistics for both disks by installing ports/sysutils/smartmontools (5.42 or newer please) and running "smartctl -a" against both disks (ada0/ada1, or ad4/ad10 -- doesn't matter which driver you're using). I will review the output. -- | Jeremy Chadwick j...@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: problems with AHCI on FreeBSD 8.2
On 02/14/12 11:19, Victor Balada Diaz wrote: We're having some troubles with AHCI under FreeBSD 8.2 and 8-STABLE. The error is: ahcich0: Timeout on slot 8 ahcich0: is cs 0100 ss rs 0100 tfd c0 serr ahcich0: AHCI reset... ahcich0: SATA connect time=0ms status=0123 ahcich0: ready wait time=18ms ahcich0: AHCI reset done: device found (ada0:ahcich0:0:0:0): Request requeued (ada0:ahcich0:0:0:0): Retrying command (ada0:ahcich0:0:0:0): Command timed out (ada0:ahcich0:0:0:0): Retrying command ahcich0: Timeout on slot 8 ahcich0: is cs 007ff000 ss 007fff00 rs 007fff00 tfd c0 serr ahcich0: AHCI reset... ahcich0: SATA connect time=0ms status=0123 ahcich0: ready wait time=84ms ahcich0: AHCI reset done: device found (ada0:ahcich0:0:0:0): Request requeued (ada0:ahcich0:0:0:0): Retrying command (ada0:ahcich0:0:0:0): Command timed out (ada0:ahcich0:0:0:0): Retrying command (ada0:ahcich0:0:0:0): Request requeued [...] If we use old ATA driver we have no problems. If we just use the first disk (ada0) with ahci, no problems either. If we use both disks (ada0 and ada1) in gmirror setup with ahci, we got the above error. If we use both disks in gmirror with old ata driver, no problems. In both cases controller reports command status as 0xc0, that means device is busy with the command. For NCQ commands it means that device in in stage of processing command itself, not a head positioning or data transfer. Enabling AHCI enables NCQ for the devices. That increases load on both devices and the controller, and it is difficult to say who's fault is here. SAMSUNG HD154UI disks AFAIR have 4k sectors that may have big performance penalties when accessing small/misaligned data. I am not sure how big that penalty can be in the worst case, especially since disks by default cache writes, hiding the real load level. Relations with gmirror is harder to explain. Depending on how you created it and partitions it could cause more misaligned I/Os during rebuild. Using gmirror also double concurrent load on the controller, but at this point I have nothing to blame it for. -- Alexander Motin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"