Re: problems with AHCI on FreeBSD 8.2

2012-02-16 Thread Oscar Prieto
Yesterday I did a backup of the sensible stuff of the pool and decided
to just break stuff on purpose ;)

I writed with dd over the sector marked as faulty by smartctl and
runned a smartctl short test. I repeated the process several times
until smartctl gave no errors at all on ada3.

After that i left the pool doing a scrub and it seemed to  repair the
integrity of the pool:
--
[root@zaibach ~]# zpool status
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scan: scrub repaired 398K in 10h39m with 0 errors on Thu Feb 16 09:15:59 2012
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
  raidz1-0  ONLINE   0 0 0
ada2p1  ONLINE   0 0 0
ada1p1  ONLINE   0 0 0
ada3p1  ONLINE   0 011
ada0p1  ONLINE   0 0 0
-

But funnily i got an ahci timeout on other drive, /dev/ada2.
-
Feb 16 04:08:23 zaibach kernel: ahcich2: Timeout on slot 15 port 0
Feb 16 04:08:23 zaibach kernel: ahcich2: is  cs 0004 ss
00078000 rs 00078000 tfd c0 serr  cmd 0004d217
---

At least a short smartctl test on /dev/ada2 doesn't seem to complain this time.

On Thu, Feb 16, 2012 at 5:48 AM, John  wrote:
> Jeremy Chadwick wrote:
>>
>> CRC errors ...
>>
>>I have no real advice for tracking this kind of problem down.  The most
>>common response is "replace cables", which isn't necessarily the root
>>cause.  I have no advice or tips on how to track down interference
>>issues, or how to truly examine a disk PCB or controller PCB for the
>>latter item.  "Flaky traces" on a PCB could cause this sort of thing.
>>Folks in the EE field would know more about these issues; I am not an EE
>>person.
>>
>>Since the attribute increased on both drives simultaneously (I have to
>>assume simultaneously?), it's more likely that the problem is not with
>>SATA cables or the drives but the controller on the motherboard.  I'd
>>recommend replacing the motherboard.  I make no guarantees this will fix
>>anything however, but it is the "common point" for both of your drives.
>
> This EE agrees with your advise. I would add if replacing the motherboard 
> fails
> to fix the problem, then replace the power supply. Even with extremely high
> end test equipment, you likely would never be able to see the failure occur
> for at least two reasons; the most likely failure mode is inside a single IC,
> and adding probes would alter the environment enough to change the failure
> mode.
>
> John Theus
> TheUs Group
> TheUsGroup.com
> ___
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: problems with AHCI on FreeBSD 8.2

2012-02-15 Thread John
Jeremy Chadwick wrote:
>
> CRC errors ...
>
>I have no real advice for tracking this kind of problem down.  The most
>common response is "replace cables", which isn't necessarily the root
>cause.  I have no advice or tips on how to track down interference
>issues, or how to truly examine a disk PCB or controller PCB for the
>latter item.  "Flaky traces" on a PCB could cause this sort of thing.
>Folks in the EE field would know more about these issues; I am not an EE
>person.
>
>Since the attribute increased on both drives simultaneously (I have to
>assume simultaneously?), it's more likely that the problem is not with
>SATA cables or the drives but the controller on the motherboard.  I'd
>recommend replacing the motherboard.  I make no guarantees this will fix
>anything however, but it is the "common point" for both of your drives.

This EE agrees with your advise. I would add if replacing the motherboard fails
to fix the problem, then replace the power supply. Even with extremely high
end test equipment, you likely would never be able to see the failure occur
for at least two reasons; the most likely failure mode is inside a single IC,
and adding probes would alter the environment enough to change the failure
mode.

John Theus
TheUs Group
TheUsGroup.com
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: problems with AHCI on FreeBSD 8.2

2012-02-15 Thread Jeremy Chadwick
On Wed, Feb 15, 2012 at 07:17:57PM +0100, Victor Balada Diaz wrote:
> On Tue, Feb 14, 2012 at 06:16:01AM -0800, Jeremy Chadwick wrote:
> > Thanks.  Both your drives look overall fine, sort-of.  I'll outline my
> > concern points, and ask for some more info:
> > 
> > * ada0 has 28 CRC errors, while ada1 has 2.  These drives have been in
> > use for 4688 hours and 4583 hours (respectively), which is roughly 6
> > months for each drive.  CRC errors usually result in transparent
> > retransmits, but this can sometimes cause I/O delays (especially if the
> > CRC errors are repeated).
> > 
> > If the timeout messages recur in the future, please run the commands I
> > gave you above once more and provide the output.  I can then compare the
> > old to the new and see if there is anything of interest.
> 
> I've made it fail again. You can see smartctl -a output. CRC errors are 
> increasing.
> But i'm not sure what does it really mean. Is HD broken? both? at the same 
> time?

CRC errors indicate one of the following, in no particular order:

* Physical cabling problems (number of reasons/possibilities here are
  too many to list)
* Dirty/dusty SATA connectors (cables/drive/host controller)
* Electrical interference (badly shielded cables, etc.)
* Physical electronic/electrical problems (disk PCB, host controller
  PCB, etc.)

The important thing to remember about CRCs is that they indicate a
hardware-level problem between the host controller and the controller
chip on the drive.  They do not indicate problems with the drive's cache
(those are tracked in attribute 184), and they do not indicate
software-level problems (e.g. driver bugs, etc.).

I have no real advice for tracking this kind of problem down.  The most
common response is "replace cables", which isn't necessarily the root
cause.  I have no advice or tips on how to track down interference
issues, or how to truly examine a disk PCB or controller PCB for the
latter item.  "Flaky traces" on a PCB could cause this sort of thing.
Folks in the EE field would know more about these issues; I am not an EE
person.

Since the attribute increased on both drives simultaneously (I have to
assume simultaneously?), it's more likely that the problem is not with
SATA cables or the drives but the controller on the motherboard.  I'd
recommend replacing the motherboard.  I make no guarantees this will fix
anything however, but it is the "common point" for both of your drives.

There really isn't anything else I can do going forward.  This is pretty
much where the buck stops for me, and is validation as to why each and
every problem/issue has to be handled individually.

-- 
| Jeremy Chadwick  jdc at parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: problems with AHCI on FreeBSD 8.2

2012-02-15 Thread Victor Balada Diaz
On Tue, Feb 14, 2012 at 06:16:01AM -0800, Jeremy Chadwick wrote:
> Thanks.  Both your drives look overall fine, sort-of.  I'll outline my
> concern points, and ask for some more info:
> 
> * ada0 has 28 CRC errors, while ada1 has 2.  These drives have been in
> use for 4688 hours and 4583 hours (respectively), which is roughly 6
> months for each drive.  CRC errors usually result in transparent
> retransmits, but this can sometimes cause I/O delays (especially if the
> CRC errors are repeated).
> 
> If the timeout messages recur in the future, please run the commands I
> gave you above once more and provide the output.  I can then compare the
> old to the new and see if there is anything of interest.

I've made it fail again. You can see smartctl -a output. CRC errors are 
increasing.
But i'm not sure what does it really mean. Is HD broken? both? at the same time?

# smartctl -a /dev/ada0
smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.1-RELEASE-p5 amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint F2 EG
Device Model: SAMSUNG HD154UI
Serial Number:S24EJ9BB200080
LU WWN Device Id: 5 0024e9 2047cb78f
Firmware Version: 1AG01118
User Capacity:1,500,301,910,016 bytes [1.50 TB]
Sector Size:  512 bytes logical/physical
Device is:In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:Wed Feb 15 18:11:31 2012 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status:  (   0) The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection:(18863) seconds.
Offline data collection
capabilities:(0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off 
support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine 
recommended polling time:(   2) minutes.
Extended self-test routine
recommended polling time:( 255) minutes.
Conveyance self-test routine
recommended polling time:(  33) minutes.
SCT capabilities:  (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   100   100   051Pre-fail  Always   
-   0
  3 Spin_Up_Time0x0007   072   072   011Pre-fail  Always   
-   9330
  4 Start_Stop_Count0x0032   100   100   000Old_age   Always   
-   22
  5 Reallocated_Sector_Ct   0x0033   100   100   010Pre-fail  Always   
-   0
  7 Seek_Error_Rate 0x000f   100   100   051Pre-fail  Always   
-   0
  8 Seek_Time_Performance   0x0025   100   100   015Pre-fail  Offline  
-   13677
  9 Power_On_Hours  0x0032   099   099   000Old_age   Always   
-   4716
 10 Spin_Retry_Count0x0033   100   100   051Pre-fail  Always   
-   0
 11 Calibration_Retry_Count 0x0012   100   100   000Old_age   Always   
-   0
 12 Power_Cycle_Count   0x0032   100   100   000Old_age   Always   
-   22
 13 Read_Soft_Error_Rate0x000e   100   100   000Old_age   Always   
-   0
183 Runtime_Bad_Block   0x0032   100   100   000Old_age   Always   
-   0
184 End-to-End_Error

Re: problems with AHCI on FreeBSD 8.2

2012-02-15 Thread Victor Balada Diaz
On Tue, Feb 14, 2012 at 04:42:47PM -0700, Scott Long wrote:
> On Feb 14, 2012, at 4:34 PM, Victor Balada Diaz wrote:
> > On Tue, Feb 14, 2012 at 03:09:58PM -0800, Jeremy Chadwick wrote:
> >> I took a stab at this, but I don't feel confident this is the proper
> >> solution/method.  I worry there's some sort of chicken-or-the-egg
> >> condition here (quirk setup/matching comes *after* SATA capabilities
> >> detection), or that it makes the code messier.  Need mav@'s
> >> recommendations on this.
> >> 
> >> Below is for RELENG_8.  I should note I haven't tested if this works, or
> >> even compiles -- normally I don't provide such patches without testing
> >> so I apologise in advance / user beware.
> > 
> > You're amazingly fast. Thanks for all your help :)
> > 
> > You start applying the quirks before 
> > 
> >snprintf(announce_buf, sizeof(announce_buf),
> >"kern.cam.ada.%d.quirks", periph->unit_number);
> >quirks = softc->quirks;
> >TUNABLE_INT_FETCH(announce_buf, &quirks);
> > 
> > So you're breaking quirk setting at boot time.
> > 
> > See my attached patch. I can confirm it works for me.
> > 
> > Regards.
> > 
> 
> I don't think that disabling NCQ entirely is the right solution.  It's a tag 
> starvation issue in the firmware, not a complete failure, and it can be dealt 
> with in the CAM XPT scheduler fairly efficiently.  Alexander and I talked 
> about this recently, and though we differ on the details, a tag hack is not 
> in order, IMHO.  In the short term, try just using "cam control tags ada0 -N 
> 1" to limit the concurrent commands to 1.
> 
> Scott

Seems changing tags on both disks doesn't fix the issue:

(ada0:ahcich0:0:0:0): Request requeued
(ada0:ahcich0:0:0:0): Retrying command
ahcich1: Timeout on slot 0
ahcich1: is  cs 0001 ss  rs 0001 tfd c0 serr 
ahcich1: AHCI reset...
ahcich1: SATA connect time=0ms status=0123
ahcich1: ready wait time=18ms
ahcich1: AHCI reset done: device found
(ada1:ahcich1:0:0:0): Request requeued
(ada1:ahcich1:0:0:0): Retrying command
(ada1:ahcich1:0:0:0): Command timed out
(ada1:ahcich1:0:0:0): Retrying command
ahcich0: Timeout on slot 30
ahcich0: is  cs c000 ss  rs c000 tfd c0 serr 
ahcich0: AHCI reset...
ahcich0: SATA connect time=0ms status=0123
ahcich0: ready wait time=18ms
ahcich0: AHCI reset done: device found

The only difference is that now i get "Request requeued" message.

-- 
La prueba más fehaciente de que existe vida inteligente en otros
planetas, es que no han intentado contactar con nosotros. 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: problems with AHCI on FreeBSD 8.2

2012-02-15 Thread Tom Evans
On Wed, Feb 15, 2012 at 10:52 AM, Jeremy Chadwick
 wrote:
> Sorry, I missed the in-line part of your post at the top where you said:
>
>> > Interesting. I have 9 SAMSUNG HD154UI 1AG01118 in my raidz setup,
>> > haven't had a problem with any of them yet (touch wood).
>
> So that would be you using the same firmware (or so I'd like to believe,
> but see my previous explanations) as others.
>

Yes, that draws my ire too - how can you update the firmware and not
change the firmware revision, it is crazy. It's possible that even
amongst my drives there are different revisions, as not all drives
were bought at the same time.

Cheers

Tom
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: problems with AHCI on FreeBSD 8.2

2012-02-15 Thread Harald Schmalzbauer
 schrieb Jeremy Chadwick am 15.02.2012 11:42 (localtime):
> On Wed, Feb 15, 2012 at 10:19:37AM +, Tom Evans wrote:
>> On Tue, Feb 14, 2012 at 7:52 PM, Jeremy Chadwick
>>  wrote:
>>> On Tue, Feb 14, 2012 at 08:31:23PM +0100, Oscar Prieto wrote:
 I used to had tons of ahci errors in my 4 disk raidz1 worth of
 HD154UIs when the rig was built a year ago or so (with 8.0 Release),
 but they dissapeared after tuning ZFS.

 Sadly i also got a new timeout days ago followed with smartcl erros i
 still keep unchecked but i guess they cold be legit, i still have to
 test/swap cables and give it a try.
>> Interesting. I have 9 SAMSUNG HD154UI 1AG01118 in my raidz setup,
>> haven't had a problem with any of them yet (touch wood).
>>
>>> Further details which pertain to Samsung drives:
>>>
>>> In your case, you run smartd(8), which periodically hits the drive with
>>> SMART requests, pulling attribute data down and parsing it. ??I believe
>>> your model is fine for this, but for similar Samsung models, I must
>>> strongly advise against this. ??There are well-documented problems with
>>> Samsung firmwares and SMART behaviour which can result in data loss (yes
>>> you read that right). ??Please see smartmontools' Wiki page on the matter
>>> for full details. ??Just make sure you're running a fixed firmware:
>>>
>>> http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks
>>>
>> Yikes, I have just this week installed a HD204UI. From that page,
>> drives manufactured after December 2010 should not be affected, which
>> is fortunate as the linked firmware page doesn't seem to exist
>> anymore, Samsung no longer seem to offer support for their drives and
>> point you at Seagate, whose site (of course!) only has downloads for
>> current Seagate drives.
>>
>>
>> Hmm reading later on in the thread there is a patch to mark certain
>> drives as having flaky NCQ - in the patch it is for the SAMSUNG
>> HD154UI. As I mentioned before, I have 9 SAMSUNG HD154UI, all of which
>> use ahci(4) and NCQ, and all work perfectly, no timeouts. This is
>> using 9-STABLE.
>>
>> I suspect that there may be more going on than 'flaky NCQ', and that
>> perhaps disabling NCQ masks the real issue.
> It could simply be a firmware bug in the drive, which is what some
> others have eluded to (and I'm in agreement with).  I would love to say
> "compare firmware versions on your drives", except there is real
> in-the-field proof that firmware version strings often do not get
> updated/changed between firmwares (at least in the case of some Seagate
> and Western Digital disks).  Furthermore, NCQ can "play differently" with
> different AHCI controllers.
>
> That said, the disks / firmware versions mentioned by people involved in
> this thread / referenced threads are:
>
> * Victor Balada Diaz  -- SAMSUNG HD154UI, firmware 1AG01118
> * Claudius Herder -- SAMSUNG HD753LJ, firmware 1AA01118
> * Oscar Prieto-- SAMSUNG HD154UI, firmware 1AG01118
>   - NOTE: In Oscar's case, his drives exhibit other problems.  I
> would provide a link but the web archive for freebsd-stable does
> not show my mail which contains analysis of the situation
> * Harald Schmalzbauer -- not provided, but hints at Samsung EG drives
  --  SAMSUNG HD154UI, firmware 1AG01118
I still have them for "outsourcing" in one server, where they idle all
the time.

Thanks,

-Harry




signature.asc
Description: OpenPGP digital signature


Re: problems with AHCI on FreeBSD 8.2

2012-02-15 Thread Jeremy Chadwick


On Wed, Feb 15, 2012 at 02:42:05AM -0800, Jeremy Chadwick wrote:
> On Wed, Feb 15, 2012 at 10:19:37AM +, Tom Evans wrote:
> > On Tue, Feb 14, 2012 at 7:52 PM, Jeremy Chadwick
> >  wrote:
> > > On Tue, Feb 14, 2012 at 08:31:23PM +0100, Oscar Prieto wrote:
> > >> I used to had tons of ahci errors in my 4 disk raidz1 worth of
> > >> HD154UIs when the rig was built a year ago or so (with 8.0 Release),
> > >> but they dissapeared after tuning ZFS.
> > >>
> > >> Sadly i also got a new timeout days ago followed with smartcl erros i
> > >> still keep unchecked but i guess they cold be legit, i still have to
> > >> test/swap cables and give it a try.
> > 
> > Interesting. I have 9 SAMSUNG HD154UI 1AG01118 in my raidz setup,
> > haven't had a problem with any of them yet (touch wood).
> > 
> > > Further details which pertain to Samsung drives:
> > >
> > > In your case, you run smartd(8), which periodically hits the drive with
> > > SMART requests, pulling attribute data down and parsing it. ??I believe
> > > your model is fine for this, but for similar Samsung models, I must
> > > strongly advise against this. ??There are well-documented problems with
> > > Samsung firmwares and SMART behaviour which can result in data loss (yes
> > > you read that right). ??Please see smartmontools' Wiki page on the matter
> > > for full details. ??Just make sure you're running a fixed firmware:
> > >
> > > http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks
> > >
> > 
> > Yikes, I have just this week installed a HD204UI. From that page,
> > drives manufactured after December 2010 should not be affected, which
> > is fortunate as the linked firmware page doesn't seem to exist
> > anymore, Samsung no longer seem to offer support for their drives and
> > point you at Seagate, whose site (of course!) only has downloads for
> > current Seagate drives.
> > 
> > 
> > Hmm reading later on in the thread there is a patch to mark certain
> > drives as having flaky NCQ - in the patch it is for the SAMSUNG
> > HD154UI. As I mentioned before, I have 9 SAMSUNG HD154UI, all of which
> > use ahci(4) and NCQ, and all work perfectly, no timeouts. This is
> > using 9-STABLE.
> > 
> > I suspect that there may be more going on than 'flaky NCQ', and that
> > perhaps disabling NCQ masks the real issue.
> 
> It could simply be a firmware bug in the drive, which is what some
> others have eluded to (and I'm in agreement with).  I would love to say
> "compare firmware versions on your drives", except there is real
> in-the-field proof that firmware version strings often do not get
> updated/changed between firmwares (at least in the case of some Seagate
> and Western Digital disks).  Furthermore, NCQ can "play differently" with
> different AHCI controllers.
> 
> That said, the disks / firmware versions mentioned by people involved in
> this thread / referenced threads are:
> 
> * Victor Balada Diaz  -- SAMSUNG HD154UI, firmware 1AG01118
> * Claudius Herder -- SAMSUNG HD753LJ, firmware 1AA01118
> * Oscar Prieto-- SAMSUNG HD154UI, firmware 1AG01118
>   - NOTE: In Oscar's case, his drives exhibit other problems.  I
> would provide a link but the web archive for freebsd-stable does
> not show my mail which contains analysis of the situation
> * Harald Schmalzbauer -- not provided, but hints at Samsung EG drives
> 
> For this to be thorough, one would need to check what all AHCI
> controllers are being used and compare those as well.
> 
> I think Scott's theory is probably on-the-ball here, as it pertains to
> tag exhaustion, which would manifest itself in the described fashion:
> 
> http://lists.freebsd.org/pipermail/freebsd-stable/2012-February/066177.html
> 
> I'd urge people experiencing this problem to issue the command Scott
> provided on all their Samsung disks and see if the problem goes away
> after that.  If it does, great, and I acknowledge there is no
> loader.conf tunable for doing this, etc. etc. etc. so either make an
> rc.d script that does it after boot-up or something.

Sorry, I missed the in-line part of your post at the top where you said:

> > Interesting. I have 9 SAMSUNG HD154UI 1AG01118 in my raidz setup,
> > haven't had a problem with any of them yet (touch wood).

So that would be you using the same firmware (or so I'd like to believe,
but see my previous explanations) as others.

It could be some AHCI<->NCQ drive implementation quirk.  There was an
example of this back in the day with Maxtor drives' NCQ implementation
not behaving correctly on nVidia controllers, which Maxtor insisted was
an nVidia problem yet released a drive firmware fix for.  I'm one of the
people this affected (on my desktop system), which is why I remember it.

-- 
| Jeremy Chadwick  jdc at parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, US |
| Making life hard for others since 1977. 

Re: problems with AHCI on FreeBSD 8.2

2012-02-15 Thread Jeremy Chadwick
On Wed, Feb 15, 2012 at 10:19:37AM +, Tom Evans wrote:
> On Tue, Feb 14, 2012 at 7:52 PM, Jeremy Chadwick
>  wrote:
> > On Tue, Feb 14, 2012 at 08:31:23PM +0100, Oscar Prieto wrote:
> >> I used to had tons of ahci errors in my 4 disk raidz1 worth of
> >> HD154UIs when the rig was built a year ago or so (with 8.0 Release),
> >> but they dissapeared after tuning ZFS.
> >>
> >> Sadly i also got a new timeout days ago followed with smartcl erros i
> >> still keep unchecked but i guess they cold be legit, i still have to
> >> test/swap cables and give it a try.
> 
> Interesting. I have 9 SAMSUNG HD154UI 1AG01118 in my raidz setup,
> haven't had a problem with any of them yet (touch wood).
> 
> > Further details which pertain to Samsung drives:
> >
> > In your case, you run smartd(8), which periodically hits the drive with
> > SMART requests, pulling attribute data down and parsing it. ??I believe
> > your model is fine for this, but for similar Samsung models, I must
> > strongly advise against this. ??There are well-documented problems with
> > Samsung firmwares and SMART behaviour which can result in data loss (yes
> > you read that right). ??Please see smartmontools' Wiki page on the matter
> > for full details. ??Just make sure you're running a fixed firmware:
> >
> > http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks
> >
> 
> Yikes, I have just this week installed a HD204UI. From that page,
> drives manufactured after December 2010 should not be affected, which
> is fortunate as the linked firmware page doesn't seem to exist
> anymore, Samsung no longer seem to offer support for their drives and
> point you at Seagate, whose site (of course!) only has downloads for
> current Seagate drives.
> 
> 
> Hmm reading later on in the thread there is a patch to mark certain
> drives as having flaky NCQ - in the patch it is for the SAMSUNG
> HD154UI. As I mentioned before, I have 9 SAMSUNG HD154UI, all of which
> use ahci(4) and NCQ, and all work perfectly, no timeouts. This is
> using 9-STABLE.
> 
> I suspect that there may be more going on than 'flaky NCQ', and that
> perhaps disabling NCQ masks the real issue.

It could simply be a firmware bug in the drive, which is what some
others have eluded to (and I'm in agreement with).  I would love to say
"compare firmware versions on your drives", except there is real
in-the-field proof that firmware version strings often do not get
updated/changed between firmwares (at least in the case of some Seagate
and Western Digital disks).  Furthermore, NCQ can "play differently" with
different AHCI controllers.

That said, the disks / firmware versions mentioned by people involved in
this thread / referenced threads are:

* Victor Balada Diaz  -- SAMSUNG HD154UI, firmware 1AG01118
* Claudius Herder -- SAMSUNG HD753LJ, firmware 1AA01118
* Oscar Prieto-- SAMSUNG HD154UI, firmware 1AG01118
  - NOTE: In Oscar's case, his drives exhibit other problems.  I
would provide a link but the web archive for freebsd-stable does
not show my mail which contains analysis of the situation
* Harald Schmalzbauer -- not provided, but hints at Samsung EG drives

For this to be thorough, one would need to check what all AHCI
controllers are being used and compare those as well.

I think Scott's theory is probably on-the-ball here, as it pertains to
tag exhaustion, which would manifest itself in the described fashion:

http://lists.freebsd.org/pipermail/freebsd-stable/2012-February/066177.html

I'd urge people experiencing this problem to issue the command Scott
provided on all their Samsung disks and see if the problem goes away
after that.  If it does, great, and I acknowledge there is no
loader.conf tunable for doing this, etc. etc. etc. so either make an
rc.d script that does it after boot-up or something.

-- 
| Jeremy Chadwick  jdc at parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: problems with AHCI on FreeBSD 8.2

2012-02-15 Thread Tom Evans
On Tue, Feb 14, 2012 at 7:52 PM, Jeremy Chadwick
 wrote:
> On Tue, Feb 14, 2012 at 08:31:23PM +0100, Oscar Prieto wrote:
>> I used to had tons of ahci errors in my 4 disk raidz1 worth of
>> HD154UIs when the rig was built a year ago or so (with 8.0 Release),
>> but they dissapeared after tuning ZFS.
>>
>> Sadly i also got a new timeout days ago followed with smartcl erros i
>> still keep unchecked but i guess they cold be legit, i still have to
>> test/swap cables and give it a try.

Interesting. I have 9 SAMSUNG HD154UI 1AG01118 in my raidz setup,
haven't had a problem with any of them yet (touch wood).

> Further details which pertain to Samsung drives:
>
> In your case, you run smartd(8), which periodically hits the drive with
> SMART requests, pulling attribute data down and parsing it.  I believe
> your model is fine for this, but for similar Samsung models, I must
> strongly advise against this.  There are well-documented problems with
> Samsung firmwares and SMART behaviour which can result in data loss (yes
> you read that right).  Please see smartmontools' Wiki page on the matter
> for full details.  Just make sure you're running a fixed firmware:
>
> http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks
>

Yikes, I have just this week installed a HD204UI. From that page,
drives manufactured after December 2010 should not be affected, which
is fortunate as the linked firmware page doesn't seem to exist
anymore, Samsung no longer seem to offer support for their drives and
point you at Seagate, whose site (of course!) only has downloads for
current Seagate drives.


Hmm reading later on in the thread there is a patch to mark certain
drives as having flaky NCQ - in the patch it is for the SAMSUNG
HD154UI. As I mentioned before, I have 9 SAMSUNG HD154UI, all of which
use ahci(4) and NCQ, and all work perfectly, no timeouts. This is
using 9-STABLE.

I suspect that there may be more going on than 'flaky NCQ', and that
perhaps disabling NCQ masks the real issue.

Cheers

Tom
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: problems with AHCI on FreeBSD 8.2

2012-02-15 Thread Andriy Gapon

[cc list somewhat trimmed]

on 15/02/2012 10:29 Victor Balada Diaz said the following:
> Indeed you're right. It's a hack.

Sorry for intruding and under-quoting... perhaps the following commit could be a
model solution for your problem here?
http://svnweb.freebsd.org/base?view=revision&revision=231745
(with scsi -> ata , of course)

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: problems with AHCI on FreeBSD 8.2

2012-02-15 Thread Victor Balada Diaz
On Tue, Feb 14, 2012 at 04:10:20PM -0800, Jeremy Chadwick wrote:
> I'm too tired to quite understand (in full) what's wrong with my patch,
> but I think you're referring to situations where someone would have
> kern.cam.ada.X.quirks set in loader.conf?
> 
> If so, I believe that same situation would happen presently if someone
> set kern.cam.ada.X.quirks in their loader.conf to a value that did not
> contain bit #0 set to 1, and used one of the 4K sector disks listed in
> ada_quirk_table -- what's in loader.conf looks like it would overwrite
> whatever the kernel code bits chose automatically:
> 
>  910 match = cam_quirkmatch((caddr_t)&cgd->ident_data,
>  911(caddr_t)ada_quirk_table,
>  912
> sizeof(ada_quirk_table)/sizeof(*ada_quirk_table),
>  913sizeof(*ada_quirk_table), 
> ata_identify_match);
>  914 if (match != NULL)
>  915 softc->quirks = ((struct ada_quirk_entry 
> *)match)->quirks;
>  916 else
>  917 softc->quirks = ADA_Q_NONE;
>  ...
>  931 snprintf(announce_buf, sizeof(announce_buf),
>  932 "kern.cam.ada.%d.quirks", periph->unit_number);
>  933 quirks = softc->quirks;
>  934 TUNABLE_INT_FETCH(announce_buf, &quirks);
>  935 softc->quirks = quirks;
> 
> I read this to mean:
> 
> Lines 910-917 -- if there's a device ID string match in ada_quirk_table,
> set softc->quirks to the content of that struct entry.  So, for example,
> 4K sector disks would set softc->quirks to 0x01.
> 
> Lines 931-933 -- assign the "quirks" variable to what softc->quirks
> currently contains.  Thus, for 4K sector disks, quirks = 0x01.
> 
> Line 934 -- load into "quirks" variable the contents of loader.conf
> entries pertaining to kern.cam.ada.N.quirks, if set.  If someone had an
> entry in loader.conf saying kern.cam.ada.N.quirks=0 then yes, this would
> overwrite what the kernel "auto-chose".
> 
> Line 935 -- assign softc->quirks = quirks, thus softc->quirks = 0x00,
> assuming someone set it to such in loader.conf.

That's exactly what i meant. 

> > See my attached patch. I can confirm it works for me.
> 
> I thought of taking that approach, but for me it's "dirty".  Here's what
> I mean by that:
> 
> ADA_FLAG_CAN_NCQ gets set in softc->flags around line 892, but then
> roughly a hundred lines later, you clear the exact same flag you just
> set (based on quirk conditionals).
> 
> I dunno how people feel about that, but my impression is that it's
> dirty/confusing.  My opinion is to only set the bit once and not mess
> about with repeated if() statements that set it, then clear it, etc...

Indeed you're right. It's a hack. Would be better to move quirk evaluation 
before checking
capabilities.

-- 
La prueba más fehaciente de que existe vida inteligente en otros
planetas, es que no han intentado contactar con nosotros. 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: problems with AHCI on FreeBSD 8.2

2012-02-14 Thread Scott Long

On Feb 14, 2012, at 4:34 PM, Victor Balada Diaz wrote:

> On Tue, Feb 14, 2012 at 03:09:58PM -0800, Jeremy Chadwick wrote:
>> On Tue, Feb 14, 2012 at 11:15:27PM +0100, Victor Balada Diaz wrote:
>>> On Tue, Feb 14, 2012 at 06:17:19PM +0100, Harald Schmalzbauer wrote:
 schrieb Jeremy Chadwick am 14.02.2012 17:50 (localtime):
> On Tue, Feb 14, 2012 at 04:55:10PM +0100, Claudius Herder wrote:
>> Hello,
>> 
>> I have got a quite similar problem with AHCI on FreeBSD 8.2 and it still
>> persists on FreeBSD 9.0 release.
>> 
>> Switching from ahci to ataahci resolved the problem for me too.
>> 
>> I'm using gmirror for swap, system is on a zpool and the problem first
>> occurred during a zpool scrub, but it is easily reproducible with dd.
>> 
>> The timeouts only occur when writing to disks, dd if=/dev/ada{0|1}
>> of=/dev/null is not an issue.
>> Sometimes I need to power off the server because after a reboot one disk
>> is still missing.
>> 
>> I really would like to help in this issue, so let me know if you need
>> any more information.
> I find it interesting that, at least so far, the only people reporting
> problems of this type with the ahci.ko driver are people using Samsung
> disks.  The only difference is that your models are F1s while the OPs
> are F2s.
 
 I saw such timeouts long ago and mav@ had a look at my postings and he
 mentioned it could be a NCQ problem.
 I suspected the disks firmware.
 I never tracked it down further, because after replacing the Samsung (F3
 in that case) disks with hitachi ones solved all my problems and gave a
 big performance kick as well (with zfs).
 You can find the discussion here:
 http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html
 
>>> 
>>> You gave me a good idea: try to disable NCQ and see if that's the fault. So
>>> i went and applied the attached patch. After it, i can no longer reproduce
>>> the issue with ahci driver.
>>> 
>>> I know this is not a solution because it disables NCQ at controller level
>>> instead of disk level, but at least we know for sure where the problem is.
>>> 
>>> I think the solution would be to add a new quirk ADA_Q_NONCQ in 
>>> sys/cam/ata/ata_da.c.
>>> Quirks infraestructure is already built, so adding a new quirk for this 
>>> seems
>>> easy.
>>> 
>>> Is someone interested? Do you think there is a better solution?
>>> 
>>> If someone is interested i can build a patch to add ADA_Q_NONCQ quirk and 
>>> add my drives
>>> to it.
>> 
>> I took a stab at this, but I don't feel confident this is the proper
>> solution/method.  I worry there's some sort of chicken-or-the-egg
>> condition here (quirk setup/matching comes *after* SATA capabilities
>> detection), or that it makes the code messier.  Need mav@'s
>> recommendations on this.
>> 
>> Below is for RELENG_8.  I should note I haven't tested if this works, or
>> even compiles -- normally I don't provide such patches without testing
>> so I apologise in advance / user beware.
> 
> You're amazingly fast. Thanks for all your help :)
> 
> You start applying the quirks before 
> 
>snprintf(announce_buf, sizeof(announce_buf),
>"kern.cam.ada.%d.quirks", periph->unit_number);
>quirks = softc->quirks;
>TUNABLE_INT_FETCH(announce_buf, &quirks);
> 
> So you're breaking quirk setting at boot time.
> 
> See my attached patch. I can confirm it works for me.
> 
> Regards.
> 

I don't think that disabling NCQ entirely is the right solution.  It's a tag 
starvation issue in the firmware, not a complete failure, and it can be dealt 
with in the CAM XPT scheduler fairly efficiently.  Alexander and I talked about 
this recently, and though we differ on the details, a tag hack is not in order, 
IMHO.  In the short term, try just using "cam control tags ada0 -N 1" to limit 
the concurrent commands to 1.

Scott


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: problems with AHCI on FreeBSD 8.2

2012-02-14 Thread Jeremy Chadwick
On Wed, Feb 15, 2012 at 12:34:20AM +0100, Victor Balada Diaz wrote:
> On Tue, Feb 14, 2012 at 03:09:58PM -0800, Jeremy Chadwick wrote:
> > On Tue, Feb 14, 2012 at 11:15:27PM +0100, Victor Balada Diaz wrote:
> > > On Tue, Feb 14, 2012 at 06:17:19PM +0100, Harald Schmalzbauer wrote:
> > > >  schrieb Jeremy Chadwick am 14.02.2012 17:50 (localtime):
> > > > > On Tue, Feb 14, 2012 at 04:55:10PM +0100, Claudius Herder wrote:
> > > > >> Hello,
> > > > >>
> > > > >> I have got a quite similar problem with AHCI on FreeBSD 8.2 and it 
> > > > >> still
> > > > >> persists on FreeBSD 9.0 release.
> > > > >>
> > > > >> Switching from ahci to ataahci resolved the problem for me too.
> > > > >>
> > > > >> I'm using gmirror for swap, system is on a zpool and the problem 
> > > > >> first
> > > > >> occurred during a zpool scrub, but it is easily reproducible with dd.
> > > > >>
> > > > >> The timeouts only occur when writing to disks, dd if=/dev/ada{0|1}
> > > > >> of=/dev/null is not an issue.
> > > > >> Sometimes I need to power off the server because after a reboot one 
> > > > >> disk
> > > > >> is still missing.
> > > > >>
> > > > >> I really would like to help in this issue, so let me know if you need
> > > > >> any more information.
> > > > > I find it interesting that, at least so far, the only people reporting
> > > > > problems of this type with the ahci.ko driver are people using Samsung
> > > > > disks.  The only difference is that your models are F1s while the OPs
> > > > > are F2s.
> > > > 
> > > > I saw such timeouts long ago and mav@ had a look at my postings and he
> > > > mentioned it could be a NCQ problem.
> > > > I suspected the disks firmware.
> > > > I never tracked it down further, because after replacing the Samsung (F3
> > > > in that case) disks with hitachi ones solved all my problems and gave a
> > > > big performance kick as well (with zfs).
> > > > You can find the discussion here:
> > > > http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html
> > > > 
> > > 
> > > You gave me a good idea: try to disable NCQ and see if that's the fault. 
> > > So
> > > i went and applied the attached patch. After it, i can no longer reproduce
> > > the issue with ahci driver.
> > > 
> > > I know this is not a solution because it disables NCQ at controller level
> > > instead of disk level, but at least we know for sure where the problem is.
> > > 
> > > I think the solution would be to add a new quirk ADA_Q_NONCQ in 
> > > sys/cam/ata/ata_da.c.
> > > Quirks infraestructure is already built, so adding a new quirk for this 
> > > seems
> > > easy.
> > > 
> > > Is someone interested? Do you think there is a better solution?
> > > 
> > > If someone is interested i can build a patch to add ADA_Q_NONCQ quirk and 
> > > add my drives
> > > to it.
> > 
> > I took a stab at this, but I don't feel confident this is the proper
> > solution/method.  I worry there's some sort of chicken-or-the-egg
> > condition here (quirk setup/matching comes *after* SATA capabilities
> > detection), or that it makes the code messier.  Need mav@'s
> > recommendations on this.
> > 
> > Below is for RELENG_8.  I should note I haven't tested if this works, or
> > even compiles -- normally I don't provide such patches without testing
> > so I apologise in advance / user beware.
> 
> You're amazingly fast. Thanks for all your help :)
> 
> You start applying the quirks before 
> 
> snprintf(announce_buf, sizeof(announce_buf),
> "kern.cam.ada.%d.quirks", periph->unit_number);
> quirks = softc->quirks;
> TUNABLE_INT_FETCH(announce_buf, &quirks);
> 
> So you're breaking quirk setting at boot time.

I'm too tired to quite understand (in full) what's wrong with my patch,
but I think you're referring to situations where someone would have
kern.cam.ada.X.quirks set in loader.conf?

If so, I believe that same situation would happen presently if someone
set kern.cam.ada.X.quirks in their loader.conf to a value that did not
contain bit #0 set to 1, and used one of the 4K sector disks listed in
ada_quirk_table -- what's in loader.conf looks like it would overwrite
whatever the kernel code bits chose automatically:

 910 match = cam_quirkmatch((caddr_t)&cgd->ident_data,
 911(caddr_t)ada_quirk_table,
 912
sizeof(ada_quirk_table)/sizeof(*ada_quirk_table),
 913sizeof(*ada_quirk_table), 
ata_identify_match);
 914 if (match != NULL)
 915 softc->quirks = ((struct ada_quirk_entry *)match)->quirks;
 916 else
 917 softc->quirks = ADA_Q_NONE;
 ...
 931 snprintf(announce_buf, sizeof(announce_buf),
 932 "kern.cam.ada.%d.quirks", periph->unit_number);
 933 quirks = softc->quirks;
 934 TUNABLE_INT_FETCH(announce_buf, &quirks);
 935 softc->quirks = quirks;

I read this to mean:

Lines 910-917 -- if there's a device ID st

Re: problems with AHCI on FreeBSD 8.2

2012-02-14 Thread Victor Balada Diaz
On Tue, Feb 14, 2012 at 03:09:58PM -0800, Jeremy Chadwick wrote:
> On Tue, Feb 14, 2012 at 11:15:27PM +0100, Victor Balada Diaz wrote:
> > On Tue, Feb 14, 2012 at 06:17:19PM +0100, Harald Schmalzbauer wrote:
> > >  schrieb Jeremy Chadwick am 14.02.2012 17:50 (localtime):
> > > > On Tue, Feb 14, 2012 at 04:55:10PM +0100, Claudius Herder wrote:
> > > >> Hello,
> > > >>
> > > >> I have got a quite similar problem with AHCI on FreeBSD 8.2 and it 
> > > >> still
> > > >> persists on FreeBSD 9.0 release.
> > > >>
> > > >> Switching from ahci to ataahci resolved the problem for me too.
> > > >>
> > > >> I'm using gmirror for swap, system is on a zpool and the problem first
> > > >> occurred during a zpool scrub, but it is easily reproducible with dd.
> > > >>
> > > >> The timeouts only occur when writing to disks, dd if=/dev/ada{0|1}
> > > >> of=/dev/null is not an issue.
> > > >> Sometimes I need to power off the server because after a reboot one 
> > > >> disk
> > > >> is still missing.
> > > >>
> > > >> I really would like to help in this issue, so let me know if you need
> > > >> any more information.
> > > > I find it interesting that, at least so far, the only people reporting
> > > > problems of this type with the ahci.ko driver are people using Samsung
> > > > disks.  The only difference is that your models are F1s while the OPs
> > > > are F2s.
> > > 
> > > I saw such timeouts long ago and mav@ had a look at my postings and he
> > > mentioned it could be a NCQ problem.
> > > I suspected the disks firmware.
> > > I never tracked it down further, because after replacing the Samsung (F3
> > > in that case) disks with hitachi ones solved all my problems and gave a
> > > big performance kick as well (with zfs).
> > > You can find the discussion here:
> > > http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html
> > > 
> > 
> > You gave me a good idea: try to disable NCQ and see if that's the fault. So
> > i went and applied the attached patch. After it, i can no longer reproduce
> > the issue with ahci driver.
> > 
> > I know this is not a solution because it disables NCQ at controller level
> > instead of disk level, but at least we know for sure where the problem is.
> > 
> > I think the solution would be to add a new quirk ADA_Q_NONCQ in 
> > sys/cam/ata/ata_da.c.
> > Quirks infraestructure is already built, so adding a new quirk for this 
> > seems
> > easy.
> > 
> > Is someone interested? Do you think there is a better solution?
> > 
> > If someone is interested i can build a patch to add ADA_Q_NONCQ quirk and 
> > add my drives
> > to it.
> 
> I took a stab at this, but I don't feel confident this is the proper
> solution/method.  I worry there's some sort of chicken-or-the-egg
> condition here (quirk setup/matching comes *after* SATA capabilities
> detection), or that it makes the code messier.  Need mav@'s
> recommendations on this.
> 
> Below is for RELENG_8.  I should note I haven't tested if this works, or
> even compiles -- normally I don't provide such patches without testing
> so I apologise in advance / user beware.

You're amazingly fast. Thanks for all your help :)

You start applying the quirks before 

snprintf(announce_buf, sizeof(announce_buf),
"kern.cam.ada.%d.quirks", periph->unit_number);
quirks = softc->quirks;
TUNABLE_INT_FETCH(announce_buf, &quirks);

So you're breaking quirk setting at boot time.

See my attached patch. I can confirm it works for me.

Regards.

-- 
La prueba más fehaciente de que existe vida inteligente en otros
planetas, es que no han intentado contactar con nosotros. 
--- ata_da.c	2012-02-14 22:17:54.0 +0100
+++ ata_da.c	2012-02-14 22:58:05.0 +0100
@@ -91,6 +91,7 @@
 typedef enum {
 	ADA_Q_NONE		= 0x00,
 	ADA_Q_4K		= 0x01,
+	ADA_Q_NONCQ		= 0x02,
 } ada_quirks;
 
 typedef enum {
@@ -162,6 +163,14 @@
 		/*quirks*/ADA_Q_4K
 	},
 	{
+		/* 
+		 * Samsung have NCQ broken:
+		 * http://lists.freebsd.org/pipermail/freebsd-stable/2012-February/066168.html
+		 */
+		{ T_DIRECT, SIP_MEDIA_FIXED, "*", "SAMSUNG HD154UI*", "*" },
+		/*quirks*/ADA_Q_NONCQ
+	},
+	{
 		/* Samsung Advanced Format (4k) drives */
 		{ T_DIRECT, SIP_MEDIA_FIXED, "*", "SAMSUNG HD155UI*", "*" },
 		/*quirks*/ADA_Q_4K
@@ -967,6 +976,10 @@
 	softc->disk->d_maxsize = maxio;
 	softc->disk->d_unit = periph->unit_number;
 	softc->disk->d_flags = 0;
+	/* Disable NCQ if needed */
+	if (softc->flags & ADA_FLAG_CAN_NCQ &&
+	softc->quirks & ADA_Q_NONCQ)
+	  softc->flags ^= ADA_FLAG_CAN_NCQ;
 	if (softc->flags & ADA_FLAG_CAN_FLUSHCACHE)
 		softc->disk->d_flags |= DISKFLAG_CANFLUSHCACHE;
 	if ((softc->flags & ADA_FLAG_CAN_TRIM) ||
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: problems with AHCI on FreeBSD 8.2

2012-02-14 Thread Jeremy Chadwick
On Tue, Feb 14, 2012 at 11:15:27PM +0100, Victor Balada Diaz wrote:
> On Tue, Feb 14, 2012 at 06:17:19PM +0100, Harald Schmalzbauer wrote:
> >  schrieb Jeremy Chadwick am 14.02.2012 17:50 (localtime):
> > > On Tue, Feb 14, 2012 at 04:55:10PM +0100, Claudius Herder wrote:
> > >> Hello,
> > >>
> > >> I have got a quite similar problem with AHCI on FreeBSD 8.2 and it still
> > >> persists on FreeBSD 9.0 release.
> > >>
> > >> Switching from ahci to ataahci resolved the problem for me too.
> > >>
> > >> I'm using gmirror for swap, system is on a zpool and the problem first
> > >> occurred during a zpool scrub, but it is easily reproducible with dd.
> > >>
> > >> The timeouts only occur when writing to disks, dd if=/dev/ada{0|1}
> > >> of=/dev/null is not an issue.
> > >> Sometimes I need to power off the server because after a reboot one disk
> > >> is still missing.
> > >>
> > >> I really would like to help in this issue, so let me know if you need
> > >> any more information.
> > > I find it interesting that, at least so far, the only people reporting
> > > problems of this type with the ahci.ko driver are people using Samsung
> > > disks.  The only difference is that your models are F1s while the OPs
> > > are F2s.
> > 
> > I saw such timeouts long ago and mav@ had a look at my postings and he
> > mentioned it could be a NCQ problem.
> > I suspected the disks firmware.
> > I never tracked it down further, because after replacing the Samsung (F3
> > in that case) disks with hitachi ones solved all my problems and gave a
> > big performance kick as well (with zfs).
> > You can find the discussion here:
> > http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html
> > 
> 
> You gave me a good idea: try to disable NCQ and see if that's the fault. So
> i went and applied the attached patch. After it, i can no longer reproduce
> the issue with ahci driver.
> 
> I know this is not a solution because it disables NCQ at controller level
> instead of disk level, but at least we know for sure where the problem is.
> 
> I think the solution would be to add a new quirk ADA_Q_NONCQ in 
> sys/cam/ata/ata_da.c.
> Quirks infraestructure is already built, so adding a new quirk for this seems
> easy.
> 
> Is someone interested? Do you think there is a better solution?
> 
> If someone is interested i can build a patch to add ADA_Q_NONCQ quirk and add 
> my drives
> to it.

I took a stab at this, but I don't feel confident this is the proper
solution/method.  I worry there's some sort of chicken-or-the-egg
condition here (quirk setup/matching comes *after* SATA capabilities
detection), or that it makes the code messier.  Need mav@'s
recommendations on this.

Below is for RELENG_8.  I should note I haven't tested if this works, or
even compiles -- normally I don't provide such patches without testing
so I apologise in advance / user beware.

-- 
| Jeremy Chadwick j...@parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |

diff -ruN /usr/src/sys/cam/ata/ata_da.c src/sys/cam/ata/ata_da.c
--- /usr/src/sys/cam/ata/ata_da.c   2012-02-10 17:22:25.0 -0800
+++ src/sys/cam/ata/ata_da.c2012-02-14 15:07:07.988814133 -0800
@@ -90,7 +90,8 @@
 
 typedef enum {
ADA_Q_NONE  = 0x00,
-   ADA_Q_4K= 0x01,
+   ADA_Q_4K= 0x01, /* 4k sectors */
+   ADA_Q_NONCQ = 0x02, /* device has flaky NCQ support */
 } ada_quirks;
 
 typedef enum {
@@ -162,6 +163,11 @@
/*quirks*/ADA_Q_4K
},
{
+   /* Samsung Spinpoint F2 EG (EcoGreen) drives */
+   { T_DIRECT, SIP_MEDIA_FIXED, "*", "SAMSUNG HD154UI*", "*" },
+   /*quirks*/ADA_Q_NONCQ,
+   },
+   {
/* Samsung Advanced Format (4k) drives */
{ T_DIRECT, SIP_MEDIA_FIXED, "*", "SAMSUNG HD155UI*", "*" },
/*quirks*/ADA_Q_4K
@@ -887,9 +893,6 @@
softc->flags |= ADA_FLAG_CAN_FLUSHCACHE;
if (cgd->ident_data.support.command1 & ATA_SUPPORT_POWERMGT)
softc->flags |= ADA_FLAG_CAN_POWERMGT;
-   if (cgd->ident_data.satacapabilities & ATA_SUPPORT_NCQ &&
-   (cgd->inq_flags & SID_DMA) && (cgd->inq_flags & SID_CmdQue))
-   softc->flags |= ADA_FLAG_CAN_NCQ;
if (cgd->ident_data.support_dsm & ATA_SUPPORT_DSM_TRIM) {
softc->flags |= ADA_FLAG_CAN_TRIM;
softc->trim_max_ranges = TRIM_MAX_RANGES;
@@ -916,6 +919,15 @@
else
softc->quirks = ADA_Q_NONE;
 
+   /*
+* Do not enable NCQ for devices which have the ADA_Q_NONCQ quirk.
+*/
+   if (!(softc->quirks & ADA_Q_NONCQ)) {
+   if (cgd->ident_data.satacapabilities & ATA_SUPPORT_NCQ &&
+   (cgd->i

Re: problems with AHCI on FreeBSD 8.2

2012-02-14 Thread Oscar Prieto
Thank you again Jeremy, sure it helps!

On Tue, Feb 14, 2012 at 9:31 PM, Jeremy Chadwick
 wrote:
> On Tue, Feb 14, 2012 at 09:19:02PM +0100, Oscar Prieto wrote:
>> Thank you Jeremy, i'm already checking your links.
>>
>> When i installed smartd i configured a daily short test and a weekly
>> long one for all the drives while the machine remains mostly unused,
>> never thought it could be a problem reading the documentation and info
>> around.
>>
>> # /usr/local/etc/smartd.conf
>> /dev/ada0 -a -o on -S on -s (S/../.././03|L/../../2/07)
>> /dev/ada1 -a -o on -S on -s (S/../.././04|L/../../3/07)
>> /dev/ada2 -a -o on -S on -s (S/../.././05|L/../../4/07)
>> /dev/ada3 -a -o on -S on -s (S/../.././06|L/../../5/07)
>
> The problem is that, quite honestly, these do you zero good.  All it does
> is make a mess (per se) of the SMART self-test log.
>
> Take for example your situation with ada3: smartd(8) told you that the
> number of pending sectors increased to 5, and uncorrected increased to
> 1.  That's really all you need to know at that point.  If you want to
> know the LBA numbers which are problematic, you can manually intervene.
>
> The point is: the drive itself is going to notice problematic or bad
> sectors quicker than periodic short or long or surface scan tests will.
> Let the drive do its thing normally and only use SMART tests when
> there's indication something is wrong.
>
>> I'll remove the checks, do you advice for removing the daemon altogether?
>
> smartd(8) is useful because it keeps track of attributes which change in
> value and logs data to syslog (if I remember right), thus you have an
> exact time/date when an attribute changed.  This is especially useful
> for things pertaining to sector/physical media problems.
>
> As such, I tend to recommend folks using smartd(8) properly tune their
> smartd.conf to only monitor specific attributes.  This varies from drive
> to drive, but the key ones are things like attributes 5, 10, 11, 192,
> 193, 194 (if you want temperature logging), 196, 197, 198, 199, and 200.
> I'm speaking strictly for Western Digital disks here.
>
> The stock defaults, if I remember right, are to "monitor everything",
> which really doesn't work well given that so many vendors encode their
> RAW_VALUE fields in proprietary/vendor-specific formats.  People will
> often monitor things like the Hardware_ECC_Recovered attribute and start
> "freaking out" once day when the value goes from 0 to 838938239 or
> something larger.  Attribute data formats are not part of the ATA
> standard, so vendors choose to encode them.  Plus, not many admins that
> I've run into (honest) know what that attribute actually means
> disk-wise (hint: it's 100% normal for sector ECC to happen at all times;
> magnetic media is not perfect, that's what the per-sector ECC section is
> for!)
>
> However: people don't understand what SMART attribute acquisition
> actually does behind the scenes -- it results in the disk having to read
> from the HPA area (not user accessible or within LBA regions), which
> means seeking + moving the arms to an area, reading, then reporting all
> of this back.  Thus, it impacts I/O performance.  This is why I don't
> use smartd(8) on any of our systems.  But if I was to use it?  I would
> have it poll maybe every 120 minutes, rather than every 30.  It all
> depends on the system/load/etc..  I've seen people poll every 5 minutes
> (I think they're absolutely crazy/paranoid).  Their systems, their
> problem.  :-)
>
> Hope this helps.
>
> --
> | Jeremy Chadwick                                 j...@parodius.com |
> | Parodius Networking                     http://www.parodius.com/ |
> | UNIX Systems Administrator                 Mountain View, CA, US |
> | Making life hard for others since 1977.             PGP 4BD6C0CB |
>
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: problems with AHCI on FreeBSD 8.2

2012-02-14 Thread Victor Balada Diaz
On Tue, Feb 14, 2012 at 06:17:19PM +0100, Harald Schmalzbauer wrote:
>  schrieb Jeremy Chadwick am 14.02.2012 17:50 (localtime):
> > On Tue, Feb 14, 2012 at 04:55:10PM +0100, Claudius Herder wrote:
> >> Hello,
> >>
> >> I have got a quite similar problem with AHCI on FreeBSD 8.2 and it still
> >> persists on FreeBSD 9.0 release.
> >>
> >> Switching from ahci to ataahci resolved the problem for me too.
> >>
> >> I'm using gmirror for swap, system is on a zpool and the problem first
> >> occurred during a zpool scrub, but it is easily reproducible with dd.
> >>
> >> The timeouts only occur when writing to disks, dd if=/dev/ada{0|1}
> >> of=/dev/null is not an issue.
> >> Sometimes I need to power off the server because after a reboot one disk
> >> is still missing.
> >>
> >> I really would like to help in this issue, so let me know if you need
> >> any more information.
> > I find it interesting that, at least so far, the only people reporting
> > problems of this type with the ahci.ko driver are people using Samsung
> > disks.  The only difference is that your models are F1s while the OPs
> > are F2s.
> 
> I saw such timeouts long ago and mav@ had a look at my postings and he
> mentioned it could be a NCQ problem.
> I suspected the disks firmware.
> I never tracked it down further, because after replacing the Samsung (F3
> in that case) disks with hitachi ones solved all my problems and gave a
> big performance kick as well (with zfs).
> You can find the discussion here:
> http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html
> 

You gave me a good idea: try to disable NCQ and see if that's the fault. So
i went and applied the attached patch. After it, i can no longer reproduce
the issue with ahci driver.

I know this is not a solution because it disables NCQ at controller level
instead of disk level, but at least we know for sure where the problem is.

I think the solution would be to add a new quirk ADA_Q_NONCQ in 
sys/cam/ata/ata_da.c.
Quirks infraestructure is already built, so adding a new quirk for this seems
easy.

Is someone interested? Do you think there is a better solution?

If someone is interested i can build a patch to add ADA_Q_NONCQ quirk and add 
my drives
to it.

Regards.
-- 
La prueba más fehaciente de que existe vida inteligente en otros
planetas, es que no han intentado contactar con nosotros. 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: problems with AHCI on FreeBSD 8.2

2012-02-14 Thread Jeremy Chadwick
On Tue, Feb 14, 2012 at 09:19:02PM +0100, Oscar Prieto wrote:
> Thank you Jeremy, i'm already checking your links.
> 
> When i installed smartd i configured a daily short test and a weekly
> long one for all the drives while the machine remains mostly unused,
> never thought it could be a problem reading the documentation and info
> around.
> 
> # /usr/local/etc/smartd.conf
> /dev/ada0 -a -o on -S on -s (S/../.././03|L/../../2/07)
> /dev/ada1 -a -o on -S on -s (S/../.././04|L/../../3/07)
> /dev/ada2 -a -o on -S on -s (S/../.././05|L/../../4/07)
> /dev/ada3 -a -o on -S on -s (S/../.././06|L/../../5/07)

The problem is that, quite honestly, these do you zero good.  All it does
is make a mess (per se) of the SMART self-test log.

Take for example your situation with ada3: smartd(8) told you that the
number of pending sectors increased to 5, and uncorrected increased to
1.  That's really all you need to know at that point.  If you want to
know the LBA numbers which are problematic, you can manually intervene.

The point is: the drive itself is going to notice problematic or bad
sectors quicker than periodic short or long or surface scan tests will.
Let the drive do its thing normally and only use SMART tests when
there's indication something is wrong.

> I'll remove the checks, do you advice for removing the daemon altogether?

smartd(8) is useful because it keeps track of attributes which change in
value and logs data to syslog (if I remember right), thus you have an
exact time/date when an attribute changed.  This is especially useful
for things pertaining to sector/physical media problems.

As such, I tend to recommend folks using smartd(8) properly tune their
smartd.conf to only monitor specific attributes.  This varies from drive
to drive, but the key ones are things like attributes 5, 10, 11, 192,
193, 194 (if you want temperature logging), 196, 197, 198, 199, and 200.
I'm speaking strictly for Western Digital disks here.

The stock defaults, if I remember right, are to "monitor everything",
which really doesn't work well given that so many vendors encode their
RAW_VALUE fields in proprietary/vendor-specific formats.  People will
often monitor things like the Hardware_ECC_Recovered attribute and start
"freaking out" once day when the value goes from 0 to 838938239 or
something larger.  Attribute data formats are not part of the ATA
standard, so vendors choose to encode them.  Plus, not many admins that
I've run into (honest) know what that attribute actually means
disk-wise (hint: it's 100% normal for sector ECC to happen at all times;
magnetic media is not perfect, that's what the per-sector ECC section is
for!)

However: people don't understand what SMART attribute acquisition
actually does behind the scenes -- it results in the disk having to read
from the HPA area (not user accessible or within LBA regions), which
means seeking + moving the arms to an area, reading, then reporting all
of this back.  Thus, it impacts I/O performance.  This is why I don't
use smartd(8) on any of our systems.  But if I was to use it?  I would
have it poll maybe every 120 minutes, rather than every 30.  It all
depends on the system/load/etc..  I've seen people poll every 5 minutes
(I think they're absolutely crazy/paranoid).  Their systems, their
problem.  :-)

Hope this helps.

-- 
| Jeremy Chadwick j...@parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: problems with AHCI on FreeBSD 8.2

2012-02-14 Thread Oscar Prieto
Thank you Jeremy, i'm already checking your links.

When i installed smartd i configured a daily short test and a weekly
long one for all the drives while the machine remains mostly unused,
never thought it could be a problem reading the documentation and info
around.

# /usr/local/etc/smartd.conf
/dev/ada0 -a -o on -S on -s (S/../.././03|L/../../2/07)
/dev/ada1 -a -o on -S on -s (S/../.././04|L/../../3/07)
/dev/ada2 -a -o on -S on -s (S/../.././05|L/../../4/07)
/dev/ada3 -a -o on -S on -s (S/../.././06|L/../../5/07)

I'll remove the checks, do you advice for removing the daemon altogether?


On Tue, Feb 14, 2012 at 8:51 PM, Martin Sugioarto  wrote:
> Am Tue, 14 Feb 2012 20:24:32 +0100
> schrieb Harald Schmalzbauer :
>
>> I guess it's always the firmware of the EcoGreen models which cause
>> these problems. Your drive isn't EG...
>> I don't remember exactly the different model numbers, but I'm sure
>> they were all EcoGreen. The lower power consumption was the reason to
>> choose these specific drives (different capacities and F2/F3 series
>> tried), with acceptable performance loss - I thought. But it turned
>> out that EcoGreen and NCQ as well as RAIDZ demands dont' fit
>> together...
>
> Hi,
>
> I intentionally did not buy any Eco or Green model because I don't like
> them (Load_Cycle_Count bugs and so on). I realized, I like to use 1 Watt
> more power but have the performance doubled.
>
> --
> Martin
> ___
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: problems with AHCI on FreeBSD 8.2

2012-02-14 Thread Jeremy Chadwick
On Tue, Feb 14, 2012 at 08:31:23PM +0100, Oscar Prieto wrote:
> I used to had tons of ahci errors in my 4 disk raidz1 worth of
> HD154UIs when the rig was built a year ago or so (with 8.0 Release),
> but they dissapeared after tuning ZFS.
> 
> Sadly i also got a new timeout days ago followed with smartcl erros i
> still keep unchecked but i guess they cold be legit, i still have to
> test/swap cables and give it a try.

About your ada3 disk:

The below SMART errors indicate your disk does in fact have physical
media problems -- 1 confirmed bad sector, and 5 which are "suspect".
"Suspect" LBAs are unreadable until writes are issued to them.  A write
will induce the drive to re-analyse the sector at that LBA and determine
if it's truly bad or not.  A single LBA can actually take quite a long
time to analyse (it depends on what the problem is), and may result in
30+ seconds of delay.  You can either let the drive figure it out over
normal usage patterns, or you can do it manually yourself time
permitting.  Your drive that shows read failures in the SMART self-test
log gives you the LBA numbers; try reading from those LBAs first.  I can
explain this procedure in another thread/offline/whatever.  (Does anyone
read what I write, re: don't hijack the thread?  :-) )

About all of your disks:

All of your disks are undergoing regular/periodic SMART short and long
tests.  Please stop this; it really, truly does no good.  You will
experience performance hits during these tests.

About timeouts:

Timeouts seen on the controller and driver level can happen in this
situation; this is universal.  This is usually what features like
Western Digital's TLER and Hitachi + Samsung's CCTL can help alleviate,
but not fully solve.  I think the ada(4) default timeout of 30 seconds
is a decent value, to be quite honest, but I'm not sure what the AHCI
driver timeout is.  mav@ would need to clue me in, or I'd need to go
look at the source.  (Right now in my life is not a good time for me to
be reviewing source code or looking at commits, sadly.  Too much on my
mind recently.)

I can discuss the TLER/CCTL stuff more at length if needed, but to be
blatantly honest, I would rather not and here's why: people begin to
rely on these features to try and circumvent actual problems with their
drives.  Phrased differently: people on the Internet become incredibly
focused on all of these timeout durations (TLER/CCTL vs. controller vs.
driver vs. storage subsystem timeouts) and try to find some bizarre
"perfect harmony" between them all.  Instead, just leave them all alone
and watch your drives for problems.

Further details which pertain to Samsung drives:

In your case, you run smartd(8), which periodically hits the drive with
SMART requests, pulling attribute data down and parsing it.  I believe
your model is fine for this, but for similar Samsung models, I must
strongly advise against this.  There are well-documented problems with
Samsung firmwares and SMART behaviour which can result in data loss (yes
you read that right).  Please see smartmontools' Wiki page on the matter
for full details.  Just make sure you're running a fixed firmware:

http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks

Regarding throughput of the drives being slow (30-40MBytes/sec across a
gigE link):

This sounds more like a Samba tuning problem, but ZFS raidz isn't known
for "amazing speed" per se.  Please see a post of mine from a while back
on how to tune Samba, which many followed up to with appreciation
stating their throughput increased dramatically:

http://lists.freebsd.org/pipermail/freebsd-stable/2011-February/061642.html

I should follow up to that post with the following entry, because I've
since updated my own smb.conf to tune things a bit better, and include
comments as to the justifications:

#
# The below options increase throughput substantially.  Be aware
# that AIO support requires the aio.ko kernel module loaded,
# and Samba to be built with AIO enabled.  Important notes:
#
# 1) We explicitly disable sendfile(2) because it has known
# problems on ZFS, including resulting in 2x the amount of memory
# used on the machine (VM cache + ZFS cache).  For further details,
# see freebsd-fs or freebsd-stable thread, subject "8.1-STABLE:
# zfs and sendfile: problem still exists".
#
# 2) (2011/10/03) socket options SO_SNDBUF and SO_RCVBUF do not
# appear to matter on FreeBSD, or our sysctls somehow take care of
# this (or maybe AIO?).  The performance is the same with or without
# these two socket options on 8.2-STABLE.
#
# 3) (2011/10/03) My previously-mentioned "aio write behind" option
# is incorrect; see the officia smb.conf(5) man page for the syntax.
# It's not a yes/no toggleable, thus serves no purpose.
#
socket options = TCP_NODELAY
use sendfile = no
min receivefile size = 16384
aio read size = 16384
aio write size = 16384

The rest is in the thread I linked.

Hope this helps.

-- 
| Jeremy Chadwick   

Re: problems with AHCI on FreeBSD 8.2

2012-02-14 Thread Martin Sugioarto
Am Tue, 14 Feb 2012 20:24:32 +0100
schrieb Harald Schmalzbauer :

> I guess it's always the firmware of the EcoGreen models which cause
> these problems. Your drive isn't EG...
> I don't remember exactly the different model numbers, but I'm sure
> they were all EcoGreen. The lower power consumption was the reason to
> choose these specific drives (different capacities and F2/F3 series
> tried), with acceptable performance loss - I thought. But it turned
> out that EcoGreen and NCQ as well as RAIDZ demands dont' fit
> together...

Hi,

I intentionally did not buy any Eco or Green model because I don't like
them (Load_Cycle_Count bugs and so on). I realized, I like to use 1 Watt
more power but have the performance doubled.

--
Martin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: problems with AHCI on FreeBSD 8.2

2012-02-14 Thread Harald Schmalzbauer
 schrieb Martin Sugioarto am 14.02.2012 19:23 (localtime):
> Am Tue, 14 Feb 2012 18:17:19 +0100
> schrieb Harald Schmalzbauer :
>
>>> I find it interesting that, at least so far, the only people
>>> reporting problems of this type with the ahci.ko driver are people
>>> using Samsung disks.  The only difference is that your models are
>>> F1s while the OPs are F2s.
>> I saw such timeouts long ago and mav@ had a look at my postings and he
>> mentioned it could be a NCQ problem.
>> I suspected the disks firmware.
>> I never tracked it down further, because after replacing the Samsung
>> (F3 in that case) disks with hitachi ones solved all my problems and
>> gave a big performance kick as well (with zfs).
>> You can find the discussion here:
>> http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html
> Hi,
>
> I just want to add here that I am using 2 drives of type "Samsung
> HD103SJ" (SpinPoint F3). And I did not have problems with ZFS and with
> UFS either (for several years now). Everything has been deployed ontop
> ada(4) since FreeBSD-8.
>
> Actually the speed is very good (sequential read at 140 MB/s and more).

I guess it's always the firmware of the EcoGreen models which cause
these problems. Your drive isn't EG...
I don't remember exactly the different model numbers, but I'm sure they
were all EcoGreen. The lower power consumption was the reason to choose
these specific drives (different capacities and F2/F3 series tried),
with acceptable performance loss - I thought. But it turned out that
EcoGreen and NCQ as well as RAIDZ demands dont' fit together...

-Harry



signature.asc
Description: OpenPGP digital signature


Re: problems with AHCI on FreeBSD 8.2

2012-02-14 Thread Martin Sugioarto
Am Tue, 14 Feb 2012 18:17:19 +0100
schrieb Harald Schmalzbauer :

> > I find it interesting that, at least so far, the only people
> > reporting problems of this type with the ahci.ko driver are people
> > using Samsung disks.  The only difference is that your models are
> > F1s while the OPs are F2s.
> 
> I saw such timeouts long ago and mav@ had a look at my postings and he
> mentioned it could be a NCQ problem.
> I suspected the disks firmware.
> I never tracked it down further, because after replacing the Samsung
> (F3 in that case) disks with hitachi ones solved all my problems and
> gave a big performance kick as well (with zfs).
> You can find the discussion here:
> http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html

Hi,

I just want to add here that I am using 2 drives of type "Samsung
HD103SJ" (SpinPoint F3). And I did not have problems with ZFS and with
UFS either (for several years now). Everything has been deployed ontop
ada(4) since FreeBSD-8.

Actually the speed is very good (sequential read at 140 MB/s and more).

--
Martin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: problems with AHCI on FreeBSD 8.2

2012-02-14 Thread Harald Schmalzbauer
 schrieb Jeremy Chadwick am 14.02.2012 17:50 (localtime):
> On Tue, Feb 14, 2012 at 04:55:10PM +0100, Claudius Herder wrote:
>> Hello,
>>
>> I have got a quite similar problem with AHCI on FreeBSD 8.2 and it still
>> persists on FreeBSD 9.0 release.
>>
>> Switching from ahci to ataahci resolved the problem for me too.
>>
>> I'm using gmirror for swap, system is on a zpool and the problem first
>> occurred during a zpool scrub, but it is easily reproducible with dd.
>>
>> The timeouts only occur when writing to disks, dd if=/dev/ada{0|1}
>> of=/dev/null is not an issue.
>> Sometimes I need to power off the server because after a reboot one disk
>> is still missing.
>>
>> I really would like to help in this issue, so let me know if you need
>> any more information.
> I find it interesting that, at least so far, the only people reporting
> problems of this type with the ahci.ko driver are people using Samsung
> disks.  The only difference is that your models are F1s while the OPs
> are F2s.

I saw such timeouts long ago and mav@ had a look at my postings and he
mentioned it could be a NCQ problem.
I suspected the disks firmware.
I never tracked it down further, because after replacing the Samsung (F3
in that case) disks with hitachi ones solved all my problems and gave a
big performance kick as well (with zfs).
You can find the discussion here:
http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html

JFI

-Harry



signature.asc
Description: OpenPGP digital signature


Re: problems with AHCI on FreeBSD 8.2

2012-02-14 Thread Jeremy Chadwick
On Tue, Feb 14, 2012 at 04:55:10PM +0100, Claudius Herder wrote:
> 
> Hello,
> 
> I have got a quite similar problem with AHCI on FreeBSD 8.2 and it still
> persists on FreeBSD 9.0 release.
> 
> Switching from ahci to ataahci resolved the problem for me too.
> 
> I'm using gmirror for swap, system is on a zpool and the problem first
> occurred during a zpool scrub, but it is easily reproducible with dd.
> 
> The timeouts only occur when writing to disks, dd if=/dev/ada{0|1}
> of=/dev/null is not an issue.
> Sometimes I need to power off the server because after a reboot one disk
> is still missing.
> 
> I really would like to help in this issue, so let me know if you need
> any more information.

I find it interesting that, at least so far, the only people reporting
problems of this type with the ahci.ko driver are people using Samsung
disks.  The only difference is that your models are F1s while the OPs
are F2s.

The only difference I can think of is that the ahci.ko driver may have
more strict timeouts than the ata driver (ata driver includes ataahci;
ataahci.ko != ahci.ko, as you know).

You may be able to adjust these using loader.conf variables:

kern.cam.ada.default_timeout
kern.cam.ada.retry_count

I also imagine that hint.ahci.X.ccc might have some involvement here,
but it's something I am not familiar with.  mav@ would need to comment
on this -- it's outside of my familiarity scope.

Furthermore, in your case, your ada1 disk has serious CRC-related
problems, and your ada0 disk has seen similar just at a much lower rate.
ada1 should probably be replaced (along with cables, dusting out SATA
ports, etc.), but keeping ada0 is probably fine.  The statistics for
these are shown in the "smartctl -l sataphy" output, field labelled ID
0x0001, "Command failed due to ICRC error".  These are SATA-level
problems or physical problems which will manifest themselves as
anomalies during any kind of I/O.

The counters shown in ID 0x000a and 0x0009 are completely fine; these
don't indicate any problems.

Your drives don't support GP log region 0x04, which is why "smartctl -l
devstat" returns the errors it does.  The errors you see coming from the
kernel in this situation are 100% okay/acceptable; the drive itself is
literally returning ABRT status to the inquiry submit to it.  Different
drives from different vendors behave differently in this regard.

So, what I'm trying to say is, your problem looks different than the
OPs.  Let's not start a big "I have this problem too" thread; that has
happened so many times over the years that when it happens I immediately
bow out + stop participating in the thread.

> smartctl -l sataphy /dev/ada0
> 
> SATA Phy Event Counters (GP Log 0x11)
> ID  Size Value  Description
> 0x000a  2  150  Device-to-host register FISes sent due to a COMRESET
> 0x0001  23  Command failed due to ICRC error
> 0x0009  2  173  Transition from drive PhyRdy to drive PhyNRdy
> 
> smartctl -l sataphy /dev/ada1
> 
> SATA Phy Event Counters (GP Log 0x11)
> ID  Size Value  Description
> 0x000a  2  155  Device-to-host register FISes sent due to a COMRESET
> 0x0001  265535+ Command failed due to ICRC error
> 0x0009  2  178  Transition from drive PhyRdy to drive PhyNRdy

-- 
| Jeremy Chadwick j...@parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: problems with AHCI on FreeBSD 8.2

2012-02-14 Thread Claudius Herder

Hello,

I have got a quite similar problem with AHCI on FreeBSD 8.2 and it still
persists on FreeBSD 9.0 release.

Switching from ahci to ataahci resolved the problem for me too.

I'm using gmirror for swap, system is on a zpool and the problem first
occurred during a zpool scrub, but it is easily reproducible with dd.

The timeouts only occur when writing to disks, dd if=/dev/ada{0|1}
of=/dev/null is not an issue.
Sometimes I need to power off the server because after a reboot one disk
is still missing.

I really would like to help in this issue, so let me know if you need
any more information.

--
Claudius

dmesg:
--cut--
Jan 14 01:33:57 server kernel: ahcich0: Timeout on slot 7 port 0
Jan 14 01:33:57 server kernel: ahcich0: is  cs 0080 ss
 rs 0080 tfd c0 serr  cmd 0004c717
Jan 14 01:33:57 server kernel: ahcich1: Timeout on slot 31 port 0
Jan 14 01:33:57 server kernel: ahcich1: is  cs 8000 ss
 rs 8000 tfd c0 serr  cmd 0004df17
Jan 14 01:33:57 server kernel: ahcich0: Timeout on slot 7 port 0
Jan 14 01:33:57 server kernel: ahcich0: is  cs f800 ss
ff80 rs ff80 tfd c0 serr  cmd 0004cb17
Jan 14 01:33:57 server kernel: ahcich1: Timeout on slot 31 port 0
Jan 14 01:33:57 server kernel: ahcich1: is  cs 00f8 ss
80ff rs 80ff tfd c0 serr  cmd 0004c317
Jan 14 01:33:57 server kernel: ahcich0: Timeout on slot 23 port 0
Jan 14 01:33:57 server kernel: ahcich0: is  cs 0180 ss
 rs 0180 tfd c0 serr  cmd 0004d717
Jan 14 01:33:57 server kernel: ahcich1: Timeout on slot 15 port 0
Jan 14 01:33:57 server kernel: ahcich1: is  cs 00018000 ss
 rs 00018000 tfd c0 serr  cmd 0004cf17
Jan 14 01:33:57 server kernel: ahcich1: Timeout on slot 17 port 0
Jan 14 01:33:57 server kernel: ahcich1: is  cs 01f8 ss
01fe rs 01fe tfd c0 serr  cmd 0004d317
Jan 14 01:33:57 server kernel: ahcich0: AHCI reset: device not ready
after 31000ms (tfd = 0080)
Jan 14 01:33:57 server kernel: ahcich1: Timeout on slot 31 port 0
Jan 14 01:33:57 server kernel: ahcich1: is  cs 8000 ss
 rs 8000 tfd c0 serr  cmd 0004df17
Jan 14 01:33:57 server kernel: ahcich0: Timeout on slot 24 port 0
--cut--

smartctl -a /dev/ada0
smartctl 5.42 2011-10-20 r3458 [FreeBSD 9.0-RELEASE amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint F1 DT
Device Model: SAMSUNG HD753LJ
Serial Number:S13UJDWS900110
LU WWN Device Id: 5 0024e9 0020d1bfa
Firmware Version: 1AA01118
User Capacity:750,156,374,016 bytes [750 GB]
Sector Size:  512 bytes logical/physical
Device is:In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:Tue Feb 14 16:32:58 2012 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status:  (   0) The previous self-test routine
completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection:( 9429) seconds.
Offline data collection
capabilities:(0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off 
support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time:(   2) minutes.
Extended self-test routine
recommended polling time:( 158) minutes.
Conveyance self-test routine
recommended polling time:(  17) minutes.
SCT capabilities:  (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
 

Re: problems with AHCI on FreeBSD 8.2

2012-02-14 Thread Victor Balada Diaz
On Tue, Feb 14, 2012 at 06:16:01AM -0800, Jeremy Chadwick wrote:

[..]

> 
> Thanks.  Both your drives look overall fine, sort-of.  I'll outline my
> concern points, and ask for some more info:
> 
> * ada0 has 28 CRC errors, while ada1 has 2.  These drives have been in
> use for 4688 hours and 4583 hours (respectively), which is roughly 6
> months for each drive.  CRC errors usually result in transparent
> retransmits, but this can sometimes cause I/O delays (especially if the
> CRC errors are repeated).
> 
> If the timeout messages recur in the future, please run the commands I
> gave you above once more and provide the output.  I can then compare the
> old to the new and see if there is anything of interest.

I can force the error each time i want. Its 100% reproducible on my environment
so i'll do the tests and send you smartctl -a output again.

> 
> * Both drives had 2 long tests run on them a few days ago ("Extended
> offline" tests).  Did you induce these manually?  If so, were these
> tests running at the time you witnessed AHCI timeout errors on ada0?
> Short, long, and selective surface scan tests are supposed to be
> non-intrusive, but given the nature of the tests sometimes they can
> stall the I/O subsystem.

I've ran the tests, but they were not running during timeout problems.
The only thing running on the disks was a newfs -J under a gjournal partiton.
For the rest, they're mostly idle.

> 
> If you do tests of this nature, you should write down the exact
> dates/times when you ran them (at least from now on).
> 
> If you didn't induce these, something must have, or possibly the drive
> itself did it (and if that's the case, convenient that it induces an
> entry in the self-test log!).
> 
> I do have some familiarity with drives doing internal tests -- the best
> example are old IBM Deskstar drives executing ADM on their own,
> resulting in the drives spinning down and performing internal tests,
> which would subsequently be interrupted by ATA I/O, drive spins back up,
> etc. -- but took too long resulting in ATA timeouts on FreeBSD and
> Linux.  I mailed IBM about this back in 2000 and got confirmation of the
> feature (which was also on their SCSI drives but defaulted to off); the
> feature was mysteriously removed in future drive models and still
> remains gone today:
> 
> http://jdc.parodius.com/freebsd/ibm_email_aware_of_adm.txt
> 
> I'm not saying your drives do this.  I'm simply saying that if there is
> some form of automated test that runs on these drives which is
> transparent to the underlying ATA layer, then there is really nothing
> you can do about it, and timeouts are possible.  The IBM ADM issue was
> only discovered after reviewing technical specifications/documentation
> and compared to their SCSI drives.

That's of course possible, but as the problem is 100% reproducible with
AHCI driver and is not with ata driver, i guess this time is not drive's
fault. 

We've also tested replacement disks and cables during the previous days. I
guess the problem is in some bad interaction with AHCI driver.

> 
> * Samsung has a notoriously bad reputation for firmware reliability on
> their SpinPoint drives, but I haven't read of anything bad about the F2
> series, just the F1, F3, and F4 models.  I have very little (almost
> none) experience with these drives.  I'm not boycotting their products,
> but I wouldn't be surprised if the timeout errors you saw were caused by
> something internal the drive was doing.  There is absolutely zero
> visibility into this kind of problem on any layer (even if you had an
> ATA protocol analyser hooked up); you're completely at the mercy of the
> firmware.  Just something to keep in mind when working with ANY kind of
> disk (MHDD, SSD, etc.).

I've seen reports on freebsd lists and smartmontools wiki about firmware
problems with F4 drives manufactured before december of 2010, but checking
samsung's web page, seems this drives are not affected. I hope
we're not hitting a new bug. More info:

http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks

> 
> All that said, could you please provide output from the following
> commands as well?  These may return "not supported" errors, which is
> acceptable, but we have to check.
> 
> * smartctl -l devstat /dev/ada0
> * smartctl -l sataphy /dev/ada0
> * smartctl -l devstat /dev/ada1
> * smartctl -l sataphy /dev/ada1
> 

Thanks a lot for you help Jeremy. Attached is the output of the commands:

fe09# smartctl -l devstat /dev/ada0
smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.2-STABLE amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

(pass0:ahcich0:0:0:0): READ_LOG_EXT. ACB: 2f 00 04 00 00 40 00 00 00 00 01 00
(pass0:ahcich0:0:0:0): CAM status: ATA Status Error
(pass0:ahcich0:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 04 (ABRT )
(pass0:ahcich0:0:0:0): RES: 51 04 04 00 00 40 00 00 00 01 00
ATA_READ_LOG_EXT (addr=0x04:0x00, page=0, n=1) failed: Unknown

Re: problems with AHCI on FreeBSD 8.2

2012-02-14 Thread Jeremy Chadwick
On Tue, Feb 14, 2012 at 02:54:35PM +0100, Victor Balada Diaz wrote:
> On Tue, Feb 14, 2012 at 02:05:13AM -0800, Jeremy Chadwick wrote:
> > On Tue, Feb 14, 2012 at 10:19:09AM +0100, Victor Balada Diaz wrote:
> > > We're having some troubles with AHCI under FreeBSD 8.2 and 8-STABLE. The 
> > > error is:
> > > 
> > > ahcich0: Timeout on slot 8
> > > ahcich0: is  cs 0100 ss  rs 0100 tfd c0 serr 
> > > 
> > > ahcich0: AHCI reset...
> > > ahcich0: SATA connect time=0ms status=0123
> > > ahcich0: ready wait time=18ms
> > > ahcich0: AHCI reset done: device found
> > > (ada0:ahcich0:0:0:0): Request requeued
> > > (ada0:ahcich0:0:0:0): Retrying command
> > > (ada0:ahcich0:0:0:0): Command timed out
> > > (ada0:ahcich0:0:0:0): Retrying command
> > > ahcich0: Timeout on slot 8
> > > ahcich0: is  cs 007ff000 ss 007fff00 rs 007fff00 tfd c0 serr 
> > > 
> > > ahcich0: AHCI reset...
> > > ahcich0: SATA connect time=0ms status=0123
> > > ahcich0: ready wait time=84ms
> > > ahcich0: AHCI reset done: device found
> > > (ada0:ahcich0:0:0:0): Request requeued
> > > (ada0:ahcich0:0:0:0): Retrying command
> > > (ada0:ahcich0:0:0:0): Command timed out
> > > (ada0:ahcich0:0:0:0): Retrying command
> > > (ada0:ahcich0:0:0:0): Request requeued
> > > [...]
> > > 
> > > If we use old ATA driver we have no problems. If we just use the first 
> > > disk (ada0) with ahci,
> > > no problems either. If we use both disks (ada0 and ada1) in gmirror setup 
> > > with ahci, we
> > > got the above error. If we use both disks in gmirror with old ata driver, 
> > > no problems.
> > 
> > Please provide SMART statistics for both disks by installing
> > ports/sysutils/smartmontools (5.42 or newer please) and running
> > "smartctl -a" against both disks (ada0/ada1, or ad4/ad10 -- doesn't
> > matter which driver you're using).  I will review the output.
> 
> Just forgot to say that from time to time, after system hangs and i need
> to reboot, one of the disks is lost. It doesn't even show after a few reboots,
> nor on Linux live system.
> 
> You can see smartctl output here:
>
> ada0:
> 
> # smartctl -a /dev/ada0
> smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.2-STABLE amd64] (local build)
> Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
> 
> === START OF INFORMATION SECTION ===
> Model Family: SAMSUNG SpinPoint F2 EG
> Device Model: SAMSUNG HD154UI
> Serial Number:S24EJ9BB200080
> LU WWN Device Id: 5 0024e9 2047cb78f
> Firmware Version: 1AG01118
> User Capacity:1,500,301,910,016 bytes [1.50 TB]
> Sector Size:  512 bytes logical/physical
> Device is:In smartctl database [for details use: -P show]
> ATA Version is:   8
> ATA Standard is:  ATA-8-ACS revision 3b
> Local Time is:Tue Feb 14 13:51:18 2012 CET
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> General SMART Values:
> Offline data collection status:  (0x00) Offline data collection activity
> was never started.
> Auto Offline Data Collection: 
> Disabled.
> Self-test execution status:  (   0) The previous self-test routine 
> completed
> without error or no self-test has 
> ever 
> been run.
> Total time to complete Offline 
> data collection:(18863) seconds.
> Offline data collection
> capabilities:(0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off 
> support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities:(0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability:(0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine 
> recommended polling time:(   2) minutes.
> Extended self-test routine
> recommended polling time:( 255) minutes.
> Conveyance self-test routine
> recommended polling time:(  33) minutes.
> SCT capabilities:  (0x003f) SCT Status supported.
> SCT Error Recovery Control supported.
> SCT Feature Control supported.
>

Re: problems with AHCI on FreeBSD 8.2

2012-02-14 Thread Victor Balada Diaz
On Tue, Feb 14, 2012 at 02:05:13AM -0800, Jeremy Chadwick wrote:
> On Tue, Feb 14, 2012 at 10:19:09AM +0100, Victor Balada Diaz wrote:
> > We're having some troubles with AHCI under FreeBSD 8.2 and 8-STABLE. The 
> > error is:
> > 
> > ahcich0: Timeout on slot 8
> > ahcich0: is  cs 0100 ss  rs 0100 tfd c0 serr 
> > 
> > ahcich0: AHCI reset...
> > ahcich0: SATA connect time=0ms status=0123
> > ahcich0: ready wait time=18ms
> > ahcich0: AHCI reset done: device found
> > (ada0:ahcich0:0:0:0): Request requeued
> > (ada0:ahcich0:0:0:0): Retrying command
> > (ada0:ahcich0:0:0:0): Command timed out
> > (ada0:ahcich0:0:0:0): Retrying command
> > ahcich0: Timeout on slot 8
> > ahcich0: is  cs 007ff000 ss 007fff00 rs 007fff00 tfd c0 serr 
> > 
> > ahcich0: AHCI reset...
> > ahcich0: SATA connect time=0ms status=0123
> > ahcich0: ready wait time=84ms
> > ahcich0: AHCI reset done: device found
> > (ada0:ahcich0:0:0:0): Request requeued
> > (ada0:ahcich0:0:0:0): Retrying command
> > (ada0:ahcich0:0:0:0): Command timed out
> > (ada0:ahcich0:0:0:0): Retrying command
> > (ada0:ahcich0:0:0:0): Request requeued
> > [...]
> > 
> > If we use old ATA driver we have no problems. If we just use the first disk 
> > (ada0) with ahci,
> > no problems either. If we use both disks (ada0 and ada1) in gmirror setup 
> > with ahci, we
> > got the above error. If we use both disks in gmirror with old ata driver, 
> > no problems.
> 
> Please provide SMART statistics for both disks by installing
> ports/sysutils/smartmontools (5.42 or newer please) and running
> "smartctl -a" against both disks (ada0/ada1, or ad4/ad10 -- doesn't
> matter which driver you're using).  I will review the output.

Just forgot to say that from time to time, after system hangs and i need
to reboot, one of the disks is lost. It doesn't even show after a few reboots,
nor on Linux live system.

You can see smartctl output here:

ada0:

# smartctl -a /dev/ada0
smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.2-STABLE amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint F2 EG
Device Model: SAMSUNG HD154UI
Serial Number:S24EJ9BB200080
LU WWN Device Id: 5 0024e9 2047cb78f
Firmware Version: 1AG01118
User Capacity:1,500,301,910,016 bytes [1.50 TB]
Sector Size:  512 bytes logical/physical
Device is:In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:Tue Feb 14 13:51:18 2012 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status:  (   0) The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection:(18863) seconds.
Offline data collection
capabilities:(0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off 
support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine 
recommended polling time:(   2) minutes.
Extended self-test routine
recommended polling time:( 255) minutes.
Conveyance self-test routine
recommended polling time:(  33) minutes.
SCT capabilities:  (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f  

Re: problems with AHCI on FreeBSD 8.2

2012-02-14 Thread Jeremy Chadwick
On Tue, Feb 14, 2012 at 10:19:09AM +0100, Victor Balada Diaz wrote:
> We're having some troubles with AHCI under FreeBSD 8.2 and 8-STABLE. The 
> error is:
> 
> ahcich0: Timeout on slot 8
> ahcich0: is  cs 0100 ss  rs 0100 tfd c0 serr 
> ahcich0: AHCI reset...
> ahcich0: SATA connect time=0ms status=0123
> ahcich0: ready wait time=18ms
> ahcich0: AHCI reset done: device found
> (ada0:ahcich0:0:0:0): Request requeued
> (ada0:ahcich0:0:0:0): Retrying command
> (ada0:ahcich0:0:0:0): Command timed out
> (ada0:ahcich0:0:0:0): Retrying command
> ahcich0: Timeout on slot 8
> ahcich0: is  cs 007ff000 ss 007fff00 rs 007fff00 tfd c0 serr 
> ahcich0: AHCI reset...
> ahcich0: SATA connect time=0ms status=0123
> ahcich0: ready wait time=84ms
> ahcich0: AHCI reset done: device found
> (ada0:ahcich0:0:0:0): Request requeued
> (ada0:ahcich0:0:0:0): Retrying command
> (ada0:ahcich0:0:0:0): Command timed out
> (ada0:ahcich0:0:0:0): Retrying command
> (ada0:ahcich0:0:0:0): Request requeued
> [...]
> 
> If we use old ATA driver we have no problems. If we just use the first disk 
> (ada0) with ahci,
> no problems either. If we use both disks (ada0 and ada1) in gmirror setup 
> with ahci, we
> got the above error. If we use both disks in gmirror with old ata driver, no 
> problems.

Please provide SMART statistics for both disks by installing
ports/sysutils/smartmontools (5.42 or newer please) and running
"smartctl -a" against both disks (ada0/ada1, or ad4/ad10 -- doesn't
matter which driver you're using).  I will review the output.

-- 
| Jeremy Chadwick j...@parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: problems with AHCI on FreeBSD 8.2

2012-02-14 Thread Alexander Motin

On 02/14/12 11:19, Victor Balada Diaz wrote:

We're having some troubles with AHCI under FreeBSD 8.2 and 8-STABLE. The error 
is:

ahcich0: Timeout on slot 8
ahcich0: is  cs 0100 ss  rs 0100 tfd c0 serr 
ahcich0: AHCI reset...
ahcich0: SATA connect time=0ms status=0123
ahcich0: ready wait time=18ms
ahcich0: AHCI reset done: device found
(ada0:ahcich0:0:0:0): Request requeued
(ada0:ahcich0:0:0:0): Retrying command
(ada0:ahcich0:0:0:0): Command timed out
(ada0:ahcich0:0:0:0): Retrying command
ahcich0: Timeout on slot 8
ahcich0: is  cs 007ff000 ss 007fff00 rs 007fff00 tfd c0 serr 
ahcich0: AHCI reset...
ahcich0: SATA connect time=0ms status=0123
ahcich0: ready wait time=84ms
ahcich0: AHCI reset done: device found
(ada0:ahcich0:0:0:0): Request requeued
(ada0:ahcich0:0:0:0): Retrying command
(ada0:ahcich0:0:0:0): Command timed out
(ada0:ahcich0:0:0:0): Retrying command
(ada0:ahcich0:0:0:0): Request requeued
[...]

If we use old ATA driver we have no problems. If we just use the first disk 
(ada0) with ahci,
no problems either. If we use both disks (ada0 and ada1) in gmirror setup with 
ahci, we
got the above error. If we use both disks in gmirror with old ata driver, no 
problems.


In both cases controller reports command status as 0xc0, that means 
device is busy with the command. For NCQ commands it means that device 
in in stage of processing command itself, not a head positioning or data 
transfer. Enabling AHCI enables NCQ for the devices. That increases load 
on both devices and the controller, and it is difficult to say who's 
fault is here. SAMSUNG HD154UI disks AFAIR have 4k sectors that may have 
big performance penalties when accessing small/misaligned data. I am not 
sure how big that penalty can be in the worst case, especially since 
disks by default cache writes, hiding the real load level. Relations 
with gmirror is harder to explain. Depending on how you created it and 
partitions it could cause more misaligned I/Os during rebuild. Using 
gmirror also double concurrent load on the controller, but at this point 
I have nothing to blame it for.


--
Alexander Motin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"