On 2016-07-06 12:43, Chris Murphy wrote:
On Wed, Jul 6, 2016 at 5:51 AM, Austin S. Hemmelgarn
<ahferro...@gmail.com> wrote:
On 2016-07-05 19:05, Chris Murphy wrote:

Related:
http://www.spinics.net/lists/raid/msg52880.html

Looks like there is some traction to figuring out what to do about
this, whether it's a udev rule or something that happens in the kernel
itself. Pretty much the only hardware setup unaffected by this are
those with enterprise or NAS drives. Every configuration of a consumer
drive, single, linear/concat, and all software (mdadm, lvm, Btrfs)
RAID Levels are adversely affected by this.

The thing I don't get about this is that while the per-device settings on a
given system are policy, the default value is not, and should be expected to
work correctly (but not necessarily optimally) on as many systems as
possible, so any claim that this should be fixed in udev are bogus by the
regular kernel rules.

Sure. But changing it in the kernel leads to what other consequences?
It fixes the problem under discussion but what problem will it
introduce? I think it's valid to explore this, at the least so
affected parties can be informed.

Also, the problem isn't instigated by Linux, rather by drive
manufacturers introducing a whole new kind of error recovery, with an
order of magnitude longer recovery time. Now probably most hardware in
the field are such drives. Even SSDs like my Samsung 840 EVO that
support SCT ERC have it disabled, therefore the top end recovery time
is undiscoverable in the device itself. Maybe it's buried in a spec.

So does it make sense to just set the default to 180? Or is there a
smarter way to do this? I don't know.
Just thinking about this:
1. People who are setting this somewhere will be functionally unaffected.
2. People using single disks which have lots of errors may or may not see an apparent degradation of performance, but will likely have the life expectancy of their device extended. 3. Individuals who are not setting this but should be will on average be no worse off than before other than seeing a bigger performance hit on a disk error. 4. People with single disks which are new will see no functional change until the disk has an error.

In an ideal situation, what I'd want to see is:
1. If the device supports SCT ERC, set scsi_command_timer to reasonable percentage over that (probably something like 25%, which would give roughly 10 seconds for the normal 7 second ERC timer). 2. If the device is actually a SCSI device, keep the 30 second timer (IIRC< this is reasonable for SCSI disks). 3. Otherwise, set the timer to 200 (we need a slight buffer over the expected disk timeout to account for things like latency outside of the disk).


I suspect, but haven't tested, that ZFS On Linux would be equally
affected, unless they're completely reimplementing their own block
layer (?) So there are quite a few parties now negatively impacted by
the current default behavior.

OTOH, I would not be surprised if the stance there is 'you get no support if
your not using enterprise drives', not because of the project itself, but
because it's ZFS.  Part of their minimum recommended hardware requirements
is ECC RAM, so it wouldn't surprise me if enterprise storage devices are
there too.

http://open-zfs.org/wiki/Hardware
"Consistent performance requires hard drives that support error
recovery control. "

"Drives that lack such functionality can be expected to have
arbitrarily high limits. Several minutes is not impossible. Drives
with this functionality typically default to 7 seconds. ZFS does not
currently adjust this setting on drives. However, it is advisable to
write a script to set the error recovery time to a low value, such as
0.1 seconds until ZFS is modified to control it. This must be done on
every boot. "

They do not explicitly require enterprise drives, but they clearly
expect SCT ERC enabled to some sane value.

At least for Btrfs and ZFS, the mkfs is in a position to know all
parameters for properly setting SCT ERC and the SCSI command timer for
every device. Maybe it could create the udev rule? Single and raid0
profiles need to permit long recoveries; where raid1, 5, 6 need to set
things for very short recoveries.

Possibly mdadm and lvm tools do the same thing.
I"m pretty certain they don't create rules, or even try to check the drive for SCT ERC support. The problem with doing this is that you can't be certain that your underlying device is actually a physical storage device or not, and thus you have to check more than just the SCT ERC commands, and many people (myself included) don't like tools doing things that modify the persistent functioning of their system that the tool itself is not intended to do (and messing with block layer settings falls into that category for a mkfs tool).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to