On 12/1/2015 12:05 PM, Brendan Hide wrote:
On 11/30/2015 11:09 PM, Chris Murphy wrote:
On Mon, Nov 30, 2015 at 1:37 PM, Austin S Hemmelgarn
<ahferro...@gmail.com> wrote:
I've had multiple cases of disks that got one write error then were
fine for
more than a year before any further issues. My thought is add an
option to
retry that single write after some short delay (1-2s maybe), and if
it still
fails, then mark the disk as failed.
Seems reasonable.
I think I added this to the Project Ideas page on the wiki a *very*
long time ago
https://btrfs.wiki.kernel.org/index.php/Project_ideas#False_alarm_on_bad_disk_-_rebuild_mitigation
"After a device is marked as unreliable, maintain the device within
the FS in order to confirm the issue persists. The device will still
contribute toward fs performance but will not be treated as if
contributing towards replication/reliability. If the device shows that
the given errors were a once-off issue then the device can be marked
as reliable once again. This will mitigate further unnecessary
rebalance. See
http://storagemojo.com/2007/02/26/netapp-weighs-in-on-disks/ - "[Drive
Resurrection]" as an example of where this is a significant feature
for storage vendors."
Related, a separate section on that same page mentions a Jeff Mahoney.
Perhaps he should be consulted or his work should be looked into:
Take device with heavy IO errors offline or mark as "unreliable"
"Devices should be taken offline after they reach a given threshold of
IO errors. Jeff Mahoney works on handling EIO errors (among others),
this project can build on top of it."
Agreed. Maybe it would be an error rate (set by ratio)?
I was thinking of either:
a. A running count, using the current error counting mechanisms,
with some
max number allowed before the device gets kicked.
b. A count that decays over time, this would need two tunables (how
long an
error is considered, and how many are allowed).
OK.
--
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html