Re: [RFC] Btrfs device and pool management (wip)

Brendan Hide Tue, 01 Dec 2015 02:23:26 -0800

On 11/30/2015 11:09 PM, Chris Murphy wrote:

On Mon, Nov 30, 2015 at 1:37 PM, Austin S Hemmelgarn
<ahferro...@gmail.com> wrote:

I've had multiple cases of disks that got one write error then were fine for
more than a year before any further issues.  My thought is add an option to
retry that single write after some short delay (1-2s maybe), and if it still
fails, then mark the disk as failed.

Seems reasonable.

I think I added this to the Project Ideas page on the wiki a *very* longtime ago

https://btrfs.wiki.kernel.org/index.php/Project_ideas#False_alarm_on_bad_disk_-_rebuild_mitigation

"After a device is marked as unreliable, maintain the device within theFS in order to confirm the issue persists. The device will stillcontribute toward fs performance but will not be treated as ifcontributing towards replication/reliability. If the device shows thatthe given errors were a once-off issue then the device can be marked asreliable once again. This will mitigate further unnecessary rebalance.See http://storagemojo.com/2007/02/26/netapp-weighs-in-on-disks/ -"[Drive Resurrection]" as an example of where this is a significantfeature for storage vendors."

Agreed. Maybe it would be an error rate (set by ratio)?

I was thinking of either:
a. A running count, using the current error counting mechanisms, with some
max number allowed before the device gets kicked.
b. A count that decays over time, this would need two tunables (how long an
error is considered, and how many are allowed).

OK.



--
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Btrfs device and pool management (wip)

Reply via email to