On 11/30/2015 11:09 PM, Chris Murphy wrote:
On Mon, Nov 30, 2015 at 1:37 PM, Austin S Hemmelgarn
<ahferro...@gmail.com> wrote:

I've had multiple cases of disks that got one write error then were fine for
more than a year before any further issues.  My thought is add an option to
retry that single write after some short delay (1-2s maybe), and if it still
fails, then mark the disk as failed.
Seems reasonable.
I think I added this to the Project Ideas page on the wiki a *very* long time ago
https://btrfs.wiki.kernel.org/index.php/Project_ideas#False_alarm_on_bad_disk_-_rebuild_mitigation

"After a device is marked as unreliable, maintain the device within the FS in order to confirm the issue persists. The device will still contribute toward fs performance but will not be treated as if contributing towards replication/reliability. If the device shows that the given errors were a once-off issue then the device can be marked as reliable once again. This will mitigate further unnecessary rebalance. See http://storagemojo.com/2007/02/26/netapp-weighs-in-on-disks/ - "[Drive Resurrection]" as an example of where this is a significant feature for storage vendors."

Agreed. Maybe it would be an error rate (set by ratio)?

I was thinking of either:
a. A running count, using the current error counting mechanisms, with some
max number allowed before the device gets kicked.
b. A count that decays over time, this would need two tunables (how long an
error is considered, and how many are allowed).

OK.





--
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to