Hi,

here's how I managed to recover from a BTRFS replace panic which
happened even on 4.8.4.

The kernel didn't seem to handle our raid10 filesystem with a missing
device correctly (even though it passed a precautionary scrub before
removing the device) :
- replace didn't work and triggered a kernel panic,
- we saw PostgreSQL corruption (duplicate entries in indexes and write
errors), both for database clusters using NoCoW and CoW (we run several
clusters on this filesystem and configure them differently based on our
needs).

What finally worked is adding devices to the filesystem, balancing (I
added skip_balance in fstab in case balance would trigger a panic like
replace) which removed data allocated to the missing device and then
delete it.
I didn't dare delete without balancing first as I couldn't get
confirmation that skip_balance would prevent the balance triggered by
delete to stop (which could mean a panic each time we tried to mount the
filesystem). In the end it seems that balancing before deleting is doing
the same work : balance correctly detects that it shouldn't use the
missing device and reallocate all data properly.

The sad result is that we are currently forced to check/restore most of
the data just because we had to replace a single disk : clearly BTRFS
can't handle itself properly until the missing device is completely
removed. That's not what I expected to do when using raid10 :-(

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to