Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

Austin S. Hemmelgarn Thu, 28 Jun 2018 04:12:38 -0700

On 2018-06-28 05:15, Qu Wenruo wrote:



On 2018年06月28日 16:16, Andrei Borzenkov wrote:

On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo <quwenruo.bt...@gmx.com> wrote:



On 2018年06月28日 11:14, r...@georgianit.com wrote:



On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:


Please get yourself clear of what other raid1 is doing.


A drive failure, where the drive is still there when the computer reboots, is a 
situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but 
raid 0) will recover from perfectly without raising a sweat. Some will rebuild 
the array automatically,


WOW, that's black magic, at least for RAID1.
The whole RAID1 has no idea of which copy is correct unlike btrfs who
has datasum.

Don't bother other things, just tell me how to determine which one is
correct?


When one drive fails, it is recorded in meta-data on remaining drives;
probably configuration generation number is increased. Next time drive
with older generation is not incorporated. Hardware controllers also
keep this information in NVRAM and so do not even depend on scanning
of other disks.


Yep, the only possible way to determine such case is from external info.

For device generation, it's possible to enhance btrfs, but at least we
could start from detect and refuse to RW mount to avoid possible further
corruption.
But anyway, if one really cares about such case, hardware RAID
controller seems to be the only solution as other software may have the
same problem.

LVM doesn't. It detects that one of the devices was gone for someperiod of time and marks the volume as degraded (and _might_, dependingon how you have things configured, automatically re-sync). Not sureabout MD, but I am willing to bet it properly detects this type ofsituation too.


And the hardware solution looks pretty interesting, is the write to
NVRAM 100% atomic? Even at power loss?

On a proper RAID controller, it's battery backed, and that batterybacking provides enough power to also make sure that the state change isproperly recorded in the event of power loss.

The only possibility is that, the misbehaved device missed several super
block update so we have a chance to detect it's out-of-date.
But that's not always working.


Why it should not work as long as any write to array is suspended
until superblock on remaining devices is updated?


What happens if there is no generation gap in device superblock?

If one device got some of its (nodatacow) data written to disk, while
the other device doesn't get data written, and neither of them reached
super block update, there is no difference in device superblock, thus no
way to detect which is correct.

Yes, but that should be a very small window (at least, once we finallyquit serializing writes across devices), and it's a problem on existingRAID1 implementations too (and therefore isn't something we should beusing as an excuse for not doing this).

If you're talking about missing generation check for btrfs, that's
valid, but it's far from a "major design flaw", as there are a lot of
cases where other RAID1 (mdraid or LVM mirrored) can also be affected
(the brain-split case).


That's different. Yes, with software-based raid there is usually no
way to detect outdated copy if no other copies are present. Having
older valid data is still very different from corrupting newer data.


While for VDI case (or any VM image file format other than raw), older
valid data normally means corruption.
Unless they have their own write-ahead log.

Some file format may detect such problem by themselves if they have
internal checksum, but anyway, older data normally means corruption,
especially when partial new and partial old.

On the other hand, with data COW and csum, btrfs can ensure the whole
filesystem update is atomic (at least for single device).
So the title, especially the "major design flaw" can't be wrong any more.

The title is excessive, but I'd agree it's a design flaw that BTRFSdoesn't at least notice that the generation ID's are different andpreferentially trust the device with the newer generation ID. The onlyspecial handling I can see that would be needed is around volumesmounted with the `nodatacow` option, which may not see generationchanges for a very long time otherwise.

others will automatically kick out the misbehaving drive.  *none* of them will 
take back the the drive with old data and start commingling that data with good 
copy.)\ This behaviour from BTRFS is completely abnormal.. and defeats even the 
most basic expectations of RAID.


RAID1 can only tolerate 1 missing device, it has nothing to do with
error detection.
And it's impossible to detect such case without extra help.

Your expectation is completely wrong.


Well ... somehow it is my experience as well ... :)


Acceptable, but not really apply to software based RAID1.

Thanks,
Qu


I'm not the one who has to clear his expectations here.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

Reply via email to