Re: RAID1 fails to recover chunk tree

Robert White Sat, 01 Nov 2014 21:19:52 -0700

On 10/31/2014 05:15 AM, Zack Coffey wrote:

Sadly I think I understand now.


So by adding the second drive, BTRFS saw it as an extension of data (ala
JBOD-ish?). Even though I thought I was only adding RAID1 for metadata,
was also adding to the data storage.

I assume that even though chunk-recover reports healthy chunks, there's
little to no way to actually get them?


Yes.

The chunks are "good" in that they are well defined, but in your casethey point to a place that no-longer exists. Sort of like if you tookthe card catalog out of a library and then burned down the library. Thecatalog is still correct, it just no longer has any books to back it up.Or more correctly, you bought a second building, moved half of yourbooks over there, made a complete copy of the card catalog, put that inthe second building... and then burned that second building down. So thecopy of the card catalog is still valid, but half of the books have beenburned.

You are making a couple of problematic assumptions about what termsmean, and what level of abstractions they involve, that may mess you upgoing forward. Here's a "quick" re-primer.

JBOD == Just a Box Of Disks. This is just a designation for puttingdisks in a computer without any special hardware. That is, when you putdisks in your computer it's JBOD. It only stops being JBOD when you add_dedicated_ hardware controllers for things like RAID operation. Thisdesignation puts it in contrast to dedicated storage systems of muchhigher complexity that are available from specialty manufacturers, suchas IBM DASD, which stands for (Direct Access Storage Device), a NAS(network attached storage) server, or a hardware RAID solution fromsomeone like SUN.

RAID == Redundant Array of Inexpensive Disks. The reason "striping" is"RAID-0" is that there is no redundancy in that layout. The zerodefinition was created after the original RAID-1 through 3 (or 4?) andbefore 5 and 6.

Pure concatenation was already well known before the whole attempt tostandardize how to think about and implement the more complex layouts.Pure concatenation is how, for instance, one would zip a bunch of stuffonto successive floppies. It's also how adding banks of ram workedbefore memory controlers and line-fetch interleaving and all that. Itsthe "longer tape is more storage, second tape is even more storage module".

(They didn't make a "RAID minus 1" designation for concatenation as thatwas getting absurd).

So every linux system you will ever build that has more than zero disks(or equivalent slow storage like SSDs) that doesn't have specialdedicated storage processors is a JBOD.

A Hardware RAID is typically an dedicated appliance with storageelements (usually disks, often pricey) that are often matched by sizeand transfer dynamics, and often backed by a substantial block ofnon-volatile or battery-backup-powered RAM that will survivereboots/crashes in such a way as to be considered "nonvolatile" over areasonable period of time etc. E.g. it's not _Just_ a box of inexpensivedisks.


(Disclaimer: arguable statements follow...)

BTRFS is _not_ a RAID at all. Nor is it a storage management system.BTRFS is a file system that _can_ selectively implement various RAIDlayout modes and can operate without a separate storage management system.

So a "real storage management system", such as Logical Volume Managerdoes things in layers. In LVM, for instance, to make a RAID Volume, Ihave to adopt the physical storage (lvm pv* commands) associate it withits peers (lvm vg* commands) and then create logical volumes (lvm lv*commands).

In a "real RAID management system", such as with mdadm, I have to matchthe partitioning or media sizes and then join them into the semanticarray layouts. That is, I have to design the layout, and pre-match thestorage "with deliberate intent" before bringing the storage into themix. For instance if I "make a RAID-5 device" the RAID-ness exists"before" the storage, at least in concept.


For Example:

mdadm --create md23 --level=raid5 --raid-devices=4 /dev/sda /dev/sdb/dev/sdc /dev/sdd


The raid "comes to exist" as /dev/md23
It is given a personality of type raid5
It is given a geometry of four devices
Then that entity is _imposed_ on each of four drives.

Now in practical terms this happens all at once, but in terms of intentand design it is in a strict order of declaration. And because I did itall at once I didn't have to specify the size of the array or the sizesof the chunks of the array. The program got to "peek ahead" at the mediaand back-figure the size and such.


Compare this to what you did with BTRFS.

You made a file system on a storage device.
Then you said "here's some more space".

Then you said "hey file system, rearrange yourself to use this space,and while you are at it, go ahead and spread the metadata around as ifit were a raid."

So the expansion of storage happened first, and separately, in the btrfsdevice add activity. The "balance" operation was a declaration of "don'tjust own the new space, figure out how best to use it."

You just also applied the metadata filter to say, by the way, I want afull copy of the metadata on both the old and the new spaces.

A non-trivial storage layout might have a number of disks, with a volumemanager, an encryption manager, and an array manager, all layered tocreate an expanse of storage that a file system could _then_ be placedattop.

BTRFS is _way_ more flexible than mdadm. And it is way less into fixedboundaries. It can, for instance, change its mind about how things arelaid out without having to go offline for a protracted period of time.

BTRFS' design philosophy seems built around the idea of being able toadd non-volatile storage into a filesystem "naked" (unpartitioned), oradd partitions of same at will, and have one layer of logic deal withthe whole mess.

So BTRFS' ideas of RAID/single layout for medatada and data is not "diskcentric" its pure semantics that are _aware_ of storage boundaries.That's why you can have, your metadata at a different RAID level thanyour data.

The idea is that you can take the dedicated layers that exist (such asdm-crypt or LVM) as you need them to manage space, but then not need tohave the hard boundaries that complicate the semantic layout of thespace if you don't want/need them.

The other systems are still important, for instance (absent hardwareencryption) its _way_ more efficient to impose a RAID3, 4, 5, or 6 on araw disk, then encrypt that raid, then put a filesystem on top of theencryption than it is to encrypt the multiple drives and then buildthose RAIDs above the encryption.

The TL;DR is that you have to be really careful about the semanticstructures. A lot of the terms and ideas overlap at different layers.That means that the terms have a lot of slack in their meanings. Likewhen people talk about "the network", a lot hinges on what differentpeople mean by words like "local".

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID1 fails to recover chunk tree

Reply via email to