Re: BTRFS messes up snapshot LV with origin

Robert White Fri, 28 Nov 2014 23:55:37 -0800

On 11/28/2014 08:59 PM, Zygo Blaxell wrote:

On Fri, Nov 28, 2014 at 06:05:48PM +0100, Goffredo Baroncelli wrote:

On 11/27/2014 05:15 AM, Zygo Blaxell wrote:

This is a weakness of the current udev and asynchronous device hotplug
concept:  there is no notion of bus enumeration in progress, so we can be
trying to assemble multi-device storage before we have all the devices
visible.  Assembly of aggregate storage (whatever it is--btrfs, md,
lvm2...) has to wait until all known storage buses are fully enumerated
in order to detect if there are duplicates.


It is more complex than that. Some devices may appear after the "1st" bus
enumeration.


That case is well handled already--a new enumeration will start with the
second (and all later) hotplug events.

The problem arises when we try to assemble disk arrays before the
known end of the "1st" (or any) enumeration.  There is no way for an
enumerating agent to tell other agents "this is definitely not the
complete list of devices yet, other devices may be inserted imminently"
and defer all the multi-device assembly until the address space of the
enumering bus is fully covered.

MDADM has an "attached" but not "started" state for arrays that handlesthis condition during incremental assembly. (see "mdadm --incremental/dev/whatever"),

To slightly misuse the vocabulary, as each partition is encountered andsubmitted to the system it's checked for a superblock. If one is foundthen it has the identity of an array encoded on it and if that arraydoesn't exist it is allocated, otherwise the device is added to theexistent array. The array is only started if all the devices areaccounted for unless an option is added to allow earlier starts, andeven then "enough" of the devices must be present to make sense (e.g.only one device missing from a RAID5, or a correct pair of devices for aRAID10 etc.)

So we'd need a "partially assembled but not started" state and someioctls to do things like force-start or force-disown a filesystem thatcannot be "finished" automatically.

That sort of thing is very easy to do with devices because devices don'thave to be opened and can reject an open attempt, or at least theread/writes after an open and such.

Unfortunately a filesystem can really only exist as a mounted thing, andcan really only be controlled by remounting thereafter. The mostefficient way to do this would be to have a alternate file systemoperations structure that was filled mostly with dummy operations thatwould return ENOENT and friends. Then the remount that finally fulfilledthe file system's requirements would then switch out that struct for thefully functional one. That remount would need an "adddev=" and someother such options (much like AUFS adds layers).

It;s all doable. But it stretches to near breaking the "mount" paradigm.You would need an operation that looked like "mount -t btrfs -odo_we_need_this /dev/whatever /this/datum/means/nothing" to match andattach a device "wherever it goes" or you might end up needing to do theCartesian product of trial attachments of each new device to all activefileystems to match it up, which is an ugly external scripting requirement.

As far as waiting for the address space to be fully covered. Meh. If aready-or-not, or ready-enough, status is established in the file systemit would be undesirable for it to know anything about any other subsystem.

We don't care if enumeration is "done" we only care if we have arational set of storage, and whether that rational set is "enough" to befully ready, enough to be only read-ready, or just plain not enough.


In theory, the idempotent mount command could be

mount -t btrfs some-uuid-instead-of-device /mount/point
mount -t btrfs some-other-uuid-here /other/mount/point

to create the zero-devices involved entity, followed by

mount -t btrfs -o trydev /dev/something /this/bit/is/ignored

repeated for all possible somethings. /mount/point and/other/mount/point would be returning ENOENT for their contents untilthey were ready-enough.

In practice this is very impure compared to how mdadm has the /dev/md-namespace in which to build its devices before any actual mount is possible.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS messes up snapshot LV with origin

Reply via email to