On 11/28/2014 08:59 PM, Zygo Blaxell wrote:
On Fri, Nov 28, 2014 at 06:05:48PM +0100, Goffredo Baroncelli wrote:
On 11/27/2014 05:15 AM, Zygo Blaxell wrote:
This is a weakness of the current udev and asynchronous device hotplug
concept:  there is no notion of bus enumeration in progress, so we can be
trying to assemble multi-device storage before we have all the devices
visible.  Assembly of aggregate storage (whatever it is--btrfs, md,
lvm2...) has to wait until all known storage buses are fully enumerated
in order to detect if there are duplicates.

It is more complex than that. Some devices may appear after the "1st" bus
enumeration.

That case is well handled already--a new enumeration will start with the
second (and all later) hotplug events.

The problem arises when we try to assemble disk arrays before the
known end of the "1st" (or any) enumeration.  There is no way for an
enumerating agent to tell other agents "this is definitely not the
complete list of devices yet, other devices may be inserted imminently"
and defer all the multi-device assembly until the address space of the
enumering bus is fully covered.

MDADM has an "attached" but not "started" state for arrays that handles this condition during incremental assembly. (see "mdadm --incremental /dev/whatever"),

To slightly misuse the vocabulary, as each partition is encountered and submitted to the system it's checked for a superblock. If one is found then it has the identity of an array encoded on it and if that array doesn't exist it is allocated, otherwise the device is added to the existent array. The array is only started if all the devices are accounted for unless an option is added to allow earlier starts, and even then "enough" of the devices must be present to make sense (e.g. only one device missing from a RAID5, or a correct pair of devices for a RAID10 etc.)

So we'd need a "partially assembled but not started" state and some ioctls to do things like force-start or force-disown a filesystem that cannot be "finished" automatically.

That sort of thing is very easy to do with devices because devices don't have to be opened and can reject an open attempt, or at least the read/writes after an open and such.

Unfortunately a filesystem can really only exist as a mounted thing, and can really only be controlled by remounting thereafter. The most efficient way to do this would be to have a alternate file system operations structure that was filled mostly with dummy operations that would return ENOENT and friends. Then the remount that finally fulfilled the file system's requirements would then switch out that struct for the fully functional one. That remount would need an "adddev=" and some other such options (much like AUFS adds layers).

It;s all doable. But it stretches to near breaking the "mount" paradigm. You would need an operation that looked like "mount -t btrfs -o do_we_need_this /dev/whatever /this/datum/means/nothing" to match and attach a device "wherever it goes" or you might end up needing to do the Cartesian product of trial attachments of each new device to all active fileystems to match it up, which is an ugly external scripting requirement.

As far as waiting for the address space to be fully covered. Meh. If a ready-or-not, or ready-enough, status is established in the file system it would be undesirable for it to know anything about any other subsystem.

We don't care if enumeration is "done" we only care if we have a rational set of storage, and whether that rational set is "enough" to be fully ready, enough to be only read-ready, or just plain not enough.

In theory, the idempotent mount command could be

mount -t btrfs some-uuid-instead-of-device /mount/point
mount -t btrfs some-other-uuid-here /other/mount/point

to create the zero-devices involved entity, followed by

mount -t btrfs -o trydev /dev/something /this/bit/is/ignored

repeated for all possible somethings. /mount/point and /other/mount/point would be returning ENOENT for their contents until they were ready-enough.

In practice this is very impure compared to how mdadm has the /dev/md- namespace in which to build its devices before any actual mount is possible.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to