Hey,
I thought it would be a good time good to play around a bit with btrfs
in the usual hotplug setup, so we can - if needed - adapt things
before it is going to be finalized.

At a first look, it looks very promising, and I really like the idea
that the state of the (possibly incomplete) device tree is kept in the
kernel, and not somewhere in a file in userspace, like we usually see
for all sorts of multi-volume/multi-device setups. It should make
things much easier as usual.

Like with every other subsystem, people will expect btrfs to just work
with hotpluggable devices, without much configuration and explicit
setup after device connect. To assemble a mountable volume, we will
need to find the (possibly several independent) devices containing the
btrfs data.
This is currently done by scanning all block devices in /dev, and
investigating the content. It works fine for simple and usual setups,
where all block devices behave normally. This strategy is also
required in some situations, like in recovery and rescue situations.
But it will just not work in several "advanced" setups. We may open
devices which do not return the requested data to us, but hang in the
kernel in a timeout, waiting to return with an error to us. There are
boxes out there with tens of thousands of devices, and some, or many
of them, may not work as expected.
In such setups, we can not access all the devices with a single
thread, and open them all sequentially. It just asks for real trouble.
We already ran into such problems on big boxes with
mount-by-label/uuid.

To emulate such behavior, one just needs to do:
  $ modprobe scsi_debug max_luns=8 num_parts=2
  $ echo 1 > /sys/module/scsi_debug/parameters/every_nth
  $ echo 4 > /sys/module/scsi_debug/parameters/opts

  $ ls -l /sys/class/block/ | wc -l
  45

Any call to single-threaded /dev device-node scanner logic will take
~2 hours now, to return to the caller.

To work around these problems, udev probes all devices in a separate
process and puts the probing results asynchronously in the udev
database, and possibly into symlinks somewhere in /dev. The probing
happens fully parallelized. We currently support huge boxes with many
disks, that need to probe ~4000 block devices in parallel, to get a
reasonable bootup/setup time.

That way, all found volume-metadata is immediately available, and
hanging probing-processes will not block the probing of other devices.
They will time out, or return the data at a later point, when it
becomes avialable.

For a first naive try to integrate with udev's async probing, I did:
  $ cat /etc/udev/rules.d/80-btrfs.rules
  SUBSYSTEM=="block", ENV{ID_FS_TYPE}=="btrfs", \
    SYMLINK+="btrfs/$env{ID_FS_UUID_ENC}/$env{ID_FS_UUID_SUB_ENC}"

Connecting devices with btrfs volumes will now create:
  $ tree /dev/btrfs/
  /dev/btrfs/
  |-- 0cdedd75-2d03-41e6-a1eb-156c0920a021
  |   |-- 897fac06-569c-4f45-a0b9-a1f91a9564d4 -> ../../sda10
  |   `-- aac20975-b642-4650-b65b-b92ce22616f2 -> ../../sda9
  `-- a1ec970a-2463-414e-864c-2eb8ac4e1cf2
      |-- 4d1f1fff-4c6b-4b87-8486-36f58abc0610 -> ../../sdb2
      `-- e7fe3065-c39f-4295-a099-a89e839ae350 -> ../../sdb1

Tools could lookup that way all currently active volumes, and it would
also be easy to make the volume known to the kernel at the same time
we recognize it.

To update these links after mkfs.btrfs, the formatting tool would need
to send a change event to the kernel, like:
  echo change > /sys/dev/block/<maj>:<min>/uevent

We should make sure, that at least such problems are known, and we
have thought about the needed infrastructure, to be able to solve
these problems anytime later. If it requires changes to make
supporting such setups easier, even when it's not implemented now, it
would be nice to try to make them before it is finalized.

Let me know, what you think.

Thanks,
Kay
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to