Thanks for the comments.. more below.

On 9/4/19 11:48 PM, David Sterba wrote:
On Tue, Mar 19, 2019 at 06:00:19PM +0800, Anand Jain wrote:
RFC patch as of now, appreciate your comments. This patch set has
been tested.

Function call chain  __btrfs_map_block()->find_live_mirror() uses
thread pid to determine the %mirror_num when the mirror_num=0.
The pid based mirror_num extrapolation has following disadvantages
  . A large single-process read IO will read only from one disk.

Please note that the pid refers to the metadata kernel threads that
submit the IO, so on average the IO is serviced by different threads
with random PID. In the end it gets distributed over multiple drives.
But it's nott ideal, no argument about that.

 ok.

  . In a worst scenario all processes read accessing the FS could have
        either odd or even pid, the read IO gets skewed.
  . There is no deterministic way of knowing/controlling which copy will
        be used for reading.
  . May see performance variations for a given set of multi process
        workload ran at different times.
So we need other types of readmirror policies. This patch introduces a framework so that we can add more policies, and
converts the existing %pid into as a configurable parameter using the
property. And also provides a transient readmirror mount option, so that
this property can be applied for the read io during mount and for
readonly FS.

I think the mirror selection by pid was a temporary solution (that
stuck, as it always happens). On average it does not work bad. Once we
have a better way then it can be dropped. So I'm not convinced it needs
to be one of the configuration options.

 Ok. Thanks for confirming, we could replace pid with something better,
 as of now we have devid-based and device-qdepth based, IMO qdepth based
 should be default as I commented in the respective patch.

What exactly do you mean by 'transient'? I see that it could be needed
for mount or read-only, though I don't see the usefulness of that.

 I mean to say does not persist across reboot. To override the
 original setting lets say device-qdepth based, but momentarily
 to avoid known read errors from a disk.

  For example:
    btrfs property set <mnt> readmirror pid
    btrfs property set <mnt> readmirror ""
    btrfs property set <mnt> readmirror devid<n>

    mount -o readmirror=pid <mnt>
    mount -o readmirror=devid<n> <mnt>

Please note:
  The property storage is an extented attributes of root inode
  (BTRFS_FS_TREE_OBJECTID), the other choice is a new ondisk KEY,
  which is open for comments.

A new key type is not going to be allocated for that, the extended
attributes for properties are there. Another question is where to attach
the property as this is a whole-filesystem property, we don't have an
example to follow so far.

 Ok no key.
 Yep its a whole filesystem property. As of now it goes to the root
 inode as an extended attribute. I find it reasonable.

  This patch uses the unused btrfs_device::type, and does not use the
  bitwise ops because as type is u64 and bitwise ops need u32, so we
  need 32bits of the btrfs_device::type. Which is a kind of messy.
  Its open for suggestion for any better way.

For a prototype and early testing, using the type is probably ok, but
I don't think it's right for permanent storage. Besides I'm sure that
the u64 will not suffice in the long run.

Hmm.. I don't know. I thought
  <upper 32 bits> for disk groups
  <lower 32 bits> for disk types
should suffice. Not too sure what I am missing.

Thanks, Anand.

Reply via email to