On Tue, Mar 19, 2019 at 06:00:19PM +0800, Anand Jain wrote:
> RFC patch as of now, appreciate your comments. This patch set has
> been tested.
> 
> Function call chain  __btrfs_map_block()->find_live_mirror() uses
> thread pid to determine the %mirror_num when the mirror_num=0.
>     
> The pid based mirror_num extrapolation has following disadvantages
>  . A large single-process read IO will read only from one disk.

Please note that the pid refers to the metadata kernel threads that
submit the IO, so on average the IO is serviced by different threads
with random PID. In the end it gets distributed over multiple drives.
But it's nott ideal, no argument about that.

>  . In a worst scenario all processes read accessing the FS could have
>        either odd or even pid, the read IO gets skewed.
>  . There is no deterministic way of knowing/controlling which copy will
>        be used for reading.
>  . May see performance variations for a given set of multi process
>        workload ran at different times.
>     
> So we need other types of readmirror policies.
>     
> This patch introduces a framework so that we can add more policies, and
> converts the existing %pid into as a configurable parameter using the
> property. And also provides a transient readmirror mount option, so that
> this property can be applied for the read io during mount and for
> readonly FS.

I think the mirror selection by pid was a temporary solution (that
stuck, as it always happens). On average it does not work bad. Once we
have a better way then it can be dropped. So I'm not convinced it needs
to be one of the configuration options.

What exactly do you mean by 'transient'? I see that it could be needed
for mount or read-only, though I don't see the usefulness of that.

>  For example:
>    btrfs property set <mnt> readmirror pid
>    btrfs property set <mnt> readmirror ""
>    btrfs property set <mnt> readmirror devid<n>
> 
>    mount -o readmirror=pid <mnt>
>    mount -o readmirror=devid<n> <mnt>
> 
> Please note:
>  The property storage is an extented attributes of root inode
>  (BTRFS_FS_TREE_OBJECTID), the other choice is a new ondisk KEY,
>  which is open for comments.

A new key type is not going to be allocated for that, the extended
attributes for properties are there. Another question is where to attach
the property as this is a whole-filesystem property, we don't have an
example to follow so far.

>  This patch uses the unused btrfs_device::type, and does not use the
>  bitwise ops because as type is u64 and bitwise ops need u32, so we
>  need 32bits of the btrfs_device::type. Which is a kind of messy.
>  Its open for suggestion for any better way.

For a prototype and early testing, using the type is probably ok, but
I don't think it's right for permanent storage. Besides I'm sure that
the u64 will not suffice in the long run.

Reply via email to