Thanks for the comments.. more below.
On 9/4/19 11:48 PM, David Sterba wrote:
On Tue, Mar 19, 2019 at 06:00:19PM +0800, Anand Jain wrote:
RFC patch as of now, appreciate your comments. This patch set has
been tested.
Function call chain __btrfs_map_block()->find_live_mirror() uses
thread pid to determine the %mirror_num when the mirror_num=0.
The pid based mirror_num extrapolation has following disadvantages
. A large single-process read IO will read only from one disk.
Please note that the pid refers to the metadata kernel threads that
submit the IO, so on average the IO is serviced by different threads
with random PID. In the end it gets distributed over multiple drives.
But it's nott ideal, no argument about that.
ok.
. In a worst scenario all processes read accessing the FS could have
either odd or even pid, the read IO gets skewed.
. There is no deterministic way of knowing/controlling which copy will
be used for reading.
. May see performance variations for a given set of multi process
workload ran at different times.
So we need other types of readmirror policies.
This patch introduces a framework so that we can add more policies, and
converts the existing %pid into as a configurable parameter using the
property. And also provides a transient readmirror mount option, so that
this property can be applied for the read io during mount and for
readonly FS.
I think the mirror selection by pid was a temporary solution (that
stuck, as it always happens). On average it does not work bad. Once we
have a better way then it can be dropped. So I'm not convinced it needs
to be one of the configuration options.
Ok. Thanks for confirming, we could replace pid with something better,
as of now we have devid-based and device-qdepth based, IMO qdepth based
should be default as I commented in the respective patch.
What exactly do you mean by 'transient'? I see that it could be needed
for mount or read-only, though I don't see the usefulness of that.
I mean to say does not persist across reboot. To override the
original setting lets say device-qdepth based, but momentarily
to avoid known read errors from a disk.
For example:
btrfs property set <mnt> readmirror pid
btrfs property set <mnt> readmirror ""
btrfs property set <mnt> readmirror devid<n>
mount -o readmirror=pid <mnt>
mount -o readmirror=devid<n> <mnt>
Please note:
The property storage is an extented attributes of root inode
(BTRFS_FS_TREE_OBJECTID), the other choice is a new ondisk KEY,
which is open for comments.
A new key type is not going to be allocated for that, the extended
attributes for properties are there. Another question is where to attach
the property as this is a whole-filesystem property, we don't have an
example to follow so far.
Ok no key.
Yep its a whole filesystem property. As of now it goes to the root
inode as an extended attribute. I find it reasonable.
This patch uses the unused btrfs_device::type, and does not use the
bitwise ops because as type is u64 and bitwise ops need u32, so we
need 32bits of the btrfs_device::type. Which is a kind of messy.
Its open for suggestion for any better way.
For a prototype and early testing, using the type is probably ok, but
I don't think it's right for permanent storage. Besides I'm sure that
the u64 will not suffice in the long run.
Hmm.. I don't know. I thought
<upper 32 bits> for disk groups
<lower 32 bits> for disk types
should suffice. Not too sure what I am missing.
Thanks, Anand.