> On 12 Sep 2019, at 5:50 PM, Josef Bacik <jo...@toxicpanda.com> wrote:
> 
> On Thu, Sep 12, 2019 at 03:41:42PM +0800, Anand Jain wrote:
>> 
>> 
>> Thanks for the comments. More below.
>> 
>> On 12/9/19 3:16 AM, Josef Bacik wrote:
>>> On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote:
>>>> On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <jo...@toxicpanda.com> wrote:
>>>>> 
>>>>> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote:
>>>>>> Function call chain  __btrfs_map_block()->find_live_mirror() uses
>>>>>> thread pid to determine the %mirror_num when the mirror_num=0.
>>>>>> 
>>>>>> This patch introduces a framework so that we can add policies to 
>>>>>> determine
>>>>>> the %mirror_num. And also adds the devid as the readmirror policy.
>>>>>> 
>>>>>> The new property is stored as an item in the device tree as show below.
>>>>>>     (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid)
>>>>>> 
>>>>>> To be able to set and get this new property also introduces new ioctls
>>>>>> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl argument
>>>>>> is defined as
>>>>>>         struct btrfs_ioctl_readmirror_args {
>>>>>>                 __u64 type; /* RW */
>>>>>>                 __u64 device_bitmap; /* RW */
>>>>>>         }
>>>>>> 
>>>>>> An usage example as follows:
>>>>>>         btrfs property set /btrfs readmirror devid:1,3
>>>>>>         btrfs property get /btrfs readmirror
>>>>>>           readmirror devid:1 3
>>>>>>         btrfs property set /btrfs readmirror ""
>>>>>>         btrfs property get /btrfs readmirror
>>>>>>           readmirror default
>>>>>> 
>>>>>> This patchset has been tested completely, however marked as RFC for the
>>>>>> following reasons and comments on them (or any other) are appreciated as
>>>>>> usual.
>>>>>>  . The new objectid is defined as
>>>>>>       #define BTRFS_READMIRROR_OBJECTID -1ULL
>>>>>>    Need consent we are fine to use this value, and with this value it
>>>>>>    shall be placed just before the DEV_STATS_OBJECTID item which is more
>>>>>>    frequently used only during the device errors.
>>>>>> 
>>>>>> .  I am using a u64 bitmap to represent the devices id, so the max device
>>>>>>    id that we could represent is 63, its a kind of limitation which 
>>>>>> should
>>>>>>    be addressed before integration, I wonder if there is any suggestion?
>>>>>>    Kindly note that, multiple ioctls with each time representing a set of
>>>>>>    device(s) is not a choice because we need to make sure the readmirror
>>>>>>    changes happens in a commit transaction.
>>>>>> 
>>>>>> v1->RFC v2:
>>>>>>   . Property is stored as a dev-tree item instead of root inode extended
>>>>>>     attribute.
>>>>>>   . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to 
>>>>>> BTRFS_DEV_STATE_READ_PREFERRED.
>>>>>>   . Changed format specifier from devid1,2,3.. to devid:1,2,3..
>>>>>> 
>>>>>> RFC->v1:
>>>>>>   Drops pid as one of the readmirror policy choices and as usual remains
>>>>>>   as default. And when the devid is reset the readmirror policy falls 
>>>>>> back
>>>>>>   to pid.
>>>>>>   Drops the mount -o readmirror idea, it can be added at a later point of
>>>>>>   time.
>>>>>>   Property now accepts more than 1 devid as readmirror device. As shown
>>>>>>   in the example above.
>>>>>> 
>>>>> 
>>>>> This is a lot of infrastructure
>> 
>>  Ok. Any idea on a better implementation?
>>  How about extended attribute approach? v1 patches proposed
>>  it, but it abused the extended attribute as commented here [1]
>>  and v2 got changed to an item-key.
>> 
>> [1]
>> https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b993...@gmx.com/
>> 
> 
> That's a NAK on the prop interface.  This is a fs wide policy, not a
> directory/inode policy.
> 
>> 
>>>>> to just change which mirror we read to based on
>>>>> some arbitrary user policy.  I assume this is to solve the case where you 
>>>>> have
>>>>> slow and fast disks, so you can always read from the fast disk?  And then 
>>>>> it's
>>>>> only used in RAID1, so the very narrow usecase of having a RAID1 setup 
>>>>> with a
>>>>> SSD and a normal disk?  I'm not seeing a point to this much code for one
>>>>> particular obscure setup.  Thanks,
>>>>> 
>>>>> Josef
>>>> 
>>>> Not commenting on the code itself, but as a user I see this SSD RAID1
>>>> acceleration as a future much have feature. It's only obscure at the
>>>> moment because we don't have code to take advantage of it. But on
>>>> large btrfs filesystems with hundreds of GB of metadata, like I have
>>>> for backups, the usability of the filesystem is dramatically improved
>>>> having the metadata on an SSD( though currently only half of the time
>>>> due to the even/odd pid distribution.)
>>> 
>>> But that's different from a mirror.  100% it would be nice to say "put my
>>> metadata on the ssd, data elsewhere".  That's not what this patch is about, 
>>> this
>>> patch is specifically about changing which drive we choose in a mirrored 
>>> setup,
>>> which is super unlikely to mirror a SSD with a slow drive, cause it's just 
>>> going
>>> to be slow no matter what.  Sure we could make it so reads always go to the 
>>> SSD,
>>> but we can accomplish that by just adding a check for nonrotational in the 
>>> code,
>>> and then we don't have to encode all this nonsense in the file system.  
>>> Thanks,
>> 
>> I wrote about the readmirror policy framework here[2],
>> I forgot to link it here, sorry about that, my mistake.
>> 
>> [2]
>> 
>> https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.j...@oracle.com/
>> 
>> Readmirror policy is for raid1, raid10 and future N way mirror.
>> Yes for now its only for raid1.
>> 
>> Here the idea is to create a framework so that readmirror policy
>> can be configured as needed. And nonrotational can be one such policy.
>> 
>> The example of hard-coded nonrotational policy does not work in case
>> of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME block
>> device, as all these are still nonrotational devices. So hard-coded
>> policy is not a good idea. If we have to hardcode then there is Q-depth
>> based readmirror routing is better (patch in the ML), but that is
>> not good enough, because some configs wants it based on the disk-LBA
>> so that SAN storage target cache is balanced and not duplicated.
>> So in short it must be a configurable policy.
>> 
> 
> Again, if you are mixing disk types you likely always want non-rotational, but
> still mixing different speed devices in a mirror setup is just asking for 
> weird
> latency problems.  I don't think solving this use case is necessary.  If you 
> mix
> ssd + network device in a serious production setup then you probably should be
> fired cause you don't know what you are doing.  Having the generic
> "nonrotational gets priority" is going to cover 99% of the actual use cases 
> that
> make sense.
> 
> The SAN usecase I can sort of see, but again I don't feel like it's a problem 
> we
> need to solve with on-disk format.  Add a priority to sysfs so you can change 
> it
> with udev or something on the fly.  Thanks,
> 
 
 Ok.
 Sysfs is fine however we need configuration to be persistent across reboots.
 Any idea?

Thanks, Anand

> Josef

Reply via email to