> On 12 Sep 2019, at 6:03 PM, Josef Bacik <jo...@toxicpanda.com> wrote:
>
> On Thu, Sep 12, 2019 at 06:00:21PM +0800, Anand Jain wrote:
>>
>>
>>> On 12 Sep 2019, at 5:50 PM, Josef Bacik <jo...@toxicpanda.com> wrote:
>>>
>>> On Thu, Sep 12, 2019 at 03:41:42PM +0800, Anand Jain wrote:
>>>>
>>>>
>>>> Thanks for the comments. More below.
>>>>
>>>> On 12/9/19 3:16 AM, Josef Bacik wrote:
>>>>> On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote:
>>>>>> On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <jo...@toxicpanda.com> wrote:
>>>>>>>
>>>>>>> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote:
>>>>>>>> Function call chain __btrfs_map_block()->find_live_mirror() uses
>>>>>>>> thread pid to determine the %mirror_num when the mirror_num=0.
>>>>>>>>
>>>>>>>> This patch introduces a framework so that we can add policies to
>>>>>>>> determine
>>>>>>>> the %mirror_num. And also adds the devid as the readmirror policy.
>>>>>>>>
>>>>>>>> The new property is stored as an item in the device tree as show below.
>>>>>>>> (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid)
>>>>>>>>
>>>>>>>> To be able to set and get this new property also introduces new ioctls
>>>>>>>> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl
>>>>>>>> argument
>>>>>>>> is defined as
>>>>>>>> struct btrfs_ioctl_readmirror_args {
>>>>>>>> __u64 type; /* RW */
>>>>>>>> __u64 device_bitmap; /* RW */
>>>>>>>> }
>>>>>>>>
>>>>>>>> An usage example as follows:
>>>>>>>> btrfs property set /btrfs readmirror devid:1,3
>>>>>>>> btrfs property get /btrfs readmirror
>>>>>>>> readmirror devid:1 3
>>>>>>>> btrfs property set /btrfs readmirror ""
>>>>>>>> btrfs property get /btrfs readmirror
>>>>>>>> readmirror default
>>>>>>>>
>>>>>>>> This patchset has been tested completely, however marked as RFC for the
>>>>>>>> following reasons and comments on them (or any other) are appreciated
>>>>>>>> as
>>>>>>>> usual.
>>>>>>>> . The new objectid is defined as
>>>>>>>> #define BTRFS_READMIRROR_OBJECTID -1ULL
>>>>>>>> Need consent we are fine to use this value, and with this value it
>>>>>>>> shall be placed just before the DEV_STATS_OBJECTID item which is more
>>>>>>>> frequently used only during the device errors.
>>>>>>>>
>>>>>>>> . I am using a u64 bitmap to represent the devices id, so the max
>>>>>>>> device
>>>>>>>> id that we could represent is 63, its a kind of limitation which
>>>>>>>> should
>>>>>>>> be addressed before integration, I wonder if there is any suggestion?
>>>>>>>> Kindly note that, multiple ioctls with each time representing a set
>>>>>>>> of
>>>>>>>> device(s) is not a choice because we need to make sure the readmirror
>>>>>>>> changes happens in a commit transaction.
>>>>>>>>
>>>>>>>> v1->RFC v2:
>>>>>>>> . Property is stored as a dev-tree item instead of root inode extended
>>>>>>>> attribute.
>>>>>>>> . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to
>>>>>>>> BTRFS_DEV_STATE_READ_PREFERRED.
>>>>>>>> . Changed format specifier from devid1,2,3.. to devid:1,2,3..
>>>>>>>>
>>>>>>>> RFC->v1:
>>>>>>>> Drops pid as one of the readmirror policy choices and as usual remains
>>>>>>>> as default. And when the devid is reset the readmirror policy falls
>>>>>>>> back
>>>>>>>> to pid.
>>>>>>>> Drops the mount -o readmirror idea, it can be added at a later point
>>>>>>>> of
>>>>>>>> time.
>>>>>>>> Property now accepts more than 1 devid as readmirror device. As shown
>>>>>>>> in the example above.
>>>>>>>>
>>>>>>>
>>>>>>> This is a lot of infrastructure
>>>>
>>>> Ok. Any idea on a better implementation?
>>>> How about extended attribute approach? v1 patches proposed
>>>> it, but it abused the extended attribute as commented here [1]
>>>> and v2 got changed to an item-key.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b993...@gmx.com/
>>>>
>>>
>>> That's a NAK on the prop interface. This is a fs wide policy, not a
>>> directory/inode policy.
>>>
>>>>
>>>>>>> to just change which mirror we read to based on
>>>>>>> some arbitrary user policy. I assume this is to solve the case where
>>>>>>> you have
>>>>>>> slow and fast disks, so you can always read from the fast disk? And
>>>>>>> then it's
>>>>>>> only used in RAID1, so the very narrow usecase of having a RAID1 setup
>>>>>>> with a
>>>>>>> SSD and a normal disk? I'm not seeing a point to this much code for one
>>>>>>> particular obscure setup. Thanks,
>>>>>>>
>>>>>>> Josef
>>>>>>
>>>>>> Not commenting on the code itself, but as a user I see this SSD RAID1
>>>>>> acceleration as a future much have feature. It's only obscure at the
>>>>>> moment because we don't have code to take advantage of it. But on
>>>>>> large btrfs filesystems with hundreds of GB of metadata, like I have
>>>>>> for backups, the usability of the filesystem is dramatically improved
>>>>>> having the metadata on an SSD( though currently only half of the time
>>>>>> due to the even/odd pid distribution.)
>>>>>
>>>>> But that's different from a mirror. 100% it would be nice to say "put my
>>>>> metadata on the ssd, data elsewhere". That's not what this patch is
>>>>> about, this
>>>>> patch is specifically about changing which drive we choose in a mirrored
>>>>> setup,
>>>>> which is super unlikely to mirror a SSD with a slow drive, cause it's
>>>>> just going
>>>>> to be slow no matter what. Sure we could make it so reads always go to
>>>>> the SSD,
>>>>> but we can accomplish that by just adding a check for nonrotational in
>>>>> the code,
>>>>> and then we don't have to encode all this nonsense in the file system.
>>>>> Thanks,
>>>>
>>>> I wrote about the readmirror policy framework here[2],
>>>> I forgot to link it here, sorry about that, my mistake.
>>>>
>>>> [2]
>>>>
>>>> https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.j...@oracle.com/
>>>>
>>>> Readmirror policy is for raid1, raid10 and future N way mirror.
>>>> Yes for now its only for raid1.
>>>>
>>>> Here the idea is to create a framework so that readmirror policy
>>>> can be configured as needed. And nonrotational can be one such policy.
>>>>
>>>> The example of hard-coded nonrotational policy does not work in case
>>>> of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME block
>>>> device, as all these are still nonrotational devices. So hard-coded
>>>> policy is not a good idea. If we have to hardcode then there is Q-depth
>>>> based readmirror routing is better (patch in the ML), but that is
>>>> not good enough, because some configs wants it based on the disk-LBA
>>>> so that SAN storage target cache is balanced and not duplicated.
>>>> So in short it must be a configurable policy.
>>>>
>>>
>>> Again, if you are mixing disk types you likely always want non-rotational,
>>> but
>>> still mixing different speed devices in a mirror setup is just asking for
>>> weird
>>> latency problems. I don't think solving this use case is necessary. If
>>> you mix
>>> ssd + network device in a serious production setup then you probably should
>>> be
>>> fired cause you don't know what you are doing. Having the generic
>>> "nonrotational gets priority" is going to cover 99% of the actual use cases
>>> that
>>> make sense.
>>>
>>> The SAN usecase I can sort of see, but again I don't feel like it's a
>>> problem we
>>> need to solve with on-disk format. Add a priority to sysfs so you can
>>> change it
>>> with udev or something on the fly. Thanks,
>>>
>>
>> Ok.
>> Sysfs is fine however we need configuration to be persistent across reboots.
>> Any idea?
>>
>
> Udev rules. Thanks,
>
Josef, configs moving along with the luns in san seems to be more
easy to manage, otherwise when the host fails, each potential new
server has to be pre-configured with the udev rules.
Thanks, Anand
> Josef