> On 12 Sep 2019, at 6:03 PM, Josef Bacik <jo...@toxicpanda.com> wrote:
> 
> On Thu, Sep 12, 2019 at 06:00:21PM +0800, Anand Jain wrote:
>> 
>> 
>>> On 12 Sep 2019, at 5:50 PM, Josef Bacik <jo...@toxicpanda.com> wrote:
>>> 
>>> On Thu, Sep 12, 2019 at 03:41:42PM +0800, Anand Jain wrote:
>>>> 
>>>> 
>>>> Thanks for the comments. More below.
>>>> 
>>>> On 12/9/19 3:16 AM, Josef Bacik wrote:
>>>>> On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote:
>>>>>> On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik <jo...@toxicpanda.com> wrote:
>>>>>>> 
>>>>>>> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote:
>>>>>>>> Function call chain  __btrfs_map_block()->find_live_mirror() uses
>>>>>>>> thread pid to determine the %mirror_num when the mirror_num=0.
>>>>>>>> 
>>>>>>>> This patch introduces a framework so that we can add policies to 
>>>>>>>> determine
>>>>>>>> the %mirror_num. And also adds the devid as the readmirror policy.
>>>>>>>> 
>>>>>>>> The new property is stored as an item in the device tree as show below.
>>>>>>>>    (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, devid)
>>>>>>>> 
>>>>>>>> To be able to set and get this new property also introduces new ioctls
>>>>>>>> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The ioctl 
>>>>>>>> argument
>>>>>>>> is defined as
>>>>>>>>        struct btrfs_ioctl_readmirror_args {
>>>>>>>>                __u64 type; /* RW */
>>>>>>>>                __u64 device_bitmap; /* RW */
>>>>>>>>        }
>>>>>>>> 
>>>>>>>> An usage example as follows:
>>>>>>>>        btrfs property set /btrfs readmirror devid:1,3
>>>>>>>>        btrfs property get /btrfs readmirror
>>>>>>>>          readmirror devid:1 3
>>>>>>>>        btrfs property set /btrfs readmirror ""
>>>>>>>>        btrfs property get /btrfs readmirror
>>>>>>>>          readmirror default
>>>>>>>> 
>>>>>>>> This patchset has been tested completely, however marked as RFC for the
>>>>>>>> following reasons and comments on them (or any other) are appreciated 
>>>>>>>> as
>>>>>>>> usual.
>>>>>>>> . The new objectid is defined as
>>>>>>>>      #define BTRFS_READMIRROR_OBJECTID -1ULL
>>>>>>>>   Need consent we are fine to use this value, and with this value it
>>>>>>>>   shall be placed just before the DEV_STATS_OBJECTID item which is more
>>>>>>>>   frequently used only during the device errors.
>>>>>>>> 
>>>>>>>> .  I am using a u64 bitmap to represent the devices id, so the max 
>>>>>>>> device
>>>>>>>>   id that we could represent is 63, its a kind of limitation which 
>>>>>>>> should
>>>>>>>>   be addressed before integration, I wonder if there is any suggestion?
>>>>>>>>   Kindly note that, multiple ioctls with each time representing a set 
>>>>>>>> of
>>>>>>>>   device(s) is not a choice because we need to make sure the readmirror
>>>>>>>>   changes happens in a commit transaction.
>>>>>>>> 
>>>>>>>> v1->RFC v2:
>>>>>>>>  . Property is stored as a dev-tree item instead of root inode extended
>>>>>>>>    attribute.
>>>>>>>>  . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to 
>>>>>>>> BTRFS_DEV_STATE_READ_PREFERRED.
>>>>>>>>  . Changed format specifier from devid1,2,3.. to devid:1,2,3..
>>>>>>>> 
>>>>>>>> RFC->v1:
>>>>>>>>  Drops pid as one of the readmirror policy choices and as usual remains
>>>>>>>>  as default. And when the devid is reset the readmirror policy falls 
>>>>>>>> back
>>>>>>>>  to pid.
>>>>>>>>  Drops the mount -o readmirror idea, it can be added at a later point 
>>>>>>>> of
>>>>>>>>  time.
>>>>>>>>  Property now accepts more than 1 devid as readmirror device. As shown
>>>>>>>>  in the example above.
>>>>>>>> 
>>>>>>> 
>>>>>>> This is a lot of infrastructure
>>>> 
>>>> Ok. Any idea on a better implementation?
>>>> How about extended attribute approach? v1 patches proposed
>>>> it, but it abused the extended attribute as commented here [1]
>>>> and v2 got changed to an item-key.
>>>> 
>>>> [1]
>>>> https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b993...@gmx.com/
>>>> 
>>> 
>>> That's a NAK on the prop interface.  This is a fs wide policy, not a
>>> directory/inode policy.
>>> 
>>>> 
>>>>>>> to just change which mirror we read to based on
>>>>>>> some arbitrary user policy.  I assume this is to solve the case where 
>>>>>>> you have
>>>>>>> slow and fast disks, so you can always read from the fast disk?  And 
>>>>>>> then it's
>>>>>>> only used in RAID1, so the very narrow usecase of having a RAID1 setup 
>>>>>>> with a
>>>>>>> SSD and a normal disk?  I'm not seeing a point to this much code for one
>>>>>>> particular obscure setup.  Thanks,
>>>>>>> 
>>>>>>> Josef
>>>>>> 
>>>>>> Not commenting on the code itself, but as a user I see this SSD RAID1
>>>>>> acceleration as a future much have feature. It's only obscure at the
>>>>>> moment because we don't have code to take advantage of it. But on
>>>>>> large btrfs filesystems with hundreds of GB of metadata, like I have
>>>>>> for backups, the usability of the filesystem is dramatically improved
>>>>>> having the metadata on an SSD( though currently only half of the time
>>>>>> due to the even/odd pid distribution.)
>>>>> 
>>>>> But that's different from a mirror.  100% it would be nice to say "put my
>>>>> metadata on the ssd, data elsewhere".  That's not what this patch is 
>>>>> about, this
>>>>> patch is specifically about changing which drive we choose in a mirrored 
>>>>> setup,
>>>>> which is super unlikely to mirror a SSD with a slow drive, cause it's 
>>>>> just going
>>>>> to be slow no matter what.  Sure we could make it so reads always go to 
>>>>> the SSD,
>>>>> but we can accomplish that by just adding a check for nonrotational in 
>>>>> the code,
>>>>> and then we don't have to encode all this nonsense in the file system.  
>>>>> Thanks,
>>>> 
>>>> I wrote about the readmirror policy framework here[2],
>>>> I forgot to link it here, sorry about that, my mistake.
>>>> 
>>>> [2]
>>>> 
>>>> https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.j...@oracle.com/
>>>> 
>>>> Readmirror policy is for raid1, raid10 and future N way mirror.
>>>> Yes for now its only for raid1.
>>>> 
>>>> Here the idea is to create a framework so that readmirror policy
>>>> can be configured as needed. And nonrotational can be one such policy.
>>>> 
>>>> The example of hard-coded nonrotational policy does not work in case
>>>> of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME block
>>>> device, as all these are still nonrotational devices. So hard-coded
>>>> policy is not a good idea. If we have to hardcode then there is Q-depth
>>>> based readmirror routing is better (patch in the ML), but that is
>>>> not good enough, because some configs wants it based on the disk-LBA
>>>> so that SAN storage target cache is balanced and not duplicated.
>>>> So in short it must be a configurable policy.
>>>> 
>>> 
>>> Again, if you are mixing disk types you likely always want non-rotational, 
>>> but
>>> still mixing different speed devices in a mirror setup is just asking for 
>>> weird
>>> latency problems.  I don't think solving this use case is necessary.  If 
>>> you mix
>>> ssd + network device in a serious production setup then you probably should 
>>> be
>>> fired cause you don't know what you are doing.  Having the generic
>>> "nonrotational gets priority" is going to cover 99% of the actual use cases 
>>> that
>>> make sense.
>>> 
>>> The SAN usecase I can sort of see, but again I don't feel like it's a 
>>> problem we
>>> need to solve with on-disk format.  Add a priority to sysfs so you can 
>>> change it
>>> with udev or something on the fly.  Thanks,
>>> 
>> 
>> Ok.
>> Sysfs is fine however we need configuration to be persistent across reboots.
>> Any idea?
>> 
> 
> Udev rules.  Thanks,
> 

 Josef, configs moving along with the luns in san seems to be more
 easy to manage, otherwise when the host fails, each potential new
 server has to be pre-configured with the udev rules. 

Thanks, Anand

> Josef 

Reply via email to