Re: [PATCH] fuse: avoid unnecessary spinlock bump

2022-04-07 Thread JeffleXu



On 4/7/22 10:10 PM, Vivek Goyal wrote:
> On Sat, Apr 02, 2022 at 06:32:50PM +0800, Jeffle Xu wrote:
>> Move dmap free worker kicker inside the critical region, so that extra
>> spinlock lock/unlock could be avoided.
>>
>> Suggested-by: Liu Jiang 
>> Signed-off-by: Jeffle Xu 
> 
> Looks good to me. Have you done any testing to make sure nothing is
> broken.

xfstests -g quick shows no regression. The tested virtiofs is mounted
with "dax=always".

-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] fuse: avoid unnecessary spinlock bump

2022-04-08 Thread JeffleXu



On 4/8/22 7:25 PM, Vivek Goyal wrote:
> On Fri, Apr 08, 2022 at 10:36:40AM +0800, JeffleXu wrote:
>>
>>
>> On 4/7/22 10:10 PM, Vivek Goyal wrote:
>>> On Sat, Apr 02, 2022 at 06:32:50PM +0800, Jeffle Xu wrote:
>>>> Move dmap free worker kicker inside the critical region, so that extra
>>>> spinlock lock/unlock could be avoided.
>>>>
>>>> Suggested-by: Liu Jiang 
>>>> Signed-off-by: Jeffle Xu 
>>>
>>> Looks good to me. Have you done any testing to make sure nothing is
>>> broken.
>>
>> xfstests -g quick shows no regression. The tested virtiofs is mounted
>> with "dax=always".
> 
> I think xfstests might not trigger reclaim. You probably will have to
> run something like blogbench with a small dax window like 1G so that
> heavy reclaim happens.


Actually, I configured the DAX window to 8MB, i.e. 4 slots when running
xfstests. Thus I think the reclaim path is most likely triggered.


> 
> For fun, I sometimes used to run it with a window of just say 16 dax
> ranges so that reclaim was so heavy that if there was a bug, it will
> show up.
> 

Yeah, my colleague had ever reported that a DAX window of 4KB will cause
hang in our internal OS (which is 4.19, we back ported virtiofs to
4.19). But then I found that this issue doesn't exist in the latest
upstream. The reason seems that in the upstream kernel,
devm_memremap_pages() called in virtio_fs_setup_dax() will fail directly
since the dax window (4KB) is not aligned with the sparse memory section.

-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] fuse: avoid unnecessary spinlock bump

2022-04-10 Thread JeffleXu



On 4/8/22 8:06 PM, Vivek Goyal wrote:
> On Fri, Apr 08, 2022 at 07:50:55PM +0800, JeffleXu wrote:
>>
>>
>> On 4/8/22 7:25 PM, Vivek Goyal wrote:
>>> On Fri, Apr 08, 2022 at 10:36:40AM +0800, JeffleXu wrote:
>>>>
>>>>
>>>> On 4/7/22 10:10 PM, Vivek Goyal wrote:
>>>>> On Sat, Apr 02, 2022 at 06:32:50PM +0800, Jeffle Xu wrote:
>>>>>> Move dmap free worker kicker inside the critical region, so that extra
>>>>>> spinlock lock/unlock could be avoided.
>>>>>>
>>>>>> Suggested-by: Liu Jiang 
>>>>>> Signed-off-by: Jeffle Xu 
>>>>>
>>>>> Looks good to me. Have you done any testing to make sure nothing is
>>>>> broken.
>>>>
>>>> xfstests -g quick shows no regression. The tested virtiofs is mounted
>>>> with "dax=always".
>>>
>>> I think xfstests might not trigger reclaim. You probably will have to
>>> run something like blogbench with a small dax window like 1G so that
>>> heavy reclaim happens.
>>
>>
>> Actually, I configured the DAX window to 8MB, i.e. 4 slots when running
>> xfstests. Thus I think the reclaim path is most likely triggered.
>>
>>
>>>
>>> For fun, I sometimes used to run it with a window of just say 16 dax
>>> ranges so that reclaim was so heavy that if there was a bug, it will
>>> show up.
>>>
>>
>> Yeah, my colleague had ever reported that a DAX window of 4KB will cause
>> hang in our internal OS (which is 4.19, we back ported virtiofs to
>> 4.19). But then I found that this issue doesn't exist in the latest
>> upstream. The reason seems that in the upstream kernel,
>> devm_memremap_pages() called in virtio_fs_setup_dax() will fail directly
>> since the dax window (4KB) is not aligned with the sparse memory section.
> 
> Given our default chunk size is 2MB (FUSE_DAX_SHIFT), may be it is not
> a bad idea to enforce some minimum cache window size. IIRC, even one
> range is not enough. Minimum 2 are required for reclaim to not deadlock.

Curiously, why minimum 1 range is not adequate? In which case minimum 2
are required?


-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] fuse: avoid unnecessary spinlock bump

2022-04-11 Thread JeffleXu



On 4/11/22 7:52 PM, Vivek Goyal wrote:
> On Mon, Apr 11, 2022 at 10:10:23AM +0800, JeffleXu wrote:
>>
>>
>> On 4/8/22 8:06 PM, Vivek Goyal wrote:
>>> On Fri, Apr 08, 2022 at 07:50:55PM +0800, JeffleXu wrote:
>>>>
>>>>
>>>> On 4/8/22 7:25 PM, Vivek Goyal wrote:
>>>>> On Fri, Apr 08, 2022 at 10:36:40AM +0800, JeffleXu wrote:
>>>>>>
>>>>>>
>>>>>> On 4/7/22 10:10 PM, Vivek Goyal wrote:
>>>>>>> On Sat, Apr 02, 2022 at 06:32:50PM +0800, Jeffle Xu wrote:
>>>>>>>> Move dmap free worker kicker inside the critical region, so that extra
>>>>>>>> spinlock lock/unlock could be avoided.
>>>>>>>>
>>>>>>>> Suggested-by: Liu Jiang 
>>>>>>>> Signed-off-by: Jeffle Xu 
>>>>>>>
>>>>>>> Looks good to me. Have you done any testing to make sure nothing is
>>>>>>> broken.
>>>>>>
>>>>>> xfstests -g quick shows no regression. The tested virtiofs is mounted
>>>>>> with "dax=always".
>>>>>
>>>>> I think xfstests might not trigger reclaim. You probably will have to
>>>>> run something like blogbench with a small dax window like 1G so that
>>>>> heavy reclaim happens.
>>>>
>>>>
>>>> Actually, I configured the DAX window to 8MB, i.e. 4 slots when running
>>>> xfstests. Thus I think the reclaim path is most likely triggered.
>>>>
>>>>
>>>>>
>>>>> For fun, I sometimes used to run it with a window of just say 16 dax
>>>>> ranges so that reclaim was so heavy that if there was a bug, it will
>>>>> show up.
>>>>>
>>>>
>>>> Yeah, my colleague had ever reported that a DAX window of 4KB will cause
>>>> hang in our internal OS (which is 4.19, we back ported virtiofs to
>>>> 4.19). But then I found that this issue doesn't exist in the latest
>>>> upstream. The reason seems that in the upstream kernel,
>>>> devm_memremap_pages() called in virtio_fs_setup_dax() will fail directly
>>>> since the dax window (4KB) is not aligned with the sparse memory section.
>>>
>>> Given our default chunk size is 2MB (FUSE_DAX_SHIFT), may be it is not
>>> a bad idea to enforce some minimum cache window size. IIRC, even one
>>> range is not enough. Minimum 2 are required for reclaim to not deadlock.
>>
>> Curiously, why minimum 1 range is not adequate? In which case minimum 2
>> are required?
> 
> Frankly speaking, right now I don't remember. I have vague memories
> of concluding in the past that 1 range is not sufficient. But if you
> like dive deeper, and try with one range and see if you can introduce
> deadlock.
> 

Alright, thanks.

-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] fuse: avoid unnecessary spinlock bump

2022-04-12 Thread JeffleXu



On 4/11/22 10:00 PM, Vivek Goyal wrote:
> On Mon, Apr 11, 2022 at 03:20:05PM +0200, Bernd Schubert wrote:
> 
> So for testing DAX, I have to rely on out of tree patches from qemu
> here if any changes in virtiofs client happen.
> 
> https://gitlab.com/virtio-fs/qemu/-/tree/virtio-fs-dev
> 
> Jeffle is probably relying on their own virtiofsd implementation for DAX
> testing.
> 

Actually I also use the C version virtiofsd in the above described
repository for testing :)

-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC] virtio-blk: support per-device queue depth

2021-01-18 Thread JeffleXu


On 1/18/21 1:25 PM, Jason Wang wrote:
> 
> On 2021/1/18 上午11:58, Joseph Qi wrote:
>> module parameter 'virtblk_queue_depth' was firstly introduced for
>> testing/benchmarking purposes described in commit fc4324b4597c
>> ("virtio-blk: base queue-depth on virtqueue ringsize or module param").
>> Since we have different virtio-blk devices which have different
>> capabilities, it requires that we support per-device queue depth instead
>> of per-module. So defaultly use vq free elements if module parameter
>> 'virtblk_queue_depth' is not set.
> 
> 
> I wonder if it's better to use sysfs instead (or whether it has already
> had something like this in the blocker layer).
> 

"/sys/block//queue/nr_requests" indeed works, but isn't better to
set queue_depth according to the hardware capability at the very first?
AFAIK, nvme just set per-device queue_depth at initializing phase.

Thanks,
Jeffle

> 
> 
>>
>> Signed-off-by: Joseph Qi 
>> ---
>>   drivers/block/virtio_blk.c | 12 +++-
>>   1 file changed, 7 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
>> index 145606d..f83a417 100644
>> --- a/drivers/block/virtio_blk.c
>> +++ b/drivers/block/virtio_blk.c
>> @@ -705,6 +705,7 @@ static int virtblk_probe(struct virtio_device *vdev)
>>   u32 v, blk_size, max_size, sg_elems, opt_io_size;
>>   u16 min_io_size;
>>   u8 physical_block_exp, alignment_offset;
>> +    unsigned int queue_depth;
>>     if (!vdev->config->get) {
>>   dev_err(&vdev->dev, "%s failure: config access disabled\n",
>> @@ -755,17 +756,18 @@ static int virtblk_probe(struct virtio_device
>> *vdev)
>>   goto out_free_vq;
>>   }
>>   -    /* Default queue sizing is to fill the ring. */
>> -    if (!virtblk_queue_depth) {
>> -    virtblk_queue_depth = vblk->vqs[0].vq->num_free;
>> +    if (likely(!virtblk_queue_depth)) {
>> +    queue_depth = vblk->vqs[0].vq->num_free;
>>   /* ... but without indirect descs, we use 2 descs per req */
>>   if (!virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC))
>> -    virtblk_queue_depth /= 2;
>> +    queue_depth /= 2;
>> +    } else {
>> +    queue_depth = virtblk_queue_depth;
>>   }
>>     memset(&vblk->tag_set, 0, sizeof(vblk->tag_set));
>>   vblk->tag_set.ops = &virtio_mq_ops;
>> -    vblk->tag_set.queue_depth = virtblk_queue_depth;
>> +    vblk->tag_set.queue_depth = queue_depth;
>>   vblk->tag_set.numa_node = NUMA_NO_NODE;
>>   vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
>>   vblk->tag_set.cmd_size =

-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC] virtio-blk: support per-device queue depth

2021-01-19 Thread JeffleXu


On 1/19/21 12:06 PM, Jason Wang wrote:
> 
> On 2021/1/19 上午9:33, JeffleXu wrote:
>>
>> On 1/18/21 1:25 PM, Jason Wang wrote:
>>> On 2021/1/18 上午11:58, Joseph Qi wrote:
>>>> module parameter 'virtblk_queue_depth' was firstly introduced for
>>>> testing/benchmarking purposes described in commit fc4324b4597c
>>>> ("virtio-blk: base queue-depth on virtqueue ringsize or module param").
>>>> Since we have different virtio-blk devices which have different
>>>> capabilities, it requires that we support per-device queue depth
>>>> instead
>>>> of per-module. So defaultly use vq free elements if module parameter
>>>> 'virtblk_queue_depth' is not set.
>>>
>>> I wonder if it's better to use sysfs instead (or whether it has already
>>> had something like this in the blocker layer).
>>>
>> "/sys/block//queue/nr_requests" indeed works, but isn't better to
>> set queue_depth according to the hardware capability at the very first?
>> AFAIK, nvme just set per-device queue_depth at initializing phase.
> 
> 
> I agree, the problem is that the current code may modify module parameter.

The module parameter 'virtblk_queue_depth' is actually remained untainted.

Actually it is the original code before this patch that changes the
module parameter. When the module parameter is not set by boot cmdline
(i.e., default to 0), it will be initialized to the queue_depth of the
vring of the first probed virtio-blk device, and will be revealed to
user space through '/sys/module/virtio_blk/parameters/queue_depth'. I'm
not sure if this behavior is reasonable or not.

The only side effect of this patch is that, now
'/sys/module/virtio_blk/parameters/queue_depth' will be kept as '0' when
the module parameter is not set manually.


Thanks,
Jeffle


>>
>>>
>>>> Signed-off-by: Joseph Qi 
>>>> ---
>>>>    drivers/block/virtio_blk.c | 12 +++-
>>>>    1 file changed, 7 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
>>>> index 145606d..f83a417 100644
>>>> --- a/drivers/block/virtio_blk.c
>>>> +++ b/drivers/block/virtio_blk.c
>>>> @@ -705,6 +705,7 @@ static int virtblk_probe(struct virtio_device
>>>> *vdev)
>>>>    u32 v, blk_size, max_size, sg_elems, opt_io_size;
>>>>    u16 min_io_size;
>>>>    u8 physical_block_exp, alignment_offset;
>>>> +    unsigned int queue_depth;
>>>>      if (!vdev->config->get) {
>>>>    dev_err(&vdev->dev, "%s failure: config access disabled\n",
>>>> @@ -755,17 +756,18 @@ static int virtblk_probe(struct virtio_device
>>>> *vdev)
>>>>    goto out_free_vq;
>>>>    }
>>>>    -    /* Default queue sizing is to fill the ring. */
>>>> -    if (!virtblk_queue_depth) {
>>>> -    virtblk_queue_depth = vblk->vqs[0].vq->num_free;
>>>> +    if (likely(!virtblk_queue_depth)) {
>>>> +    queue_depth = vblk->vqs[0].vq->num_free;
>>>>    /* ... but without indirect descs, we use 2 descs per req */
>>>>    if (!virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC))
>>>> -    virtblk_queue_depth /= 2;
>>>> +    queue_depth /= 2;
>>>> +    } else {
>>>> +    queue_depth = virtblk_queue_depth;
>>>>    }
>>>>      memset(&vblk->tag_set, 0, sizeof(vblk->tag_set));
>>>>    vblk->tag_set.ops = &virtio_mq_ops;
>>>> -    vblk->tag_set.queue_depth = virtblk_queue_depth;
>>>> +    vblk->tag_set.queue_depth = queue_depth;
>>>>    vblk->tag_set.numa_node = NUMA_NO_NODE;
>>>>    vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
>>>>    vblk->tag_set.cmd_size =

-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 3/3] fuse: add per-file DAX flag

2021-07-15 Thread JeffleXu



On 7/16/21 8:51 AM, Vivek Goyal wrote:
> On Fri, Jul 16, 2021 at 08:40:29AM +0800, Liu Bo wrote:
>> On Thu, Jul 15, 2021 at 05:30:31PM +0800, Jeffle Xu wrote:
>>> Add one flag for fuse_attr.flags indicating if DAX shall be enabled for
>>> this file.
>>>
>>> When the per-file DAX flag changes for an *opened* file, the state of
>>> the file won't be updated until this file is closed and reopened later.
>>>
>>> Currently it is not implemented yet to change per-file DAX flag inside
>>> guest kernel, e.g., by chattr(1).
>>
>> Thanks for the patch, it looks good to me.
>>
>> I think it's a good starting point, what I'd like to discuss here is
>> whether we're going to let chattr to toggle the dax flag.
> 
> I have the same question. Why not take chattr approach as taken
> by ext4/xfs as well.
> 
> Vivek

Thanks.

We can implement the chattr approach as ext4/xfs do, if we have this use
scenario. It's an RFC patch, and I want to collect more feedback as soon
as possible.

-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 3/3] fuse: add per-file DAX flag

2021-07-15 Thread JeffleXu



On 7/16/21 9:32 AM, Vivek Goyal wrote:
> On Fri, Jul 16, 2021 at 09:18:34AM +0800, JeffleXu wrote:
>>
>>
>> On 7/16/21 8:51 AM, Vivek Goyal wrote:
>>> On Fri, Jul 16, 2021 at 08:40:29AM +0800, Liu Bo wrote:
>>>> On Thu, Jul 15, 2021 at 05:30:31PM +0800, Jeffle Xu wrote:
>>>>> Add one flag for fuse_attr.flags indicating if DAX shall be enabled for
>>>>> this file.
>>>>>
>>>>> When the per-file DAX flag changes for an *opened* file, the state of
>>>>> the file won't be updated until this file is closed and reopened later.
>>>>>
>>>>> Currently it is not implemented yet to change per-file DAX flag inside
>>>>> guest kernel, e.g., by chattr(1).
>>>>
>>>> Thanks for the patch, it looks good to me.
>>>>
>>>> I think it's a good starting point, what I'd like to discuss here is
>>>> whether we're going to let chattr to toggle the dax flag.
>>>
>>> I have the same question. Why not take chattr approach as taken
>>> by ext4/xfs as well.
>>>
>>> Vivek
>>
>> Thanks.
>>
>> We can implement the chattr approach as ext4/xfs do, if we have this use
>> scenario. It's an RFC patch, and I want to collect more feedback as soon
>> as possible.
> 
> I guess chattr approach will allow client (as well as server) to control
> which files should be DAX. While this approach allows only server to
> specify which files should use DAX. Given currently we let client
> control whether to use dax or not (-o dax), it probably will make
> sense to use chattr based approach?

Yes, changing the per-file DAX flag from guest side, such as by chattr,
may be needed for completeness. I will include this in the next version.

> 
> I will look at the patches. Do you have a corresponding user space
> implementation somewhere so that I can test it?

Thanks. I have sent the corresponding patch in-reply-to your mail.

-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 0/4] virtiofs,fuse: support per-file DAX

2021-07-19 Thread JeffleXu



On 7/20/21 5:30 AM, Vivek Goyal wrote:
> On Fri, Jul 16, 2021 at 06:47:49PM +0800, Jeffle Xu wrote:
>> This patchset adds support of per-file DAX for virtiofs, which is
>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>
>> There are three related scenarios:
>> 1. Alloc inode: get per-file DAX flag from fuse_attr.flags. (patch 3)
>> 2. Per-file DAX flag changes when the file has been opened. (patch 3)
>> In this case, the dentry and inode are all marked as DONT_CACHE, and
>> the DAX state won't be updated until the file is closed and reopened
>> later.
>> 3. Users can change the per-file DAX flag inside the guest by chattr(1).
>> (patch 4)
>> 4. Create new files under directories with DAX enabled. When creating
>> new files in ext4/xfs on host, the new created files will inherit the
>> per-file DAX flag from the directory, and thus the new created files in
>> virtiofs will also inherit the per-file DAX flag if the fuse server
>> derives fuse_attr.flags from the underlying ext4/xfs inode's per-file
>> DAX flag.
> 
> Thinking little bit more about this from requirement perspective. I think
> we are trying to address two use cases here.
> 
> A. Client does not know which files DAX should be used on. Only server
>knows it and server passes this information to client. I suspect
>that's your primary use case.

Yes, this is the starting point of this patch set.

> 
> B. Client is driving which files are supposed to be using DAX. This is
>exactly same as the model ext4/xfs are using by storing a persistent
>flag on inode. 
> 
> Current patches seem to be a hybrid of both approach A and B. 
> 
> If we were to implement B, then fuse client probably needs to have the
> capability to query FS_XFLAG_DAX on inode and decide whether to
> enable DAX or not. (Without extra round trip). Or know it indirectly
> by extending GETATTR and requesting this explicitly.

If guest queries if the file is DAX capable or not by an extra GETATTR,
I'm afraid this will add extra round trip.

Or if we add the DAX flag (ATTR_DAX) by extending LOOKUP, as done by
this patch set, then the FUSE server needs to set ATTR_DAX according to
the FS_XFLAG_DAX of the backend files, *to make the FS_XFLAG_DAX flag
previously set by FUSE client work*. Then it becomes a *mandatory*
requirement when implementing FUSE server. i.e., it becomes part of the
FUSE protocol rather than implementation specific. FUSE server can still
implement some other algorithm deciding whether to set ATTR_DAX or not,
though it must set ATTR_DAX once the backend file is flagged with
FS_XFLAG_DAX.

Besides, as you said, FUSE server needs to add one extra
GETFLAGS/FSGETXATTR ioctl per LOOKUP in this case. To eliminate this
cost, we could negotiate the per-file DAX capability during FUSE_INIT.
Only when the per-file DAX capability is negotiated, will the FUSE
server do extra GETFLAGS/FSGETXATTR ioctl and set ATTR_DAX flag when
doing LOOKUP.


Personally, I prefer the second way, i.e., by extending LOOKUP (adding
ATTR_DAX), to eliminate the extra round trip.

> 
> If we were only implementing A, then server does not have a way to 
> tell client to enable DAX. Server can either look at FS_XFLAG_DAX
> and decide to enable DAX or use some other property. Given querying
> FS_XFLAG_DAX will be an extra ioctl() on every inode lookup/getattr,
> it probably will be a server option. But enabling on server does not
> mean client will enable it.

As I said previously, it can be negotiated whether this per-file DAX
capability is needed. Guest can advertise this capability when '-o
dax=inode' is configured.

> 
> I think my primary concern with this patch right now is trying to
> figure out which requirement we are trying to cater to first and how
> to connect server and client well so they both understand what mode
> they are operating in and interact well.
> 


-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/4] fuse: Make DAX mount option a tri-state

2021-07-19 Thread JeffleXu



On 7/20/21 2:02 AM, Vivek Goyal wrote:
> On Fri, Jul 16, 2021 at 06:47:51PM +0800, Jeffle Xu wrote:
>> We add 'always', 'never', and 'inode' (default). '-o dax' continues to
>> operate the same which is equivalent to 'always'.
>>
>> By the time this patch is applied, 'inode' mode is actually equal to
>> 'always' mode, before the per-file DAX flag is introduced in the
>> following patch.
>>
>> Signed-off-by: Jeffle Xu 
>> ---
>>  fs/fuse/dax.c   | 13 -
>>  fs/fuse/fuse_i.h| 11 +--
>>  fs/fuse/inode.c |  2 +-
>>  fs/fuse/virtio_fs.c | 16 ++--
>>  4 files changed, 36 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
>> index c6f4e82e65f3..a478e824c2d0 100644
>> --- a/fs/fuse/dax.c
>> +++ b/fs/fuse/dax.c
>> @@ -70,6 +70,9 @@ struct fuse_inode_dax {
>>  };
>>  
>>  struct fuse_conn_dax {
>> +/** dax mode: FUSE_DAX_MOUNT_* (always, never or per-file) **/
>> +unsigned int mode;
> 
> Why "/**" ?

I copied this comment style from fuse in v4.19... Anyway, I will fix this.

> 
> How about make it something like "enum fuse_dax_mode mode" instead?
> 
> enum fuse_dax_mode dax_mode;

OK.

> 
>> +
>>  /* DAX device */
>>  struct dax_device *dev;
>>  
>> @@ -1288,7 +1291,8 @@ static int fuse_dax_mem_range_init(struct 
>> fuse_conn_dax *fcd)
>>  return ret;
>>  }
>>  
>> -int fuse_dax_conn_alloc(struct fuse_conn *fc, struct dax_device *dax_dev)
>> +int fuse_dax_conn_alloc(struct fuse_conn *fc, unsigned int mode,
>> +struct dax_device *dax_dev)
>>  {
>>  struct fuse_conn_dax *fcd;
>>  int err;
>> @@ -1301,6 +1305,7 @@ int fuse_dax_conn_alloc(struct fuse_conn *fc, struct 
>> dax_device *dax_dev)
>>  return -ENOMEM;
>>  
>>  spin_lock_init(&fcd->lock);
>> +fcd->mode = mode;
>>  fcd->dev = dax_dev;
>>  err = fuse_dax_mem_range_init(fcd);
>>  if (err) {
>> @@ -1339,10 +1344,16 @@ static const struct address_space_operations 
>> fuse_dax_file_aops  = {
>>  static bool fuse_should_enable_dax(struct inode *inode)
>>  {
>>  struct fuse_conn *fc = get_fuse_conn(inode);
>> +unsigned int mode;
>>  
>>  if (!fc->dax)
>>  return false;
>>  
>> +mode = fc->dax->mode;
>> +
>> +if (mode == FUSE_DAX_MOUNT_NEVER)
>> +return false;
>> +
>>  return true;
>>  }
>>  
>> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
>> index 07829ce78695..f29018323845 100644
>> --- a/fs/fuse/fuse_i.h
>> +++ b/fs/fuse/fuse_i.h
>> @@ -487,6 +487,12 @@ struct fuse_dev {
>>  struct list_head entry;
>>  };
>>  
>> +enum {
> And this becomes.
> 
> enum fuse_dax_mode {
> };

OK.

> 
>> +FUSE_DAX_MOUNT_INODE,
>> +FUSE_DAX_MOUNT_ALWAYS,
>> +FUSE_DAX_MOUNT_NEVER,
>> +};
> 
> How about getting rid of "MOUNT" and just do.
> 
>   FUSE_DAX_INODE,
>   FUSE_DAX_ALWAYS,
>   FUSE_DAX_NEVER,

OK.

> 
>> +
>>  struct fuse_fs_context {
>>  int fd;
>>  unsigned int rootmode;
>> @@ -503,7 +509,7 @@ struct fuse_fs_context {
>>  bool no_control:1;
>>  bool no_force_umount:1;
>>  bool legacy_opts_show:1;
>> -bool dax:1;
>> +unsigned int dax;
> 
> enum fuse_dax_mode dax_mode;

OK.

> 
>>  unsigned int max_read;
>>  unsigned int blksize;
>>  const char *subtype;
>> @@ -1242,7 +1248,8 @@ ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct 
>> iov_iter *to);
>>  ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
>>  int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma);
>>  int fuse_dax_break_layouts(struct inode *inode, u64 dmap_start, u64 
>> dmap_end);
>> -int fuse_dax_conn_alloc(struct fuse_conn *fc, struct dax_device *dax_dev);
>> +int fuse_dax_conn_alloc(struct fuse_conn *fc, unsigned int mode,
>  ^^
>   enum fuse_dax_mode dax_mode

OK.

>> +struct dax_device *dax_dev);
>>  void fuse_dax_conn_free(struct fuse_conn *fc);
>>  bool fuse_dax_inode_alloc(struct super_block *sb, struct fuse_inode *fi);
>>  void fuse_dax_inode_init(struct inode *inode);
>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> index b9beb39a4a18..f6b46395edb2 100644
>> --- a/fs/fuse/inode.c
>> +++ b/fs/fuse/inode.c
>> @@ -1434,7 +1434,7 @@ int fuse_fill_super_common(struct super_block *sb, 
>> struct fuse_fs_context *ctx)
>>  sb->s_subtype = ctx->subtype;
>>  ctx->subtype = NULL;
>>  if (IS_ENABLED(CONFIG_FUSE_DAX)) {
>> -err = fuse_dax_conn_alloc(fc, ctx->dax_dev);
>> +err = fuse_dax_conn_alloc(fc, ctx->dax, ctx->dax_dev);
>>  if (err)
>>  goto err;
>>  }
>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
>> index 8f52cdaa8445..561f711d1945 100644
>> --- a/fs/fuse/virtio_fs.c
>> +++ b/fs/fuse/virtio_fs.c
>> @@ -88,12 +88,21 @@ struct virtio_fs_req_work {
>>  static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
>> 

Re: [PATCH v2 3/4] fuse: add per-file DAX flag

2021-07-19 Thread JeffleXu



On 7/20/21 3:44 AM, Vivek Goyal wrote:
> On Fri, Jul 16, 2021 at 06:47:52PM +0800, Jeffle Xu wrote:
>> Add one flag for fuse_attr.flags indicating if DAX shall be enabled for
>> this file.
>>
>> When the per-file DAX flag changes for an *opened* file, the state of
>> the file won't be updated until this file is closed and reopened later.
>>
>> Signed-off-by: Jeffle Xu 
>> ---
>>  fs/fuse/dax.c | 21 +
>>  fs/fuse/file.c|  4 ++--
>>  fs/fuse/fuse_i.h  |  5 +++--
>>  fs/fuse/inode.c   |  5 -
>>  include/uapi/linux/fuse.h |  5 +
>>  5 files changed, 31 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
>> index a478e824c2d0..0e862119757a 100644
>> --- a/fs/fuse/dax.c
>> +++ b/fs/fuse/dax.c
>> @@ -1341,7 +1341,7 @@ static const struct address_space_operations 
>> fuse_dax_file_aops  = {
>>  .invalidatepage = noop_invalidatepage,
>>  };
>>  
>> -static bool fuse_should_enable_dax(struct inode *inode)
>> +static bool fuse_should_enable_dax(struct inode *inode, unsigned int flags)
>>  {
>>  struct fuse_conn *fc = get_fuse_conn(inode);
>>  unsigned int mode;
>> @@ -1354,18 +1354,31 @@ static bool fuse_should_enable_dax(struct inode 
>> *inode)
>>  if (mode == FUSE_DAX_MOUNT_NEVER)
>>  return false;
>>  
>> -return true;
>> +if (mode == FUSE_DAX_MOUNT_ALWAYS)
>> +return true;
>> +
>> +WARN_ON(mode != FUSE_DAX_MOUNT_INODE);
>> +return flags & FUSE_ATTR_DAX;
>>  }
>>  
>> -void fuse_dax_inode_init(struct inode *inode)
>> +void fuse_dax_inode_init(struct inode *inode, unsigned int flags)
>>  {
>> -if (!fuse_should_enable_dax(inode))
>> +if (!fuse_should_enable_dax(inode, flags))
>>  return;
>>  
>>  inode->i_flags |= S_DAX;
>>  inode->i_data.a_ops = &fuse_dax_file_aops;
>>  }
>>  
>> +void fuse_dax_dontcache(struct inode *inode, bool newdax)
>> +{
>> +struct fuse_conn *fc = get_fuse_conn(inode);
>> +
>> +if (fc->dax && fc->dax->mode == FUSE_DAX_MOUNT_INODE &&
>> +IS_DAX(inode) != newdax)
>> +d_mark_dontcache(inode);
>> +}
>> +
> 
> This capability to mark an inode dontcache should probably be in a
> separate patch. These seem to logically two functionalities. One is
> enabling DAX on an inode. And second is making sure how soon you
> see the effect of that change and hence marking inode dontcache.

OK, sounds reasonable.

> 
> Not sure how useful this is. In cache=none mode we should get rid of
> inode ASAP. In cache=auto mode we will get rid of after 1 second (or
> after a user specified timeout). So only place this seems to be
> useful is cache=always.

Actually dontcache here is used to avoid dynamic switching between DAX
and non-DAX state while file is opened. The complexity of dynamic
switching is that, you have to clear the address_space, since page cache
and DAX entry can not coexist in the address space. Besides,
inode->a_ops also needs to be changed dynamically.

With dontcache, dynamic switching is no longer needed and the DAX state
will be decided only when inode (in memory) is initialized. The downside
is that the new DAX state won't be updated until the file is closed and
reopened later.

'cache=none' only invalidates dentry, while the inode (in memory) is
still there (with address_space uncleared and a_ops unchanged).

The dynamic switching may be done, though it's not such straightforward.
Currently, ext4/xfs are all implemented in this dontcache way, i.e., the
new DAX state won't be seen until the file is closed and reopened later.

-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 3/4] fuse: add per-file DAX flag

2021-07-20 Thread JeffleXu



On 7/20/21 2:41 AM, Vivek Goyal wrote:
> On Fri, Jul 16, 2021 at 06:47:52PM +0800, Jeffle Xu wrote:
>> Add one flag for fuse_attr.flags indicating if DAX shall be enabled for
>> this file.
>>
>> When the per-file DAX flag changes for an *opened* file, the state of
>> the file won't be updated until this file is closed and reopened later.
>>
>> Signed-off-by: Jeffle Xu 
> 
> [..]
>> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
>> index 36ed092227fa..90c9df10d37a 100644
>> --- a/include/uapi/linux/fuse.h
>> +++ b/include/uapi/linux/fuse.h
>> @@ -184,6 +184,9 @@
>>   *
>>   *  7.34
>>   *  - add FUSE_SYNCFS
>> + *
>> + *  7.35
>> + *  - add FUSE_ATTR_DAX
>>   */
>>  
>>  #ifndef _LINUX_FUSE_H
>> @@ -449,8 +452,10 @@ struct fuse_file_lock {
>>   * fuse_attr flags
>>   *
>>   * FUSE_ATTR_SUBMOUNT: Object is a submount root
>> + * FUSE_ATTR_DAX: Enable DAX for this file in per-file DAX mode
>>   */
>>  #define FUSE_ATTR_SUBMOUNT  (1 << 0)
>> +#define FUSE_ATTR_DAX   (1 << 1)
> 
> Generic fuse changes (addition of FUSE_ATTR_DAX) should probably in
> a separate patch. 

Got it.

> 
> I am not clear on one thing. If we are planning to rely on persistent
> inode attr (FS_XFLAG_DAX as per Documentation/filesystems/dax.rst), then
> why fuse server needs to communicate the state of that attr using a 
> flag? Can client directly query it?  I am not sure where at these
> attrs stored and if fuse protocol currently supports it.

There are two issues.

1. FUSE server side: Algorithm of deciding whether DAX is enabled for a
file.

As I previously replied in [1], FUSE server must enable DAX if the
backend file is flagged with FS_XFLAG_DAX, to make the FS_XFLAG_DAX
previously set by FUSE client effective.

But I will argue that FUSE server also has the flexibility of the
algorithm implementation. Even if guest queries FS_XFLAG_DAX by
GETFLAGS/FSGETXATTR ioctl, FUSE server can still enable DAX when the
backend file is not FS_XFLAG_DAX flagged.


2. The protocol between server and client.

extending LOOKUP vs. LOOKUP + GETFLAGS/FSGETXATTR ioctl

As I said in [1], client can directly query the FS_XFLAG_DAX flag, but
there will be one more round trip.


[1]
https://lore.kernel.org/linux-fsdevel/031efb1d-7c0d-35fb-c147-dcc3b6cac...@linux.alibaba.com/T/#m3f3407158b2c028694c85d82d0d6bd0387f4e24e

> 
> What about flag STATX_ATTR_DAX. We probably should report that too
> in stat if we are using dax on the inode?
> 

VFS will automatically report STATX_ATTR_DAX if inode is in DAX mode,
e.g., in vfs_getattr_nosec().



-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 3/4] fuse: add per-file DAX flag

2021-07-20 Thread JeffleXu



On 7/20/21 2:51 PM, JeffleXu wrote:
> 
> 
> On 7/20/21 3:44 AM, Vivek Goyal wrote:
>> On Fri, Jul 16, 2021 at 06:47:52PM +0800, Jeffle Xu wrote:
>>> Add one flag for fuse_attr.flags indicating if DAX shall be enabled for
>>> this file.
>>>
>>> When the per-file DAX flag changes for an *opened* file, the state of
>>> the file won't be updated until this file is closed and reopened later.
>>>
>>> Signed-off-by: Jeffle Xu 
>>> ---
>>>  fs/fuse/dax.c | 21 +
>>>  fs/fuse/file.c|  4 ++--
>>>  fs/fuse/fuse_i.h  |  5 +++--
>>>  fs/fuse/inode.c   |  5 -
>>>  include/uapi/linux/fuse.h |  5 +
>>>  5 files changed, 31 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
>>> index a478e824c2d0..0e862119757a 100644
>>> --- a/fs/fuse/dax.c
>>> +++ b/fs/fuse/dax.c
>>> @@ -1341,7 +1341,7 @@ static const struct address_space_operations 
>>> fuse_dax_file_aops  = {
>>> .invalidatepage = noop_invalidatepage,
>>>  };
>>>  
>>> -static bool fuse_should_enable_dax(struct inode *inode)
>>> +static bool fuse_should_enable_dax(struct inode *inode, unsigned int flags)
>>>  {
>>> struct fuse_conn *fc = get_fuse_conn(inode);
>>> unsigned int mode;
>>> @@ -1354,18 +1354,31 @@ static bool fuse_should_enable_dax(struct inode 
>>> *inode)
>>> if (mode == FUSE_DAX_MOUNT_NEVER)
>>> return false;
>>>  
>>> -   return true;
>>> +   if (mode == FUSE_DAX_MOUNT_ALWAYS)
>>> +   return true;
>>> +
>>> +   WARN_ON(mode != FUSE_DAX_MOUNT_INODE);
>>> +   return flags & FUSE_ATTR_DAX;
>>>  }
>>>  
>>> -void fuse_dax_inode_init(struct inode *inode)
>>> +void fuse_dax_inode_init(struct inode *inode, unsigned int flags)
>>>  {
>>> -   if (!fuse_should_enable_dax(inode))
>>> +   if (!fuse_should_enable_dax(inode, flags))
>>> return;
>>>  
>>> inode->i_flags |= S_DAX;
>>> inode->i_data.a_ops = &fuse_dax_file_aops;
>>>  }
>>>  
>>> +void fuse_dax_dontcache(struct inode *inode, bool newdax)
>>> +{
>>> +   struct fuse_conn *fc = get_fuse_conn(inode);
>>> +
>>> +   if (fc->dax && fc->dax->mode == FUSE_DAX_MOUNT_INODE &&
>>> +   IS_DAX(inode) != newdax)
>>> +   d_mark_dontcache(inode);
>>> +}
>>> +
>>
>> This capability to mark an inode dontcache should probably be in a
>> separate patch. These seem to logically two functionalities. One is
>> enabling DAX on an inode. And second is making sure how soon you
>> see the effect of that change and hence marking inode dontcache.
> 
> OK, sounds reasonable.
> 
>>
>> Not sure how useful this is. In cache=none mode we should get rid of
>> inode ASAP. In cache=auto mode we will get rid of after 1 second (or
>> after a user specified timeout). So only place this seems to be
>> useful is cache=always.
> 
> Actually dontcache here is used to avoid dynamic switching between DAX
> and non-DAX state while file is opened. The complexity of dynamic
> switching is that, you have to clear the address_space, since page cache
> and DAX entry can not coexist in the address space. Besides,
> inode->a_ops also needs to be changed dynamically.

FYI some context on the complication of switching DAX state dynamically,
when Ira Weiny was working on per-file DAX of ext4/xfs.

> At LSF/MM we discussed the difficulties of switching the DAX state of
a file with active mappings / page cache.  It was thought the races
could be avoided by limiting DAX state flips to 0-length files.
>
> However, this turns out to not be true.[3][5] This is because address
space operations (a_ops) may be in use at any time the inode is referenced.
> from Ira Weiny

[1]
https://patchwork.kernel.org/project/xfs/cover/20200407182958.568475-1-ira.we...@intel.com/
[2] https://lore.kernel.org/lkml/20200305155144.ga5...@lst.de/
[3] https://lore.kernel.org/lkml/20200401040021.GC56958@magnolia/
[4] https://lore.kernel.org/lkml/20200403182904.GP80283@magnolia/

> 
> With dontcache, dynamic switching is no longer needed and the DAX state
> will be decided only when inode (in memory) is initialized. The downside
> is that the new DAX state won't be updated until the file is closed and
> reopened later.
> 
> 'cache=none' only invalidates dentry, while the inode (in memory) is
> still there (with address_space uncleared and a_ops unchanged).
> 
> The dynamic switching may be done, though it's not such straightforward.
> Currently, ext4/xfs are all implemented in this dontcache way, i.e., the
> new DAX state won't be seen until the file is closed and reopened later.
> 

-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 0/4] virtiofs,fuse: support per-file DAX

2021-07-21 Thread JeffleXu



On 7/21/21 3:18 AM, Vivek Goyal wrote:
> On Tue, Jul 20, 2021 at 01:25:11PM +0800, JeffleXu wrote:
>>
>>
>> On 7/20/21 5:30 AM, Vivek Goyal wrote:
>>> On Fri, Jul 16, 2021 at 06:47:49PM +0800, Jeffle Xu wrote:
>>>> This patchset adds support of per-file DAX for virtiofs, which is
>>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>>>
>>>> There are three related scenarios:
>>>> 1. Alloc inode: get per-file DAX flag from fuse_attr.flags. (patch 3)
>>>> 2. Per-file DAX flag changes when the file has been opened. (patch 3)
>>>> In this case, the dentry and inode are all marked as DONT_CACHE, and
>>>> the DAX state won't be updated until the file is closed and reopened
>>>> later.
>>>> 3. Users can change the per-file DAX flag inside the guest by chattr(1).
>>>> (patch 4)
>>>> 4. Create new files under directories with DAX enabled. When creating
>>>> new files in ext4/xfs on host, the new created files will inherit the
>>>> per-file DAX flag from the directory, and thus the new created files in
>>>> virtiofs will also inherit the per-file DAX flag if the fuse server
>>>> derives fuse_attr.flags from the underlying ext4/xfs inode's per-file
>>>> DAX flag.
>>>
>>> Thinking little bit more about this from requirement perspective. I think
>>> we are trying to address two use cases here.
>>>
>>> A. Client does not know which files DAX should be used on. Only server
>>>knows it and server passes this information to client. I suspect
>>>that's your primary use case.
>>
>> Yes, this is the starting point of this patch set.
>>
>>>
>>> B. Client is driving which files are supposed to be using DAX. This is
>>>exactly same as the model ext4/xfs are using by storing a persistent
>>>flag on inode. 
>>>
>>> Current patches seem to be a hybrid of both approach A and B. 
>>>
>>> If we were to implement B, then fuse client probably needs to have the
>>> capability to query FS_XFLAG_DAX on inode and decide whether to
>>> enable DAX or not. (Without extra round trip). Or know it indirectly
>>> by extending GETATTR and requesting this explicitly.
>>
>> If guest queries if the file is DAX capable or not by an extra GETATTR,
>> I'm afraid this will add extra round trip.
>>
>> Or if we add the DAX flag (ATTR_DAX) by extending LOOKUP, as done by
>> this patch set, then the FUSE server needs to set ATTR_DAX according to
>> the FS_XFLAG_DAX of the backend files, *to make the FS_XFLAG_DAX flag
>> previously set by FUSE client work*. Then it becomes a *mandatory*
>> requirement when implementing FUSE server. i.e., it becomes part of the
>> FUSE protocol rather than implementation specific. FUSE server can still
>> implement some other algorithm deciding whether to set ATTR_DAX or not,
>> though it must set ATTR_DAX once the backend file is flagged with
>> FS_XFLAG_DAX.
>>
>> Besides, as you said, FUSE server needs to add one extra
>> GETFLAGS/FSGETXATTR ioctl per LOOKUP in this case. To eliminate this
>> cost, we could negotiate the per-file DAX capability during FUSE_INIT.
>> Only when the per-file DAX capability is negotiated, will the FUSE
>> server do extra GETFLAGS/FSGETXATTR ioctl and set ATTR_DAX flag when
>> doing LOOKUP.
>>
>>
>> Personally, I prefer the second way, i.e., by extending LOOKUP (adding
>> ATTR_DAX), to eliminate the extra round trip.
> 
> Negotiating a fuse feature say FUSE_FS_XFLAG_DAX makes sense. If
> client is mounted with "-o dax=inode", then client will negotitate
> this feature and if server does not support it, mount can fail.
> 
> But this probably will be binding on server that it needs to return
> the state of FS_XFLAG_DAX attr on inode upon lookup/getattr. I don't
> this will allow server to implement its own separate policy which
> does not follow FS_XFLAG_DAX xattr. 

That means the backend fs must be ext4/xfs supporting per-file DAX feature.

If given more right to construct its own policy, FUSE server could
support per-file DAX upon other backend fs, though it will always fail
when virtiofs wants to set FS_XFLAG_DAX inside guest.

> 
> IOW, I don't think server can choose to implement its own policy
> for enabling dax for "-o dax=inode".
> 
> If there is really a need to for something new where server needs
> to dynamically decide which inodes should use dax (and not use
> FS_XFLAG_FS), I guess that probably should be a separ

Re: [PATCH v2 3/4] fuse: add per-file DAX flag

2021-07-21 Thread JeffleXu



On 7/21/21 3:40 AM, Vivek Goyal wrote:
> On Tue, Jul 20, 2021 at 03:19:50PM +0800, JeffleXu wrote:
>>
>>
>> On 7/20/21 2:41 AM, Vivek Goyal wrote:
>>> On Fri, Jul 16, 2021 at 06:47:52PM +0800, Jeffle Xu wrote:
>>>> Add one flag for fuse_attr.flags indicating if DAX shall be enabled for
>>>> this file.
>>>>
>>>> When the per-file DAX flag changes for an *opened* file, the state of
>>>> the file won't be updated until this file is closed and reopened later.
>>>>
>>>> Signed-off-by: Jeffle Xu 
>>>
>>> [..]
>>>> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
>>>> index 36ed092227fa..90c9df10d37a 100644
>>>> --- a/include/uapi/linux/fuse.h
>>>> +++ b/include/uapi/linux/fuse.h
>>>> @@ -184,6 +184,9 @@
>>>>   *
>>>>   *  7.34
>>>>   *  - add FUSE_SYNCFS
>>>> + *
>>>> + *  7.35
>>>> + *  - add FUSE_ATTR_DAX
>>>>   */
>>>>  
>>>>  #ifndef _LINUX_FUSE_H
>>>> @@ -449,8 +452,10 @@ struct fuse_file_lock {
>>>>   * fuse_attr flags
>>>>   *
>>>>   * FUSE_ATTR_SUBMOUNT: Object is a submount root
>>>> + * FUSE_ATTR_DAX: Enable DAX for this file in per-file DAX mode
>>>>   */
>>>>  #define FUSE_ATTR_SUBMOUNT  (1 << 0)
>>>> +#define FUSE_ATTR_DAX (1 << 1)
>>>
>>> Generic fuse changes (addition of FUSE_ATTR_DAX) should probably in
>>> a separate patch. 
>>
>> Got it.
>>
>>>
>>> I am not clear on one thing. If we are planning to rely on persistent
>>> inode attr (FS_XFLAG_DAX as per Documentation/filesystems/dax.rst), then
>>> why fuse server needs to communicate the state of that attr using a 
>>> flag? Can client directly query it?  I am not sure where at these
>>> attrs stored and if fuse protocol currently supports it.
>>
>> There are two issues.
>>
>> 1. FUSE server side: Algorithm of deciding whether DAX is enabled for a
>> file.
>>
>> As I previously replied in [1], FUSE server must enable DAX if the
>> backend file is flagged with FS_XFLAG_DAX, to make the FS_XFLAG_DAX
>> previously set by FUSE client effective.
>>
>> But I will argue that FUSE server also has the flexibility of the
>> algorithm implementation. Even if guest queries FS_XFLAG_DAX by
>> GETFLAGS/FSGETXATTR ioctl, FUSE server can still enable DAX when the
>> backend file is not FS_XFLAG_DAX flagged.
>>
>>
>> 2. The protocol between server and client.
>>
>> extending LOOKUP vs. LOOKUP + GETFLAGS/FSGETXATTR ioctl
>>
>> As I said in [1], client can directly query the FS_XFLAG_DAX flag, but
>> there will be one more round trip.
>>
>>
>> [1]
>> https://lore.kernel.org/linux-fsdevel/031efb1d-7c0d-35fb-c147-dcc3b6cac...@linux.alibaba.com/T/#m3f3407158b2c028694c85d82d0d6bd0387f4e24e
>>
>>>
>>> What about flag STATX_ATTR_DAX. We probably should report that too
>>> in stat if we are using dax on the inode?
>>>
>>
>> VFS will automatically report STATX_ATTR_DAX if inode is in DAX mode,
>> e.g., in vfs_getattr_nosec().
> 
> Good to know. Given user will know which files are using dax and 
> which ones are not, it is even more important to define semantics
> properly. In what cases DAX will be driven by FS_XFLAGS_DAX attr
> and in what cases DAX will completely be driven by server.
> 
> May be we should divide it in two patch series. First patch series
> implements "-o dax=inode" and server follows FS_XFLAGS_DAX attr
> and reports during lookup/getattr/. 

Agreed, '-o dax=inode' and policy upon FS_XFLAG_DAX xattr could be
implemented as the first step.

> 
> And once that is merged this can be ehanced with "-o dax=server" where
> server is free to choose what files dax should be used on. Only if
> this is still needed.

I also need to discuss with my colleagues about our using case, and if
FS_XFLAG_DAX poly is enough.


-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 3/4] fuse: add per-file DAX flag

2021-07-21 Thread JeffleXu



On 7/21/21 3:27 AM, Vivek Goyal wrote:
> On Tue, Jul 20, 2021 at 02:51:34PM +0800, JeffleXu wrote:
>>
>>
>> On 7/20/21 3:44 AM, Vivek Goyal wrote:
>>> On Fri, Jul 16, 2021 at 06:47:52PM +0800, Jeffle Xu wrote:
>>>> Add one flag for fuse_attr.flags indicating if DAX shall be enabled for
>>>> this file.
>>>>
>>>> When the per-file DAX flag changes for an *opened* file, the state of
>>>> the file won't be updated until this file is closed and reopened later.
>>>>
>>>> Signed-off-by: Jeffle Xu 
>>>> ---
>>>>  fs/fuse/dax.c | 21 +
>>>>  fs/fuse/file.c|  4 ++--
>>>>  fs/fuse/fuse_i.h  |  5 +++--
>>>>  fs/fuse/inode.c   |  5 -
>>>>  include/uapi/linux/fuse.h |  5 +
>>>>  5 files changed, 31 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
>>>> index a478e824c2d0..0e862119757a 100644
>>>> --- a/fs/fuse/dax.c
>>>> +++ b/fs/fuse/dax.c
>>>> @@ -1341,7 +1341,7 @@ static const struct address_space_operations 
>>>> fuse_dax_file_aops  = {
>>>>.invalidatepage = noop_invalidatepage,
>>>>  };
>>>>  
>>>> -static bool fuse_should_enable_dax(struct inode *inode)
>>>> +static bool fuse_should_enable_dax(struct inode *inode, unsigned int 
>>>> flags)
>>>>  {
>>>>struct fuse_conn *fc = get_fuse_conn(inode);
>>>>unsigned int mode;
>>>> @@ -1354,18 +1354,31 @@ static bool fuse_should_enable_dax(struct inode 
>>>> *inode)
>>>>if (mode == FUSE_DAX_MOUNT_NEVER)
>>>>return false;
>>>>  
>>>> -  return true;
>>>> +  if (mode == FUSE_DAX_MOUNT_ALWAYS)
>>>> +  return true;
>>>> +
>>>> +  WARN_ON(mode != FUSE_DAX_MOUNT_INODE);
>>>> +  return flags & FUSE_ATTR_DAX;
>>>>  }
>>>>  
>>>> -void fuse_dax_inode_init(struct inode *inode)
>>>> +void fuse_dax_inode_init(struct inode *inode, unsigned int flags)
>>>>  {
>>>> -  if (!fuse_should_enable_dax(inode))
>>>> +  if (!fuse_should_enable_dax(inode, flags))
>>>>return;
>>>>  
>>>>inode->i_flags |= S_DAX;
>>>>inode->i_data.a_ops = &fuse_dax_file_aops;
>>>>  }
>>>>  
>>>> +void fuse_dax_dontcache(struct inode *inode, bool newdax)
>>>> +{
>>>> +  struct fuse_conn *fc = get_fuse_conn(inode);
>>>> +
>>>> +  if (fc->dax && fc->dax->mode == FUSE_DAX_MOUNT_INODE &&
>>>> +  IS_DAX(inode) != newdax)
>>>> +  d_mark_dontcache(inode);
>>>> +}
>>>> +
>>>
>>> This capability to mark an inode dontcache should probably be in a
>>> separate patch. These seem to logically two functionalities. One is
>>> enabling DAX on an inode. And second is making sure how soon you
>>> see the effect of that change and hence marking inode dontcache.
>>
>> OK, sounds reasonable.
>>
>>>
>>> Not sure how useful this is. In cache=none mode we should get rid of
>>> inode ASAP. In cache=auto mode we will get rid of after 1 second (or
>>> after a user specified timeout). So only place this seems to be
>>> useful is cache=always.
>>
>> Actually dontcache here is used to avoid dynamic switching between DAX
>> and non-DAX state while file is opened. The complexity of dynamic
>> switching is that, you have to clear the address_space, since page cache
>> and DAX entry can not coexist in the address space. Besides,
>> inode->a_ops also needs to be changed dynamically.
>>
>> With dontcache, dynamic switching is no longer needed and the DAX state
>> will be decided only when inode (in memory) is initialized. The downside
>> is that the new DAX state won't be updated until the file is closed and
>> reopened later.
>>
>> 'cache=none' only invalidates dentry, while the inode (in memory) is
>> still there (with address_space uncleared and a_ops unchanged).
> 
> Aha.., that's a good point.
>>
>> The dynamic switching may be done, though it's not such straightforward.
>> Currently, ext4/xfs are all implemented in this dontcache way, i.e., the
>> new DAX state won't be seen until the file is closed and reopened later.
> 
> Got it. Agreed that dontcache seems reasonable if file's DAX state
> has changed. Keep it in separate patch though with proper commit
> logs.
> 
> Also, please copy virtiofs list (virtio...@redhat.com) when you post
> patches next time.
> 

Got it. By the way, what's the git repository of virtiofsd? AFAIK,
virtiofsd included in qemu (g...@github.com:qemu/qemu.git) doesn't
support DAX yet?

-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 0/4] virtiofs,fuse: support per-file DAX

2021-08-03 Thread JeffleXu



On 7/21/21 10:42 PM, Vivek Goyal wrote:
> On Wed, Jul 21, 2021 at 08:48:57AM -0400, Vivek Goyal wrote:
> [..]
 So is "dax=inode" enough for your needs? What's your requirement,
 can you give little bit of more details.
>>>
>>> In our use case, the backend fs is something like SquashFS on host. The
>>> content of the file on host is downloaded *as needed*. When the file is
>>> not completely ready (completely downloaded), the guest will follow the
>>> normal IO routine, i.e., by FUSE_READ/FUSE_WRITE request. While the file
>>> is completely ready, per-file DAX is enabled for this file. IOW the FUSE
>>> server need to dynamically decide if per-file DAX shall be enabled,
>>> depending on if the file is completely downloaded.
>>
>> So you don't want to enable DAX yet because guest might fault on
>> a section of file which has not been downloaded yet?
>>
>> I am wondering if somehow user fault handling can help with this.
>> If we could handle faults for this file in user space, then you
>> should be able to download that particular page[s] and resolve
>> the fault?
> 
> Stefan mentioned that can't we block when fuse mmap request comes
> in and download corresponding section of file. Or do whatever you
> are doing in FUSE_READ. 
> 
> IOW, even if you enable dax in your use case on all files,
> FUSE_SETUPMAPPING request will give you control to make sure 
> file section being mmaped has been downloaded.
> 

Sorry for the late reply. I missed this mail as it is classified into
the mailing list folder.

The idea you mentioned may works. Anyway, the implementation details of
the FUSE server is not strongly binding to the FUSE protocol changes in
kernel. The protocol only requires that FUSE client shall be able to
store FS_DAX_FL attr persistently in *some way*. The changes in kernel
shall be general, no matter whether the FUSE server is FS_DAX_FL attr
based or something else.


-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 0/8] fuse,virtiofs: support per-file DAX

2021-08-11 Thread JeffleXu
ping? Corresponding patches for virtiofsd are also included in this
patch set.

On 8/4/21 3:06 PM, Jeffle Xu wrote:
> changes since v2:
> - modify fuse_show_options() accordingly to make it compatible with
>   new tri-state mount option (patch 2)
> - extract FUSE protocol changes into one seperate patch (patch 3)
> - FUSE server/client need to negotiate if they support per-file DAX
>   (patch 4)
> - extract DONT_CACHE logic into patch 6/7
> 
> 
> This patchset adds support of per-file DAX for virtiofs, which is
> inspired by Ira Weiny's work on ext4[1] and xfs[2].
> 
> Any comment is welcome.
> 
> [1] commit 9cb20f94afcd ("fs/ext4: Make DAX mount option a tri-state")
> [2] commit 02beb2686ff9 ("fs/xfs: Make DAX mount option a tri-state")
> 
> v2: https://www.spinics.net/lists/linux-fsdevel/msg199584.html
> v1: https://www.spinics.net/lists/linux-virtualization/msg51008.html
> 
> Jeffle Xu (8):
>   fuse: add fuse_should_enable_dax() helper
>   fuse: Make DAX mount option a tri-state
>   fuse: support per-file DAX
>   fuse: negotiate if server/client supports per-file DAX
>   fuse: enable per-file DAX
>   fuse: mark inode DONT_CACHE when per-file DAX indication changes
>   fuse: support changing per-file DAX flag inside guest
>   fuse: show '-o dax=inode' option only when FUSE server supports
> 
>  fs/fuse/dax.c | 32 ++--
>  fs/fuse/file.c|  4 ++--
>  fs/fuse/fuse_i.h  | 22 ++
>  fs/fuse/inode.c   | 27 ++-
>  fs/fuse/ioctl.c   | 15 +--
>  fs/fuse/virtio_fs.c   | 16 ++--
>  include/uapi/linux/fuse.h |  9 -
>  7 files changed, 103 insertions(+), 22 deletions(-)
> 

-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Virtio-fs] [PATCH v4 0/8] fuse,virtiofs: support per-file DAX

2021-08-17 Thread JeffleXu



On 8/17/21 6:09 PM, Miklos Szeredi wrote:
> On Tue, 17 Aug 2021 at 11:32, Dr. David Alan Gilbert
>  wrote:
>>
>> * Miklos Szeredi (mik...@szeredi.hu) wrote:
>>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu  wrote:

 This patchset adds support of per-file DAX for virtiofs, which is
 inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>>
>>> Can you please explain the background of this change in detail?
>>>
>>> Why would an admin want to enable DAX for a particular virtiofs file
>>> and not for others?
>>
>> Where we're contending on virtiofs dax cache size it makes a lot of
>> sense; it's quite expensive for us to map something into the cache
>> (especially if we push something else out), so selectively DAXing files
>> that are expected to be hot could help reduce cache churn.
> 
> If this is a performance issue, it should be fixed in a way that
> doesn't require hand tuning like you suggest, I think.
> 
> I'm not sure what the  ext4/xfs case for per-file DAX is.  Maybe that
> can help understand the virtiofs case as well.
> 

Some hints why ext4/xfs support per-file DAX can be found [1] and [2].

"Boaz Harrosh wondered why someone might want to turn DAX off for a
persistent memory device. Hellwig said that the performance "could
suck"; Williams noted that the page cache could be useful for some
applications as well. Jan Kara pointed out that reads from persistent
memory are close to DRAM speed, but that writes are not; the page cache
could be helpful for frequent writes. Applications need to change to
fully take advantage of DAX, Williams said; part of the promise of
adding a flag is that users can do DAX on smaller granularities than a
full filesystem."

In summary, page cache is preferable in some cases, and thus more fine
grained way of DAX control is needed.


As for virtiofs, Dr. David Alan Gilbert has mentioned that various files
may compete for limited DAX window resource.

Besides, supporting DAX for small files can be expensive. Small files
can consume DAX window resource rapidly, and if small files are accessed
only once, the cost of mmap/munmap on host can not be ignored.


[1]
https://lore.kernel.org/lkml/20200428002142.404144-1-ira.we...@intel.com/
[2] https://lwn.net/Articles/787973/

-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 0/8] fuse,virtiofs: support per-file DAX

2021-08-17 Thread JeffleXu



On 8/17/21 8:39 PM, Vivek Goyal wrote:
> On Tue, Aug 17, 2021 at 10:06:53AM +0200, Miklos Szeredi wrote:
>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu  wrote:
>>>
>>> This patchset adds support of per-file DAX for virtiofs, which is
>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>
>> Can you please explain the background of this change in detail?
>>
>> Why would an admin want to enable DAX for a particular virtiofs file
>> and not for others?
> 
> Initially I thought that they needed it because they are downloading
> files on the fly from server. So they don't want to enable dax on the file
> till file is completely downloaded. 

Right, it's our initial requirement.


> But later I realized that they should
> be able to block in FUSE_SETUPMAPPING call and make sure associated
> file section has been downloaded before returning and solve the problem.
> So that can't be the primary reason.

Saying we want to access 4KB of one file inside guest, if it goes
through FUSE request routine, then the fuse daemon only need to download
this 4KB from remote server. But if it goes through DAX, then the fuse
daemon need to download the whole DAX window (e.g., 2MB) from remote
server, so called amplification. Maybe we could decrease the DAX window
size, but it's a trade off.

> 
> Other reason mentioned I think was that only certain files benefit
> from DAX. But not much details are there after that. It will be nice
> to hear a more concrete use case and more details about this usage.
> 

Apart from our internal requirement, more fine grained control for DAX
shall be general and more flexible. Glad to hear more discussion from
community.


-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Virtio-fs] [PATCH v4 6/8] fuse: mark inode DONT_CACHE when per-file DAX indication changes

2021-08-17 Thread JeffleXu



On 8/17/21 6:26 PM, Dr. David Alan Gilbert wrote:
> * Jeffle Xu (jeffl...@linux.alibaba.com) wrote:
>> When the per-file DAX indication changes while the file is still
>> *opened*, it is quite complicated and maybe fragile to dynamically
>> change the DAX state.
>>
>> Hence mark the inode and corresponding dentries as DONE_CACHE once the
> 
>  ^^
> typo as DONT ?
> 

Thanks. I will fix it.

> 
>> per-file DAX indication changes, so that the inode instance will be
>> evicted and freed as soon as possible once the file is closed and the
>> last reference to the inode is put. And then when the file gets reopened
>> next time, the inode will reflect the new DAX state.
>>
>> In summary, when the per-file DAX indication changes for an *opened*
>> file, the state of the file won't be updated until this file is closed
>> and reopened later.
>>
>> Signed-off-by: Jeffle Xu 
>> ---
>>  fs/fuse/dax.c| 9 +
>>  fs/fuse/fuse_i.h | 1 +
>>  fs/fuse/inode.c  | 3 +++
>>  3 files changed, 13 insertions(+)
>>
>> diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
>> index 30833f8d37dd..f7ede0be4e00 100644
>> --- a/fs/fuse/dax.c
>> +++ b/fs/fuse/dax.c
>> @@ -1364,6 +1364,15 @@ void fuse_dax_inode_init(struct inode *inode, 
>> unsigned int flags)
>>  inode->i_data.a_ops = &fuse_dax_file_aops;
>>  }
>>  
>> +void fuse_dax_dontcache(struct inode *inode, bool newdax)
>> +{
>> +struct fuse_conn *fc = get_fuse_conn(inode);
>> +
>> +if (fc->dax_mode == FUSE_DAX_INODE &&
>> +fc->perfile_dax && (!!IS_DAX(inode) != newdax))
>> +d_mark_dontcache(inode);
>> +}
>> +
>>  bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int 
>> map_alignment)
>>  {
>>  if (fc->dax && (map_alignment > FUSE_DAX_SHIFT)) {
>> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
>> index 7b7b4c208af2..56fe1c4d2136 100644
>> --- a/fs/fuse/fuse_i.h
>> +++ b/fs/fuse/fuse_i.h
>> @@ -1260,6 +1260,7 @@ void fuse_dax_conn_free(struct fuse_conn *fc);
>>  bool fuse_dax_inode_alloc(struct super_block *sb, struct fuse_inode *fi);
>>  void fuse_dax_inode_init(struct inode *inode, unsigned int flags);
>>  void fuse_dax_inode_cleanup(struct inode *inode);
>> +void fuse_dax_dontcache(struct inode *inode, bool newdax);
>>  bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int 
>> map_alignment);
>>  void fuse_dax_cancel_work(struct fuse_conn *fc);
>>  
>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> index 8080f78befed..8c9774c6a210 100644
>> --- a/fs/fuse/inode.c
>> +++ b/fs/fuse/inode.c
>> @@ -269,6 +269,9 @@ void fuse_change_attributes(struct inode *inode, struct 
>> fuse_attr *attr,
>>  if (inval)
>>  invalidate_inode_pages2(inode->i_mapping);
>>  }
>> +
>> +if (IS_ENABLED(CONFIG_FUSE_DAX))
>> +fuse_dax_dontcache(inode, attr->flags & FUSE_ATTR_DAX);
>>  }
>>  
>>  static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
>> -- 
>> 2.27.0
>>
>> ___
>> Virtio-fs mailing list
>> virtio...@redhat.com
>> https://listman.redhat.com/mailman/listinfo/virtio-fs
>>

-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 0/8] fuse,virtiofs: support per-file DAX

2021-08-17 Thread JeffleXu



On 8/17/21 10:08 PM, Miklos Szeredi wrote:
> On Tue, 17 Aug 2021 at 15:22, JeffleXu  wrote:
>>
>>
>>
>> On 8/17/21 8:39 PM, Vivek Goyal wrote:
>>> On Tue, Aug 17, 2021 at 10:06:53AM +0200, Miklos Szeredi wrote:
>>>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu  wrote:
>>>>>
>>>>> This patchset adds support of per-file DAX for virtiofs, which is
>>>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>>>
>>>> Can you please explain the background of this change in detail?
>>>>
>>>> Why would an admin want to enable DAX for a particular virtiofs file
>>>> and not for others?
>>>
>>> Initially I thought that they needed it because they are downloading
>>> files on the fly from server. So they don't want to enable dax on the file
>>> till file is completely downloaded.
>>
>> Right, it's our initial requirement.
>>
>>
>>> But later I realized that they should
>>> be able to block in FUSE_SETUPMAPPING call and make sure associated
>>> file section has been downloaded before returning and solve the problem.
>>> So that can't be the primary reason.
>>
>> Saying we want to access 4KB of one file inside guest, if it goes
>> through FUSE request routine, then the fuse daemon only need to download
>> this 4KB from remote server. But if it goes through DAX, then the fuse
>> daemon need to download the whole DAX window (e.g., 2MB) from remote
>> server, so called amplification. Maybe we could decrease the DAX window
>> size, but it's a trade off.
> 
> That could be achieved with a plain fuse filesystem on the host (which
> will get 4k READ requests for accesses to mapped area inside guest).
> Since this can be done selectively for files which are not yet
> downloaded, the extra layer wouldn't be a performance problem.

I'm not sure if I fully understand your idea. Then in this case, host
daemon only prepares 4KB while guest thinks that the whole DAX window
(e.g., 2MB) has been fully mapped. Then when guest really accesses the
remained part (2MB - 4KB), page fault is triggered, and now host daemon
is responsible for downloading the remained part?

> 
> Is there a reason why that wouldn't work?
> 
> Thanks,
> Miklos
> 

-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Virtio-fs] [PATCH v4 0/8] fuse,virtiofs: support per-file DAX

2021-08-17 Thread JeffleXu



On 8/17/21 10:54 PM, Vivek Goyal wrote:
> On Tue, Aug 17, 2021 at 09:08:35PM +0800, JeffleXu wrote:
>>
>>
>> On 8/17/21 6:09 PM, Miklos Szeredi wrote:
>>> On Tue, 17 Aug 2021 at 11:32, Dr. David Alan Gilbert
>>>  wrote:
>>>>
>>>> * Miklos Szeredi (mik...@szeredi.hu) wrote:
>>>>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu  
>>>>> wrote:
>>>>>>
>>>>>> This patchset adds support of per-file DAX for virtiofs, which is
>>>>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>>>>
>>>>> Can you please explain the background of this change in detail?
>>>>>
>>>>> Why would an admin want to enable DAX for a particular virtiofs file
>>>>> and not for others?
>>>>
>>>> Where we're contending on virtiofs dax cache size it makes a lot of
>>>> sense; it's quite expensive for us to map something into the cache
>>>> (especially if we push something else out), so selectively DAXing files
>>>> that are expected to be hot could help reduce cache churn.
>>>
>>> If this is a performance issue, it should be fixed in a way that
>>> doesn't require hand tuning like you suggest, I think.
>>>
>>> I'm not sure what the  ext4/xfs case for per-file DAX is.  Maybe that
>>> can help understand the virtiofs case as well.
>>>
>>
>> Some hints why ext4/xfs support per-file DAX can be found [1] and [2].
>>
>> "Boaz Harrosh wondered why someone might want to turn DAX off for a
>> persistent memory device. Hellwig said that the performance "could
>> suck"; Williams noted that the page cache could be useful for some
>> applications as well. Jan Kara pointed out that reads from persistent
>> memory are close to DRAM speed, but that writes are not; the page cache
>> could be helpful for frequent writes. Applications need to change to
>> fully take advantage of DAX, Williams said; part of the promise of
>> adding a flag is that users can do DAX on smaller granularities than a
>> full filesystem."
>>
>> In summary, page cache is preferable in some cases, and thus more fine
>> grained way of DAX control is needed.
> 
> In case of virtiofs, we are using page cache on host. So this probably
> is not a factor for us. Writes will go in page cache of host.
> 
>>
>>
>> As for virtiofs, Dr. David Alan Gilbert has mentioned that various files
>> may compete for limited DAX window resource.
>>
>> Besides, supporting DAX for small files can be expensive. Small files
>> can consume DAX window resource rapidly, and if small files are accessed
>> only once, the cost of mmap/munmap on host can not be ignored.
> 
> W.r.r access pattern, same applies to large files also. So if a section
> of large file is accessed only once, it will consume dax window as well
> and will have to be reclaimed.
> 
> Dax in virtiofs provides speed gain only if map file once and access
> it multiple times. If that pattern does not hold true, then dax does
> not seem to provide speed gains and in fact might be slower than
> non-dax.
> 
> So if there is a pattern where we know some files are accessed repeatedly
> while others are not, then enabling/disabling dax selectively will make
> sense. Question is how many workloads really know that and how will
> you make that decision. Do you have any data to back that up.

There's no precise performance data yet. Empirically, small files used
to have worse performance with dax, while frequently accessed files
(such as .so libraries) behave better with dax.

> 
> W.r.t small file, is that a real concern. If that file is being accessed
> mutliple times, then we will still see the speed gain. Only down side
> is that there is little wastage of resources because our minimum dax
> mapping granularity is 2MB. I am wondering can we handle that by
> supporting other dax mapping granularities as well. say 256K and let
> users choose it.


-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 0/8] fuse,virtiofs: support per-file DAX

2021-08-17 Thread JeffleXu



On 8/17/21 10:57 PM, Vivek Goyal wrote:
> On Tue, Aug 17, 2021 at 09:22:53PM +0800, JeffleXu wrote:
>>
>>
>> On 8/17/21 8:39 PM, Vivek Goyal wrote:
>>> On Tue, Aug 17, 2021 at 10:06:53AM +0200, Miklos Szeredi wrote:
>>>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu  wrote:
>>>>>
>>>>> This patchset adds support of per-file DAX for virtiofs, which is
>>>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>>>
>>>> Can you please explain the background of this change in detail?
>>>>
>>>> Why would an admin want to enable DAX for a particular virtiofs file
>>>> and not for others?
>>>
>>> Initially I thought that they needed it because they are downloading
>>> files on the fly from server. So they don't want to enable dax on the file
>>> till file is completely downloaded. 
>>
>> Right, it's our initial requirement.
>>
>>
>>> But later I realized that they should
>>> be able to block in FUSE_SETUPMAPPING call and make sure associated
>>> file section has been downloaded before returning and solve the problem.
>>> So that can't be the primary reason.
>>
>> Saying we want to access 4KB of one file inside guest, if it goes
>> through FUSE request routine, then the fuse daemon only need to download
>> this 4KB from remote server. But if it goes through DAX, then the fuse
>> daemon need to download the whole DAX window (e.g., 2MB) from remote
>> server, so called amplification. Maybe we could decrease the DAX window
>> size, but it's a trade off.
> 
> Downloading 2MB chunk should not be a big issue (IMHO). 

Then the latency increases. Latency really matters in our use case.


> And if this
> turns out to be real concern, we could experiment with a smaller
> mapping granularity.
> 


-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Virtio-fs] [virtiofsd PATCH v4 3/4] virtiofsd: support per-file DAX negotiation in FUSE_INIT

2021-08-17 Thread JeffleXu



On 8/18/21 1:15 AM, Dr. David Alan Gilbert wrote:
> * Jeffle Xu (jeffl...@linux.alibaba.com) wrote:
>> In FUSE_INIT negotiating phase, server/client should advertise if it
>> supports per-file DAX.
>>
>> Once advertising support for per-file DAX feature, virtiofsd should
>> support storing FS_DAX_FL flag persistently passed by
>> FS_IOC_SETFLAGS/FS_IOC_FSSETXATTR ioctl, and set FUSE_ATTR_DAX in
>> FUSE_LOOKUP accordingly if the file is capable of per-file DAX.
>>
>> Currently only ext4/xfs since linux kernel v5.8 support storing
>> FS_DAX_FL flag persistently, and thus advertise support for per-file
>> DAX feature only when the backend fs type is ext4 and xfs.
> 
> I'm a little worried about the meaning of the flags we're storing and
> the fact we're storing them in the normal host DAX flags.
> 
> Doesn't this mean that we're using a single host flag to mean:
>   a) It can be mapped as DAX on the host if it was a real DAX device
>   b) We can map it as DAX inside the guest with virtiofs?

Yes the side effect is that the host file is also dax enabled if the
backend fs is built upon real nvdimm device.

The rationale here is that, fuse daemon shall be capable of *marking*
the file as dax capable *persistently*, so that it can be informed that
this file is capable of dax later.

I'm not sure if xattr (extent attribute) is a better option for this?


> 
> what happens when we're using usernamespaces for the guest?
> 
> Dave
> 
> 
>> Signed-off-by: Jeffle Xu 
>> ---
>>  tools/virtiofsd/fuse_common.h|  5 +
>>  tools/virtiofsd/fuse_lowlevel.c  |  6 ++
>>  tools/virtiofsd/passthrough_ll.c | 29 +
>>  3 files changed, 40 insertions(+)
>>
>> diff --git a/tools/virtiofsd/fuse_common.h b/tools/virtiofsd/fuse_common.h
>> index 8a75729be9..ee6fc64c23 100644
>> --- a/tools/virtiofsd/fuse_common.h
>> +++ b/tools/virtiofsd/fuse_common.h
>> @@ -372,6 +372,11 @@ struct fuse_file_info {
>>   */
>>  #define FUSE_CAP_HANDLE_KILLPRIV_V2 (1 << 28)
>>  
>> +/**
>> + * Indicates support for per-file DAX.
>> + */
>> +#define FUSE_CAP_PERFILE_DAX (1 << 29)
>> +
>>  /**
>>   * Ioctl flags
>>   *
>> diff --git a/tools/virtiofsd/fuse_lowlevel.c 
>> b/tools/virtiofsd/fuse_lowlevel.c
>> index 50fc5c8d5a..04a4f17423 100644
>> --- a/tools/virtiofsd/fuse_lowlevel.c
>> +++ b/tools/virtiofsd/fuse_lowlevel.c
>> @@ -2065,6 +2065,9 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
>>  if (arg->flags & FUSE_HANDLE_KILLPRIV_V2) {
>>  se->conn.capable |= FUSE_CAP_HANDLE_KILLPRIV_V2;
>>  }
>> +if (arg->flags & FUSE_PERFILE_DAX) {
>> +se->conn.capable |= FUSE_CAP_PERFILE_DAX;
>> +}
>>  #ifdef HAVE_SPLICE
>>  #ifdef HAVE_VMSPLICE
>>  se->conn.capable |= FUSE_CAP_SPLICE_WRITE | FUSE_CAP_SPLICE_MOVE;
>> @@ -2180,6 +2183,9 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
>>  if (se->conn.want & FUSE_CAP_POSIX_ACL) {
>>  outarg.flags |= FUSE_POSIX_ACL;
>>  }
>> +if (se->op.ioctl && (se->conn.want & FUSE_CAP_PERFILE_DAX)) {
>> +outarg.flags |= FUSE_PERFILE_DAX;
>> +}
>>  outarg.max_readahead = se->conn.max_readahead;
>>  outarg.max_write = se->conn.max_write;
>>  if (se->conn.max_background >= (1 << 16)) {
>> diff --git a/tools/virtiofsd/passthrough_ll.c 
>> b/tools/virtiofsd/passthrough_ll.c
>> index e170b17adb..5b6228210f 100644
>> --- a/tools/virtiofsd/passthrough_ll.c
>> +++ b/tools/virtiofsd/passthrough_ll.c
>> @@ -53,8 +53,10 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include "qemu/cutils.h"
>>  #include "passthrough_helpers.h"
>> @@ -136,6 +138,13 @@ enum {
>>  SANDBOX_CHROOT,
>>  };
>>  
>> +/* capability of storing DAX flag persistently */
>> +enum {
>> +DAX_CAP_NONE,  /* not supported */
>> +DAX_CAP_FLAGS, /* stored in flags (FS_IOC_GETFLAGS/FS_IOC_SETFLAGS) */
>> +DAX_CAP_XATTR, /* stored in xflags 
>> (FS_IOC_FSGETXATTR/FS_IOC_FSSETXATTR) */
>> +};
>> +
>>  typedef struct xattr_map_entry {
>>  char *key;
>>  char *prepend;
>> @@ -161,6 +170,7 @@ struct lo_data {
>>  int readdirplus_clear;
>>  int allow_direct_io;
>>  int announce_submounts;
>> +int perfile_dax_cap; /* capability of backend fs */
>>  bool use_statx;
>>  struct lo_inode root;
>>  GHashTable *inodes; /* protected by lo->mutex */
>> @@ -703,6 +713,10 @@ static void lo_init(void *userdata, struct 
>> fuse_conn_info *conn)
>>  conn->want &= ~FUSE_CAP_HANDLE_KILLPRIV_V2;
>>  lo->killpriv_v2 = 0;
>>  }
>> +
>> +if (conn->capable & FUSE_CAP_PERFILE_DAX && lo->perfile_dax_cap ) {
>> +conn->want |= FUSE_CAP_PERFILE_DAX;
>> +}
>>  }
>>  
>>  static void lo_getattr(fuse_req_t req, fuse_ino_t ino,
>> @@ -3800,6 +3814,7 @@ static void setup_root(struct lo_data *lo, struct 
>> lo_inode *root)
>>  int fd, res;
>>  struct stat stat;
>>  uint64_t mnt_id;
>> +struct statfs statfs;
>>  
>> 

Re: [Virtio-fs] [virtiofsd PATCH v4 4/4] virtiofsd: support per-file DAX in FUSE_LOOKUP

2021-08-17 Thread JeffleXu



On 8/18/21 3:00 AM, Dr. David Alan Gilbert wrote:
> * Jeffle Xu (jeffl...@linux.alibaba.com) wrote:
>> For passthrough, when the corresponding virtiofs in guest is mounted
>> with '-o dax=inode', advertise that the file is capable of per-file
>> DAX if the inode in the backend fs is marked with FS_DAX_FL flag.
>>
>> Signed-off-by: Jeffle Xu 
>> ---
>>  tools/virtiofsd/passthrough_ll.c | 43 
>>  1 file changed, 43 insertions(+)
>>
>> diff --git a/tools/virtiofsd/passthrough_ll.c 
>> b/tools/virtiofsd/passthrough_ll.c
>> index 5b6228210f..4cbd904248 100644
>> --- a/tools/virtiofsd/passthrough_ll.c
>> +++ b/tools/virtiofsd/passthrough_ll.c
>> @@ -171,6 +171,7 @@ struct lo_data {
>>  int allow_direct_io;
>>  int announce_submounts;
>>  int perfile_dax_cap; /* capability of backend fs */
>> +bool perfile_dax; /* enable per-file DAX or not */
>>  bool use_statx;
>>  struct lo_inode root;
>>  GHashTable *inodes; /* protected by lo->mutex */
>> @@ -716,6 +717,10 @@ static void lo_init(void *userdata, struct 
>> fuse_conn_info *conn)
>>  
>>  if (conn->capable & FUSE_CAP_PERFILE_DAX && lo->perfile_dax_cap ) {
>>  conn->want |= FUSE_CAP_PERFILE_DAX;
>> +lo->perfile_dax = 1;
>> +}
>> +else {
>> +lo->perfile_dax = 0;
>>  }
>>  }
>>  
>> @@ -983,6 +988,41 @@ static int do_statx(struct lo_data *lo, int dirfd, 
>> const char *pathname,
>>  return 0;
>>  }
>>  
>> +/*
>> + * If the file is marked with FS_DAX_FL or FS_XFLAG_DAX, then DAX should be
>> + * enabled for this file.
>> + */
>> +static bool lo_should_enable_dax(struct lo_data *lo, struct lo_inode *dir,
>> + const char *name)
>> +{
>> +int res, fd;
>> +int ret = false;;
>> +unsigned int attr;
>> +struct fsxattr xattr;
>> +
>> +if (!lo->perfile_dax)
>> +return false;
>> +
>> +/* Open file without O_PATH, so that ioctl can be called. */
>> +fd = openat(dir->fd, name, O_NOFOLLOW);
>> +if (fd == -1)
>> +return false;
> 
> Doesn't that defeat the whole benefit of using O_PATH - i.e. that we
> might stumble into a /dev node or something else we're not allowed to
> open?

As far as I know, virtiofsd will pivot_root/chroot to the source
directory, and can only access files inside the source directory
specified by "-o source=". Then where do these unexpected files come
from? Besides, fd opened without O_PATH here is temporary and used for
FS_IOC_GETFLAGS/FS_IOC_FSGETXATTR ioctl only. It's closed when the
function returns.

> 
>> +if (lo->perfile_dax_cap == DAX_CAP_FLAGS) {
>> +res = ioctl(fd, FS_IOC_GETFLAGS, &attr);
>> +if (!res && (attr & FS_DAX_FL))
>> +ret = true;
>> +}
>> +else if (lo->perfile_dax_cap == DAX_CAP_XATTR) {
>> +res = ioctl(fd, FS_IOC_FSGETXATTR, &xattr);
>> +if (!res && (xattr.fsx_xflags & FS_XFLAG_DAX))
>> +ret = true;
>> +}
> 
> This all looks pretty expensive for each lookup.

Yes. it can be somehow optimized if we can agree on the way of storing
the dax flag persistently.

-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Virtio-fs] [PATCH v4 0/8] fuse,virtiofs: support per-file DAX

2021-08-18 Thread JeffleXu



On 8/17/21 10:54 PM, Vivek Goyal wrote:
[...]
>>
>> As for virtiofs, Dr. David Alan Gilbert has mentioned that various files
>> may compete for limited DAX window resource.
>>
>> Besides, supporting DAX for small files can be expensive. Small files
>> can consume DAX window resource rapidly, and if small files are accessed
>> only once, the cost of mmap/munmap on host can not be ignored.
> 
> W.r.r access pattern, same applies to large files also. So if a section
> of large file is accessed only once, it will consume dax window as well
> and will have to be reclaimed.
> 
> Dax in virtiofs provides speed gain only if map file once and access
> it multiple times. If that pattern does not hold true, then dax does
> not seem to provide speed gains and in fact might be slower than
> non-dax.
> 
> So if there is a pattern where we know some files are accessed repeatedly
> while others are not, then enabling/disabling dax selectively will make
> sense. Question is how many workloads really know that and how will
> you make that decision. Do you have any data to back that up.

Empirically, some files are naturally accessed only once, such as
configuration files under /etc/ directory, .py, .js files, etc. It's the
real case that we have met in real world. While some others are most
likely accessed multiple times, such as .so libraries. With per-file DAX
feature, administrator can decide on their own which files shall be dax
enabled and thus gain most benefit from dax, while others not.

As for how we can distinguish the file access mode, besides the
intuitive insights described previously, we can develop more advanced
method distinguishing it, e.g., scanning the DAX window map and finding
the hot files. With the mechanism offered by kernel, more advanced
strategy can be developed then.

> 
> W.r.t small file, is that a real concern. If that file is being accessed
> mutliple times, then we will still see the speed gain. Only down side
> is that there is little wastage of resources because our minimum dax
> mapping granularity is 2MB. I am wondering can we handle that by
> supporting other dax mapping granularities as well. say 256K and let
> users choose it.
> 


-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Virtio-fs] [virtiofsd PATCH v4 4/4] virtiofsd: support per-file DAX in FUSE_LOOKUP

2021-08-19 Thread JeffleXu



On 8/19/21 9:08 PM, Dr. David Alan Gilbert wrote:
> * JeffleXu (jeffl...@linux.alibaba.com) wrote:
>>
>>
>> On 8/18/21 3:00 AM, Dr. David Alan Gilbert wrote:
>>> * Jeffle Xu (jeffl...@linux.alibaba.com) wrote:
>>>> For passthrough, when the corresponding virtiofs in guest is mounted
>>>> with '-o dax=inode', advertise that the file is capable of per-file
>>>> DAX if the inode in the backend fs is marked with FS_DAX_FL flag.
>>>>
>>>> Signed-off-by: Jeffle Xu 
>>>> ---
>>>>  tools/virtiofsd/passthrough_ll.c | 43 
>>>>  1 file changed, 43 insertions(+)
>>>>
>>>> diff --git a/tools/virtiofsd/passthrough_ll.c 
>>>> b/tools/virtiofsd/passthrough_ll.c
>>>> index 5b6228210f..4cbd904248 100644
>>>> --- a/tools/virtiofsd/passthrough_ll.c
>>>> +++ b/tools/virtiofsd/passthrough_ll.c
>>>> @@ -171,6 +171,7 @@ struct lo_data {
>>>>  int allow_direct_io;
>>>>  int announce_submounts;
>>>>  int perfile_dax_cap; /* capability of backend fs */
>>>> +bool perfile_dax; /* enable per-file DAX or not */
>>>>  bool use_statx;
>>>>  struct lo_inode root;
>>>>  GHashTable *inodes; /* protected by lo->mutex */
>>>> @@ -716,6 +717,10 @@ static void lo_init(void *userdata, struct 
>>>> fuse_conn_info *conn)
>>>>  
>>>>  if (conn->capable & FUSE_CAP_PERFILE_DAX && lo->perfile_dax_cap ) {
>>>>  conn->want |= FUSE_CAP_PERFILE_DAX;
>>>> +  lo->perfile_dax = 1;
>>>> +}
>>>> +else {
>>>> +  lo->perfile_dax = 0;
>>>>  }
>>>>  }
>>>>  
>>>> @@ -983,6 +988,41 @@ static int do_statx(struct lo_data *lo, int dirfd, 
>>>> const char *pathname,
>>>>  return 0;
>>>>  }
>>>>  
>>>> +/*
>>>> + * If the file is marked with FS_DAX_FL or FS_XFLAG_DAX, then DAX should 
>>>> be
>>>> + * enabled for this file.
>>>> + */
>>>> +static bool lo_should_enable_dax(struct lo_data *lo, struct lo_inode *dir,
>>>> +   const char *name)
>>>> +{
>>>> +int res, fd;
>>>> +int ret = false;;
>>>> +unsigned int attr;
>>>> +struct fsxattr xattr;
>>>> +
>>>> +if (!lo->perfile_dax)
>>>> +  return false;
>>>> +
>>>> +/* Open file without O_PATH, so that ioctl can be called. */
>>>> +fd = openat(dir->fd, name, O_NOFOLLOW);
>>>> +if (fd == -1)
>>>> +return false;
>>>
>>> Doesn't that defeat the whole benefit of using O_PATH - i.e. that we
>>> might stumble into a /dev node or something else we're not allowed to
>>> open?
>>
>> As far as I know, virtiofsd will pivot_root/chroot to the source
>> directory, and can only access files inside the source directory
>> specified by "-o source=". Then where do these unexpected files come
>> from? Besides, fd opened without O_PATH here is temporary and used for
>> FS_IOC_GETFLAGS/FS_IOC_FSGETXATTR ioctl only. It's closed when the
>> function returns.
> 
> The guest is still allowed to mknod.
> See:
>https://lists.gnu.org/archive/html/qemu-devel/2021-01/msg05461.html
> 
> also it's legal to expose a root filesystem for a guest; the virtiofsd
> should *never* open a device other than O_PATH - and it's really tricky
> to do a check to see if it is a device in a race-free way.
> 

Fine. Got it. However the returned fd (opened without O_PATH) is only
used for FS_IOC_GETFLAGS/FS_IOC_FSGETXATTR ioctl, while in most cases
for special device files, these two ioctls should return -ENOTTY.

If it's really a security issue, then lo_inode_open() could be used to
get a temporary fd, i.e., check if it's a special file before opening.
After all, FUSE_OPEN also handles in this way. Besides, I can't
understand what "race-free way" means.


-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 0/8] fuse,virtiofs: support per-file DAX

2021-09-02 Thread JeffleXu



On 8/17/21 10:08 PM, Miklos Szeredi wrote:
> On Tue, 17 Aug 2021 at 15:22, JeffleXu  wrote:
>>
>>
>>
>> On 8/17/21 8:39 PM, Vivek Goyal wrote:
>>> On Tue, Aug 17, 2021 at 10:06:53AM +0200, Miklos Szeredi wrote:
>>>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu  wrote:
>>>>>
>>>>> This patchset adds support of per-file DAX for virtiofs, which is
>>>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>>>
>>>> Can you please explain the background of this change in detail?
>>>>
>>>> Why would an admin want to enable DAX for a particular virtiofs file
>>>> and not for others?
>>>
>>> Initially I thought that they needed it because they are downloading
>>> files on the fly from server. So they don't want to enable dax on the file
>>> till file is completely downloaded.
>>
>> Right, it's our initial requirement.
>>
>>
>>> But later I realized that they should
>>> be able to block in FUSE_SETUPMAPPING call and make sure associated
>>> file section has been downloaded before returning and solve the problem.
>>> So that can't be the primary reason.
>>
>> Saying we want to access 4KB of one file inside guest, if it goes
>> through FUSE request routine, then the fuse daemon only need to download
>> this 4KB from remote server. But if it goes through DAX, then the fuse
>> daemon need to download the whole DAX window (e.g., 2MB) from remote
>> server, so called amplification. Maybe we could decrease the DAX window
>> size, but it's a trade off.
> 
> That could be achieved with a plain fuse filesystem on the host (which
> will get 4k READ requests for accesses to mapped area inside guest).
> Since this can be done selectively for files which are not yet
> downloaded, the extra layer wouldn't be a performance problem.
> 
> Is there a reason why that wouldn't work?

I didn't realize this mechanism (working around from user space) before
sending this patch set.

After learning the virtualization and KVM stuffs, I find that, as Vivek
Goyal replied in [1], virtiofsd/qemu need to somehow hook the user page
fault and then download the remained part.

IMHO, this mechanism (as you proposed by implementing a plain fuse
filesystem on the host) seems a little bit sophisticated so far.


[1] https://lore.kernel.org/linux-fsdevel/yr08knp8co8lj...@redhat.com/


-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/2] virtiofs: miscellaneous fixes

2021-09-06 Thread JeffleXu
ping ...

On 8/12/21 1:46 PM, Jeffle Xu wrote:
> Some fixes or optimization for virtiofs, which are authored by Liu Bo.
> 
> Liu Bo (2):
>   virtio-fs: disable atomic_o_trunc if no_open is enabled
>   virtiofs: reduce lock contention on fpq->lock
> 
>  fs/fuse/file.c  | 11 +--
>  fs/fuse/virtio_fs.c |  3 ---
>  2 files changed, 9 insertions(+), 5 deletions(-)
> 

-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] fuse: disable atomic_o_trunc if no_open is enabled

2021-09-08 Thread JeffleXu



On 9/7/21 4:34 PM, Miklos Szeredi wrote:
> On Thu, 12 Aug 2021 at 07:46, Jeffle Xu  wrote:
>>
>> From: Liu Bo 
>>
>> When 'no_open' is used by virtiofsd, guest kernel won't send OPEN request
>> any more.  However, with atomic_o_trunc, SETATTR request is also omitted in
>> OPEN(O_TRUNC) so that the backend file is not truncated.  With a following
>> GETATTR, inode size on guest side is updated to be same with that on host
>> side, the end result is that O_TRUNC semantic is broken.
>>
>> This disables atomic_o_trunc as well if with no_open.
> 
> I don't quite get it why one would want to enable atomic_o_trunc with
> no_open in the first place?

Oops..We didn't realize that it could also be worked around by fuse
daemon side. Please ignore this.

-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Virtio-fs] [virtiofsd PATCH v4 4/4] virtiofsd: support per-file DAX in FUSE_LOOKUP

2021-09-08 Thread JeffleXu



On 8/24/21 6:15 PM, Greg Kurz wrote:
> On Fri, 20 Aug 2021 13:03:23 +0800
> JeffleXu  wrote:
>>
>> Fine. Got it. However the returned fd (opened without O_PATH) is only
>> used for FS_IOC_GETFLAGS/FS_IOC_FSGETXATTR ioctl, while in most cases
>> for special device files, these two ioctls should return -ENOTTY.
>>
> 
> The actual problem is that a FIFO will cause openat() to block until
> the other end of the FIFO is open for writing...

Got it.

> 
>> If it's really a security issue, then lo_inode_open() could be used to
> 
> ... and cause a DoS on virtiofsd. So yes, this is a security issue and
> lo_inode_open() was introduced specifically to handle this.
> 
>> get a temporary fd, i.e., check if it's a special file before opening.
>> After all, FUSE_OPEN also handles in this way. Besides, I can't
>> understand what "race-free way" means.
>>
> 
> "race-free way" means a way that guarantees that file type
> cannot change between the time you check it and the time
> you open it (TOCTOU error). For example, doing a plain stat(),
> checking st_mode and proceeding to open() is wrong : nothing
> prevents the file to be unlinked and replaced by something
> else between stat() and open().
> 
> We avoid that by keeping O_PATH fds around and using
> lo_inode_open() instead of openat().

Thanks for the detailed explanation. Got it.

> 
> In your case, it seems that you should do the checking after
> you have an actual lo_inode for the target file, and pass
> that to lo_should_enable_dax() instead of the parent lo_inode
> and target name.
> 

Yes, that will be more reasonable. Thanks.

-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Virtio-fs] [PATCH v4 0/8] fuse,virtiofs: support per-file DAX

2021-09-16 Thread JeffleXu
Hi, I add some performance statistics below.


On 8/17/21 8:40 PM, Vivek Goyal wrote:
> On Tue, Aug 17, 2021 at 10:32:14AM +0100, Dr. David Alan Gilbert wrote:
>> * Miklos Szeredi (mik...@szeredi.hu) wrote:
>>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu  wrote:

 This patchset adds support of per-file DAX for virtiofs, which is
 inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>>
>>> Can you please explain the background of this change in detail?
>>>
>>> Why would an admin want to enable DAX for a particular virtiofs file
>>> and not for others?
>>
>> Where we're contending on virtiofs dax cache size it makes a lot of
>> sense; it's quite expensive for us to map something into the cache
>> (especially if we push something else out), so selectively DAXing files
>> that are expected to be hot could help reduce cache churn.

Yes, the performance of dax can be limited when the DAX window is
limited, where dax window may be contended by multiple files.

I tested kernel compiling in virtiofs, emulating the scenario where a
lot of files contending dax window and triggering dax window reclaiming.

Environment setup:
- guest vCPU: 16
- time make vmlinux -j128

type| cache  | cache-size | time
--- | -- | -- | 
non-dax | always |   --   | real 2m48.119s
dax | always | 64M| real 4m49.563s
dax | always |   1G   | real 3m14.200s
dax | always |   4G   | real 2m41.141s


It can be seen that there's performance drop, comparing to the normal
buffered IO, when dax window resource is restricted and dax window
relcaiming is triggered. The smaller the cache size is, the worse the
performance is. The performance drop can be alleviated and eliminated as
cache size increases.

Though we may not compile kernel in virtiofs, indeed we may access a lot
of small files in virtiofs and suffer this performance drop.


> In that case probaly we should just make DAX window larger. I assume

Yes, as the DAX window gets larger, it is less likely that we can run
short of dax window resource.

However it doesn't come without cost. 'struct page' descriptor for dax
window will consume guest memory at a ratio of ~1.5% (64/4096 = ~1.5%,
page descriptor is of 64 bytes size, assuming 4K sized page). That is,
every 1GB cache size will cost 16MB guest memory. As the cache size
increases, the memory footprint for page descriptors also increases,
which may offset the benefit of dax by eliminating guest page cache.

In summary, per-file dax feature tries to achieve a balance between
performance and memory overhead, by offering a finer gained control for
dax to users.


> that selecting which files to turn DAX on, will itself will not be
> a trivial. Not sure what heuristics are being deployed to determine
> that. Will like to know more about it.

Currently we enable dax for hot and large blob files, while disabling
dax for other miscellaneous small files.



-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Virtio-fs] [PATCH v4 0/8] fuse,virtiofs: support per-file DAX

2021-09-17 Thread JeffleXu
Hi Vivek, Miklos,

On 9/16/21 4:21 PM, JeffleXu wrote:
> Hi, I add some performance statistics below.
> 
> 
> On 8/17/21 8:40 PM, Vivek Goyal wrote:
>> On Tue, Aug 17, 2021 at 10:32:14AM +0100, Dr. David Alan Gilbert wrote:
>>> * Miklos Szeredi (mik...@szeredi.hu) wrote:
>>>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu  wrote:
>>>>>
>>>>> This patchset adds support of per-file DAX for virtiofs, which is
>>>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>>>
>>>> Can you please explain the background of this change in detail?
>>>>
>>>> Why would an admin want to enable DAX for a particular virtiofs file
>>>> and not for others?
>>>
>>> Where we're contending on virtiofs dax cache size it makes a lot of
>>> sense; it's quite expensive for us to map something into the cache
>>> (especially if we push something else out), so selectively DAXing files
>>> that are expected to be hot could help reduce cache churn.
> 
> Yes, the performance of dax can be limited when the DAX window is
> limited, where dax window may be contended by multiple files.
> 
> I tested kernel compiling in virtiofs, emulating the scenario where a
> lot of files contending dax window and triggering dax window reclaiming.
> 
> Environment setup:
> - guest vCPU: 16
> - time make vmlinux -j128
> 
> type| cache  | cache-size | time
> --- | -- | -- | 
> non-dax | always |   --   | real 2m48.119s
> dax | always | 64M| real 4m49.563s
> dax | always |   1G   | real 3m14.200s
> dax | always |   4G   | real 2m41.141s
> 
> 
> It can be seen that there's performance drop, comparing to the normal
> buffered IO, when dax window resource is restricted and dax window
> relcaiming is triggered. The smaller the cache size is, the worse the
> performance is. The performance drop can be alleviated and eliminated as
> cache size increases.
> 
> Though we may not compile kernel in virtiofs, indeed we may access a lot
> of small files in virtiofs and suffer this performance drop.
> 
> 
>> In that case probaly we should just make DAX window larger. I assume
> 
> Yes, as the DAX window gets larger, it is less likely that we can run
> short of dax window resource.
> 
> However it doesn't come without cost. 'struct page' descriptor for dax
> window will consume guest memory at a ratio of ~1.5% (64/4096 = ~1.5%,
> page descriptor is of 64 bytes size, assuming 4K sized page). That is,
> every 1GB cache size will cost 16MB guest memory. As the cache size
> increases, the memory footprint for page descriptors also increases,
> which may offset the benefit of dax by eliminating guest page cache.
> 
> In summary, per-file dax feature tries to achieve a balance between
> performance and memory overhead, by offering a finer gained control for
> dax to users.
> 

I'm not sure if this is adequate for introducing per-file dax feature to
community? Need some feedback from the community.

And if that's the case, I also want to know if setting/clearing S_DAX
inside guest is needed, since in our internal using scenario, setting
S_DAX from host daemon is adequate. If setting/clearing S_DAX inside
guest can be omitted then, the negotiation during FUSE_INIT phase is not
needed either. After all we could completely rely on the FUSE_ATTR_DAX
flag feeded by host daemon to see if dax shall be enabled or not for
corresponding file. The whole patch set will also be somehow simper then.


-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Virtio-fs] [PATCH v4 0/8] fuse,virtiofs: support per-file DAX

2021-09-22 Thread JeffleXu
Thanks for the replying and suggesting. ;)


On 9/20/21 3:45 AM, Vivek Goyal wrote:
> On Thu, Sep 16, 2021 at 04:21:59PM +0800, JeffleXu wrote:
>> Hi, I add some performance statistics below.
>>
>>
>> On 8/17/21 8:40 PM, Vivek Goyal wrote:
>>> On Tue, Aug 17, 2021 at 10:32:14AM +0100, Dr. David Alan Gilbert wrote:
>>>> * Miklos Szeredi (mik...@szeredi.hu) wrote:
>>>>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu  
>>>>> wrote:
>>>>>>
>>>>>> This patchset adds support of per-file DAX for virtiofs, which is
>>>>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>>>>
>>>>> Can you please explain the background of this change in detail?
>>>>>
>>>>> Why would an admin want to enable DAX for a particular virtiofs file
>>>>> and not for others?
>>>>
>>>> Where we're contending on virtiofs dax cache size it makes a lot of
>>>> sense; it's quite expensive for us to map something into the cache
>>>> (especially if we push something else out), so selectively DAXing files
>>>> that are expected to be hot could help reduce cache churn.
>>
>> Yes, the performance of dax can be limited when the DAX window is
>> limited, where dax window may be contended by multiple files.
>>
>> I tested kernel compiling in virtiofs, emulating the scenario where a
>> lot of files contending dax window and triggering dax window reclaiming.
>>
>> Environment setup:
>> - guest vCPU: 16
>> - time make vmlinux -j128
>>
>> type| cache  | cache-size | time
>> --- | -- | -- | 
>> non-dax | always |   --   | real 2m48.119s
>> dax | always | 64M| real 4m49.563s
>> dax | always |   1G   | real 3m14.200s
>> dax | always |   4G   | real 2m41.141s
>>
>>
>> It can be seen that there's performance drop, comparing to the normal
>> buffered IO, when dax window resource is restricted and dax window
>> relcaiming is triggered. The smaller the cache size is, the worse the
>> performance is. The performance drop can be alleviated and eliminated as
>> cache size increases.
>>
>> Though we may not compile kernel in virtiofs, indeed we may access a lot
>> of small files in virtiofs and suffer this performance drop.
> 
> Hi Jeffle,
> 
> If you access lot of big files or a file bigger than dax window, still
> you will face performance drop due to reclaim. IOW, if data being
> accessed is bigger than dax window, then reclaim will trigger and
> performance drop will be observed. So I think its not fair to assciate
> performance drop with big for small files as such.

Yes, it is. Actually what I mean is that small files (with size smaller
than dax window chunk size) is more likely to consume more dax windows
compared to large files, under the same total file size.


> 
> What makes more sense is that memomry usage argument you have used
> later in the email. That is, we have a fixed chunk size of 2MB. And
> that means we use 512 * 64 = 32K of memory per chunk. So if a file
> is smaller than 32K in size, it might be better to just access it
> without DAX and incur the cost of page cache in guest instead. Even this
> argument also works only if dax window is being utilized fully.

Yes, agreed. In this case, the meaning of per-file dax is that, admin
could control the size of overall dax window under a limited number,
while still sustaining a reasonable performance. But at least, users are
capable of tuning it now.

> 
> Anyway, I think Miklos already asked you to send patches so that
> virtiofs daemon specifies which file to use dax on. So are you
> planning to post patches again for that. (And drop patches to
> read dax attr from per inode from filesystem in guest).

OK. I will send a new version, disabling dax based on the file size on
the host daemon side. Besides, I'm afraid the negotiation phase is also
not needed anymore, since currently the hint whether dax shall be
enabled or not is completely feeded from host daemon, and the guest side
needn't set/clear per inode dax attr now.

-- 
Thanks,
Jeffle
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization