Re: [Devel] [vzlin-dev] [PATCH vz7] fuse: relax i_mutex coverage in fuse_fsync

2016-12-01 Thread Maxim Patlasov

On 12/01/2016 11:27 AM, Denis V. Lunev wrote:


On 12/01/2016 10:09 PM, Maxim Patlasov wrote:

On 12/01/2016 12:06 AM, Dmitry Monakhov wrote:


Maxim Patlasov  writes:


Alexey,


You're right. And while composing the patch I well understood that it's
possible to rework fuse_sync_writes() using a counter instead of
negative bias. But the problem with flush_mtime still exists anyway.
Think about it: we firstly acquire local mtime from local inode, then
fill and submit mtime-update-request. Since then, we don't know when
exactly fuse daemon will apply that new mtime to its metadata
structures. If another mtime-update is generated in-between (e.g.
"touch
-d  file", or even simplier -- just a single direct write
implicitly updating mtime), we wouldn't know which of those two
mtime-update-requests are processed by fused first. That comes from a
general FUSE protocol limitation: when kernel fuse queues request A,
then request B, it cannot be sure if they will be processed by
userspace
as  or .


The big advantage of the patch I sent is that it's very simple,
straightforward and presumably will remove 99% of contention between
fsync and io_submit (assuming we spend most of time waiting for
userspace ACK for FUSE_FSYNC request. There are actually three
questions
to answer:

1) Do we really must honor a crazy app who mixes a lot of fsyncs with a
lot of io_submits? The goal of fsync is to ensure that some state is
actually went to platters. An app who races io_submit-s with fsync-s
actually doesn't care which state will come to platters. I'm not sure
that it's reasonable to work very hard to achieve the best possible
performance for such a marginal app.

Obiously any filesystem behave like this.
Task A(mail-server) may perform write/fsync, task B(mysql) do a lot
of io_submit-s
All that io may happens in parallel, fs guarantee only that metadata
will be serialized. So all that concurent IO flowa to blockdevice which
does no have i_mutex so all IO indeed happen concurrently.

Looks as you're comparing an app doing POSIX
open/read/write/fsync/close with fs doing submit_bio. This is a
stretch. But OK, there is a similarity. But I don't think this rather
vague similarity proves something.

we are speaking about VM process, which essentially
re-submits IO from the guest to host like above. For sure
QEMU and VM_app have this IO pattern. Thus this
pattern MUST be optimized as this is one of our
main loads.


Yes, I agree. That's exactly why I wrote in the same email (next paragraph):

This really makes sense. If an app inside a VM loops over ordinary 
direct writes, while another app (in the same VM) does fsync, it's not 
fair to suspend the first app for long while just because fuse holds 
i_mutex for long somewhere deep in fuse_fsync. 


Max



That is why I think that this case is not marginal
and important.

Den


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [vzlin-dev] [PATCH vz7] fuse: relax i_mutex coverage in fuse_fsync

2016-12-01 Thread Denis V. Lunev
On 12/01/2016 10:09 PM, Maxim Patlasov wrote:
> On 12/01/2016 12:06 AM, Dmitry Monakhov wrote:
>
>> Maxim Patlasov  writes:
>>
>>> Alexey,
>>>
>>>
>>> You're right. And while composing the patch I well understood that it's
>>> possible to rework fuse_sync_writes() using a counter instead of
>>> negative bias. But the problem with flush_mtime still exists anyway.
>>> Think about it: we firstly acquire local mtime from local inode, then
>>> fill and submit mtime-update-request. Since then, we don't know when
>>> exactly fuse daemon will apply that new mtime to its metadata
>>> structures. If another mtime-update is generated in-between (e.g.
>>> "touch
>>> -d  file", or even simplier -- just a single direct write
>>> implicitly updating mtime), we wouldn't know which of those two
>>> mtime-update-requests are processed by fused first. That comes from a
>>> general FUSE protocol limitation: when kernel fuse queues request A,
>>> then request B, it cannot be sure if they will be processed by
>>> userspace
>>> as  or .
>>>
>>>
>>> The big advantage of the patch I sent is that it's very simple,
>>> straightforward and presumably will remove 99% of contention between
>>> fsync and io_submit (assuming we spend most of time waiting for
>>> userspace ACK for FUSE_FSYNC request. There are actually three
>>> questions
>>> to answer:
>>>
>>> 1) Do we really must honor a crazy app who mixes a lot of fsyncs with a
>>> lot of io_submits? The goal of fsync is to ensure that some state is
>>> actually went to platters. An app who races io_submit-s with fsync-s
>>> actually doesn't care which state will come to platters. I'm not sure
>>> that it's reasonable to work very hard to achieve the best possible
>>> performance for such a marginal app.
>> Obiously any filesystem behave like this.
>> Task A(mail-server) may perform write/fsync, task B(mysql) do a lot
>> of io_submit-s
>> All that io may happens in parallel, fs guarantee only that metadata
>> will be serialized. So all that concurent IO flowa to blockdevice which
>> does no have i_mutex so all IO indeed happen concurrently.
>
> Looks as you're comparing an app doing POSIX
> open/read/write/fsync/close with fs doing submit_bio. This is a
> stretch. But OK, there is a similarity. But I don't think this rather
> vague similarity proves something.

we are speaking about VM process, which essentially
re-submits IO from the guest to host like above. For sure
QEMU and VM_app have this IO pattern. Thus this
pattern MUST be optimized as this is one of our
main loads.

That is why I think that this case is not marginal
and important.

Den
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [vzlin-dev] [PATCH vz7] fuse: relax i_mutex coverage in fuse_fsync

2016-12-01 Thread Maxim Patlasov

On 12/01/2016 12:06 AM, Dmitry Monakhov wrote:


Maxim Patlasov  writes:


Alexey,


You're right. And while composing the patch I well understood that it's
possible to rework fuse_sync_writes() using a counter instead of
negative bias. But the problem with flush_mtime still exists anyway.
Think about it: we firstly acquire local mtime from local inode, then
fill and submit mtime-update-request. Since then, we don't know when
exactly fuse daemon will apply that new mtime to its metadata
structures. If another mtime-update is generated in-between (e.g. "touch
-d  file", or even simplier -- just a single direct write
implicitly updating mtime), we wouldn't know which of those two
mtime-update-requests are processed by fused first. That comes from a
general FUSE protocol limitation: when kernel fuse queues request A,
then request B, it cannot be sure if they will be processed by userspace
as  or .


The big advantage of the patch I sent is that it's very simple,
straightforward and presumably will remove 99% of contention between
fsync and io_submit (assuming we spend most of time waiting for
userspace ACK for FUSE_FSYNC request. There are actually three questions
to answer:

1) Do we really must honor a crazy app who mixes a lot of fsyncs with a
lot of io_submits? The goal of fsync is to ensure that some state is
actually went to platters. An app who races io_submit-s with fsync-s
actually doesn't care which state will come to platters. I'm not sure
that it's reasonable to work very hard to achieve the best possible
performance for such a marginal app.

Obiously any filesystem behave like this.
Task A(mail-server) may perform write/fsync, task B(mysql) do a lot of 
io_submit-s
All that io may happens in parallel, fs guarantee only that metadata
will be serialized. So all that concurent IO flowa to blockdevice which
does no have i_mutex so all IO indeed happen concurrently.


Looks as you're comparing an app doing POSIX open/read/write/fsync/close 
with fs doing submit_bio. This is a stretch. But OK, there is a 
similarity. But I don't think this rather vague similarity proves something.



But when we dealt with fs-in-file (loop/ploop/qemu-nbd) we face i_mutex
on file.


This really makes sense. If an app inside a VM loops over ordinary 
direct writes, while another app (in the same VM) does fsync, it's not 
fair to suspend the first app for long while just because fuse holds 
i_mutex for long somewhere deep in fuse_fsync.



For general filesystem (xfs/ext4)


What makes xfs/ext4 general? They are *local* fs having almost direct 
access to underlying HDD. Have you explored how CEPH implements fsync?



we grab i_mutex only on write
path, fsync is lockless.


What about the following part of ext4_sync_file:


if (!journal) {
ret = generic_file_fsync(file, start, end, datasync);

? Is it lockless?


But int case of fuse we artificially introduce
i_mutex inside fsync


exactly the same as generic_file_fsync()


which basically kill concurrency for upper FS.
As result we have SMP scalability as we have in Linux-v2.2 with single
mutex in VFS.


You can't derive "kill concurrency" from "introduce i_mutex" as simple 
as that. It really does matter for how long the mutex is held and how 
often it contends with writes.




BTW: I'm wondering why do we care about mtime at all. for fs-in-file
we can relax that, for example flush mtime only on fsync, and not for fdatasync.


Good catch, makes sense!






2) Will the patch (in the form I sent it) break something? I think no.
If you know some usecase that can be broken, let's discuss it in more
details.


3) Should we expect some noticeable (or significant) improvement in
performance comparing fuse_fsync with no locking at all vs. the locking
we have with that patch applied? I tend to think that the answer is "no"
because handling FUSE_FSYNC is notoriously heavy-weight operation. If
you disagree, let's firstly measure that difference in performance
(simply commenting out lock/unlock(i_mutex) in fuse_fsync) and then
start to think if it's really worthy to fully re-work locking scheme to
preserve flush_mtime correctness w/o i_mutex.


Thanks,

Maxim


On 11/30/2016 05:09 AM, Alexey Kuznetsov wrote:

Sorry, missed that pair fuse_set_nowrite/fuse_release_writes
can be done only under i_mutex.

IMHO it is only due to bad implementation.
If fuse_set_nowrite would be done with separate
count instead of adding negative bias, it would
be possible.


On Wed, Nov 30, 2016 at 3:47 PM, Alexey Kuznetsov  wrote:

Hello!

I do not think you got it right.

i_mutex in fsync is not about some atomicity,
it is about stopping data feed while fsync is executed
to prevent livelock.

I cannot tell anything about mtime update, it is just some voodoo
magic for me.

What's about fsync semantics, I see two different ways:

A.

1. Remove useless write_inode_now. Its work is done
  by 

Re: [Devel] [vzlin-dev] [PATCH vz7] fuse: relax i_mutex coverage in fuse_fsync

2016-12-01 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> Alexey,
>
>
> You're right. And while composing the patch I well understood that it's 
> possible to rework fuse_sync_writes() using a counter instead of 
> negative bias. But the problem with flush_mtime still exists anyway. 
> Think about it: we firstly acquire local mtime from local inode, then 
> fill and submit mtime-update-request. Since then, we don't know when 
> exactly fuse daemon will apply that new mtime to its metadata 
> structures. If another mtime-update is generated in-between (e.g. "touch 
> -d  file", or even simplier -- just a single direct write 
> implicitly updating mtime), we wouldn't know which of those two 
> mtime-update-requests are processed by fused first. That comes from a 
> general FUSE protocol limitation: when kernel fuse queues request A, 
> then request B, it cannot be sure if they will be processed by userspace 
> as  or .
>
>
> The big advantage of the patch I sent is that it's very simple, 
> straightforward and presumably will remove 99% of contention between 
> fsync and io_submit (assuming we spend most of time waiting for 
> userspace ACK for FUSE_FSYNC request. There are actually three questions 
> to answer:

>
>
> 1) Do we really must honor a crazy app who mixes a lot of fsyncs with a 
> lot of io_submits? The goal of fsync is to ensure that some state is 
> actually went to platters. An app who races io_submit-s with fsync-s 
> actually doesn't care which state will come to platters. I'm not sure 
> that it's reasonable to work very hard to achieve the best possible 
> performance for such a marginal app.
Obiously any filesystem behave like this.
Task A(mail-server) may perform write/fsync, task B(mysql) do a lot of 
io_submit-s
All that io may happens in parallel, fs guarantee only that metadata
will be serialized. So all that concurent IO flowa to blockdevice which
does no have i_mutex so all IO indeed happen concurrently.
But when we dealt with fs-in-file (loop/ploop/qemu-nbd) we face i_mutex
on file. For general filesystem (xfs/ext4) we grab i_mutex only on write
path, fsync is lockless. But int case of fuse we artificially introduce
i_mutex inside fsync which basically kill concurrency for upper FS.
As result we have SMP scalability as we have in Linux-v2.2 with single
mutex in VFS.

BTW: I'm wondering why do we care about mtime at all. for fs-in-file
we can relax that, for example flush mtime only on fsync, and not for fdatasync.

>
>
> 2) Will the patch (in the form I sent it) break something? I think no. 
> If you know some usecase that can be broken, let's discuss it in more 
> details.
>
>
> 3) Should we expect some noticeable (or significant) improvement in 
> performance comparing fuse_fsync with no locking at all vs. the locking 
> we have with that patch applied? I tend to think that the answer is "no" 
> because handling FUSE_FSYNC is notoriously heavy-weight operation. If 
> you disagree, let's firstly measure that difference in performance 
> (simply commenting out lock/unlock(i_mutex) in fuse_fsync) and then 
> start to think if it's really worthy to fully re-work locking scheme to 
> preserve flush_mtime correctness w/o i_mutex.
>
>
> Thanks,
>
> Maxim
>
>
> On 11/30/2016 05:09 AM, Alexey Kuznetsov wrote:
>> Sorry, missed that pair fuse_set_nowrite/fuse_release_writes
>> can be done only under i_mutex.
>>
>> IMHO it is only due to bad implementation.
>> If fuse_set_nowrite would be done with separate
>> count instead of adding negative bias, it would
>> be possible.
>>
>>
>> On Wed, Nov 30, 2016 at 3:47 PM, Alexey Kuznetsov  
>> wrote:
>>> Hello!
>>>
>>> I do not think you got it right.
>>>
>>> i_mutex in fsync is not about some atomicity,
>>> it is about stopping data feed while fsync is executed
>>> to prevent livelock.
>>>
>>> I cannot tell anything about mtime update, it is just some voodoo
>>> magic for me.
>>>
>>> What's about fsync semantics, I see two different ways:
>>>
>>> A.
>>>
>>> 1. Remove useless write_inode_now. Its work is done
>>>  by filemap_write_and_wait_range(), there is no need to repeat it
>>> under mutex.
>>> 2. move mutex_lock _after_  fuse_sync_writes(), which is essentially
>>>  fuse continuation forfilemap_write_and_wait_range().
>>> 3. i_mutex is preserved only around fsync call.
>>>
>>> B.
>>> 1. Remove  write_inode_now as well.
>>> 2. Remove i_mutex _completely_. (No idea about mtime voodo though)
>>> 2. Replace fuse_sync_writes() with fuse_set_nowrite()
>>>  and add release after call to FSYNC.
>>>
>>> Both prevent livelock. B is obviosly optimal.
>>>
>>> But A preserves historic fuse protocol semantics.
>>> F.e. I have no idea would user space survive truncate
>>> racing with fsync. pstorage should survice, though this
>>> path was never tested.
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 30, 2016 at 4:02 AM, Maxim Patlasov  
>>> wrote:
 fuse_fsync_common() does need i_mutex for