Re: [Devel] [vzlin-dev] [PATCH vz7] fuse: relax i_mutex coverage in fuse_fsync

2016-12-01 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> Alexey,
>
>
> You're right. And while composing the patch I well understood that it's 
> possible to rework fuse_sync_writes() using a counter instead of 
> negative bias. But the problem with flush_mtime still exists anyway. 
> Think about it: we firstly acquire local mtime from local inode, then 
> fill and submit mtime-update-request. Since then, we don't know when 
> exactly fuse daemon will apply that new mtime to its metadata 
> structures. If another mtime-update is generated in-between (e.g. "touch 
> -d  file", or even simplier -- just a single direct write 
> implicitly updating mtime), we wouldn't know which of those two 
> mtime-update-requests are processed by fused first. That comes from a 
> general FUSE protocol limitation: when kernel fuse queues request A, 
> then request B, it cannot be sure if they will be processed by userspace 
> as  or .
>
>
> The big advantage of the patch I sent is that it's very simple, 
> straightforward and presumably will remove 99% of contention between 
> fsync and io_submit (assuming we spend most of time waiting for 
> userspace ACK for FUSE_FSYNC request. There are actually three questions 
> to answer:

>
>
> 1) Do we really must honor a crazy app who mixes a lot of fsyncs with a 
> lot of io_submits? The goal of fsync is to ensure that some state is 
> actually went to platters. An app who races io_submit-s with fsync-s 
> actually doesn't care which state will come to platters. I'm not sure 
> that it's reasonable to work very hard to achieve the best possible 
> performance for such a marginal app.
Obiously any filesystem behave like this.
Task A(mail-server) may perform write/fsync, task B(mysql) do a lot of 
io_submit-s
All that io may happens in parallel, fs guarantee only that metadata
will be serialized. So all that concurent IO flowa to blockdevice which
does no have i_mutex so all IO indeed happen concurrently.
But when we dealt with fs-in-file (loop/ploop/qemu-nbd) we face i_mutex
on file. For general filesystem (xfs/ext4) we grab i_mutex only on write
path, fsync is lockless. But int case of fuse we artificially introduce
i_mutex inside fsync which basically kill concurrency for upper FS.
As result we have SMP scalability as we have in Linux-v2.2 with single
mutex in VFS.

BTW: I'm wondering why do we care about mtime at all. for fs-in-file
we can relax that, for example flush mtime only on fsync, and not for fdatasync.

>
>
> 2) Will the patch (in the form I sent it) break something? I think no. 
> If you know some usecase that can be broken, let's discuss it in more 
> details.
>
>
> 3) Should we expect some noticeable (or significant) improvement in 
> performance comparing fuse_fsync with no locking at all vs. the locking 
> we have with that patch applied? I tend to think that the answer is "no" 
> because handling FUSE_FSYNC is notoriously heavy-weight operation. If 
> you disagree, let's firstly measure that difference in performance 
> (simply commenting out lock/unlock(i_mutex) in fuse_fsync) and then 
> start to think if it's really worthy to fully re-work locking scheme to 
> preserve flush_mtime correctness w/o i_mutex.
>
>
> Thanks,
>
> Maxim
>
>
> On 11/30/2016 05:09 AM, Alexey Kuznetsov wrote:
>> Sorry, missed that pair fuse_set_nowrite/fuse_release_writes
>> can be done only under i_mutex.
>>
>> IMHO it is only due to bad implementation.
>> If fuse_set_nowrite would be done with separate
>> count instead of adding negative bias, it would
>> be possible.
>>
>>
>> On Wed, Nov 30, 2016 at 3:47 PM, Alexey Kuznetsov  
>> wrote:
>>> Hello!
>>>
>>> I do not think you got it right.
>>>
>>> i_mutex in fsync is not about some atomicity,
>>> it is about stopping data feed while fsync is executed
>>> to prevent livelock.
>>>
>>> I cannot tell anything about mtime update, it is just some voodoo
>>> magic for me.
>>>
>>> What's about fsync semantics, I see two different ways:
>>>
>>> A.
>>>
>>> 1. Remove useless write_inode_now. Its work is done
>>>  by filemap_write_and_wait_range(), there is no need to repeat it
>>> under mutex.
>>> 2. move mutex_lock _after_  fuse_sync_writes(), which is essentially
>>>  fuse continuation forfilemap_write_and_wait_range().
>>> 3. i_mutex is preserved only around fsync call.
>>>
>>> B.
>>> 1. Remove  write_inode_now as well.
>>> 2. Remove i_mutex _completely_. (No idea about mtime voodo though)
>>> 2. Replace fuse_sync_writes() with fuse_set_nowrite()
>>>  and add release after call to FSYNC.
>>>
>>> Both prevent livelock. B is obviosly optimal.
>>>
>>> But A preserves historic fuse protocol semantics.
>>> F.e. I have no idea would user space survive truncate
>>> racing with fsync. pstorage should survice, though this
>>> path was never tested.
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 30, 2016 at 4:02 AM, Maxim Patlasov  
>>> wrote:
 fuse_fsync_common() does need i_mutex for fuse_sync_writes() and
 fuse_flush_mtime(). But when those operations are done, it's actual

[Devel] [PATCH RHEL7 COMMIT] ms/KVM: x86: expose MSR_TSC_AUX to userspace

2016-12-01 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.36.1.vz7.20.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.36.1.vz7.20.8
-->
commit 37f023b9287ec994f33a676f280ebd26816b2dfa
Author: Paolo Bonzini 
Date:   Thu Dec 1 13:51:12 2016 +0400

ms/KVM: x86: expose MSR_TSC_AUX to userspace

If we do not do this, it is not properly saved and restored across
migration.  Windows notices due to its self-protection mechanisms,
and is very upset about it (blue screen of death).

Cc: Radim Krcmar 
Cc: sta...@vger.kernel.org
Signed-off-by: Paolo Bonzini 

ms commit: 9dbe6cf ("KVM: x86: expose MSR_TSC_AUX to userspace")
https://jira.sw.ru/browse/PSBM-55531

Signed-off-by: Evgeny Yakovlev 
---
 arch/x86/kvm/x86.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4b2754b..78ea28c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -953,7 +953,7 @@ static u32 msrs_to_save[] = {
MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR,
 #endif
MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
-   MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS
+   MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
 };
 
 static unsigned num_msrs_to_save;
@@ -4148,16 +4148,17 @@ static void kvm_init_msr_list(void)
 
/*
 * Even MSRs that are valid in the host may not be exposed
-* to the guests in some cases.  We could work around this
-* in VMX with the generic MSR save/load machinery, but it
-* is not really worthwhile since it will really only
-* happen with nested virtualization.
+* to the guests in some cases.
 */
switch (msrs_to_save[i]) {
case MSR_IA32_BNDCFGS:
if (!kvm_x86_ops->mpx_supported())
continue;
break;
+   case MSR_TSC_AUX:
+   if (!kvm_x86_ops->rdtscp_supported())
+   continue;
+   break;
default:
break;
}
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [vzlin-dev] [PATCH vz7] fuse: relax i_mutex coverage in fuse_fsync

2016-12-01 Thread Maxim Patlasov

On 12/01/2016 12:06 AM, Dmitry Monakhov wrote:


Maxim Patlasov  writes:


Alexey,


You're right. And while composing the patch I well understood that it's
possible to rework fuse_sync_writes() using a counter instead of
negative bias. But the problem with flush_mtime still exists anyway.
Think about it: we firstly acquire local mtime from local inode, then
fill and submit mtime-update-request. Since then, we don't know when
exactly fuse daemon will apply that new mtime to its metadata
structures. If another mtime-update is generated in-between (e.g. "touch
-d  file", or even simplier -- just a single direct write
implicitly updating mtime), we wouldn't know which of those two
mtime-update-requests are processed by fused first. That comes from a
general FUSE protocol limitation: when kernel fuse queues request A,
then request B, it cannot be sure if they will be processed by userspace
as  or .


The big advantage of the patch I sent is that it's very simple,
straightforward and presumably will remove 99% of contention between
fsync and io_submit (assuming we spend most of time waiting for
userspace ACK for FUSE_FSYNC request. There are actually three questions
to answer:

1) Do we really must honor a crazy app who mixes a lot of fsyncs with a
lot of io_submits? The goal of fsync is to ensure that some state is
actually went to platters. An app who races io_submit-s with fsync-s
actually doesn't care which state will come to platters. I'm not sure
that it's reasonable to work very hard to achieve the best possible
performance for such a marginal app.

Obiously any filesystem behave like this.
Task A(mail-server) may perform write/fsync, task B(mysql) do a lot of 
io_submit-s
All that io may happens in parallel, fs guarantee only that metadata
will be serialized. So all that concurent IO flowa to blockdevice which
does no have i_mutex so all IO indeed happen concurrently.


Looks as you're comparing an app doing POSIX open/read/write/fsync/close 
with fs doing submit_bio. This is a stretch. But OK, there is a 
similarity. But I don't think this rather vague similarity proves something.



But when we dealt with fs-in-file (loop/ploop/qemu-nbd) we face i_mutex
on file.


This really makes sense. If an app inside a VM loops over ordinary 
direct writes, while another app (in the same VM) does fsync, it's not 
fair to suspend the first app for long while just because fuse holds 
i_mutex for long somewhere deep in fuse_fsync.



For general filesystem (xfs/ext4)


What makes xfs/ext4 general? They are *local* fs having almost direct 
access to underlying HDD. Have you explored how CEPH implements fsync?



we grab i_mutex only on write
path, fsync is lockless.


What about the following part of ext4_sync_file:


if (!journal) {
ret = generic_file_fsync(file, start, end, datasync);

? Is it lockless?


But int case of fuse we artificially introduce
i_mutex inside fsync


exactly the same as generic_file_fsync()


which basically kill concurrency for upper FS.
As result we have SMP scalability as we have in Linux-v2.2 with single
mutex in VFS.


You can't derive "kill concurrency" from "introduce i_mutex" as simple 
as that. It really does matter for how long the mutex is held and how 
often it contends with writes.




BTW: I'm wondering why do we care about mtime at all. for fs-in-file
we can relax that, for example flush mtime only on fsync, and not for fdatasync.


Good catch, makes sense!






2) Will the patch (in the form I sent it) break something? I think no.
If you know some usecase that can be broken, let's discuss it in more
details.


3) Should we expect some noticeable (or significant) improvement in
performance comparing fuse_fsync with no locking at all vs. the locking
we have with that patch applied? I tend to think that the answer is "no"
because handling FUSE_FSYNC is notoriously heavy-weight operation. If
you disagree, let's firstly measure that difference in performance
(simply commenting out lock/unlock(i_mutex) in fuse_fsync) and then
start to think if it's really worthy to fully re-work locking scheme to
preserve flush_mtime correctness w/o i_mutex.


Thanks,

Maxim


On 11/30/2016 05:09 AM, Alexey Kuznetsov wrote:

Sorry, missed that pair fuse_set_nowrite/fuse_release_writes
can be done only under i_mutex.

IMHO it is only due to bad implementation.
If fuse_set_nowrite would be done with separate
count instead of adding negative bias, it would
be possible.


On Wed, Nov 30, 2016 at 3:47 PM, Alexey Kuznetsov  wrote:

Hello!

I do not think you got it right.

i_mutex in fsync is not about some atomicity,
it is about stopping data feed while fsync is executed
to prevent livelock.

I cannot tell anything about mtime update, it is just some voodoo
magic for me.

What's about fsync semantics, I see two different ways:

A.

1. Remove useless write_inode_now. Its work is done
  by filemap_write_and_wait_range(), there is no need to repeat it
under mutex.
2. move mutex_lock _afte

Re: [Devel] [vzlin-dev] [PATCH vz7] fuse: relax i_mutex coverage in fuse_fsync

2016-12-01 Thread Denis V. Lunev
On 12/01/2016 10:09 PM, Maxim Patlasov wrote:
> On 12/01/2016 12:06 AM, Dmitry Monakhov wrote:
>
>> Maxim Patlasov  writes:
>>
>>> Alexey,
>>>
>>>
>>> You're right. And while composing the patch I well understood that it's
>>> possible to rework fuse_sync_writes() using a counter instead of
>>> negative bias. But the problem with flush_mtime still exists anyway.
>>> Think about it: we firstly acquire local mtime from local inode, then
>>> fill and submit mtime-update-request. Since then, we don't know when
>>> exactly fuse daemon will apply that new mtime to its metadata
>>> structures. If another mtime-update is generated in-between (e.g.
>>> "touch
>>> -d  file", or even simplier -- just a single direct write
>>> implicitly updating mtime), we wouldn't know which of those two
>>> mtime-update-requests are processed by fused first. That comes from a
>>> general FUSE protocol limitation: when kernel fuse queues request A,
>>> then request B, it cannot be sure if they will be processed by
>>> userspace
>>> as  or .
>>>
>>>
>>> The big advantage of the patch I sent is that it's very simple,
>>> straightforward and presumably will remove 99% of contention between
>>> fsync and io_submit (assuming we spend most of time waiting for
>>> userspace ACK for FUSE_FSYNC request. There are actually three
>>> questions
>>> to answer:
>>>
>>> 1) Do we really must honor a crazy app who mixes a lot of fsyncs with a
>>> lot of io_submits? The goal of fsync is to ensure that some state is
>>> actually went to platters. An app who races io_submit-s with fsync-s
>>> actually doesn't care which state will come to platters. I'm not sure
>>> that it's reasonable to work very hard to achieve the best possible
>>> performance for such a marginal app.
>> Obiously any filesystem behave like this.
>> Task A(mail-server) may perform write/fsync, task B(mysql) do a lot
>> of io_submit-s
>> All that io may happens in parallel, fs guarantee only that metadata
>> will be serialized. So all that concurent IO flowa to blockdevice which
>> does no have i_mutex so all IO indeed happen concurrently.
>
> Looks as you're comparing an app doing POSIX
> open/read/write/fsync/close with fs doing submit_bio. This is a
> stretch. But OK, there is a similarity. But I don't think this rather
> vague similarity proves something.

we are speaking about VM process, which essentially
re-submits IO from the guest to host like above. For sure
QEMU and VM_app have this IO pattern. Thus this
pattern MUST be optimized as this is one of our
main loads.

That is why I think that this case is not marginal
and important.

Den
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [vzlin-dev] [PATCH vz7] fuse: relax i_mutex coverage in fuse_fsync

2016-12-01 Thread Maxim Patlasov

On 12/01/2016 11:27 AM, Denis V. Lunev wrote:


On 12/01/2016 10:09 PM, Maxim Patlasov wrote:

On 12/01/2016 12:06 AM, Dmitry Monakhov wrote:


Maxim Patlasov  writes:


Alexey,


You're right. And while composing the patch I well understood that it's
possible to rework fuse_sync_writes() using a counter instead of
negative bias. But the problem with flush_mtime still exists anyway.
Think about it: we firstly acquire local mtime from local inode, then
fill and submit mtime-update-request. Since then, we don't know when
exactly fuse daemon will apply that new mtime to its metadata
structures. If another mtime-update is generated in-between (e.g.
"touch
-d  file", or even simplier -- just a single direct write
implicitly updating mtime), we wouldn't know which of those two
mtime-update-requests are processed by fused first. That comes from a
general FUSE protocol limitation: when kernel fuse queues request A,
then request B, it cannot be sure if they will be processed by
userspace
as  or .


The big advantage of the patch I sent is that it's very simple,
straightforward and presumably will remove 99% of contention between
fsync and io_submit (assuming we spend most of time waiting for
userspace ACK for FUSE_FSYNC request. There are actually three
questions
to answer:

1) Do we really must honor a crazy app who mixes a lot of fsyncs with a
lot of io_submits? The goal of fsync is to ensure that some state is
actually went to platters. An app who races io_submit-s with fsync-s
actually doesn't care which state will come to platters. I'm not sure
that it's reasonable to work very hard to achieve the best possible
performance for such a marginal app.

Obiously any filesystem behave like this.
Task A(mail-server) may perform write/fsync, task B(mysql) do a lot
of io_submit-s
All that io may happens in parallel, fs guarantee only that metadata
will be serialized. So all that concurent IO flowa to blockdevice which
does no have i_mutex so all IO indeed happen concurrently.

Looks as you're comparing an app doing POSIX
open/read/write/fsync/close with fs doing submit_bio. This is a
stretch. But OK, there is a similarity. But I don't think this rather
vague similarity proves something.

we are speaking about VM process, which essentially
re-submits IO from the guest to host like above. For sure
QEMU and VM_app have this IO pattern. Thus this
pattern MUST be optimized as this is one of our
main loads.


Yes, I agree. That's exactly why I wrote in the same email (next paragraph):

This really makes sense. If an app inside a VM loops over ordinary 
direct writes, while another app (in the same VM) does fsync, it's not 
fair to suspend the first app for long while just because fuse holds 
i_mutex for long somewhere deep in fuse_fsync. 


Max



That is why I think that this case is not marginal
and important.

Den


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz7] fuse: no mtime flush on fdatasync

2016-12-01 Thread Maxim Patlasov
fuse_fsync_common() may skip fuse_flush_mtime() if datasync=1 because
mtime is pure metadata and the content of file doesn't depend on it.

https://jira.sw.ru/browse/PSBM-55919

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 559dfd9..e5c4778 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -684,8 +684,8 @@ int fuse_fsync_common(struct file *file, loff_t start, 
loff_t end,
if (err)
goto out;
 
-   if (test_bit(FUSE_I_MTIME_UPDATED,
-&get_fuse_inode(inode)->state)) {
+   if (!datasync && test_bit(FUSE_I_MTIME_UPDATED,
+ &get_fuse_inode(inode)->state)) {
err = fuse_flush_mtime(file, false);
if (err)
goto out;

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel