Re: [Devel] [vzlin-dev] [PATCH vz7] fuse: relax i_mutex coverage in fuse_fsync
On 12/01/2016 11:27 AM, Denis V. Lunev wrote: On 12/01/2016 10:09 PM, Maxim Patlasov wrote: On 12/01/2016 12:06 AM, Dmitry Monakhov wrote: Maxim Patlasovwrites: Alexey, You're right. And while composing the patch I well understood that it's possible to rework fuse_sync_writes() using a counter instead of negative bias. But the problem with flush_mtime still exists anyway. Think about it: we firstly acquire local mtime from local inode, then fill and submit mtime-update-request. Since then, we don't know when exactly fuse daemon will apply that new mtime to its metadata structures. If another mtime-update is generated in-between (e.g. "touch -d file", or even simplier -- just a single direct write implicitly updating mtime), we wouldn't know which of those two mtime-update-requests are processed by fused first. That comes from a general FUSE protocol limitation: when kernel fuse queues request A, then request B, it cannot be sure if they will be processed by userspace as or . The big advantage of the patch I sent is that it's very simple, straightforward and presumably will remove 99% of contention between fsync and io_submit (assuming we spend most of time waiting for userspace ACK for FUSE_FSYNC request. There are actually three questions to answer: 1) Do we really must honor a crazy app who mixes a lot of fsyncs with a lot of io_submits? The goal of fsync is to ensure that some state is actually went to platters. An app who races io_submit-s with fsync-s actually doesn't care which state will come to platters. I'm not sure that it's reasonable to work very hard to achieve the best possible performance for such a marginal app. Obiously any filesystem behave like this. Task A(mail-server) may perform write/fsync, task B(mysql) do a lot of io_submit-s All that io may happens in parallel, fs guarantee only that metadata will be serialized. So all that concurent IO flowa to blockdevice which does no have i_mutex so all IO indeed happen concurrently. Looks as you're comparing an app doing POSIX open/read/write/fsync/close with fs doing submit_bio. This is a stretch. But OK, there is a similarity. But I don't think this rather vague similarity proves something. we are speaking about VM process, which essentially re-submits IO from the guest to host like above. For sure QEMU and VM_app have this IO pattern. Thus this pattern MUST be optimized as this is one of our main loads. Yes, I agree. That's exactly why I wrote in the same email (next paragraph): This really makes sense. If an app inside a VM loops over ordinary direct writes, while another app (in the same VM) does fsync, it's not fair to suspend the first app for long while just because fuse holds i_mutex for long somewhere deep in fuse_fsync. Max That is why I think that this case is not marginal and important. Den ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [vzlin-dev] [PATCH vz7] fuse: relax i_mutex coverage in fuse_fsync
On 12/01/2016 10:09 PM, Maxim Patlasov wrote: > On 12/01/2016 12:06 AM, Dmitry Monakhov wrote: > >> Maxim Patlasovwrites: >> >>> Alexey, >>> >>> >>> You're right. And while composing the patch I well understood that it's >>> possible to rework fuse_sync_writes() using a counter instead of >>> negative bias. But the problem with flush_mtime still exists anyway. >>> Think about it: we firstly acquire local mtime from local inode, then >>> fill and submit mtime-update-request. Since then, we don't know when >>> exactly fuse daemon will apply that new mtime to its metadata >>> structures. If another mtime-update is generated in-between (e.g. >>> "touch >>> -d file", or even simplier -- just a single direct write >>> implicitly updating mtime), we wouldn't know which of those two >>> mtime-update-requests are processed by fused first. That comes from a >>> general FUSE protocol limitation: when kernel fuse queues request A, >>> then request B, it cannot be sure if they will be processed by >>> userspace >>> as or . >>> >>> >>> The big advantage of the patch I sent is that it's very simple, >>> straightforward and presumably will remove 99% of contention between >>> fsync and io_submit (assuming we spend most of time waiting for >>> userspace ACK for FUSE_FSYNC request. There are actually three >>> questions >>> to answer: >>> >>> 1) Do we really must honor a crazy app who mixes a lot of fsyncs with a >>> lot of io_submits? The goal of fsync is to ensure that some state is >>> actually went to platters. An app who races io_submit-s with fsync-s >>> actually doesn't care which state will come to platters. I'm not sure >>> that it's reasonable to work very hard to achieve the best possible >>> performance for such a marginal app. >> Obiously any filesystem behave like this. >> Task A(mail-server) may perform write/fsync, task B(mysql) do a lot >> of io_submit-s >> All that io may happens in parallel, fs guarantee only that metadata >> will be serialized. So all that concurent IO flowa to blockdevice which >> does no have i_mutex so all IO indeed happen concurrently. > > Looks as you're comparing an app doing POSIX > open/read/write/fsync/close with fs doing submit_bio. This is a > stretch. But OK, there is a similarity. But I don't think this rather > vague similarity proves something. we are speaking about VM process, which essentially re-submits IO from the guest to host like above. For sure QEMU and VM_app have this IO pattern. Thus this pattern MUST be optimized as this is one of our main loads. That is why I think that this case is not marginal and important. Den ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [vzlin-dev] [PATCH vz7] fuse: relax i_mutex coverage in fuse_fsync
On 12/01/2016 12:06 AM, Dmitry Monakhov wrote: Maxim Patlasovwrites: Alexey, You're right. And while composing the patch I well understood that it's possible to rework fuse_sync_writes() using a counter instead of negative bias. But the problem with flush_mtime still exists anyway. Think about it: we firstly acquire local mtime from local inode, then fill and submit mtime-update-request. Since then, we don't know when exactly fuse daemon will apply that new mtime to its metadata structures. If another mtime-update is generated in-between (e.g. "touch -d file", or even simplier -- just a single direct write implicitly updating mtime), we wouldn't know which of those two mtime-update-requests are processed by fused first. That comes from a general FUSE protocol limitation: when kernel fuse queues request A, then request B, it cannot be sure if they will be processed by userspace as or . The big advantage of the patch I sent is that it's very simple, straightforward and presumably will remove 99% of contention between fsync and io_submit (assuming we spend most of time waiting for userspace ACK for FUSE_FSYNC request. There are actually three questions to answer: 1) Do we really must honor a crazy app who mixes a lot of fsyncs with a lot of io_submits? The goal of fsync is to ensure that some state is actually went to platters. An app who races io_submit-s with fsync-s actually doesn't care which state will come to platters. I'm not sure that it's reasonable to work very hard to achieve the best possible performance for such a marginal app. Obiously any filesystem behave like this. Task A(mail-server) may perform write/fsync, task B(mysql) do a lot of io_submit-s All that io may happens in parallel, fs guarantee only that metadata will be serialized. So all that concurent IO flowa to blockdevice which does no have i_mutex so all IO indeed happen concurrently. Looks as you're comparing an app doing POSIX open/read/write/fsync/close with fs doing submit_bio. This is a stretch. But OK, there is a similarity. But I don't think this rather vague similarity proves something. But when we dealt with fs-in-file (loop/ploop/qemu-nbd) we face i_mutex on file. This really makes sense. If an app inside a VM loops over ordinary direct writes, while another app (in the same VM) does fsync, it's not fair to suspend the first app for long while just because fuse holds i_mutex for long somewhere deep in fuse_fsync. For general filesystem (xfs/ext4) What makes xfs/ext4 general? They are *local* fs having almost direct access to underlying HDD. Have you explored how CEPH implements fsync? we grab i_mutex only on write path, fsync is lockless. What about the following part of ext4_sync_file: if (!journal) { ret = generic_file_fsync(file, start, end, datasync); ? Is it lockless? But int case of fuse we artificially introduce i_mutex inside fsync exactly the same as generic_file_fsync() which basically kill concurrency for upper FS. As result we have SMP scalability as we have in Linux-v2.2 with single mutex in VFS. You can't derive "kill concurrency" from "introduce i_mutex" as simple as that. It really does matter for how long the mutex is held and how often it contends with writes. BTW: I'm wondering why do we care about mtime at all. for fs-in-file we can relax that, for example flush mtime only on fsync, and not for fdatasync. Good catch, makes sense! 2) Will the patch (in the form I sent it) break something? I think no. If you know some usecase that can be broken, let's discuss it in more details. 3) Should we expect some noticeable (or significant) improvement in performance comparing fuse_fsync with no locking at all vs. the locking we have with that patch applied? I tend to think that the answer is "no" because handling FUSE_FSYNC is notoriously heavy-weight operation. If you disagree, let's firstly measure that difference in performance (simply commenting out lock/unlock(i_mutex) in fuse_fsync) and then start to think if it's really worthy to fully re-work locking scheme to preserve flush_mtime correctness w/o i_mutex. Thanks, Maxim On 11/30/2016 05:09 AM, Alexey Kuznetsov wrote: Sorry, missed that pair fuse_set_nowrite/fuse_release_writes can be done only under i_mutex. IMHO it is only due to bad implementation. If fuse_set_nowrite would be done with separate count instead of adding negative bias, it would be possible. On Wed, Nov 30, 2016 at 3:47 PM, Alexey Kuznetsov wrote: Hello! I do not think you got it right. i_mutex in fsync is not about some atomicity, it is about stopping data feed while fsync is executed to prevent livelock. I cannot tell anything about mtime update, it is just some voodoo magic for me. What's about fsync semantics, I see two different ways: A. 1. Remove useless write_inode_now. Its work is done by
Re: [Devel] [vzlin-dev] [PATCH vz7] fuse: relax i_mutex coverage in fuse_fsync
Maxim Patlasovwrites: > Alexey, > > > You're right. And while composing the patch I well understood that it's > possible to rework fuse_sync_writes() using a counter instead of > negative bias. But the problem with flush_mtime still exists anyway. > Think about it: we firstly acquire local mtime from local inode, then > fill and submit mtime-update-request. Since then, we don't know when > exactly fuse daemon will apply that new mtime to its metadata > structures. If another mtime-update is generated in-between (e.g. "touch > -d file", or even simplier -- just a single direct write > implicitly updating mtime), we wouldn't know which of those two > mtime-update-requests are processed by fused first. That comes from a > general FUSE protocol limitation: when kernel fuse queues request A, > then request B, it cannot be sure if they will be processed by userspace > as or . > > > The big advantage of the patch I sent is that it's very simple, > straightforward and presumably will remove 99% of contention between > fsync and io_submit (assuming we spend most of time waiting for > userspace ACK for FUSE_FSYNC request. There are actually three questions > to answer: > > > 1) Do we really must honor a crazy app who mixes a lot of fsyncs with a > lot of io_submits? The goal of fsync is to ensure that some state is > actually went to platters. An app who races io_submit-s with fsync-s > actually doesn't care which state will come to platters. I'm not sure > that it's reasonable to work very hard to achieve the best possible > performance for such a marginal app. Obiously any filesystem behave like this. Task A(mail-server) may perform write/fsync, task B(mysql) do a lot of io_submit-s All that io may happens in parallel, fs guarantee only that metadata will be serialized. So all that concurent IO flowa to blockdevice which does no have i_mutex so all IO indeed happen concurrently. But when we dealt with fs-in-file (loop/ploop/qemu-nbd) we face i_mutex on file. For general filesystem (xfs/ext4) we grab i_mutex only on write path, fsync is lockless. But int case of fuse we artificially introduce i_mutex inside fsync which basically kill concurrency for upper FS. As result we have SMP scalability as we have in Linux-v2.2 with single mutex in VFS. BTW: I'm wondering why do we care about mtime at all. for fs-in-file we can relax that, for example flush mtime only on fsync, and not for fdatasync. > > > 2) Will the patch (in the form I sent it) break something? I think no. > If you know some usecase that can be broken, let's discuss it in more > details. > > > 3) Should we expect some noticeable (or significant) improvement in > performance comparing fuse_fsync with no locking at all vs. the locking > we have with that patch applied? I tend to think that the answer is "no" > because handling FUSE_FSYNC is notoriously heavy-weight operation. If > you disagree, let's firstly measure that difference in performance > (simply commenting out lock/unlock(i_mutex) in fuse_fsync) and then > start to think if it's really worthy to fully re-work locking scheme to > preserve flush_mtime correctness w/o i_mutex. > > > Thanks, > > Maxim > > > On 11/30/2016 05:09 AM, Alexey Kuznetsov wrote: >> Sorry, missed that pair fuse_set_nowrite/fuse_release_writes >> can be done only under i_mutex. >> >> IMHO it is only due to bad implementation. >> If fuse_set_nowrite would be done with separate >> count instead of adding negative bias, it would >> be possible. >> >> >> On Wed, Nov 30, 2016 at 3:47 PM, Alexey Kuznetsov >> wrote: >>> Hello! >>> >>> I do not think you got it right. >>> >>> i_mutex in fsync is not about some atomicity, >>> it is about stopping data feed while fsync is executed >>> to prevent livelock. >>> >>> I cannot tell anything about mtime update, it is just some voodoo >>> magic for me. >>> >>> What's about fsync semantics, I see two different ways: >>> >>> A. >>> >>> 1. Remove useless write_inode_now. Its work is done >>> by filemap_write_and_wait_range(), there is no need to repeat it >>> under mutex. >>> 2. move mutex_lock _after_ fuse_sync_writes(), which is essentially >>> fuse continuation forfilemap_write_and_wait_range(). >>> 3. i_mutex is preserved only around fsync call. >>> >>> B. >>> 1. Remove write_inode_now as well. >>> 2. Remove i_mutex _completely_. (No idea about mtime voodo though) >>> 2. Replace fuse_sync_writes() with fuse_set_nowrite() >>> and add release after call to FSYNC. >>> >>> Both prevent livelock. B is obviosly optimal. >>> >>> But A preserves historic fuse protocol semantics. >>> F.e. I have no idea would user space survive truncate >>> racing with fsync. pstorage should survice, though this >>> path was never tested. >>> >>> >>> >>> >>> >>> On Wed, Nov 30, 2016 at 4:02 AM, Maxim Patlasov >>> wrote: fuse_fsync_common() does need i_mutex for