Re: [Devel] [PATCH rh7] ploop: io_direct: delay f_op->fsync() until index_update for reloc requests
Dima, On 07/06/2016 04:58 AM, Dmitry Monakhov wrote: Maxim Patlasov writes: Commit 9f860e606 introduced an engine to delay fsync: doing fallocate(FALLOC_FL_CONVERT_UNWRITTEN) dio_post_submit marks io as PLOOP_IO_FSYNC_DELAYED to ensure that fsync happens later, when incoming FLUSH|FUA comes. That was deemed as important because (PSBM-47026): This optimization becomes more important due to the fact that customers tend to use pcompact heavily => ploop images grow each day. Now, we can easily re-use the engine to delay fsync for reloc requests as well. As explained in the description of commit 5aa3fe09: 1->read_data_from_old_post 2->write_to_new_pos ->sumbit_alloc ->submit_pad ->post_submit->convert_unwritten 3->update_index ->write_page with FLUSH|FUA 4->nullify_old_pos 5->issue_flush by the time of step 3 extent coversion is not yet stable because belongs to uncommitted transaction. But instead of doing fsync inside ->post_submit, we can fsync later, as the very first step of write_page for index_update. NAK from me. What is advantage of this patch? The advantage is the following: in case of BAT multi-updates, instead of doing many fsync-s (one per dio_post_submit), we'll do only one (when final ->write_page is called). Does it makes code more optimal? No Yes, it does. In the same sense as 9f860e606: saving some fsync-s. Does it makes main ploop more asynchronous? No. Correct, the patch optimizes ploop in the other way. It's not about making ploop more asynchronous. If you want to make optimization then it is reasonable to queue preq with PLOOP_IO_FSYNC_DELAYED to top_io->fsync_queue before processing PLOOP_E_DATA_WBI state for preq with FUA So sequence will looks like follows: ->sumbit_alloc ->submit_pad ->post_submit->convert_unwritten-> tag PLOOP_IO_FSYNC_DELAYED ->ploop_req_state_process case PLOOP_E_DATA_WBI: if (preq->start & PLOOP_IO_FSYNC_DELAYED_FL) { preq->start &= ~PLOOP_IO_FSYNC_DELAYED_FL list_add_tail(&preq->list, &top_io->fsync_queue) return; } ##Let fsync_thread do it's work ->ploop_req_state_process case LOOP_E_DATA_WBI: update_index->write_page with FUA (FLUSH is not required because we already done fsync) That's another type of optimization: making ploop more asynchronous. I thought about it, but didn't come to conclusion whether it's worthy w.r.t. adding more complexity to ploop-state-machine and possible bugs introduced with that. Thanks, Maxim https://jira.sw.ru/browse/PSBM-47026 Signed-off-by: Maxim Patlasov --- drivers/block/ploop/dev.c |4 ++-- drivers/block/ploop/io_direct.c | 25 - drivers/block/ploop/io_kaio.c |3 ++- drivers/block/ploop/map.c | 17 - include/linux/ploop/ploop.h |3 ++- 5 files changed, 42 insertions(+), 10 deletions(-) diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index e5f010b..40768b6 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -4097,7 +4097,7 @@ static void ploop_relocate(struct ploop_device * plo) preq->bl.tail = preq->bl.head = NULL; preq->req_cluster = 0; preq->req_size = 0; - preq->req_rw = WRITE_SYNC|REQ_FUA; + preq->req_rw = WRITE_SYNC; preq->eng_state = PLOOP_E_ENTRY; preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_A); preq->error = 0; @@ -4401,7 +4401,7 @@ static void ploop_relocblks_process(struct ploop_device *plo) preq->bl.tail = preq->bl.head = NULL; preq->req_cluster = ~0U; /* uninitialized */ preq->req_size = 0; - preq->req_rw = WRITE_SYNC|REQ_FUA; + preq->req_rw = WRITE_SYNC; preq->eng_state = PLOOP_E_ENTRY; preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_S); preq->error = 0; diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c index 1086850..0a5fb15 100644 --- a/drivers/block/ploop/io_direct.c +++ b/drivers/block/ploop/io_direct.c @@ -1494,13 +1494,36 @@ dio_read_page(struct ploop_io * io, struct ploop_request * preq, static void dio_write_page(struct ploop_io * io, struct ploop_request * preq, - struct page * page, sector_t sec, unsigned long rw) + struct page * page, sector_t sec, unsigned long rw, + int do_fsync_if_delayed) { if (!(io->files.file->f_mode & FMODE_WRITE)) { PLOOP_FAIL_REQUEST(preq, -EBADF); return; } + if (do_fsync_if_delayed && + test_bit(PLOOP_IO_FSYNC_DELAYED, &io->io_state)) { + struct ploop_device * plo = io->plo; + u64 io_count; + int err; + + spin_lock_irq(&plo->lock); + io_count = io->io_count; + spin_unlock_irq(&plo->lock)
Re: [Devel] [PATCH rh7] fs: make overlayfs disabled in CT by default
On 07/06/2016 02:26 AM, Vladimir Davydov wrote: On Tue, Jul 05, 2016 at 04:45:10PM -0700, Maxim Patlasov wrote: Vova, On 07/04/2016 11:03 AM, Maxim Patlasov wrote: On 07/04/2016 08:53 AM, Vladimir Davydov wrote: On Tue, Jun 28, 2016 at 03:48:54PM -0700, Maxim Patlasov wrote: ... @@ -643,6 +643,7 @@ static struct cgroup_subsys_state *ve_create(struct cgroup *cg) ve->odirect_enable = 2; ve->fsync_enable = 2; +ve->experimental_fs_enable = 2; For odirect_enable and fsync_enable, 2 means follow the host's config, 1 means enable unconditionally, and 0 means disable unconditionally. But we don't want to allow a user inside a CT to enable this feature, right? I thought it's OK to allow user inside CT to enable it if host sysadmin is OK about it. The same logic as for odirect: by default ve0->experimental_fs_enable = 0, so whatever user inside CT writes to this knob, the feature is disabled. If sysadmin writes '1' to ve0->..., the feature becomes enabled. If an user wants to voluntarily disable it inside CT, that's OK too. This is confusing. May be, we'd better add a new VE_FEATURE for the purpose? Not sure right now. I'll look at it and let you know later. Technically, it is very easy to implement new VE_FEATURE for overlayfs. But this approach is less flexible because we return EPERM from ve_write_64 if CT is running, and we'll need to involve userspace team to make the feature configurable and (possibly) persistent. Do you think it's worthy for something we'll get rid of soon anyway (I mean as soon as PSBM-47981 resolved)? Fair enough, not much point in introducing yet another feature for the purpose, at least right now, sysctl should do for the beginning. Come to think of it, do we really need this sysctl inside containers? I mean, by enabling this sysctl on the host we open a possible system-wide security hole, which a CT admin won't be able to mitigate by disabling overlayfs inside her CT. So why would she need it for? To prevent non-privileged CT users from mounting overlayfs inside a user ns? But overlayfs is not permitted to be mounted by a userns root anyway AFAICS. May be, just drop in-CT sysctl then? Currently, anyone who can login into CT as root may mount overlayfs, then try to exploit its weak sides. This is a problem. Until we ensure that overlayfs is production-ready (at least does not have obvious breaches), let's disable it by default (of course, if ve != ve0). Those who want to play with overlayfs at their own risk will enable it by turning on some knob on host system (ve == ve0). I don't think that mixing trusted (overlayfs-enabled) CTs and not trusted (overlayfs-disabled) CTs on the same physical node is important use-case for now. So, any simple system-wide knob must work. Essentially, the same scheme with odirect: by default it is '0' in ve0 and the root inside CT cannot turn it on; and if it is manually set to '1' in ve0, the behavior will depend on per-CT root willing. Thanks, Maxim ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] ploop: io_direct: delay f_op->fsync() until index_update for reloc requests
Maxim Patlasov writes: > Commit 9f860e606 introduced an engine to delay fsync: doing > fallocate(FALLOC_FL_CONVERT_UNWRITTEN) dio_post_submit marks > io as PLOOP_IO_FSYNC_DELAYED to ensure that fsync happens > later, when incoming FLUSH|FUA comes. > > That was deemed as important because (PSBM-47026): > >> This optimization becomes more important due to the fact that customers tend >> to use pcompact heavily => ploop images grow each day. > > Now, we can easily re-use the engine to delay fsync for reloc > requests as well. As explained in the description of commit > 5aa3fe09: > >> 1->read_data_from_old_post >> 2->write_to_new_pos >> ->sumbit_alloc >> ->submit_pad >> ->post_submit->convert_unwritten >> 3->update_index ->write_page with FLUSH|FUA >> 4->nullify_old_pos >>5->issue_flush > > by the time of step 3 extent coversion is not yet stable because > belongs to uncommitted transaction. But instead of doing fsync > inside ->post_submit, we can fsync later, as the very first step > of write_page for index_update. NAK from me. What is advantage of this patch? Does it makes code more optimal? No Does it makes main ploop more asynchronous? No. If you want to make optimization then it is reasonable to queue preq with PLOOP_IO_FSYNC_DELAYED to top_io->fsync_queue before processing PLOOP_E_DATA_WBI state for preq with FUA So sequence will looks like follows: ->sumbit_alloc ->submit_pad ->post_submit->convert_unwritten-> tag PLOOP_IO_FSYNC_DELAYED ->ploop_req_state_process case PLOOP_E_DATA_WBI: if (preq->start & PLOOP_IO_FSYNC_DELAYED_FL) { preq->start &= ~PLOOP_IO_FSYNC_DELAYED_FL list_add_tail(&preq->list, &top_io->fsync_queue) return; } ##Let fsync_thread do it's work ->ploop_req_state_process case LOOP_E_DATA_WBI: update_index->write_page with FUA (FLUSH is not required because we already done fsync) > > https://jira.sw.ru/browse/PSBM-47026 > > Signed-off-by: Maxim Patlasov > --- > drivers/block/ploop/dev.c |4 ++-- > drivers/block/ploop/io_direct.c | 25 - > drivers/block/ploop/io_kaio.c |3 ++- > drivers/block/ploop/map.c | 17 - > include/linux/ploop/ploop.h |3 ++- > 5 files changed, 42 insertions(+), 10 deletions(-) > > diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c > index e5f010b..40768b6 100644 > --- a/drivers/block/ploop/dev.c > +++ b/drivers/block/ploop/dev.c > @@ -4097,7 +4097,7 @@ static void ploop_relocate(struct ploop_device * plo) > preq->bl.tail = preq->bl.head = NULL; > preq->req_cluster = 0; > preq->req_size = 0; > - preq->req_rw = WRITE_SYNC|REQ_FUA; > + preq->req_rw = WRITE_SYNC; > preq->eng_state = PLOOP_E_ENTRY; > preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_A); > preq->error = 0; > @@ -4401,7 +4401,7 @@ static void ploop_relocblks_process(struct ploop_device > *plo) > preq->bl.tail = preq->bl.head = NULL; > preq->req_cluster = ~0U; /* uninitialized */ > preq->req_size = 0; > - preq->req_rw = WRITE_SYNC|REQ_FUA; > + preq->req_rw = WRITE_SYNC; > preq->eng_state = PLOOP_E_ENTRY; > preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_S); > preq->error = 0; > diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c > index 1086850..0a5fb15 100644 > --- a/drivers/block/ploop/io_direct.c > +++ b/drivers/block/ploop/io_direct.c > @@ -1494,13 +1494,36 @@ dio_read_page(struct ploop_io * io, struct > ploop_request * preq, > > static void > dio_write_page(struct ploop_io * io, struct ploop_request * preq, > -struct page * page, sector_t sec, unsigned long rw) > +struct page * page, sector_t sec, unsigned long rw, > +int do_fsync_if_delayed) > { > if (!(io->files.file->f_mode & FMODE_WRITE)) { > PLOOP_FAIL_REQUEST(preq, -EBADF); > return; > } > > + if (do_fsync_if_delayed && > + test_bit(PLOOP_IO_FSYNC_DELAYED, &io->io_state)) { > + struct ploop_device * plo = io->plo; > + u64 io_count; > + int err; > + > + spin_lock_irq(&plo->lock); > + io_count = io->io_count; > + spin_unlock_irq(&plo->lock); > + > + err = io->ops->sync(io); > + if (err) { > + PLOOP_FAIL_REQUEST(preq, -EBADF); > + return; > + } > + > + spin_lock_irq(&plo->lock); > + if (io_count == io->io_count && !(io_count & 1)) > + clear_bit(PLOOP_IO_FSYNC_DELAYED, &io->io_state); > + spin_unlock_irq(&plo->lock); > + } > + > dio_io_page(io, rw | WRITE | REQ_SYNC, preq, page, sec); > } > > diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c > in
[Devel] [PATCH] propogate_mnt: Handle the first propogated copy being a slave
From: "Eric W. Biederman" When the first propgated copy was a slave the following oops would result: > BUG: unable to handle kernel NULL pointer dereference at 0010 > IP: [] propagate_one+0xbe/0x1c0 > PGD bacd4067 PUD bac66067 PMD 0 > Oops: [#1] SMP > Modules linked in: > CPU: 1 PID: 824 Comm: mount Not tainted 4.6.0-rc5userns+ #1523 > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 > task: 8800bb0a8000 ti: 8800bac3c000 task.ti: 8800bac3c000 > RIP: 0010:[] [] propagate_one+0xbe/0x1c0 > RSP: 0018:8800bac3fd38 EFLAGS: 00010283 > RAX: RBX: 8800bb77ec00 RCX: 0010 > RDX: RSI: 8800bb58c000 RDI: 8800bb58c480 > RBP: 8800bac3fd48 R08: 0001 R09: > R10: 1ca1 R11: 1c9d R12: > R13: 8800ba713800 R14: 8800bac3fda0 R15: 8800bb77ec00 > FS: 7f3c0cd9b7e0() GS:8800bfb0() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 0010 CR3: bb79d000 CR4: 06e0 > Stack: > 8800bb77ec00 8800bac3fd88 811fbf85 > 8800bac3fd98 8800bb77f080 8800ba713800 8800bb262b40 > 8800bac3fdd8 811f1da0 > Call Trace: > [] propagate_mnt+0x105/0x140 > [] attach_recursive_mnt+0x120/0x1e0 > [] graft_tree+0x63/0x70 > [] do_add_mount+0x9b/0x100 > [] do_mount+0x2aa/0xdf0 > [] ? strndup_user+0x4e/0x70 > [] SyS_mount+0x75/0xc0 > [] do_syscall_64+0x4b/0xa0 > [] entry_SYSCALL64_slow_path+0x25/0x25 > Code: 00 00 75 ec 48 89 0d 02 22 22 01 8b 89 10 01 00 00 48 89 05 fd 21 22 01 > 39 8e 10 01 00 00 0f 84 e0 00 00 00 48 8b 80 d8 00 00 00 <48> 8b 50 10 48 89 > 05 df 21 22 01 48 89 15 d0 21 22 01 8b 53 30 > RIP [] propagate_one+0xbe/0x1c0 > RSP > CR2: 0010 > ---[ end trace 2725ecd95164f217 ]--- This oops happens with the namespace_sem held and can be triggered by non-root users. An all around not pleasant experience. To avoid this scenario when finding the appropriate source mount to copy stop the walk up the mnt_master chain when the first source mount is encountered. Further rewrite the walk up the last_source mnt_master chain so that it is clear what is going on. The reason why the first source mount is special is that it it's mnt_parent is not a mount in the dest_mnt propagation tree, and as such termination conditions based up on the dest_mnt mount propgation tree do not make sense. To avoid other kinds of confusion last_dest is not changed when computing last_source. last_dest is only used once in propagate_one and that is above the point of the code being modified, so changing the global variable is meaningless and confusing. Cc: sta...@vger.kernel.org fixes: f2ebb3a921c1ca1e2ddd9242e95a1989a50c4c68 ("smarter propagate_mnt()") Reported-by: Tycho Andersen Reviewed-by: Seth Forshee Tested-by: Seth Forshee Signed-off-by: "Eric W. Biederman" (cherry picked from commit 5ec0811d30378ae104f250bfc9b3640242d81e3f) Signed-off-by: Vladimir Davydov Fixes: CVE-2016-4581 Conflicts: fs/pnode.c --- fs/pnode.c | 28 +++- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/fs/pnode.c b/fs/pnode.c index 74f10c0e7e00..cc9ac074ba00 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -198,7 +198,7 @@ static struct mount *next_group(struct mount *m, struct mount *origin) /* all accesses are serialized by namespace_sem */ static struct user_namespace *user_ns; -static struct mount *last_dest, *last_source, *dest_master; +static struct mount *last_dest, *first_source, *last_source, *dest_master; static struct mountpoint *mp; static struct list_head *list; @@ -216,22 +216,23 @@ static int propagate_one(struct mount *m) type = CL_MAKE_SHARED; } else { struct mount *n, *p; + bool done; for (n = m; ; n = p) { p = n->mnt_master; - if (p == dest_master || IS_MNT_MARKED(p)) { - while (last_dest->mnt_master != p) { - last_source = last_source->mnt_master; - last_dest = last_source->mnt_parent; - } - if (n->mnt_group_id != last_dest->mnt_group_id || - (!n->mnt_group_id && -!last_dest->mnt_group_id)) { - last_source = last_source->mnt_master; - last_dest = last_source->mnt_parent; - } + if (p == dest_master || IS_MNT_MARKED(p)) break; - } } + do { + struct mount *parent = last_source-
Re: [Devel] [PATCH rh7] fs: make overlayfs disabled in CT by default
On Tue, Jul 05, 2016 at 04:45:10PM -0700, Maxim Patlasov wrote: > Vova, > > > On 07/04/2016 11:03 AM, Maxim Patlasov wrote: > >On 07/04/2016 08:53 AM, Vladimir Davydov wrote: > > > >>On Tue, Jun 28, 2016 at 03:48:54PM -0700, Maxim Patlasov wrote: > >>... > >>>@@ -643,6 +643,7 @@ static struct cgroup_subsys_state > >>>*ve_create(struct cgroup *cg) > >>>ve->odirect_enable = 2; > >>> ve->fsync_enable = 2; > >>>+ve->experimental_fs_enable = 2; > >>For odirect_enable and fsync_enable, 2 means follow the host's config, 1 > >>means enable unconditionally, and 0 means disable unconditionally. But > >>we don't want to allow a user inside a CT to enable this feature, right? > > > >I thought it's OK to allow user inside CT to enable it if host sysadmin is > >OK about it. The same logic as for odirect: by default > >ve0->experimental_fs_enable = 0, so whatever user inside CT writes to this > >knob, the feature is disabled. If sysadmin writes '1' to ve0->..., the > >feature becomes enabled. If an user wants to voluntarily disable it inside > >CT, that's OK too. > > > >>This is confusing. May be, we'd better add a new VE_FEATURE for the > >>purpose? > > > >Not sure right now. I'll look at it and let you know later. > > Technically, it is very easy to implement new VE_FEATURE for overlayfs. But > this approach is less flexible because we return EPERM from ve_write_64 if > CT is running, and we'll need to involve userspace team to make the feature > configurable and (possibly) persistent. Do you think it's worthy for > something we'll get rid of soon anyway (I mean as soon as PSBM-47981 > resolved)? Fair enough, not much point in introducing yet another feature for the purpose, at least right now, sysctl should do for the beginning. Come to think of it, do we really need this sysctl inside containers? I mean, by enabling this sysctl on the host we open a possible system-wide security hole, which a CT admin won't be able to mitigate by disabling overlayfs inside her CT. So why would she need it for? To prevent non-privileged CT users from mounting overlayfs inside a user ns? But overlayfs is not permitted to be mounted by a userns root anyway AFAICS. May be, just drop in-CT sysctl then? ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v2] netfilter: nf_log: fix error on write NONE to logger choice sysctl
On Fri, Jul 01, 2016 at 04:53:54PM +0300, Pavel Tikhomirov wrote: > It is hard to unbind nf-logger: > > echo NONE > /proc/sys/net/netfilter/nf_log/0 > bash: echo: write error: No such file or directory > > sysctl -w net.netfilter.nf_log.0=NONE > sysctl: setting key "net.netfilter.nf_log.0": No such file or directory > net.netfilter.nf_log.0 = NONE > > You need explicitly send '\0', for instance like: > > echo -e "NONE\0" > /proc/sys/net/netfilter/nf_log/0 > > That seem to be strange, so fix it using proc_dostring. > > Now it works fine: >modprobe nfnetlink_log >echo nfnetlink_log > /proc/sys/net/netfilter/nf_log/0 >cat /proc/sys/net/netfilter/nf_log/0 >nfnetlink_log >echo NONE > /proc/sys/net/netfilter/nf_log/0 >cat /proc/sys/net/netfilter/nf_log/0 >NONE Applied, thanks. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel