Re: [Devel] [PATCH rhel7] procfs: always expose /proc//map_files/ and make it readable
On Fri, May 20, 2016 at 05:19:27PM +0300, Konstantin Khorenko wrote: > This patch is a step for CRIU to work under non-root user. It allows us to restore memory mappings without using memfd in userns > The idea is good, but there is a long-long way ahead and it even can finally > possible only after next "Virtuozzo 8" kernel appearance. > > => let's not include preparation patches until feature of running CRIU under > non-root work on mainstream kernel as least. > > Surely this is valid until such patches are not required due to some other > reasons as well. > > -- > Best regards, > > Konstantin Khorenko, > Virtuozzo Linux Kernel Team > > On 05/16/2016 10:42 PM, Andrey Vagin wrote: > > Acked-by: Andrey Vagin> > > > On Mon, May 16, 2016 at 11:28:51AM +0300, Cyrill Gorcunov wrote: > > > This is a backport of commit > > > > > > ML: bdb4d100afe9818aebd1d98ced575c5ef143456c > > > > > > From: Calvin Owens > > > > > > Currently, /proc//map_files/ is restricted to CAP_SYS_ADMIN, and is > > > only exposed if CONFIG_CHECKPOINT_RESTORE is set. > > > > > > Each mapped file region gets a symlink in /proc//map_files/ > > > corresponding to the virtual address range at which it is mapped. The > > > symlinks work like the symlinks in /proc//fd/, so you can follow them > > > to the backing file even if that backing file has been unlinked. > > > > > > Currently, files which are mapped, unlinked, and closed are impossible to > > > stat() from userspace. Exposing /proc//map_files/ closes this > > > functionality "hole". > > > > > > Not being able to stat() such files makes noticing and explicitly > > > accounting for the space they use on the filesystem impossible. You can > > > work around this by summing up the space used by every file in the > > > filesystem and subtracting that total from what statfs() tells you, but > > > that obviously isn't great, and it becomes unworkable once your filesystem > > > becomes large enough. > > > > > > This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and > > > adjusts the permissions enforced on it as follows: > > > > > > * proc_map_files_lookup() > > > * proc_map_files_readdir() > > > * map_files_d_revalidate() > > > > > > Remove the CAP_SYS_ADMIN restriction, leaving only the current > > > restriction requiring PTRACE_MODE_READ. The information made > > > available to userspace by these three functions is already > > > available in /proc/PID/maps with MODE_READ, so I don't see any > > > reason to limit them any further (see below for more detail). > > > > > > * proc_map_files_follow_link() > > > > > > This stub has been added, and requires that the user have > > > CAP_SYS_ADMIN in order to follow the links in map_files/, > > > since there was concern on LKML both about the potential for > > > bypassing permissions on ancestor directories in the path to > > > files pointed to, and about what happens with more exotic > > > memory mappings created by some drivers (ie dma-buf). > > > > > > In older versions of this patch, I changed every permission check in > > > the four functions above to enforce MODE_ATTACH instead of MODE_READ. > > > This was an oversight on my part, and after revisiting the discussion > > > it seems that nobody was concerned about anything outside of what is > > > made possible by ->follow_link(). So in this version, I've left the > > > checks for PTRACE_MODE_READ as-is. > > > > > > [a...@linux-foundation.org: catch up with concurrent > > > proc_pid_follow_link() changes] > > > Signed-off-by: Calvin Owens > > > Reviewed-by: Kees Cook > > > Cc: Andy Lutomirski > > > Cc: Cyrill Gorcunov > > > Cc: Joe Perches > > > Cc: Kirill A. Shutemov > > > Signed-off-by: Andrew Morton > > > Signed-off-by: Linus Torvalds > > > Signed-off-by: Cyrill Gorcunov > > > --- > > > > > > Kostya, please wait for Ack from Andrew. The patch on its own is not > > > bound to some of the bug we're working on now but usefull in general > > > and probably will help us with renaming of memfd restored memory > > > in criu (we use memfd to be able to restore anonymous shared memory > > > in userns case but memfd mangles the backend name, we didn't find > > > any problem with it yet, but been talking to Andrew and he agreed > > > that we might need to do something with this problem, and this patch > > > is first step). > > > > > > fs/proc/base.c | 44 +++- > > > 1 file changed, 23 insertions(+), 21 deletions(-) > > > > > > Index: linux-pcs7.git/fs/proc/base.c > > > === > > > --- linux-pcs7.git.orig/fs/proc/base.c > > > +++ linux-pcs7.git/fs/proc/base.c > > > @@ -1925,8
[Devel] [PATCH rh7] kmod: Allow netfilter conntrack inside VE
Netfilter conntrack module is used during checkpoint (which is done on node) so the modules get autoloaded but in case of migration the restore starts inside veX so we need to allow the conntrack to be requested from ve context. Thus add them into whitelist. Initially missed them in ebc70d73717f592c89ad992f77587d9e118bbee6. https://jira.sw.ru/browse/PSBM-47359 CC: Vladimir DavydovCC: Konstantin Khorenko CC: Andrey Vagin CC: Pavel Emelyanov Signed-off-by: Cyrill Gorcunov --- kernel/kmod.c |2 ++ 1 file changed, 2 insertions(+) Index: linux-pcs7.git/kernel/kmod.c === --- linux-pcs7.git.orig/kernel/kmod.c +++ linux-pcs7.git/kernel/kmod.c @@ -392,6 +392,8 @@ static const char * const ve0_allowed_mo /* nfnetlink */ "net-pf-16-proto-12", /* PF_NETLINK, NETLINK_NETFILTER */ + "nfnetlink-subsys-1", /* NFNL_SUBSYS_CTNETLINK */ + "nfnetlink-subsys-2", /* NFNL_SUBSYS_CTNETLINK_EXP */ /* unix_diag */ "net-pf-16-proto-4-type-1", /* PF_NETLINK, NETLINK_SOCK_DIAG, AF_LOCAL */ ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ploop: avoid costly user-controllable kmalloc() calls
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.5 --> commit 1f61933478251e603117125316771469f9284788 Author: Andrey RyabininDate: Fri May 20 18:27:26 2016 +0400 ploop: avoid costly user-controllable kmalloc() calls Certain ploop's ioctls copy large arrays from user space into kernel memory allocated via kmalloc() call. User-controllable kmalloc() is bad as it could taint the kernel (if size > MAX_ORDER). Also allocation of large physically contiguous memory is slow and not reliable in low-memory situations. So instead of preallocating memory and coping big chunk of user memory into the kernel, we copy only one element that we need at the moment. Such approach is much more reliable, may save us several MBs of kernel memory, and might be even faster. https://jira.sw.ru/browse/PSBM-47303 Signed-off-by: Andrey Ryabinin Acked-by: Maxim Patlasov --- drivers/block/ploop/dev.c | 73 +-- 1 file changed, 32 insertions(+), 41 deletions(-) diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index 15e34e5..da35c5c 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -4167,7 +4167,7 @@ static int ploop_freeblks_ioc(struct ploop_device *plo, unsigned long arg) { struct ploop_delta *delta; struct ploop_freeblks_ctl ctl; - struct ploop_freeblks_ctl_extent *extents; + struct ploop_freeblks_ctl_extent __user *extents; struct ploop_freeblks_desc *fbd; int i; int rc = 0; @@ -4184,34 +4184,35 @@ static int ploop_freeblks_ioc(struct ploop_device *plo, unsigned long arg) if (copy_from_user(, (void*)arg, sizeof(ctl))) return -EFAULT; - extents = kzalloc(sizeof(*extents) * ctl.n_extents, GFP_KERNEL); - if (!extents) - return -ENOMEM; - delta = ploop_top_delta(plo); if (delta->level != ctl.level) { rc = -EINVAL; - goto free_extents; - } - - if (copy_from_user(extents, (u8*)arg + sizeof(ctl), - sizeof(*extents) * ctl.n_extents)) { - rc = -EINVAL; - goto free_extents; + goto exit; } fbd = ploop_fb_init(plo); if (!fbd) { rc = -ENOMEM; - goto free_extents; + goto exit; } + extents = (void __user *)(arg + sizeof(ctl)); + for (i = 0; i < ctl.n_extents; i++) { - rc = ploop_fb_add_free_extent(fbd, extents[i].clu, - extents[i].iblk, extents[i].len); + struct ploop_freeblks_ctl_extent extent; + + if (copy_from_user(, [i], + sizeof(extent))) { + rc = -EFAULT; + ploop_fb_fini(fbd, rc); + goto exit; + } + + rc = ploop_fb_add_free_extent(fbd, extent.clu, + extent.iblk, extent.len); if (rc) { ploop_fb_fini(fbd, rc); - goto free_extents; + goto exit; } } @@ -4232,8 +4233,7 @@ static int ploop_freeblks_ioc(struct ploop_device *plo, unsigned long arg) } ploop_relax(plo); -free_extents: - kfree(extents); +exit: return rc; } @@ -4324,13 +4324,8 @@ static void ploop_relocblks_process(struct ploop_device *plo) spin_unlock_irq(>lock); } -static int release_fbd(struct ploop_device *plo, - struct ploop_relocblks_ctl_extent *e, - int err) +static int release_fbd(struct ploop_device *plo, int err) { - if (e) - kfree(e); - clear_bit(PLOOP_S_DISCARD, >state); ploop_quiesce(plo); @@ -4378,7 +4373,6 @@ static int ploop_relocblks_ioc(struct ploop_device *plo, unsigned long arg) { struct ploop_delta *delta = ploop_top_delta(plo); struct ploop_relocblks_ctl ctl; - struct ploop_relocblks_ctl_extent *extents = NULL; struct ploop_freeblks_desc *fbd = plo->fbd; int i; int err = 0; @@ -4406,26 +4400,23 @@ static int ploop_relocblks_ioc(struct ploop_device *plo, unsigned long arg) goto already; if (ctl.n_extents) { - extents = kzalloc(sizeof(*extents) * ctl.n_extents, - GFP_KERNEL); - if (!extents) - return release_fbd(plo, extents, -ENOMEM); + struct ploop_relocblks_ctl_extent __user *extents; - if (copy_from_user(extents, (u8*)arg +
[Devel] [PATCH RHEL7 COMMIT] fs/aio: show real number of aio events in fs.aio-nr sysctl
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.5 --> commit 280363c520967aab9be356c0e6b37ca02109ebea Author: Andrey RyabininDate: Fri May 20 19:17:05 2016 +0400 fs/aio: show real number of aio events in fs.aio-nr sysctl fs.aio-nr accounts number of aio events requested by user via io_setup() syscall. The kernel usually creates more events than was requested. CRIU doesn't care about the number of requested events, it cares only about created events. So while restoring the process CRIU requests in io_setup() the number of actually created events. This leads to inconsistent value of fs.aio-nr after the restore. Let's show in fs.aio-nr a number of created events, not requested. https://jira.sw.ru/browse/PSBM-47209 Signed-off-by: Andrey Ryabinin Acked-by: Kirill Tkhai --- fs/aio.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index ee72178..0f00aa0 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -636,12 +636,12 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) /* limit the number of system wide aios */ spin_lock(>aio_nr_lock); - if (ve->aio_nr + nr_events > ve->aio_max_nr || - ve->aio_nr + nr_events < ve->aio_nr) { + if (ve->aio_nr + ctx->nr_events > ve->aio_max_nr || + ve->aio_nr + ctx->nr_events < ve->aio_nr) { spin_unlock(>aio_nr_lock); goto out_cleanup; } - ve->aio_nr += ctx->max_reqs; + ve->aio_nr += ctx->nr_events; spin_unlock(>aio_nr_lock); /* now link into global list. */ ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/net: sock: move ->sk_shutdown out of bitfields.
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.5 --> commit 0e9bd23e9447385f71ed9b8e0ad7bbe0e026c068 Author: Andrey RyabininDate: Fri May 20 18:58:08 2016 +0400 ms/net: sock: move ->sk_shutdown out of bitfields. ->sk_shutdown bits share one bitfield with some other bits in sock struct, such as ->sk_no_check_[r,t]x, ->sk_userlocks ... sock_setsockopt() may write to these bits, while holding the socket lock. In case of AF_UNIX sockets, we change ->sk_shutdown bits while holding only unix_state_lock(). So concurrent setsockopt() and shutdown() may lead to corrupting these bits. Fix this by moving ->sk_shutdown bits out of bitfield into a separate byte. This will not change the 'struct sock' size since ->sk_shutdown moved into previously unused 16-bit hole. https://jira.sw.ru/browse/PSBM-47023 Link: http://lkml.kernel.org/g/1463588367-16310-1-git-send-email-aryabi...@virtuozzo.com Signed-off-by: Andrey Ryabinin Suggested-by: Hannes Frederic Sowa Acked-by: Eric Dumazet --- include/net/sock.h | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/include/net/sock.h b/include/net/sock.h index b940aa5..ccaacb6 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -370,8 +370,13 @@ struct sock { atomic_tsk_omem_alloc; int sk_sndbuf; struct sk_buff_head sk_write_queue; + + /* +* Because of non atomicity rules, all +* changes are protected by socket lock. +*/ kmemcheck_bitfield_begin(flags); - unsigned intsk_shutdown : 2, + unsigned intsk_padding : 2, #ifdef __GENKSYMS__ sk_no_check : 2, #else @@ -382,6 +387,7 @@ struct sock { sk_protocol : 8, sk_type : 16; kmemcheck_bitfield_end(flags); + int sk_wmem_queued; gfp_t sk_allocation; u32 sk_pacing_rate; /* bytes per second */ @@ -390,6 +396,7 @@ struct sock { int sk_gso_type; unsigned intsk_gso_max_size; u16 sk_gso_max_segs; + u8 sk_shutdown; int sk_rcvlowat; unsigned long sk_lingertime; struct sk_buff_head sk_error_queue; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] fs/aio: show real number of aio events in fs.aio-nr sysctl
On 20.05.2016 16:30, Andrey Ryabinin wrote: > fs.aio-nr accounts number of aio events requested by user via io_setup() > syscall. The kernel usually creates more events than was requested. > CRIU doesn't care about the number of requested events, it cares only > about created events. So while restoring the process CRIU requests > in io_setup() the number of actually created events. This leads > to inconsistent value of fs.aio-nr after the restore. > > Let's account in fs.aio-nr a number of created events, not requested. > > https://jira.sw.ru/browse/PSBM-47209 > > Signed-off-by: Andrey RyabininAcked-by: Kirill Tkhai > --- > fs/aio.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/fs/aio.c b/fs/aio.c > index 29d02ab..19ebcd1 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -633,12 +633,12 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) > > /* limit the number of system wide aios */ > spin_lock(>aio_nr_lock); > - if (ve->aio_nr + nr_events > ve->aio_max_nr || > - ve->aio_nr + nr_events < ve->aio_nr) { > + if (ve->aio_nr + ctx->nr_events > ve->aio_max_nr || > + ve->aio_nr + ctx->nr_events < ve->aio_nr) { > spin_unlock(>aio_nr_lock); > goto out_cleanup; > } > - ve->aio_nr += ctx->max_reqs; > + ve->aio_nr += ctx->nr_events; > spin_unlock(>aio_nr_lock); > > /* now link into global list. */ > ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rhel7] procfs: always expose /proc//map_files/ and make it readable
This patch is a step for CRIU to work under non-root user. The idea is good, but there is a long-long way ahead and it even can finally possible only after next "Virtuozzo 8" kernel appearance. => let's not include preparation patches until feature of running CRIU under non-root work on mainstream kernel as least. Surely this is valid until such patches are not required due to some other reasons as well. -- Best regards, Konstantin Khorenko, Virtuozzo Linux Kernel Team On 05/16/2016 10:42 PM, Andrey Vagin wrote: Acked-by: Andrey VaginOn Mon, May 16, 2016 at 11:28:51AM +0300, Cyrill Gorcunov wrote: This is a backport of commit ML: bdb4d100afe9818aebd1d98ced575c5ef143456c From: Calvin Owens Currently, /proc//map_files/ is restricted to CAP_SYS_ADMIN, and is only exposed if CONFIG_CHECKPOINT_RESTORE is set. Each mapped file region gets a symlink in /proc//map_files/ corresponding to the virtual address range at which it is mapped. The symlinks work like the symlinks in /proc//fd/, so you can follow them to the backing file even if that backing file has been unlinked. Currently, files which are mapped, unlinked, and closed are impossible to stat() from userspace. Exposing /proc//map_files/ closes this functionality "hole". Not being able to stat() such files makes noticing and explicitly accounting for the space they use on the filesystem impossible. You can work around this by summing up the space used by every file in the filesystem and subtracting that total from what statfs() tells you, but that obviously isn't great, and it becomes unworkable once your filesystem becomes large enough. This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and adjusts the permissions enforced on it as follows: * proc_map_files_lookup() * proc_map_files_readdir() * map_files_d_revalidate() Remove the CAP_SYS_ADMIN restriction, leaving only the current restriction requiring PTRACE_MODE_READ. The information made available to userspace by these three functions is already available in /proc/PID/maps with MODE_READ, so I don't see any reason to limit them any further (see below for more detail). * proc_map_files_follow_link() This stub has been added, and requires that the user have CAP_SYS_ADMIN in order to follow the links in map_files/, since there was concern on LKML both about the potential for bypassing permissions on ancestor directories in the path to files pointed to, and about what happens with more exotic memory mappings created by some drivers (ie dma-buf). In older versions of this patch, I changed every permission check in the four functions above to enforce MODE_ATTACH instead of MODE_READ. This was an oversight on my part, and after revisiting the discussion it seems that nobody was concerned about anything outside of what is made possible by ->follow_link(). So in this version, I've left the checks for PTRACE_MODE_READ as-is. [a...@linux-foundation.org: catch up with concurrent proc_pid_follow_link() changes] Signed-off-by: Calvin Owens Reviewed-by: Kees Cook Cc: Andy Lutomirski Cc: Cyrill Gorcunov Cc: Joe Perches Cc: Kirill A. Shutemov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Cyrill Gorcunov --- Kostya, please wait for Ack from Andrew. The patch on its own is not bound to some of the bug we're working on now but usefull in general and probably will help us with renaming of memfd restored memory in criu (we use memfd to be able to restore anonymous shared memory in userns case but memfd mangles the backend name, we didn't find any problem with it yet, but been talking to Andrew and he agreed that we might need to do something with this problem, and this patch is first step). fs/proc/base.c | 44 +++- 1 file changed, 23 insertions(+), 21 deletions(-) Index: linux-pcs7.git/fs/proc/base.c === --- linux-pcs7.git.orig/fs/proc/base.c +++ linux-pcs7.git/fs/proc/base.c @@ -1925,8 +1925,6 @@ end_instantiate: return filldir(dirent, name, len, filp->f_pos, ino, type); } -#ifdef CONFIG_CHECKPOINT_RESTORE - /* * dname_to_vma_addr - maps a dentry name into two unsigned longs * which represent vma start and end addresses. @@ -1953,11 +1951,6 @@ static int map_files_d_revalidate(struct if (flags & LOOKUP_RCU) return -ECHILD; - if (!capable(CAP_SYS_ADMIN)) { - status = -EPERM; - goto out_notask; - } - inode = dentry->d_inode; task = get_proc_task(inode); if (!task) @@ -2048,6 +2041,28 @@ struct
Re: [Devel] [vzlin-dev] [PATCH rh7 4/4] ploop: get rid of direct calls to file->f_op->fsync()
Maxim Patlasovwrites: > The patch hides file->f_op->fsync() in dio_sync. The only exception is > dio_truncate where "file" may come from userspace fd. > > Signed-off-by: Maxim Patlasov Acked-by:dmonak...@openvz.org > --- > drivers/block/ploop/io_direct.c | 13 + > 1 file changed, 5 insertions(+), 8 deletions(-) > > diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c > index 1ff848c..8096110 100644 > --- a/drivers/block/ploop/io_direct.c > +++ b/drivers/block/ploop/io_direct.c > @@ -405,8 +405,7 @@ try_again: > } > > /* flush new i_size to disk */ > - err = io->files.file->f_op->fsync(io->files.file, 0, > - LLONG_MAX, 0); > + err = io->ops->sync(io); > if (err) > goto end_write; > > @@ -524,8 +523,8 @@ dio_post_submit(struct ploop_io *io, struct ploop_request > * preq) > FALLOC_FL_CONVERT_UNWRITTEN, > (loff_t)sec << 9, clu_siz); > if (!err) > - err = io->files.file->f_op->fsync(io->files.file, 0, > - LLONG_MAX, 0); > + err = io->ops->sync(io); > + > file_end_write(io->files.file); > if (err) { > PLOOP_REQ_SET_ERROR(preq, err); > @@ -814,8 +813,7 @@ static int dio_fsync_thread(void * data) > /* filemap_fdatawrite() has been made already */ > filemap_fdatawait(io->files.mapping); > > - err = io->files.file->f_op->fsync(io->files.file, 0, > - LLONG_MAX, 0); > + err = io->ops->sync(io); > > /* Do we need to invalidate page cache? Not really, >* because we use it only to create full new pages, > @@ -1367,8 +1365,7 @@ static int dio_alloc_sync(struct ploop_io * io, loff_t > pos, loff_t len) > if (err) > goto fail; > > - err = io->files.file->f_op->fsync(io->files.file, 0, > - LLONG_MAX, 0); > + err = io->ops->sync(io); > if (err) > goto fail; > signature.asc Description: PGP signature ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ploop: push_backup: optimize ploop_pb_get_pending()
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.5 --> commit 6130b647da6a3f732194155a34a0a77632614f3d Author: Maxim PatlasovDate: Fri May 20 16:48:27 2016 +0400 ploop: push_backup: optimize ploop_pb_get_pending() Pack as many adjacent preq-s to each extent as possible. https://jira.sw.ru/browse/PSBM-45000 Signed-off-by: Maxim Patlasov --- drivers/block/ploop/push_backup.c | 46 +-- 1 file changed, 39 insertions(+), 7 deletions(-) diff --git a/drivers/block/ploop/push_backup.c b/drivers/block/ploop/push_backup.c index 46ea2da..ba64468 100644 --- a/drivers/block/ploop/push_backup.c +++ b/drivers/block/ploop/push_backup.c @@ -351,7 +351,8 @@ static struct ploop_request *ploop_pb_get_req_from_tree(struct rb_root *tree, } static struct ploop_request * -ploop_pb_get_first_req_from_tree(struct rb_root *tree) +ploop_pb_get_first_req_from_tree(struct rb_root *tree, +struct ploop_request **npreq) { static struct ploop_request *p; struct rb_node *n = rb_first(tree); @@ -359,6 +360,15 @@ ploop_pb_get_first_req_from_tree(struct rb_root *tree) if (!n) return NULL; + if (npreq) { + struct rb_node *next = rb_next(n); + if (next) + *npreq = rb_entry(next, struct ploop_request, + reloc_link); + else + *npreq = NULL; + } + p = rb_entry(n, struct ploop_request, reloc_link); rb_erase(>reloc_link, tree); return p; @@ -367,13 +377,20 @@ ploop_pb_get_first_req_from_tree(struct rb_root *tree) static struct ploop_request * ploop_pb_get_first_req_from_pending(struct ploop_pushbackup_desc *pbd) { - return ploop_pb_get_first_req_from_tree(>pending_tree); + return ploop_pb_get_first_req_from_tree(>pending_tree, NULL); +} + +static struct ploop_request * +ploop_pb_get_first_reqs_from_pending(struct ploop_pushbackup_desc *pbd, +struct ploop_request **npreq) +{ + return ploop_pb_get_first_req_from_tree(>pending_tree, npreq); } static struct ploop_request * ploop_pb_get_first_req_from_reported(struct ploop_pushbackup_desc *pbd) { - return ploop_pb_get_first_req_from_tree(>reported_tree); + return ploop_pb_get_first_req_from_tree(>reported_tree, NULL); } static struct ploop_request * @@ -463,13 +480,12 @@ int ploop_pb_get_pending(struct ploop_pushbackup_desc *pbd, cluster_t *clu_p, cluster_t *len_p, unsigned n_done) { bool blocking = !n_done; - struct ploop_request *preq; + struct ploop_request *preq, *npreq; int err = 0; spin_lock(>ppb_lock); - /* OPTIMIZE ME LATER: rb_first() once, then rb_next() */ - preq = ploop_pb_get_first_req_from_pending(pbd); + preq = ploop_pb_get_first_reqs_from_pending(pbd, ); if (!preq) { struct ploop_device *plo = pbd->plo; @@ -498,7 +514,7 @@ int ploop_pb_get_pending(struct ploop_pushbackup_desc *pbd, pbd->ppb_waiting = false; init_completion(>ppb_comp); - preq = ploop_pb_get_first_req_from_pending(pbd); + preq = ploop_pb_get_first_reqs_from_pending(pbd, ); if (!preq) { if (!test_bit(PLOOP_S_PUSH_BACKUP, >state)) err = -EINTR; @@ -515,6 +531,22 @@ int ploop_pb_get_pending(struct ploop_pushbackup_desc *pbd, *clu_p = preq->req_cluster; *len_p = 1; + while (npreq && npreq->req_cluster == *clu_p + *len_p) { + struct rb_node *next = rb_next(>reloc_link); + + preq = npreq; + if (next) + npreq = rb_entry(next, struct ploop_request, +reloc_link); + else + npreq = NULL; + + rb_erase(>reloc_link, >pending_tree); + ploop_pb_add_req_to_reported(pbd, preq); + + (*len_p)++; + } + get_pending_unlock: spin_unlock(>ppb_lock); return err; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] fs/aio: show real number of aio events in fs.aio-nr sysctl
fs.aio-nr accounts number of aio events requested by user via io_setup() syscall. The kernel usually creates more events than was requested. CRIU doesn't care about the number of requested events, it cares only about created events. So while restoring the process CRIU requests in io_setup() the number of actually created events. This leads to inconsistent value of fs.aio-nr after the restore. Let's account in fs.aio-nr a number of created events, not requested. https://jira.sw.ru/browse/PSBM-47209 Signed-off-by: Andrey Ryabinin--- fs/aio.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 29d02ab..19ebcd1 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -633,12 +633,12 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) /* limit the number of system wide aios */ spin_lock(>aio_nr_lock); - if (ve->aio_nr + nr_events > ve->aio_max_nr || - ve->aio_nr + nr_events < ve->aio_nr) { + if (ve->aio_nr + ctx->nr_events > ve->aio_max_nr || + ve->aio_nr + ctx->nr_events < ve->aio_nr) { spin_unlock(>aio_nr_lock); goto out_cleanup; } - ve->aio_nr += ctx->max_reqs; + ve->aio_nr += ctx->nr_events; spin_unlock(>aio_nr_lock); /* now link into global list. */ -- 2.7.3 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [vzlin-dev] [PATCH rh7 1/4] ploop: get rid of FOP_FSYNC
Maxim Patlasovwrites: > We keep ploop sources as in-tree module of rhel7-based kernel. So we know > for sure how fsync fop prototype looks like. > > Signed-off-by: Maxim Patlasov Acked-by:dmonak...@openvz.org > --- > drivers/block/ploop/io_direct.c | 15 +-- > include/linux/ploop/compat.h|6 -- > 2 files changed, 9 insertions(+), 12 deletions(-) > > diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c > index 5a2e12a..583b110 100644 > --- a/drivers/block/ploop/io_direct.c > +++ b/drivers/block/ploop/io_direct.c > @@ -406,7 +406,8 @@ try_again: > } > > /* flush new i_size to disk */ > - err = io->files.file->f_op->FOP_FSYNC(io->files.file, > 0); > + err = io->files.file->f_op->fsync(io->files.file, 0, > + LLONG_MAX, 0); > if (err) > goto end_write; > > @@ -524,7 +525,8 @@ dio_post_submit(struct ploop_io *io, struct ploop_request > * preq) > FALLOC_FL_CONVERT_UNWRITTEN, > (loff_t)sec << 9, clu_siz); > if (!err) > - err = io->files.file->f_op->FOP_FSYNC(io->files.file, 0); > + err = io->files.file->f_op->fsync(io->files.file, 0, > + LLONG_MAX, 0); > file_end_write(io->files.file); > if (err) { > PLOOP_REQ_SET_ERROR(preq, err); > @@ -815,8 +817,8 @@ static int dio_fsync_thread(void * data) > > err = 0; > if (io->files.file->f_op->fsync) > - err = io->files.file->f_op->FOP_FSYNC(io->files.file, > - 0); > + err = io->files.file->f_op->fsync(io->files.file, 0, > + LLONG_MAX, 0); > > /* Do we need to invalidate page cache? Not really, >* because we use it only to create full new pages, > @@ -853,7 +855,7 @@ static int dio_fsync(struct file * file) > ret = filemap_write_and_wait(mapping); > err = 0; > if (file->f_op && file->f_op->fsync) { > - err = file->f_op->FOP_FSYNC(file, 0); > + err = file->f_op->fsync(file, 0, LLONG_MAX, 0); > if (!ret) > ret = err; > } > @@ -1385,7 +1387,8 @@ static int dio_alloc_sync(struct ploop_io * io, loff_t > pos, loff_t len) > goto fail; > > if (io->files.file->f_op && io->files.file->f_op->fsync) { > - err = io->files.file->f_op->FOP_FSYNC(io->files.file, 0); > + err = io->files.file->f_op->fsync(io->files.file, 0, > + LLONG_MAX, 0); > if (err) > goto fail; > } > diff --git a/include/linux/ploop/compat.h b/include/linux/ploop/compat.h > index 03c3ae3..8a36d81 100644 > --- a/include/linux/ploop/compat.h > +++ b/include/linux/ploop/compat.h > @@ -58,10 +58,4 @@ static void func(struct bio *bio, int err) { > > #endif > > -#if LINUX_VERSION_CODE > KERNEL_VERSION(2,6,32) > -#define FOP_FSYNC(file, datasync) fsync(file, 0, LLONG_MAX, datasync) > -#else > -#define FOP_FSYNC(file, datasync) fsync(file, F_DENTRY(file), datasync) > -#endif > - > #endif signature.asc Description: PGP signature ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH] aio: Multiply passed nr_events in io_setup()
CRIU supports AIO in kernel versions >= 3.19. It expects the logic, which was introduced in kernel commit e1bdd5f27a5b, when it became to multiply the passed nr_events twice. Vz7 kernel doesn't do that, and this lead to wrong calculation of lenght of AIO ring in CRIU. This is a port of small part of commit e1bdd5f27a5b, to make CRIU works on Virtuozzo 7 kernel. https://jira.sw.ru/browse/PSBM-47075 Signed-off-by: Kirill Tkhai--- fs/aio.c |3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/aio.c b/fs/aio.c index 29d02ab..ee72178 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -596,6 +596,9 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) struct ve_struct *ve = get_exec_env(); int err = -ENOMEM; + /* Kernel since e1bdd5f27a5b do this, and criu is tuned on that */ + nr_events *= 2; + /* Prevent overflows */ if ((nr_events > (0x1000U / sizeof(struct io_event))) || (nr_events > (0x1000U / sizeof(struct kiocb { ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ploop: push_backup: optimize ploop_pb_put_reported()
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.5 --> commit 0fb4dff848a098b16c93028aed261d4f5aa11a33 Author: Maxim PatlasovDate: Fri May 20 16:57:47 2016 +0400 ploop: push_backup: optimize ploop_pb_put_reported() To process an extent [clu, clu + len) coming from userspace, let's firstly find the leftmost preq in the tree matching this interval. Then, we can iterate to the right by rb_next() until we meet a preq that doesn't match. That's it because any preq to the right would have any higher req_cluster, and hence doesn't match too. https://jira.sw.ru/browse/PSBM-45000 Signed-off-by: Maxim Patlasov --- drivers/block/ploop/dev.c | 9 +--- drivers/block/ploop/push_backup.c | 107 +- 2 files changed, 73 insertions(+), 43 deletions(-) diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index f3bb092..15e34e5 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -4626,13 +4626,8 @@ static int ploop_push_backup_io_write(struct ploop_device *plo, unsigned long ar goto io_write_done; rc = 0; - for (i = 0; i < ctl->n_extents; i++) { - cluster_t j; - for (j = e[i].clu; j < e[i].clu + e[i].len; j++) - ploop_pb_put_reported(plo->pbd, j, 1); -/* OPTIMIZE ME LATER: like this: -* ploop_pb_put_reported(plo->pbd, e[i].clu, e[i].len); */ - } + for (i = 0; i < ctl->n_extents; i++) + ploop_pb_put_reported(plo->pbd, e[i].clu, e[i].len); io_write_done: kfree(e); diff --git a/drivers/block/ploop/push_backup.c b/drivers/block/ploop/push_backup.c index ba64468..90172e4 100644 --- a/drivers/block/ploop/push_backup.c +++ b/drivers/block/ploop/push_backup.c @@ -329,11 +329,23 @@ static void ploop_pb_add_req_to_reported(struct ploop_pushbackup_desc *pbd, ploop_pb_add_req_to_tree(preq, >reported_tree); } +static inline bool preq_match(struct ploop_request *preq, cluster_t clu, + cluster_t len) +{ + return preq && + clu <= preq->req_cluster && + preq->req_cluster < clu + len; +} + +/* returns leftmost preq which req_cluster >= clu */ static struct ploop_request *ploop_pb_get_req_from_tree(struct rb_root *tree, - cluster_t clu) + cluster_t clu, cluster_t len, + struct ploop_request **npreq) { struct rb_node *n = tree->rb_node; - struct ploop_request *p; + struct ploop_request *p = NULL; + + *npreq = NULL; while (n) { p = rb_entry(n, struct ploop_request, reloc_link); @@ -342,11 +354,31 @@ static struct ploop_request *ploop_pb_get_req_from_tree(struct rb_root *tree, n = n->rb_left; else if (clu > p->req_cluster) n = n->rb_right; - else { + else { /* perfect match */ + n = rb_next(n); + if (n) + *npreq = rb_entry(n, struct ploop_request, + reloc_link); rb_erase(>reloc_link, tree); return p; } } + /* here p is not perfect, but it's closest */ + + if (p && p->req_cluster < clu) { + n = rb_next(>reloc_link); + if (n) + p = rb_entry(n, struct ploop_request, reloc_link); + } + + if (preq_match(p, clu, len)) { + n = rb_next(>reloc_link); + if (n) + *npreq = rb_entry(n, struct ploop_request, reloc_link); + rb_erase(>reloc_link, tree); + return p; + } + return NULL; } @@ -393,20 +425,6 @@ ploop_pb_get_first_req_from_reported(struct ploop_pushbackup_desc *pbd) return ploop_pb_get_first_req_from_tree(>reported_tree, NULL); } -static struct ploop_request * -ploop_pb_get_req_from_pending(struct ploop_pushbackup_desc *pbd, - cluster_t clu) -{ - return ploop_pb_get_req_from_tree(>pending_tree, clu); -} - -static struct ploop_request * -ploop_pb_get_req_from_reported(struct ploop_pushbackup_desc *pbd, - cluster_t clu) -{ - return ploop_pb_get_req_from_tree(>reported_tree, clu); -} - int ploop_pb_preq_add_pending(struct ploop_pushbackup_desc *pbd, struct ploop_request *preq) { @@ -552,27 +570,46 @@ get_pending_unlock: return err; } +static void
Re: [Devel] [vzlin-dev] [PATCH rh7 3/4] ploop: get rid of dio_fsync()
Maxim Patlasovwrites: > For ext4 dio_fsync() is actually equivalent to direct call to fsync fop: > > 1) file->f_op cannot be NULL; > 2) file->f_op->fsync is always equal to ext4_sync_file; > 3) ext4_sync_file() does filemap_write_and_wait() internally, >no need to call it explicitly. > > The patch also fixes a potential problem: if fsync() fails, it's better > to pass error code up to the stack of callers. > > Signed-off-by: Maxim Patlasov Acked-by:dmonak...@openvz.org > --- > drivers/block/ploop/io_direct.c | 52 > ++- > 1 file changed, 24 insertions(+), 28 deletions(-) > > diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c > index a37f296..1ff848c 100644 > --- a/drivers/block/ploop/io_direct.c > +++ b/drivers/block/ploop/io_direct.c > @@ -844,21 +844,6 @@ static int dio_fsync_thread(void * data) > return 0; > } > > -static int dio_fsync(struct file * file) > -{ > - int err, ret; > - struct address_space *mapping = file->f_mapping; > - > - ret = filemap_write_and_wait(mapping); > - err = 0; > - if (file->f_op && file->f_op->fsync) { > - err = file->f_op->fsync(file, 0, LLONG_MAX, 0); > - if (!ret) > - ret = err; > - } > - return ret; > -} > - > /* Invalidate page cache. It is called with inode mutex taken > * and mapping mapping must be synced. If some dirty pages remained, > * it will fail. > @@ -949,20 +934,17 @@ static void dio_destroy(struct ploop_io * io) > static int dio_sync(struct ploop_io * io) > { > struct file * file = io->files.file; > + int err = 0; > > if (file) > - dio_fsync(file); > - return 0; > + err = file->f_op->fsync(file, 0, LLONG_MAX, 0); > + > + return err; > } > > static int dio_stop(struct ploop_io * io) > { > - struct file * file = io->files.file; > - > - if (file) { > - dio_fsync(file); > - } > - return 0; > + return io->ops->sync(io); > } > > static int dio_open(struct ploop_io * io) > @@ -979,7 +961,9 @@ static int dio_open(struct ploop_io * io) > io->files.inode = io->files.mapping->host; > io->files.bdev = io->files.inode->i_sb->s_bdev; > > - dio_fsync(file); > + err = io->ops->sync(io); > + if (err) > + return err; > > mutex_lock(>files.inode->i_mutex); > em_tree = ploop_dio_open(io, (delta->flags & PLOOP_FMT_RDONLY)); > @@ -1646,7 +1630,11 @@ static int dio_prepare_snapshot(struct ploop_io * io, > struct ploop_snapdata *sd) > return -EINVAL; > } > > - dio_fsync(file); > + err = io->ops->sync(io); > + if (err) { > + fput(file); > + return err; > + } > > mutex_lock(>files.inode->i_mutex); > err = dio_invalidate_cache(io->files.mapping, io->files.bdev); > @@ -1713,7 +1701,11 @@ static int dio_prepare_merge(struct ploop_io * io, > struct ploop_snapdata *sd) > return -EINVAL; > } > > - dio_fsync(file); > + err = io->ops->sync(io); > + if (err) { > + fput(file); > + return err; > + } > > mutex_lock(>files.inode->i_mutex); > > @@ -1772,8 +1764,12 @@ static int dio_truncate(struct ploop_io * io, struct > file * file, > atomic_long_sub(*io->size_ptr - new_size, _io_images_size); > *io->size_ptr = new_size; > > - if (!err) > - err = dio_fsync(file); > + if (!err) { > + if (io->files.file == file) > + err = io->ops->sync(io); > + else > + err = file->f_op->fsync(file, 0, LLONG_MAX, 0); > + } > > return err; > } signature.asc Description: PGP signature ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [vzlin-dev] [PATCH rh7 2/4] ploop: io_direct: check for fsync fop on startup
Maxim Patlasovwrites: > We don't support host file systems without fsync fop. The patch refuses > to start ploop if fsync is absent. > > Signed-off-by: Maxim Patlasov Acked-by:dmonak...@openvz.org > --- > drivers/block/ploop/io_direct.c | 23 --- > 1 file changed, 12 insertions(+), 11 deletions(-) > > diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c > index 583b110..a37f296 100644 > --- a/drivers/block/ploop/io_direct.c > +++ b/drivers/block/ploop/io_direct.c > @@ -376,7 +376,6 @@ cached_submit(struct ploop_io *io, iblock_t iblk, struct > ploop_request * preq, > loff_t new_size; > loff_t used_pos; > bool may_fallocate = io->files.file->f_op->fallocate && > - io->files.file->f_op->fsync && > io->files.flags & EXT4_EXTENTS_FL; > > trace_cached_submit(preq); > @@ -815,10 +814,8 @@ static int dio_fsync_thread(void * data) > /* filemap_fdatawrite() has been made already */ > filemap_fdatawait(io->files.mapping); > > - err = 0; > - if (io->files.file->f_op->fsync) > - err = io->files.file->f_op->fsync(io->files.file, 0, > - LLONG_MAX, 0); > + err = io->files.file->f_op->fsync(io->files.file, 0, > + LLONG_MAX, 0); > > /* Do we need to invalidate page cache? Not really, >* because we use it only to create full new pages, > @@ -1386,12 +1383,11 @@ static int dio_alloc_sync(struct ploop_io * io, > loff_t pos, loff_t len) > if (err) > goto fail; > > - if (io->files.file->f_op && io->files.file->f_op->fsync) { > - err = io->files.file->f_op->fsync(io->files.file, 0, > - LLONG_MAX, 0); > - if (err) > - goto fail; > - } > + err = io->files.file->f_op->fsync(io->files.file, 0, > + LLONG_MAX, 0); > + if (err) > + goto fail; > + > err = filemap_fdatawait(io->files.mapping); > > fail: > @@ -1878,6 +1874,11 @@ static int dio_autodetect(struct ploop_io * io) > return -1; > } > > + if (!file->f_op->fsync) { > + printk("Cannot run on EXT4(%s): no fsync\n", s_id); > + return -1; > + } > + > fs = get_fs(); > set_fs(KERNEL_DS); > flags = 0; signature.asc Description: PGP signature ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/KVM: x86: move steal time initialization to vcpu entry time
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.5 --> commit 4a055bb845eda81847c8e2b2233ba8ac3a889ab6 Author: Marcelo TosattiDate: Fri May 20 16:38:02 2016 +0400 ms/KVM: x86: move steal time initialization to vcpu entry time As reported at https://bugs.launchpad.net/qemu/+bug/1494350, it is possible to have vcpu->arch.st.last_steal initialized from a thread other than vcpu thread, say the iothread, via KVM_SET_MSRS. Which can cause an overflow later (when subtracting from vcpu threads sched_info.run_delay). To avoid that, move steal time accumulation to vcpu entry time, before copying steal time data to guest. Signed-off-by: Marcelo Tosatti Reviewed-by: David Matlack Signed-off-by: Paolo Bonzini (cherry picked from commit 7cae2bedcbd4680b155999655e49c27b9cf020fa) https://jira.sw.ru/browse/PSBM-46737 Signed-off-by: Roman Kagan --- arch/x86/kvm/x86.c | 9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ca78e5e..146c34a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2057,6 +2057,8 @@ static void accumulate_steal_time(struct kvm_vcpu *vcpu) static void record_steal_time(struct kvm_vcpu *vcpu) { + accumulate_steal_time(vcpu); + if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) return; @@ -2207,12 +2209,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if (!(data & KVM_MSR_ENABLED)) break; - vcpu->arch.st.last_steal = current->sched_info.run_delay; - - preempt_disable(); - accumulate_steal_time(vcpu); - preempt_enable(); - kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); break; @@ -2826,7 +2822,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) vcpu->cpu = cpu; } - accumulate_steal_time(vcpu); kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 03/10] dcache: don't need rcu in shrink_dentry_list()
From: Miklos SzerediSince now the shrink list is private and nobody can free the dentry while it is on the shrink list, we can remove RCU protection from this. Signed-off-by: Miklos Szeredi Signed-off-by: Al Viro (cherry picked from commit 60942f2f235ce7b817166cdf355eed729094834d) Signed-off-by: Vladimir Davydov --- fs/dcache.c | 27 --- 1 file changed, 4 insertions(+), 23 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 1961048478cf..c584d23a194a 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -826,23 +826,9 @@ static void shrink_dentry_list(struct list_head *list) { struct dentry *dentry, *parent; - rcu_read_lock(); - for (;;) { - dentry = list_entry_rcu(list->prev, struct dentry, d_lru); - if (>d_lru == list) - break; /* empty */ - - /* -* Get the dentry lock, and re-verify that the dentry is -* this on the shrinking list. If it is, we know that -* DCACHE_SHRINK_LIST and DCACHE_LRU_LIST are set. -*/ + while (!list_empty(list)) { + dentry = list_entry(list->prev, struct dentry, d_lru); spin_lock(>d_lock); - if (dentry != list_entry(list->prev, struct dentry, d_lru)) { - spin_unlock(>d_lock); - continue; - } - /* * The dispose list is isolated and dentries are not accounted * to the LRU here, so we can simply remove it from the list @@ -858,23 +844,20 @@ static void shrink_dentry_list(struct list_head *list) spin_unlock(>d_lock); continue; } - rcu_read_unlock(); parent = dentry_kill(dentry, 0); /* * If dentry_kill returns NULL, we have nothing more to do. */ - if (!parent) { - rcu_read_lock(); + if (!parent) continue; - } + if (unlikely(parent == dentry)) { /* * trylocks have failed and d_lock has been held the * whole time, so it could not have been added to any * other lists. Just add it back to the shrink list. */ - rcu_read_lock(); d_shrink_add(dentry, list); spin_unlock(>d_lock); continue; @@ -888,9 +871,7 @@ static void shrink_dentry_list(struct list_head *list) dentry = parent; while (dentry && !lockref_put_or_lock(>d_lockref)) dentry = dentry_kill(dentry, 1); - rcu_read_lock(); } - rcu_read_unlock(); } static enum lru_status dentry_lru_isolate(struct list_head *item, -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 10/10] dcache: add missing lockdep annotation
From: Linus Torvaldslock_parent() very much on purpose does nested locking of dentries, and is careful to maintain the right order (lock parent first). But because it didn't annotate the nested locking order, lockdep thought it might be a deadlock on d_lock, and complained. Add the proper annotation for the inner locking of the child dentry to make lockdep happy. Introduced by commit 046b961b45f9 ("shrink_dentry_list(): take parent's ->d_lock earlier"). Reported-and-tested-by: Josh Boyer Cc: Al Viro Signed-off-by: Linus Torvalds (cherry picked from commit 9f12600fe425bc28f0ccba034a77783c09c15af4) Signed-off-by: Vladimir Davydov --- fs/dcache.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/dcache.c b/fs/dcache.c index ab6859764a6a..09ed486c9f1d 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -582,7 +582,7 @@ again: } rcu_read_unlock(); if (parent != dentry) - spin_lock(>d_lock); + spin_lock_nested(>d_lock, DENTRY_D_LOCK_NESTED); else parent = NULL; return parent; -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 08/10] dealing with the rest of shrink_dentry_list() livelock
From: Al ViroWe have the same problem with ->d_lock order in the inner loop, where we are dropping references to ancestors. Same solution, basically - instead of using dentry_kill() we use lock_parent() (introduced in the previous commit) to get that lock in a safe way, recheck ->d_count (in case if lock_parent() has ended up dropping and retaking ->d_lock and somebody managed to grab a reference during that window), trylock the inode->i_lock and use __dentry_kill() to do the rest. Signed-off-by: Al Viro (cherry picked from commit b2b80195d8829921506880f6dccd21cabd163d0d) Signed-off-by: Vladimir Davydov --- fs/dcache.c | 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index a287c5a63cde..7b75b482bec2 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -913,8 +913,26 @@ static void shrink_dentry_list(struct list_head *list) * fragmentation. */ dentry = parent; - while (dentry && !lockref_put_or_lock(>d_lockref)) - dentry = dentry_kill(dentry, 1); + while (dentry && !lockref_put_or_lock(>d_lockref)) { + parent = lock_parent(dentry); + if (dentry->d_lockref.count != 1) { + dentry->d_lockref.count--; + spin_unlock(>d_lock); + if (parent) + spin_unlock(>d_lock); + break; + } + inode = dentry->d_inode;/* can't be NULL */ + if (unlikely(!spin_trylock(>i_lock))) { + spin_unlock(>d_lock); + if (parent) + spin_unlock(>d_lock); + cpu_relax(); + continue; + } + __dentry_kill(dentry); + dentry = parent; + } } } -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 05/10] split dentry_kill()
From: Al Viro... into trylocks and everything else. The latter (actual killing) is __dentry_kill(). Signed-off-by: Al Viro (cherry picked from commit e55fd011549eae01a230e3cace6f4d031b6a3453) Signed-off-by: Vladimir Davydov Conflicts: fs/dcache.c --- fs/dcache.c | 66 +++-- 1 file changed, 38 insertions(+), 28 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 37420f461ca2..70b34bd63e4e 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -482,37 +482,12 @@ void d_drop(struct dentry *dentry) } EXPORT_SYMBOL(d_drop); -/* - * Finish off a dentry we've decided to kill. - * dentry->d_lock must be held, returns with it unlocked. - * If ref is non-zero, then decrement the refcount too. - * Returns dentry requiring refcount drop, or NULL if we're done. - */ -static inline struct dentry * -dentry_kill(struct dentry *dentry, int unlock_on_failure) - __releases(dentry->d_lock) +static void __dentry_kill(struct dentry *dentry) { - struct inode *inode; - struct dentry *parent; + struct dentry *parent = NULL; - inode = dentry->d_inode; - if (inode && !spin_trylock(>i_lock)) { -relock: - if (unlock_on_failure) { - spin_unlock(>d_lock); - cpu_relax(); - } - return dentry; /* try again with same dentry */ - } - if (IS_ROOT(dentry)) - parent = NULL; - else + if (!IS_ROOT(dentry)) parent = dentry->d_parent; - if (parent && !spin_trylock(>d_lock)) { - if (inode) - spin_unlock(>i_lock); - goto relock; - } /* * The dentry is now unrecoverably dead to the world. @@ -546,7 +521,42 @@ relock: * transient RCU lookups) can reach this dentry. */ d_free(dentry); +} + +/* + * Finish off a dentry we've decided to kill. + * dentry->d_lock must be held, returns with it unlocked. + * If ref is non-zero, then decrement the refcount too. + * Returns dentry requiring refcount drop, or NULL if we're done. + */ +static struct dentry * +dentry_kill(struct dentry *dentry, int unlock_on_failure) + __releases(dentry->d_lock) +{ + struct inode *inode = dentry->d_inode; + struct dentry *parent = NULL; + + if (inode && unlikely(!spin_trylock(>i_lock))) + goto failed; + + if (!IS_ROOT(dentry)) { + parent = dentry->d_parent; + if (unlikely(!spin_trylock(>d_lock))) { + if (inode) + spin_unlock(>i_lock); + goto failed; + } + } + + __dentry_kill(dentry); return parent; + +failed: + if (unlock_on_failure) { + spin_unlock(>d_lock); + cpu_relax(); + } + return dentry; /* try again with same dentry */ } /* -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 06/10] expand dentry_kill(dentry, 0) in shrink_dentry_list()
From: Al ViroResult will be massaged to saner shape in the next commits. It is ugly, no questions - the point of that one is to be a provably equivalent transformation (and it might be worth splitting a bit more). Signed-off-by: Al Viro (cherry picked from commit ff2fde9929feb2aef45377ce56b8b12df85dda69) Signed-off-by: Vladimir Davydov --- fs/dcache.c | 30 +- 1 file changed, 17 insertions(+), 13 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 70b34bd63e4e..2e4a7192a877 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -829,6 +829,7 @@ static void shrink_dentry_list(struct list_head *list) struct dentry *dentry, *parent; while (!list_empty(list)) { + struct inode *inode; dentry = list_entry(list->prev, struct dentry, d_lru); spin_lock(>d_lock); /* @@ -856,23 +857,26 @@ static void shrink_dentry_list(struct list_head *list) continue; } - parent = dentry_kill(dentry, 0); - /* -* If dentry_kill returns NULL, we have nothing more to do. -*/ - if (!parent) - continue; - - if (unlikely(parent == dentry)) { - /* -* trylocks have failed and d_lock has been held the -* whole time, so it could not have been added to any -* other lists. Just add it back to the shrink list. -*/ + inode = dentry->d_inode; + if (inode && unlikely(!spin_trylock(>i_lock))) { d_shrink_add(dentry, list); spin_unlock(>d_lock); continue; } + + parent = NULL; + if (!IS_ROOT(dentry)) { + parent = dentry->d_parent; + if (unlikely(!spin_trylock(>d_lock))) { + if (inode) + spin_unlock(>i_lock); + d_shrink_add(dentry, list); + spin_unlock(>d_lock); + continue; + } + } + + __dentry_kill(dentry); /* * We need to prune ancestors too. This is necessary to prevent * quadratic behavior of shrink_dcache_parent(), but is also -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 00/10] Backport fix for shrink_dentry_list livelock
https://jira.sw.ru/browse/PSBM-47312 Al Viro (7): fold try_prune_one_dentry() lift the "already marked killed" case into shrink_dentry_list() split dentry_kill() expand dentry_kill(dentry, 0) in shrink_dentry_list() shrink_dentry_list(): take parent's ->d_lock earlier dealing with the rest of shrink_dentry_list() livelock dentry_kill() doesn't need the second argument now Linus Torvalds (1): dcache: add missing lockdep annotation Miklos Szeredi (1): dcache: don't need rcu in shrink_dentry_list() Vladimir Davydov (1): fold d_kill() fs/dcache.c | 275 +--- 1 file changed, 135 insertions(+), 140 deletions(-) -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 04/10] lift the "already marked killed" case into shrink_dentry_list()
From: Al ViroIt can happen only when dentry_kill() is called with unlock_on_failure equal to 0 - other callers had dentry pinned until the moment they've got ->d_lock and DCACHE_DENTRY_KILLED is set only after lockref_mark_dead(). IOW, only one of three call sites of dentry_kill() might end up reaching that code. Just move it there. Signed-off-by: Al Viro (cherry picked from commit 64fd72e0a44bdd62c5ca277cb24d0d02b2d8e9dc) Signed-off-by: Vladimir Davydov Conflicts: fs/dcache.c --- fs/dcache.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index c584d23a194a..37420f461ca2 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -495,14 +495,6 @@ dentry_kill(struct dentry *dentry, int unlock_on_failure) struct inode *inode; struct dentry *parent; - if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) { - bool can_free = dentry->d_flags & DCACHE_MAY_FREE; - spin_unlock(>d_lock); - if (likely(can_free)) - dentry_free(dentry); - return NULL; - } - inode = dentry->d_inode; if (inode && !spin_trylock(>i_lock)) { relock: @@ -845,6 +837,15 @@ static void shrink_dentry_list(struct list_head *list) continue; } + + if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) { + bool can_free = dentry->d_flags & DCACHE_MAY_FREE; + spin_unlock(>d_lock); + if (can_free) + dentry_free(dentry); + continue; + } + parent = dentry_kill(dentry, 0); /* * If dentry_kill returns NULL, we have nothing more to do. -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 07/10] shrink_dentry_list(): take parent's ->d_lock earlier
From: Al ViroThe cause of livelocks there is that we are taking ->d_lock on dentry and its parent in the wrong order, forcing us to use trylock on the parent's one. d_walk() takes them in the right order, and unfortunately it's not hard to create a situation when shrink_dentry_list() can't make progress since trylock keeps failing, and shrink_dcache_parent() or check_submounts_and_drop() keeps calling d_walk() disrupting the very shrink_dentry_list() it's waiting for. Solution is straightforward - if that trylock fails, let's unlock the dentry itself and take locks in the right order. We need to stabilize ->d_parent without holding ->d_lock, but that's doable using RCU. And we'd better do that in the very beginning of the loop in shrink_dentry_list(), since the checks on refcount, etc. would need to be redone anyway. That deals with a half of the problem - killing dentries on the shrink list itself. Another one (dropping their parents) is in the next commit. locking parent is interesting - it would be easy to do rcu_read_lock(), lock whatever we think is a parent, lock dentry itself and check if the parent is still the right one. Except that we need to check that *before* locking the dentry, or we are risking taking ->d_lock out of order. Fortunately, once the D1 is locked, we can check if D2->d_parent is equal to D1 without the need to lock D2; D2->d_parent can start or stop pointing to D1 only under D1->d_lock, so taking D1->d_lock is enough. In other words, the right solution is rcu_read_lock/lock what looks like parent right now/check if it's still our parent/rcu_read_unlock/lock the child. Signed-off-by: Al Viro (cherry picked from commit 046b961b45f93a92e4c70525a12f3d378bced130) Signed-off-by: Vladimir Davydov --- fs/dcache.c | 53 + 1 file changed, 41 insertions(+), 12 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 2e4a7192a877..a287c5a63cde 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -559,6 +559,38 @@ failed: return dentry; /* try again with same dentry */ } +static inline struct dentry *lock_parent(struct dentry *dentry) +{ + struct dentry *parent = dentry->d_parent; + if (IS_ROOT(dentry)) + return NULL; + if (likely(spin_trylock(>d_lock))) + return parent; + spin_unlock(>d_lock); + rcu_read_lock(); +again: + parent = ACCESS_ONCE(dentry->d_parent); + spin_lock(>d_lock); + /* +* We can't blindly lock dentry until we are sure +* that we won't violate the locking order. +* Any changes of dentry->d_parent must have +* been done with parent->d_lock held, so +* spin_lock() above is enough of a barrier +* for checking if it's still our child. +*/ + if (unlikely(parent != dentry->d_parent)) { + spin_unlock(>d_lock); + goto again; + } + rcu_read_unlock(); + if (parent != dentry) + spin_lock(>d_lock); + else + parent = NULL; + return parent; +} + /* * This is dput * @@ -832,6 +864,8 @@ static void shrink_dentry_list(struct list_head *list) struct inode *inode; dentry = list_entry(list->prev, struct dentry, d_lru); spin_lock(>d_lock); + parent = lock_parent(dentry); + /* * The dispose list is isolated and dentries are not accounted * to the LRU here, so we can simply remove it from the list @@ -845,6 +879,8 @@ static void shrink_dentry_list(struct list_head *list) */ if (dentry->d_lockref.count) { spin_unlock(>d_lock); + if (parent) + spin_unlock(>d_lock); continue; } @@ -852,6 +888,8 @@ static void shrink_dentry_list(struct list_head *list) if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) { bool can_free = dentry->d_flags & DCACHE_MAY_FREE; spin_unlock(>d_lock); + if (parent) + spin_unlock(>d_lock); if (can_free) dentry_free(dentry); continue; @@ -861,22 +899,13 @@ static void shrink_dentry_list(struct list_head *list) if (inode && unlikely(!spin_trylock(>i_lock))) { d_shrink_add(dentry, list); spin_unlock(>d_lock); + if (parent) + spin_unlock(>d_lock); continue; } - parent = NULL; - if (!IS_ROOT(dentry)) { - parent = dentry->d_parent; -
[Devel] [PATCH 09/10] dentry_kill() doesn't need the second argument now
From: Al Viroit's 1 in the only remaining caller. Signed-off-by: Al Viro (cherry picked from commit 8cbf74da435d1bd13dbb790f94c7ff67b2fb6af4) Signed-off-by: Vladimir Davydov --- fs/dcache.c | 11 --- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 7b75b482bec2..ab6859764a6a 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -529,8 +529,7 @@ static void __dentry_kill(struct dentry *dentry) * If ref is non-zero, then decrement the refcount too. * Returns dentry requiring refcount drop, or NULL if we're done. */ -static struct dentry * -dentry_kill(struct dentry *dentry, int unlock_on_failure) +static struct dentry *dentry_kill(struct dentry *dentry) __releases(dentry->d_lock) { struct inode *inode = dentry->d_inode; @@ -552,10 +551,8 @@ dentry_kill(struct dentry *dentry, int unlock_on_failure) return parent; failed: - if (unlock_on_failure) { - spin_unlock(>d_lock); - cpu_relax(); - } + spin_unlock(>d_lock); + cpu_relax(); return dentry; /* try again with same dentry */ } @@ -643,7 +640,7 @@ repeat: return; kill_it: - dentry = dentry_kill(dentry, 1); + dentry = dentry_kill(dentry); if (dentry) goto repeat; } -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 01/10] fold d_kill()
This partially backports commit 03b3b889e79c ("fold d_kill() and d_free()"). I can't backport the whole commit, because d_free() is still used elsewhere. Signed-off-by: Vladimir Davydov--- fs/dcache.c | 50 +++--- 1 file changed, 15 insertions(+), 35 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 0c2826d94c26..bfc284d797d1 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -429,40 +429,6 @@ static void dentry_lru_add(struct dentry *dentry) d_lru_add(dentry); } -/** - * d_kill - kill dentry and return parent - * @dentry: dentry to kill - * @parent: parent dentry - * - * The dentry must already be unhashed and removed from the LRU. - * - * If this is the root of the dentry tree, return NULL. - * - * dentry->d_lock and parent->d_lock must be held by caller, and are dropped by - * d_kill. - */ -static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent) - __releases(dentry->d_lock) - __releases(parent->d_lock) - __releases(dentry->d_inode->i_lock) -{ - __list_del_entry(>d_u.d_child); - /* -* Inform d_walk() that we are no longer attached to the -* dentry tree -*/ - dentry->d_flags |= DCACHE_DENTRY_KILLED; - if (parent) - spin_unlock(>d_lock); - dentry_iput(dentry); - /* -* dentry_iput drops the locks, at which point nobody (except -* transient RCU lookups) can reach this dentry. -*/ - d_free(dentry); - return parent; -} - /* * Unhash a dentry without inserting an RCU walk barrier or checking that * dentry->d_lock is locked. The caller must take care of that, if @@ -574,7 +540,21 @@ relock: } /* if it was on the hash then remove it */ __d_drop(dentry); - return d_kill(dentry, parent); + __list_del_entry(>d_u.d_child); + /* +* Inform d_walk() that we are no longer attached to the +* dentry tree +*/ + dentry->d_flags |= DCACHE_DENTRY_KILLED; + if (parent) + spin_unlock(>d_lock); + dentry_iput(dentry); + /* +* dentry_iput drops the locks, at which point nobody (except +* transient RCU lookups) can reach this dentry. +*/ + d_free(dentry); + return parent; } /* -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] mm: memcontrol: drop useless VM_BUG_ON's in memcg_{start, stop}_kmem_account
No point in them, really - if these are called from a kernel thread or by a dying process nothing bad is going to happen. Quite the contrary, having them will result in KP if any of those is called after exit_mm. An example goes below: kernel BUG at mm/memcontrol.c:3358! invalid opcode: [#1] SMP CPU: 13 PID: 35647 Comm: mailsrvanalog ve: 1000 Not tainted 3.10.0-327.10.1.ovz.12.17+ #863 12.17 task: 8800a0188cb0 ti: 88005634c000 task.ti: 88005634c000 RIP: 0010:[] [] memcg_stop_kmem_account+0x25/0x30 RSP: 0018:88005634fd00 EFLAGS: 00010246 RAX: 8800a0188cb0 RBX: 880118b580b0 RCX: 88005634fd39 RDX: 0033 RSI: 88005634fd36 RDI: 0035 RBP: 88005634fd00 R08: 880118b58000 R09: bed7ff49 R10: 7c640f5b R11: b6f8c81e R12: 880036aa4000 R13: 88005634fdc4 R14: 880118b58130 R15: 88009f440020 FS: () GS:88014b14() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7fcbdb0977c0 CR3: 0194e000 CR4: 06e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Stack: 88005634fd80 a01ab5f6 880144cf1f20 8800364a38c0 2078696674736f50 737973206c69616d 336300800a6d6574 643332343339342f 3839336165356264 6236373636343735 3232626364373932 0033353639653465 Call Trace: [] ext4_open_pfcache+0xc6/0x1a0 [ext4] [] ext4_save_data_csum+0xc9/0x150 [ext4] [] ext4_commit_data_csum+0xac/0xd0 [ext4] [] ext4_release_file+0xdc/0xf0 [ext4] [] __fput+0xe1/0x250 [] fput+0xe/0x10 [] task_work_run+0xc4/0xe0 [] do_exit+0x2eb/0xb10 [] ? __do_page_fault+0x164/0x450 [] do_group_exit+0x3f/0xa0 [] SyS_exit_group+0x14/0x20 [] system_call_fastpath+0x16/0x1b Also, let's move memcg_{start,stop}_kmem_account to the header as they both are tiny. Fixes: 6812907ce7c3 ("pfcache: do not account peer files to memcg") Signed-off-by: Vladimir Davydov--- include/linux/memcontrol.h | 11 +-- mm/memcontrol.c| 13 - 2 files changed, 9 insertions(+), 15 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1c5f916054be..37ce43221f3c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -517,8 +517,15 @@ extern int memcg_nr_cache_ids; extern void memcg_get_cache_ids(void); extern void memcg_put_cache_ids(void); -extern void memcg_stop_kmem_account(void); -extern void memcg_resume_kmem_account(void); +static inline void memcg_stop_kmem_account(void) +{ + current->memcg_kmem_skip_account++; +} + +static inline void memcg_resume_kmem_account(void) +{ + current->memcg_kmem_skip_account--; +} /* * Helper macro to loop through all memcg-specific caches. Callers must still diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 89c9edfe6ce1..af2c14b88060 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3353,19 +3353,6 @@ static void memcg_free_cache_id(int id) * memcg_kmem_skip_account. So we enclose anything that might allocate memory * inside the following two functions. */ -void memcg_stop_kmem_account(void) -{ - VM_BUG_ON(!current->mm); - current->memcg_kmem_skip_account++; -} -EXPORT_SYMBOL(memcg_stop_kmem_account); - -void memcg_resume_kmem_account(void) -{ - VM_BUG_ON(!current->mm); - current->memcg_kmem_skip_account--; -} -EXPORT_SYMBOL(memcg_resume_kmem_account); struct memcg_kmem_cache_create_work { struct mem_cgroup *memcg; -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RH7 1/2] fs/overlay: set flag FS_VIRTUALIZED
On 05/20/2016 01:45 PM, Vladimir Davydov wrote: On Fri, May 20, 2016 at 12:49:35PM +0300, Pavel Tikhomirov wrote: I remember I tried to add "fs-overlay" to these list and docker failed to start, as it first checks /proc/filesystems to know that overlay is supported, but overlay is not in these file until the module is loaded. I see. But it tries modprobe first, right? Let's then allow modprobe to load modules that are on the whitelist. I don't think it would affect security, because these modules can be easily auto-loaded anyway. That is reasonable, OK. -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RH7 1/2] fs/overlay: set flag FS_VIRTUALIZED
On 05/20/2016 11:59 AM, Vladimir Davydov wrote: On Thu, May 19, 2016 at 06:32:00PM +0300, Pavel Tikhomirov wrote: these is temporary decision to make docker in CT work with overlayfs storage driver, it can be unsafe to give access to fs-overlay module from container. Why? (just curious) As overlayfs works on files/folders level and not on block device level unlike dm-thin it seem to be safe enough to give access to it in CT. For dm-thin the access to a block device and it's raw data in image file in CT was the stumbling-block. But may be overlay driver need to be checked hard to really be safe. From docker.com: "As promising as OverlayFS is, it is still relatively young. Therefore caution should be taken before using it in production Docker environments." *need to modprobe overlay module on host May be, we should add overlayfs module to the whitelist? (see ve0_allowed_mod) -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RH7 1/2] fs/overlay: set flag FS_VIRTUALIZED
On Fri, May 20, 2016 at 12:49:35PM +0300, Pavel Tikhomirov wrote: > I remember I tried to add "fs-overlay" to these list and docker failed > to start, as it first checks /proc/filesystems to know that overlay is > supported, but overlay is not in these file until the module is > loaded. I see. But it tries modprobe first, right? Let's then allow modprobe to load modules that are on the whitelist. I don't think it would affect security, because these modules can be easily auto-loaded anyway. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RH7 1/2] fs/overlay: set flag FS_VIRTUALIZED
On Fri, May 20, 2016 at 12:17:18PM +0300, Pavel Tikhomirov wrote: > > > >> > >>*need to modprobe overlay module on host > > > >May be, we should add overlayfs module to the whitelist? (see > >ve0_allowed_mod) > > I thought that whilelist will work only for auto-loading iptables modules on > special rules creation? But not for manual load: > > [root@pcs7 ~]# vzctl exec 109 modprobe binfmt_misc > [root@pcs7 ~]# lsmod | grep binfmt_misc > [root@pcs7 ~]# > > 3.10.0-327.18.2.ovz.14.4 AFAIK the fs module should auto-loaded on mount - see get_fs_type(). ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RH7 1/2] fs/overlay: set flag FS_VIRTUALIZED
Best regards, Tikhomirov Pavel Software Developer, Virtuozzo. Пользователь Vladimir Davydov написал > On Fri, May 20, 2016 at 12:17:18PM +0300, Pavel Tikhomirov wrote: > > > > > >> > > >>*need to modprobe overlay module on host > > > > > >May be, we should add overlayfs module to the whitelist? (see > > >ve0_allowed_mod) > > > > I thought that whilelist will work only for auto-loading iptables modules on > > special rules creation? But not for manual load: > > > > [root@pcs7 ~]# vzctl exec 109 modprobe binfmt_misc > > [root@pcs7 ~]# lsmod | grep binfmt_misc > > [root@pcs7 ~]# > > > > 3.10.0-327.18.2.ovz.14.4 > > AFAIK the fs module should auto-loaded on mount - see get_fs_type(). I remember I tried to add "fs-overlay" to these list and docker failed to start, as it first checks /proc/filesystems to know that overlay is supported, but overlay is not in these file until the module is loaded.___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RH7 1/2] fs/overlay: set flag FS_VIRTUALIZED
*need to modprobe overlay module on host May be, we should add overlayfs module to the whitelist? (see ve0_allowed_mod) I thought that whilelist will work only for auto-loading iptables modules on special rules creation? But not for manual load: [root@pcs7 ~]# vzctl exec 109 modprobe binfmt_misc [root@pcs7 ~]# lsmod | grep binfmt_misc [root@pcs7 ~]# 3.10.0-327.18.2.ovz.14.4 -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RH7 2/2] proc/cpuset: do not show cpuset in CT
On Thu, May 19, 2016 at 06:32:01PM +0300, Pavel Tikhomirov wrote: > After commit da53619c5d49 ("ve/cpuset: revert changes allowing to > attach to empty cpusets") one can not create non-empty cpuset > cgroup in CT. And docker which tries to create cgroup for every > visible controller creates cpuset cgroup for docker-ct and fails > to add processes to it. > > Cgroup files cpuset.cpus are by design not valid to use > in our CTs as they pin processes in cgroup to defined range of > processors, but we don't want processes in container to be able > to pin itself to cpus they want. We have other mechanism to restric > CT's cpus usage - cpu.nr_cpus cgroup file, which allows balansing > containers between cpus. So we faked cpuset.cpus in CT so one can > not realy pin processes in CT. But that makes all cpuset cgroups > non-initialized and we also can't attach processes to cgroups. Same > is valid for cpuset.mems exept we do not have ~nr_mems. > > We can just hide cpuset cgroup from /proc/self/cgroup and /proc/cgroups > to protect it from being used in CT(and also do not mount it in > libvzctl, which seem to automaticly happen). Yeah, libvzctl is smart enough to check /proc/cgroups when adding cgroup bindmounts, which happens after attaching to ve cgroup, so it should just work. > Docker not seeing > cpuset will almost silently skip it and work as usual. I hope nobody but docker has intention to use cpuset inside container. > > *Over option is to make fake cpuset.{cpus,mems} to be semi-fake copying > values from root cgroup on write. But that can lock us changing > these files on host root cgroup as they are hierarchical. Nah, that's bad. I bet we would end up reintroducing our ugly hacks allowing to attach to empty cpusets then. BTW, I think we can now revert commit 7c57b078d025b ("sched: Port diff-fairsched-cpuset-add-fake-cpuset-for-containers"). Please do. > > https://jira.sw.ru/browse/PSBM-47280 > > Signed-off-by: Pavel Tikhomirov> --- > kernel/cgroup.c | 9 + > 1 file changed, 9 insertions(+) > > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index 5afeb59b..0a284f2 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -4953,6 +4953,11 @@ int proc_cgroup_show(struct seq_file *m, void *v) > struct cgroup *cgrp; > int count = 0; > > + if (!ve_is_super(get_exec_env()) && > + (root->subsys_mask & (1UL << cpuset_subsys_id))) > + /* do not show cpuset in CT */ > + continue; > + > seq_printf(m, "%d:", root->hierarchy_id); > for_each_subsys(root, ss) > seq_printf(m, "%s%s", count++ ? "," : "", ss->name); > @@ -4997,6 +5002,10 @@ static int proc_cgroupstats_show(struct seq_file *m, > void *v) > > if (ss == NULL) > continue; > + if (!ve_is_super(get_exec_env()) && > + (ss->root->subsys_mask & (1UL << cpuset_subsys_id))) > + /* do not show cpuset in CT */ > + continue; This check is better to be in a function. More flexible that way - you can extend the black list by patching just one place instead of two then. > num = _cg_virtualized(ss->root->number_of_cgroups); > seq_printf(m, "%s\t%d\t%d\t%d\n", > ss->name, ss->root->hierarchy_id, ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RH7 1/2] fs/overlay: set flag FS_VIRTUALIZED
On Thu, May 19, 2016 at 06:32:00PM +0300, Pavel Tikhomirov wrote: > these is temporary decision to make docker in CT work with overlayfs > storage driver, it can be unsafe to give access to fs-overlay module > from container. Why? (just curious) > > *need to modprobe overlay module on host May be, we should add overlayfs module to the whitelist? (see ve0_allowed_mod) ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] ms/net: sock: move ->sk_shutdown out of bitfields.
->sk_shutdown bits share one bitfield with some other bits in sock struct, such as ->sk_no_check_[r,t]x, ->sk_userlocks ... sock_setsockopt() may write to these bits, while holding the socket lock. In case of AF_UNIX sockets, we change ->sk_shutdown bits while holding only unix_state_lock(). So concurrent setsockopt() and shutdown() may lead to corrupting these bits. Fix this by moving ->sk_shutdown bits out of bitfield into a separate byte. This will not change the 'struct sock' size since ->sk_shutdown moved into previously unused 16-bit hole. https://jira.sw.ru/browse/PSBM-47023 Link: http://lkml.kernel.org/g/1463588367-16310-1-git-send-email-aryabi...@virtuozzo.com Signed-off-by: Andrey RyabininSuggested-by: Hannes Frederic Sowa Acked-by: Eric Dumazet --- include/net/sock.h | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/include/net/sock.h b/include/net/sock.h index b940aa5..ccaacb6 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -370,8 +370,13 @@ struct sock { atomic_tsk_omem_alloc; int sk_sndbuf; struct sk_buff_head sk_write_queue; + + /* +* Because of non atomicity rules, all +* changes are protected by socket lock. +*/ kmemcheck_bitfield_begin(flags); - unsigned intsk_shutdown : 2, + unsigned intsk_padding : 2, #ifdef __GENKSYMS__ sk_no_check : 2, #else @@ -382,6 +387,7 @@ struct sock { sk_protocol : 8, sk_type : 16; kmemcheck_bitfield_end(flags); + int sk_wmem_queued; gfp_t sk_allocation; u32 sk_pacing_rate; /* bytes per second */ @@ -390,6 +396,7 @@ struct sock { int sk_gso_type; unsigned intsk_gso_max_size; u16 sk_gso_max_segs; + u8 sk_shutdown; int sk_rcvlowat; unsigned long sk_lingertime; struct sk_buff_head sk_error_queue; -- 2.7.3 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel