Re: [Devel] [PATCH rhel7] procfs: always expose /proc//map_files/ and make it readable

2016-05-20 Thread Andrey Vagin
On Fri, May 20, 2016 at 05:19:27PM +0300, Konstantin Khorenko wrote:
> This patch is a step for CRIU to work under non-root user.

It allows us to restore memory mappings without using memfd in userns

> The idea is good, but there is a long-long way ahead and it even can finally 
> possible only after next "Virtuozzo 8" kernel appearance.
> 
> => let's not include preparation patches until feature of running CRIU under 
> non-root work on mainstream kernel as least.
> 
> Surely this is valid until such patches are not required due to some other 
> reasons as well.
> 
> --
> Best regards,
> 
> Konstantin Khorenko,
> Virtuozzo Linux Kernel Team
> 
> On 05/16/2016 10:42 PM, Andrey Vagin wrote:
> > Acked-by: Andrey Vagin 
> > 
> > On Mon, May 16, 2016 at 11:28:51AM +0300, Cyrill Gorcunov wrote:
> > > This is a backport of commit
> > > 
> > > ML: bdb4d100afe9818aebd1d98ced575c5ef143456c
> > > 
> > > From: Calvin Owens 
> > > 
> > > Currently, /proc//map_files/ is restricted to CAP_SYS_ADMIN, and is
> > > only exposed if CONFIG_CHECKPOINT_RESTORE is set.
> > > 
> > > Each mapped file region gets a symlink in /proc//map_files/
> > > corresponding to the virtual address range at which it is mapped.  The
> > > symlinks work like the symlinks in /proc//fd/, so you can follow them
> > > to the backing file even if that backing file has been unlinked.
> > > 
> > > Currently, files which are mapped, unlinked, and closed are impossible to
> > > stat() from userspace.  Exposing /proc//map_files/ closes this
> > > functionality "hole".
> > > 
> > > Not being able to stat() such files makes noticing and explicitly
> > > accounting for the space they use on the filesystem impossible.  You can
> > > work around this by summing up the space used by every file in the
> > > filesystem and subtracting that total from what statfs() tells you, but
> > > that obviously isn't great, and it becomes unworkable once your filesystem
> > > becomes large enough.
> > > 
> > > This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
> > > adjusts the permissions enforced on it as follows:
> > > 
> > > * proc_map_files_lookup()
> > > * proc_map_files_readdir()
> > > * map_files_d_revalidate()
> > > 
> > >   Remove the CAP_SYS_ADMIN restriction, leaving only the current
> > >   restriction requiring PTRACE_MODE_READ. The information made
> > >   available to userspace by these three functions is already
> > >   available in /proc/PID/maps with MODE_READ, so I don't see any
> > >   reason to limit them any further (see below for more detail).
> > > 
> > > * proc_map_files_follow_link()
> > > 
> > >   This stub has been added, and requires that the user have
> > >   CAP_SYS_ADMIN in order to follow the links in map_files/,
> > >   since there was concern on LKML both about the potential for
> > >   bypassing permissions on ancestor directories in the path to
> > >   files pointed to, and about what happens with more exotic
> > >   memory mappings created by some drivers (ie dma-buf).
> > > 
> > > In older versions of this patch, I changed every permission check in
> > > the four functions above to enforce MODE_ATTACH instead of MODE_READ.
> > > This was an oversight on my part, and after revisiting the discussion
> > > it seems that nobody was concerned about anything outside of what is
> > > made possible by ->follow_link(). So in this version, I've left the
> > > checks for PTRACE_MODE_READ as-is.
> > > 
> > > [a...@linux-foundation.org: catch up with concurrent 
> > > proc_pid_follow_link() changes]
> > > Signed-off-by: Calvin Owens 
> > > Reviewed-by: Kees Cook 
> > > Cc: Andy Lutomirski 
> > > Cc: Cyrill Gorcunov 
> > > Cc: Joe Perches 
> > > Cc: Kirill A. Shutemov 
> > > Signed-off-by: Andrew Morton 
> > > Signed-off-by: Linus Torvalds 
> > > Signed-off-by: Cyrill Gorcunov 
> > > ---
> > > 
> > > Kostya, please wait for Ack from Andrew. The patch on its own is not
> > > bound to some of the bug we're working on now but usefull in general
> > > and probably will help us with renaming of memfd restored memory
> > > in criu (we use memfd to be able to restore anonymous shared memory
> > > in userns case but memfd mangles the backend name, we didn't find
> > > any problem with it yet, but been talking to Andrew and he agreed
> > > that we might need to do something with this problem, and this patch
> > > is first step).
> > > 
> > >   fs/proc/base.c |   44 +++-
> > >   1 file changed, 23 insertions(+), 21 deletions(-)
> > > 
> > > Index: linux-pcs7.git/fs/proc/base.c
> > > ===
> > > --- linux-pcs7.git.orig/fs/proc/base.c
> > > +++ linux-pcs7.git/fs/proc/base.c
> > > @@ -1925,8 

[Devel] [PATCH rh7] kmod: Allow netfilter conntrack inside VE

2016-05-20 Thread Cyrill Gorcunov
Netfilter conntrack module is used during checkpoint (which
is done on node) so the modules get autoloaded but in case
of migration the restore starts inside veX so we need to allow
the conntrack to be requested from ve context. Thus add them
into whitelist.

Initially missed them in ebc70d73717f592c89ad992f77587d9e118bbee6.

https://jira.sw.ru/browse/PSBM-47359

CC: Vladimir Davydov 
CC: Konstantin Khorenko 
CC: Andrey Vagin 
CC: Pavel Emelyanov 
Signed-off-by: Cyrill Gorcunov 
---
 kernel/kmod.c |2 ++
 1 file changed, 2 insertions(+)

Index: linux-pcs7.git/kernel/kmod.c
===
--- linux-pcs7.git.orig/kernel/kmod.c
+++ linux-pcs7.git/kernel/kmod.c
@@ -392,6 +392,8 @@ static const char * const ve0_allowed_mo
 
/* nfnetlink  */
"net-pf-16-proto-12",   /* PF_NETLINK, NETLINK_NETFILTER */
+   "nfnetlink-subsys-1",   /* NFNL_SUBSYS_CTNETLINK */
+   "nfnetlink-subsys-2",   /* NFNL_SUBSYS_CTNETLINK_EXP */
 
/* unix_diag */
"net-pf-16-proto-4-type-1", /* PF_NETLINK, NETLINK_SOCK_DIAG, 
AF_LOCAL */
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: avoid costly user-controllable kmalloc() calls

2016-05-20 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.5
-->
commit 1f61933478251e603117125316771469f9284788
Author: Andrey Ryabinin 
Date:   Fri May 20 18:27:26 2016 +0400

ploop: avoid costly user-controllable kmalloc() calls

Certain ploop's ioctls copy large arrays from user space into
kernel memory allocated via kmalloc() call.
User-controllable kmalloc() is bad as it could taint the kernel
(if size > MAX_ORDER). Also allocation of large physically contiguous
memory is slow and not reliable in low-memory situations.

So instead of preallocating memory and coping big chunk of user memory
into the kernel, we copy only one element that we need at the moment.
Such approach is much more reliable, may save us several MBs of kernel
memory, and might be even faster.

https://jira.sw.ru/browse/PSBM-47303

Signed-off-by: Andrey Ryabinin 
Acked-by: Maxim Patlasov 
---
 drivers/block/ploop/dev.c | 73 +--
 1 file changed, 32 insertions(+), 41 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 15e34e5..da35c5c 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4167,7 +4167,7 @@ static int ploop_freeblks_ioc(struct ploop_device *plo, 
unsigned long arg)
 {
struct ploop_delta *delta;
struct ploop_freeblks_ctl ctl;
-   struct ploop_freeblks_ctl_extent *extents;
+   struct ploop_freeblks_ctl_extent __user *extents;
struct ploop_freeblks_desc *fbd;
int i;
int rc = 0;
@@ -4184,34 +4184,35 @@ static int ploop_freeblks_ioc(struct ploop_device *plo, 
unsigned long arg)
if (copy_from_user(, (void*)arg, sizeof(ctl)))
return -EFAULT;
 
-   extents = kzalloc(sizeof(*extents) * ctl.n_extents, GFP_KERNEL);
-   if (!extents)
-   return -ENOMEM;
-
delta = ploop_top_delta(plo);
if (delta->level != ctl.level) {
rc = -EINVAL;
-   goto free_extents;
-   }
-
-   if (copy_from_user(extents, (u8*)arg + sizeof(ctl),
-  sizeof(*extents) * ctl.n_extents)) {
-   rc = -EINVAL;
-   goto free_extents;
+   goto exit;
}
 
fbd = ploop_fb_init(plo);
if (!fbd) {
rc = -ENOMEM;
-   goto free_extents;
+   goto exit;
}
 
+   extents = (void __user *)(arg + sizeof(ctl));
+
for (i = 0; i < ctl.n_extents; i++) {
-   rc = ploop_fb_add_free_extent(fbd, extents[i].clu,
- extents[i].iblk, extents[i].len);
+   struct ploop_freeblks_ctl_extent extent;
+
+   if (copy_from_user(, [i],
+   sizeof(extent))) {
+   rc = -EFAULT;
+   ploop_fb_fini(fbd, rc);
+   goto exit;
+   }
+
+   rc = ploop_fb_add_free_extent(fbd, extent.clu,
+   extent.iblk, extent.len);
if (rc) {
ploop_fb_fini(fbd, rc);
-   goto free_extents;
+   goto exit;
}
}
 
@@ -4232,8 +4233,7 @@ static int ploop_freeblks_ioc(struct ploop_device *plo, 
unsigned long arg)
}
 
ploop_relax(plo);
-free_extents:
-   kfree(extents);
+exit:
return rc;
 }
 
@@ -4324,13 +4324,8 @@ static void ploop_relocblks_process(struct ploop_device 
*plo)
spin_unlock_irq(>lock);
 }
 
-static int release_fbd(struct ploop_device *plo,
-  struct ploop_relocblks_ctl_extent *e,
-  int err)
+static int release_fbd(struct ploop_device *plo, int err)
 {
-   if (e)
-   kfree(e);
-
clear_bit(PLOOP_S_DISCARD, >state);
 
ploop_quiesce(plo);
@@ -4378,7 +4373,6 @@ static int ploop_relocblks_ioc(struct ploop_device *plo, 
unsigned long arg)
 {
struct ploop_delta *delta = ploop_top_delta(plo);
struct ploop_relocblks_ctl ctl;
-   struct ploop_relocblks_ctl_extent *extents = NULL;
struct ploop_freeblks_desc *fbd = plo->fbd;
int i;
int err = 0;
@@ -4406,26 +4400,23 @@ static int ploop_relocblks_ioc(struct ploop_device 
*plo, unsigned long arg)
goto already;
 
if (ctl.n_extents) {
-   extents = kzalloc(sizeof(*extents) * ctl.n_extents,
-   GFP_KERNEL);
-   if (!extents)
-   return release_fbd(plo, extents, -ENOMEM);
+   struct ploop_relocblks_ctl_extent __user *extents;
 
-   if (copy_from_user(extents, (u8*)arg + 

[Devel] [PATCH RHEL7 COMMIT] fs/aio: show real number of aio events in fs.aio-nr sysctl

2016-05-20 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.5
-->
commit 280363c520967aab9be356c0e6b37ca02109ebea
Author: Andrey Ryabinin 
Date:   Fri May 20 19:17:05 2016 +0400

fs/aio: show real number of aio events in fs.aio-nr sysctl

fs.aio-nr accounts number of aio events requested by user via io_setup()
syscall. The kernel usually creates more events than was requested.
CRIU doesn't care about the number of requested events, it cares only
about created events. So while restoring the process CRIU requests
in io_setup() the number of actually created events. This leads
to inconsistent value of fs.aio-nr after the restore.

Let's show in fs.aio-nr a number of created events, not requested.

https://jira.sw.ru/browse/PSBM-47209

Signed-off-by: Andrey Ryabinin 
Acked-by: Kirill Tkhai 
---
 fs/aio.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index ee72178..0f00aa0 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -636,12 +636,12 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 
/* limit the number of system wide aios */
spin_lock(>aio_nr_lock);
-   if (ve->aio_nr + nr_events > ve->aio_max_nr ||
-   ve->aio_nr + nr_events < ve->aio_nr) {
+   if (ve->aio_nr + ctx->nr_events > ve->aio_max_nr ||
+   ve->aio_nr + ctx->nr_events < ve->aio_nr) {
spin_unlock(>aio_nr_lock);
goto out_cleanup;
}
-   ve->aio_nr += ctx->max_reqs;
+   ve->aio_nr += ctx->nr_events;
spin_unlock(>aio_nr_lock);
 
/* now link into global list. */
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ms/net: sock: move ->sk_shutdown out of bitfields.

2016-05-20 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.5
-->
commit 0e9bd23e9447385f71ed9b8e0ad7bbe0e026c068
Author: Andrey Ryabinin 
Date:   Fri May 20 18:58:08 2016 +0400

ms/net: sock: move ->sk_shutdown out of bitfields.

->sk_shutdown bits share one bitfield with some other bits in sock struct,
such as ->sk_no_check_[r,t]x, ->sk_userlocks ...
sock_setsockopt() may write to these bits, while holding the socket lock.

In case of AF_UNIX sockets, we change ->sk_shutdown bits while holding only
unix_state_lock(). So concurrent setsockopt() and shutdown() may lead
to corrupting these bits.

Fix this by moving ->sk_shutdown bits out of bitfield into a separate byte.
This will not change the 'struct sock' size since ->sk_shutdown moved into
previously unused 16-bit hole.

https://jira.sw.ru/browse/PSBM-47023

Link: 
http://lkml.kernel.org/g/1463588367-16310-1-git-send-email-aryabi...@virtuozzo.com
Signed-off-by: Andrey Ryabinin 

Suggested-by: Hannes Frederic Sowa 
Acked-by: Eric Dumazet 
---
 include/net/sock.h | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index b940aa5..ccaacb6 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -370,8 +370,13 @@ struct sock {
atomic_tsk_omem_alloc;
int sk_sndbuf;
struct sk_buff_head sk_write_queue;
+
+   /*
+* Because of non atomicity rules, all
+* changes are protected by socket lock.
+*/
kmemcheck_bitfield_begin(flags);
-   unsigned intsk_shutdown  : 2,
+   unsigned intsk_padding  : 2,
 #ifdef __GENKSYMS__
sk_no_check : 2,
 #else
@@ -382,6 +387,7 @@ struct sock {
sk_protocol  : 8,
sk_type  : 16;
kmemcheck_bitfield_end(flags);
+
int sk_wmem_queued;
gfp_t   sk_allocation;
u32 sk_pacing_rate; /* bytes per second */
@@ -390,6 +396,7 @@ struct sock {
int sk_gso_type;
unsigned intsk_gso_max_size;
u16 sk_gso_max_segs;
+   u8  sk_shutdown;
int sk_rcvlowat;
unsigned long   sk_lingertime;
struct sk_buff_head sk_error_queue;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] fs/aio: show real number of aio events in fs.aio-nr sysctl

2016-05-20 Thread Kirill Tkhai


On 20.05.2016 16:30, Andrey Ryabinin wrote:
> fs.aio-nr accounts number of aio events requested by user via io_setup()
> syscall. The kernel usually creates more events than was requested.
> CRIU doesn't care about the number of requested events, it cares only
> about created events. So while restoring the process CRIU requests
> in io_setup() the number of actually created events. This leads
> to inconsistent value of fs.aio-nr after the restore.
> 
> Let's account in fs.aio-nr a number of created events, not requested.
> 
> https://jira.sw.ru/browse/PSBM-47209
> 
> Signed-off-by: Andrey Ryabinin 

Acked-by: Kirill Tkhai 

> ---
>  fs/aio.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index 29d02ab..19ebcd1 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -633,12 +633,12 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
>  
>   /* limit the number of system wide aios */
>   spin_lock(>aio_nr_lock);
> - if (ve->aio_nr + nr_events > ve->aio_max_nr ||
> - ve->aio_nr + nr_events < ve->aio_nr) {
> + if (ve->aio_nr + ctx->nr_events > ve->aio_max_nr ||
> + ve->aio_nr + ctx->nr_events < ve->aio_nr) {
>   spin_unlock(>aio_nr_lock);
>   goto out_cleanup;
>   }
> - ve->aio_nr += ctx->max_reqs;
> + ve->aio_nr += ctx->nr_events;
>   spin_unlock(>aio_nr_lock);
>  
>   /* now link into global list. */
> 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rhel7] procfs: always expose /proc//map_files/ and make it readable

2016-05-20 Thread Konstantin Khorenko

This patch is a step for CRIU to work under non-root user.
The idea is good, but there is a long-long way ahead and it even can finally possible 
only after next "Virtuozzo 8" kernel appearance.

=> let's not include preparation patches until feature of running CRIU under 
non-root work on mainstream kernel as least.

Surely this is valid until such patches are not required due to some other 
reasons as well.

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 05/16/2016 10:42 PM, Andrey Vagin wrote:

Acked-by: Andrey Vagin 

On Mon, May 16, 2016 at 11:28:51AM +0300, Cyrill Gorcunov wrote:

This is a backport of commit

ML: bdb4d100afe9818aebd1d98ced575c5ef143456c

From: Calvin Owens 

Currently, /proc//map_files/ is restricted to CAP_SYS_ADMIN, and is
only exposed if CONFIG_CHECKPOINT_RESTORE is set.

Each mapped file region gets a symlink in /proc//map_files/
corresponding to the virtual address range at which it is mapped.  The
symlinks work like the symlinks in /proc//fd/, so you can follow them
to the backing file even if that backing file has been unlinked.

Currently, files which are mapped, unlinked, and closed are impossible to
stat() from userspace.  Exposing /proc//map_files/ closes this
functionality "hole".

Not being able to stat() such files makes noticing and explicitly
accounting for the space they use on the filesystem impossible.  You can
work around this by summing up the space used by every file in the
filesystem and subtracting that total from what statfs() tells you, but
that obviously isn't great, and it becomes unworkable once your filesystem
becomes large enough.

This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
adjusts the permissions enforced on it as follows:

* proc_map_files_lookup()
* proc_map_files_readdir()
* map_files_d_revalidate()

Remove the CAP_SYS_ADMIN restriction, leaving only the current
restriction requiring PTRACE_MODE_READ. The information made
available to userspace by these three functions is already
available in /proc/PID/maps with MODE_READ, so I don't see any
reason to limit them any further (see below for more detail).

* proc_map_files_follow_link()

This stub has been added, and requires that the user have
CAP_SYS_ADMIN in order to follow the links in map_files/,
since there was concern on LKML both about the potential for
bypassing permissions on ancestor directories in the path to
files pointed to, and about what happens with more exotic
memory mappings created by some drivers (ie dma-buf).

In older versions of this patch, I changed every permission check in
the four functions above to enforce MODE_ATTACH instead of MODE_READ.
This was an oversight on my part, and after revisiting the discussion
it seems that nobody was concerned about anything outside of what is
made possible by ->follow_link(). So in this version, I've left the
checks for PTRACE_MODE_READ as-is.

[a...@linux-foundation.org: catch up with concurrent proc_pid_follow_link() 
changes]
Signed-off-by: Calvin Owens 
Reviewed-by: Kees Cook 
Cc: Andy Lutomirski 
Cc: Cyrill Gorcunov 
Cc: Joe Perches 
Cc: Kirill A. Shutemov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Cyrill Gorcunov 
---

Kostya, please wait for Ack from Andrew. The patch on its own is not
bound to some of the bug we're working on now but usefull in general
and probably will help us with renaming of memfd restored memory
in criu (we use memfd to be able to restore anonymous shared memory
in userns case but memfd mangles the backend name, we didn't find
any problem with it yet, but been talking to Andrew and he agreed
that we might need to do something with this problem, and this patch
is first step).

  fs/proc/base.c |   44 +++-
  1 file changed, 23 insertions(+), 21 deletions(-)

Index: linux-pcs7.git/fs/proc/base.c
===
--- linux-pcs7.git.orig/fs/proc/base.c
+++ linux-pcs7.git/fs/proc/base.c
@@ -1925,8 +1925,6 @@ end_instantiate:
return filldir(dirent, name, len, filp->f_pos, ino, type);
  }

-#ifdef CONFIG_CHECKPOINT_RESTORE
-
  /*
   * dname_to_vma_addr - maps a dentry name into two unsigned longs
   * which represent vma start and end addresses.
@@ -1953,11 +1951,6 @@ static int map_files_d_revalidate(struct
if (flags & LOOKUP_RCU)
return -ECHILD;

-   if (!capable(CAP_SYS_ADMIN)) {
-   status = -EPERM;
-   goto out_notask;
-   }
-
inode = dentry->d_inode;
task = get_proc_task(inode);
if (!task)
@@ -2048,6 +2041,28 @@ struct 

Re: [Devel] [vzlin-dev] [PATCH rh7 4/4] ploop: get rid of direct calls to file->f_op->fsync()

2016-05-20 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> The patch hides file->f_op->fsync() in dio_sync. The only exception is
> dio_truncate where "file" may come from userspace fd.
>
> Signed-off-by: Maxim Patlasov 
Acked-by:dmonak...@openvz.org
> ---
>  drivers/block/ploop/io_direct.c |   13 +
>  1 file changed, 5 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
> index 1ff848c..8096110 100644
> --- a/drivers/block/ploop/io_direct.c
> +++ b/drivers/block/ploop/io_direct.c
> @@ -405,8 +405,7 @@ try_again:
>   }
>  
>   /* flush new i_size to disk */
> - err = io->files.file->f_op->fsync(io->files.file, 0,
> -   LLONG_MAX, 0);
> + err = io->ops->sync(io);
>   if (err)
>   goto end_write;
>  
> @@ -524,8 +523,8 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
> * preq)
> FALLOC_FL_CONVERT_UNWRITTEN,
> (loff_t)sec << 9, clu_siz);
>   if (!err)
> - err = io->files.file->f_op->fsync(io->files.file, 0,
> -   LLONG_MAX, 0);
> + err = io->ops->sync(io);
> +
>   file_end_write(io->files.file);
>   if (err) {
>   PLOOP_REQ_SET_ERROR(preq, err);
> @@ -814,8 +813,7 @@ static int dio_fsync_thread(void * data)
>   /* filemap_fdatawrite() has been made already */
>   filemap_fdatawait(io->files.mapping);
>  
> - err = io->files.file->f_op->fsync(io->files.file, 0,
> -   LLONG_MAX, 0);
> + err = io->ops->sync(io);
>  
>   /* Do we need to invalidate page cache? Not really,
>* because we use it only to create full new pages,
> @@ -1367,8 +1365,7 @@ static int dio_alloc_sync(struct ploop_io * io, loff_t 
> pos, loff_t len)
>   if (err)
>   goto fail;
>  
> - err = io->files.file->f_op->fsync(io->files.file, 0,
> -   LLONG_MAX, 0);
> + err = io->ops->sync(io);
>   if (err)
>   goto fail;
>  


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: push_backup: optimize ploop_pb_get_pending()

2016-05-20 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.5
-->
commit 6130b647da6a3f732194155a34a0a77632614f3d
Author: Maxim Patlasov 
Date:   Fri May 20 16:48:27 2016 +0400

ploop: push_backup: optimize ploop_pb_get_pending()

Pack as many adjacent preq-s to each extent as possible.

https://jira.sw.ru/browse/PSBM-45000

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/push_backup.c | 46 +--
 1 file changed, 39 insertions(+), 7 deletions(-)

diff --git a/drivers/block/ploop/push_backup.c 
b/drivers/block/ploop/push_backup.c
index 46ea2da..ba64468 100644
--- a/drivers/block/ploop/push_backup.c
+++ b/drivers/block/ploop/push_backup.c
@@ -351,7 +351,8 @@ static struct ploop_request 
*ploop_pb_get_req_from_tree(struct rb_root *tree,
 }
 
 static struct ploop_request *
-ploop_pb_get_first_req_from_tree(struct rb_root *tree)
+ploop_pb_get_first_req_from_tree(struct rb_root *tree,
+struct ploop_request **npreq)
 {
static struct ploop_request *p;
struct rb_node *n = rb_first(tree);
@@ -359,6 +360,15 @@ ploop_pb_get_first_req_from_tree(struct rb_root *tree)
if (!n)
return NULL;
 
+   if (npreq) {
+   struct rb_node *next = rb_next(n);
+   if (next)
+   *npreq = rb_entry(next, struct ploop_request,
+ reloc_link);
+   else
+   *npreq = NULL;
+   }
+
p = rb_entry(n, struct ploop_request, reloc_link);
rb_erase(>reloc_link, tree);
return p;
@@ -367,13 +377,20 @@ ploop_pb_get_first_req_from_tree(struct rb_root *tree)
 static struct ploop_request *
 ploop_pb_get_first_req_from_pending(struct ploop_pushbackup_desc *pbd)
 {
-   return ploop_pb_get_first_req_from_tree(>pending_tree);
+   return ploop_pb_get_first_req_from_tree(>pending_tree, NULL);
+}
+
+static struct ploop_request *
+ploop_pb_get_first_reqs_from_pending(struct ploop_pushbackup_desc *pbd,
+struct ploop_request **npreq)
+{
+   return ploop_pb_get_first_req_from_tree(>pending_tree, npreq);
 }
 
 static struct ploop_request *
 ploop_pb_get_first_req_from_reported(struct ploop_pushbackup_desc *pbd)
 {
-   return ploop_pb_get_first_req_from_tree(>reported_tree);
+   return ploop_pb_get_first_req_from_tree(>reported_tree, NULL);
 }
 
 static struct ploop_request *
@@ -463,13 +480,12 @@ int ploop_pb_get_pending(struct ploop_pushbackup_desc 
*pbd,
 cluster_t *clu_p, cluster_t *len_p, unsigned n_done)
 {
bool blocking  = !n_done;
-   struct ploop_request *preq;
+   struct ploop_request *preq, *npreq;
int err = 0;
 
spin_lock(>ppb_lock);
 
-   /* OPTIMIZE ME LATER: rb_first() once, then rb_next() */
-   preq = ploop_pb_get_first_req_from_pending(pbd);
+   preq = ploop_pb_get_first_reqs_from_pending(pbd, );
if (!preq) {
struct ploop_device *plo = pbd->plo;
 
@@ -498,7 +514,7 @@ int ploop_pb_get_pending(struct ploop_pushbackup_desc *pbd,
pbd->ppb_waiting = false;
init_completion(>ppb_comp);
 
-   preq = ploop_pb_get_first_req_from_pending(pbd);
+   preq = ploop_pb_get_first_reqs_from_pending(pbd, );
if (!preq) {
if (!test_bit(PLOOP_S_PUSH_BACKUP, >state))
err = -EINTR;
@@ -515,6 +531,22 @@ int ploop_pb_get_pending(struct ploop_pushbackup_desc *pbd,
*clu_p = preq->req_cluster;
*len_p = 1;
 
+   while (npreq && npreq->req_cluster == *clu_p + *len_p) {
+   struct rb_node *next = rb_next(>reloc_link);
+
+   preq = npreq;
+   if (next)
+   npreq = rb_entry(next, struct ploop_request,
+reloc_link);
+   else
+   npreq = NULL;
+
+   rb_erase(>reloc_link, >pending_tree);
+   ploop_pb_add_req_to_reported(pbd, preq);
+
+   (*len_p)++;
+   }
+
 get_pending_unlock:
spin_unlock(>ppb_lock);
return err;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] fs/aio: show real number of aio events in fs.aio-nr sysctl

2016-05-20 Thread Andrey Ryabinin
fs.aio-nr accounts number of aio events requested by user via io_setup()
syscall. The kernel usually creates more events than was requested.
CRIU doesn't care about the number of requested events, it cares only
about created events. So while restoring the process CRIU requests
in io_setup() the number of actually created events. This leads
to inconsistent value of fs.aio-nr after the restore.

Let's account in fs.aio-nr a number of created events, not requested.

https://jira.sw.ru/browse/PSBM-47209

Signed-off-by: Andrey Ryabinin 
---
 fs/aio.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 29d02ab..19ebcd1 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -633,12 +633,12 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 
/* limit the number of system wide aios */
spin_lock(>aio_nr_lock);
-   if (ve->aio_nr + nr_events > ve->aio_max_nr ||
-   ve->aio_nr + nr_events < ve->aio_nr) {
+   if (ve->aio_nr + ctx->nr_events > ve->aio_max_nr ||
+   ve->aio_nr + ctx->nr_events < ve->aio_nr) {
spin_unlock(>aio_nr_lock);
goto out_cleanup;
}
-   ve->aio_nr += ctx->max_reqs;
+   ve->aio_nr += ctx->nr_events;
spin_unlock(>aio_nr_lock);
 
/* now link into global list. */
-- 
2.7.3

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [vzlin-dev] [PATCH rh7 1/4] ploop: get rid of FOP_FSYNC

2016-05-20 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> We keep ploop sources as in-tree module of rhel7-based kernel. So we know
> for sure how fsync fop prototype looks like.
>
> Signed-off-by: Maxim Patlasov 
Acked-by:dmonak...@openvz.org
> ---
>  drivers/block/ploop/io_direct.c |   15 +--
>  include/linux/ploop/compat.h|6 --
>  2 files changed, 9 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
> index 5a2e12a..583b110 100644
> --- a/drivers/block/ploop/io_direct.c
> +++ b/drivers/block/ploop/io_direct.c
> @@ -406,7 +406,8 @@ try_again:
>   }
>  
>   /* flush new i_size to disk */
> - err = io->files.file->f_op->FOP_FSYNC(io->files.file, 
> 0);
> + err = io->files.file->f_op->fsync(io->files.file, 0,
> +   LLONG_MAX, 0);
>   if (err)
>   goto end_write;
>  
> @@ -524,7 +525,8 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
> * preq)
> FALLOC_FL_CONVERT_UNWRITTEN,
> (loff_t)sec << 9, clu_siz);
>   if (!err)
> - err = io->files.file->f_op->FOP_FSYNC(io->files.file, 0);
> + err = io->files.file->f_op->fsync(io->files.file, 0,
> +   LLONG_MAX, 0);
>   file_end_write(io->files.file);
>   if (err) {
>   PLOOP_REQ_SET_ERROR(preq, err);
> @@ -815,8 +817,8 @@ static int dio_fsync_thread(void * data)
>  
>   err = 0;
>   if (io->files.file->f_op->fsync)
> - err = io->files.file->f_op->FOP_FSYNC(io->files.file,
> -   0);
> + err = io->files.file->f_op->fsync(io->files.file, 0,
> +   LLONG_MAX, 0);
>  
>   /* Do we need to invalidate page cache? Not really,
>* because we use it only to create full new pages,
> @@ -853,7 +855,7 @@ static int dio_fsync(struct file * file)
>   ret = filemap_write_and_wait(mapping);
>   err = 0;
>   if (file->f_op && file->f_op->fsync) {
> - err = file->f_op->FOP_FSYNC(file, 0);
> + err = file->f_op->fsync(file, 0, LLONG_MAX, 0);
>   if (!ret)
>   ret = err;
>   }
> @@ -1385,7 +1387,8 @@ static int dio_alloc_sync(struct ploop_io * io, loff_t 
> pos, loff_t len)
>   goto fail;
>  
>   if (io->files.file->f_op && io->files.file->f_op->fsync) {
> - err = io->files.file->f_op->FOP_FSYNC(io->files.file, 0);
> + err = io->files.file->f_op->fsync(io->files.file, 0,
> +   LLONG_MAX, 0);
>   if (err)
>   goto fail;
>   }
> diff --git a/include/linux/ploop/compat.h b/include/linux/ploop/compat.h
> index 03c3ae3..8a36d81 100644
> --- a/include/linux/ploop/compat.h
> +++ b/include/linux/ploop/compat.h
> @@ -58,10 +58,4 @@ static void func(struct bio *bio, int err) {
>  
>  #endif
>  
> -#if LINUX_VERSION_CODE > KERNEL_VERSION(2,6,32)
> -#define FOP_FSYNC(file, datasync) fsync(file, 0, LLONG_MAX, datasync)
> -#else
> -#define FOP_FSYNC(file, datasync) fsync(file, F_DENTRY(file), datasync)
> -#endif
> -
>  #endif


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH] aio: Multiply passed nr_events in io_setup()

2016-05-20 Thread Kirill Tkhai
CRIU supports AIO in kernel versions >= 3.19.
It expects the logic, which was introduced in
kernel commit e1bdd5f27a5b, when it became
to multiply the passed nr_events twice.
Vz7 kernel doesn't do that, and this lead
to wrong calculation of lenght of AIO ring in CRIU.

This is a port of small part of commit e1bdd5f27a5b,
to make CRIU works on Virtuozzo 7 kernel.

https://jira.sw.ru/browse/PSBM-47075

Signed-off-by: Kirill Tkhai 
---
 fs/aio.c |3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index 29d02ab..ee72178 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -596,6 +596,9 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
struct ve_struct *ve = get_exec_env();
int err = -ENOMEM;
 
+   /* Kernel since e1bdd5f27a5b do this, and criu is tuned on that */
+   nr_events *= 2;
+
/* Prevent overflows */
if ((nr_events > (0x1000U / sizeof(struct io_event))) ||
(nr_events > (0x1000U / sizeof(struct kiocb {

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: push_backup: optimize ploop_pb_put_reported()

2016-05-20 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.5
-->
commit 0fb4dff848a098b16c93028aed261d4f5aa11a33
Author: Maxim Patlasov 
Date:   Fri May 20 16:57:47 2016 +0400

ploop: push_backup: optimize ploop_pb_put_reported()

To process an extent [clu, clu + len) coming from userspace, let's firstly
find the leftmost preq in the tree matching this interval.

Then, we can iterate to the right by rb_next() until we meet a preq that
doesn't match.

That's it because any preq to the right would have any higher req_cluster,
and hence doesn't match too.

https://jira.sw.ru/browse/PSBM-45000

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/dev.c |   9 +---
 drivers/block/ploop/push_backup.c | 107 +-
 2 files changed, 73 insertions(+), 43 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index f3bb092..15e34e5 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4626,13 +4626,8 @@ static int ploop_push_backup_io_write(struct 
ploop_device *plo, unsigned long ar
goto io_write_done;
 
rc = 0;
-   for (i = 0; i < ctl->n_extents; i++) {
-   cluster_t j;
-   for (j = e[i].clu; j < e[i].clu + e[i].len; j++)
-   ploop_pb_put_reported(plo->pbd, j, 1);
-/* OPTIMIZE ME LATER: like this:
-* ploop_pb_put_reported(plo->pbd, e[i].clu, e[i].len); */
-   }
+   for (i = 0; i < ctl->n_extents; i++)
+   ploop_pb_put_reported(plo->pbd, e[i].clu, e[i].len);
 
 io_write_done:
kfree(e);
diff --git a/drivers/block/ploop/push_backup.c 
b/drivers/block/ploop/push_backup.c
index ba64468..90172e4 100644
--- a/drivers/block/ploop/push_backup.c
+++ b/drivers/block/ploop/push_backup.c
@@ -329,11 +329,23 @@ static void ploop_pb_add_req_to_reported(struct 
ploop_pushbackup_desc *pbd,
ploop_pb_add_req_to_tree(preq, >reported_tree);
 }
 
+static inline bool preq_match(struct ploop_request *preq, cluster_t clu,
+ cluster_t len)
+{
+   return preq &&
+   clu <= preq->req_cluster &&
+   preq->req_cluster < clu + len;
+}
+
+/* returns leftmost preq which req_cluster >= clu */
 static struct ploop_request *ploop_pb_get_req_from_tree(struct rb_root *tree,
-   cluster_t clu)
+   cluster_t clu, cluster_t len,
+   struct ploop_request **npreq)
 {
struct rb_node *n = tree->rb_node;
-   struct ploop_request *p;
+   struct ploop_request *p = NULL;
+
+   *npreq = NULL;
 
while (n) {
p = rb_entry(n, struct ploop_request, reloc_link);
@@ -342,11 +354,31 @@ static struct ploop_request 
*ploop_pb_get_req_from_tree(struct rb_root *tree,
n = n->rb_left;
else if (clu > p->req_cluster)
n = n->rb_right;
-   else {
+   else { /* perfect match */
+   n = rb_next(n);
+   if (n)
+   *npreq = rb_entry(n, struct ploop_request,
+ reloc_link);
rb_erase(>reloc_link, tree);
return p;
}
}
+   /* here p is not perfect, but it's closest */
+
+   if (p && p->req_cluster < clu) {
+   n = rb_next(>reloc_link);
+   if (n)
+   p = rb_entry(n, struct ploop_request, reloc_link);
+   }
+
+   if (preq_match(p, clu, len)) {
+   n = rb_next(>reloc_link);
+   if (n)
+   *npreq = rb_entry(n, struct ploop_request, reloc_link);
+   rb_erase(>reloc_link, tree);
+   return p;
+   }
+
return NULL;
 }
 
@@ -393,20 +425,6 @@ ploop_pb_get_first_req_from_reported(struct 
ploop_pushbackup_desc *pbd)
return ploop_pb_get_first_req_from_tree(>reported_tree, NULL);
 }
 
-static struct ploop_request *
-ploop_pb_get_req_from_pending(struct ploop_pushbackup_desc *pbd,
- cluster_t clu)
-{
-   return ploop_pb_get_req_from_tree(>pending_tree, clu);
-}
-
-static struct ploop_request *
-ploop_pb_get_req_from_reported(struct ploop_pushbackup_desc *pbd,
-  cluster_t clu)
-{
-   return ploop_pb_get_req_from_tree(>reported_tree, clu);
-}
-
 int ploop_pb_preq_add_pending(struct ploop_pushbackup_desc *pbd,
   struct ploop_request *preq)
 {
@@ -552,27 +570,46 @@ get_pending_unlock:
return err;
 }
 
+static void 

Re: [Devel] [vzlin-dev] [PATCH rh7 3/4] ploop: get rid of dio_fsync()

2016-05-20 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> For ext4 dio_fsync() is actually equivalent to direct call to fsync fop:
>
> 1) file->f_op cannot be NULL;
> 2) file->f_op->fsync is always equal to ext4_sync_file;
> 3) ext4_sync_file() does filemap_write_and_wait() internally,
>no need to call it explicitly.
>
> The patch also fixes a potential problem: if fsync() fails, it's better
> to pass error code up to the stack of callers.
>
> Signed-off-by: Maxim Patlasov 
Acked-by:dmonak...@openvz.org
> ---
>  drivers/block/ploop/io_direct.c |   52 
> ++-
>  1 file changed, 24 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
> index a37f296..1ff848c 100644
> --- a/drivers/block/ploop/io_direct.c
> +++ b/drivers/block/ploop/io_direct.c
> @@ -844,21 +844,6 @@ static int dio_fsync_thread(void * data)
>   return 0;
>  }
>  
> -static int dio_fsync(struct file * file)
> -{
> - int err, ret;
> - struct address_space *mapping = file->f_mapping;
> -
> - ret = filemap_write_and_wait(mapping);
> - err = 0;
> - if (file->f_op && file->f_op->fsync) {
> - err = file->f_op->fsync(file, 0, LLONG_MAX, 0);
> - if (!ret)
> - ret = err;
> - }
> - return ret;
> -}
> -
>  /* Invalidate page cache. It is called with inode mutex taken
>   * and mapping mapping must be synced. If some dirty pages remained,
>   * it will fail.
> @@ -949,20 +934,17 @@ static void dio_destroy(struct ploop_io * io)
>  static int dio_sync(struct ploop_io * io)
>  {
>   struct file * file = io->files.file;
> + int err = 0;
>  
>   if (file)
> - dio_fsync(file);
> - return 0;
> + err = file->f_op->fsync(file, 0, LLONG_MAX, 0);
> +
> + return err;
>  }
>  
>  static int dio_stop(struct ploop_io * io)
>  {
> - struct file * file = io->files.file;
> -
> - if (file) {
> - dio_fsync(file);
> - }
> - return 0;
> + return io->ops->sync(io);
>  }
>  
>  static int dio_open(struct ploop_io * io)
> @@ -979,7 +961,9 @@ static int dio_open(struct ploop_io * io)
>   io->files.inode = io->files.mapping->host;
>   io->files.bdev = io->files.inode->i_sb->s_bdev;
>  
> - dio_fsync(file);
> + err = io->ops->sync(io);
> + if (err)
> + return err;
>  
>   mutex_lock(>files.inode->i_mutex);
>   em_tree = ploop_dio_open(io, (delta->flags & PLOOP_FMT_RDONLY));
> @@ -1646,7 +1630,11 @@ static int dio_prepare_snapshot(struct ploop_io * io, 
> struct ploop_snapdata *sd)
>   return -EINVAL;
>   }
>  
> - dio_fsync(file);
> + err = io->ops->sync(io);
> + if (err) {
> + fput(file);
> + return err;
> + }
>  
>   mutex_lock(>files.inode->i_mutex);
>   err = dio_invalidate_cache(io->files.mapping, io->files.bdev);
> @@ -1713,7 +1701,11 @@ static int dio_prepare_merge(struct ploop_io * io, 
> struct ploop_snapdata *sd)
>   return -EINVAL;
>   }
>  
> - dio_fsync(file);
> + err = io->ops->sync(io);
> + if (err) {
> + fput(file);
> + return err;
> + }
>  
>   mutex_lock(>files.inode->i_mutex);
>  
> @@ -1772,8 +1764,12 @@ static int dio_truncate(struct ploop_io * io, struct 
> file * file,
>   atomic_long_sub(*io->size_ptr - new_size, _io_images_size);
>   *io->size_ptr = new_size;
>  
> - if (!err)
> - err = dio_fsync(file);
> + if (!err) {
> + if (io->files.file == file)
> + err = io->ops->sync(io);
> + else
> + err = file->f_op->fsync(file, 0, LLONG_MAX, 0);
> + }
>  
>   return err;
>  }


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [vzlin-dev] [PATCH rh7 2/4] ploop: io_direct: check for fsync fop on startup

2016-05-20 Thread Dmitry Monakhov
Maxim Patlasov  writes:

> We don't support host file systems without fsync fop. The patch refuses
> to start ploop if fsync is absent.
>
> Signed-off-by: Maxim Patlasov 
Acked-by:dmonak...@openvz.org
> ---
>  drivers/block/ploop/io_direct.c |   23 ---
>  1 file changed, 12 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
> index 583b110..a37f296 100644
> --- a/drivers/block/ploop/io_direct.c
> +++ b/drivers/block/ploop/io_direct.c
> @@ -376,7 +376,6 @@ cached_submit(struct ploop_io *io, iblock_t iblk, struct 
> ploop_request * preq,
>   loff_t new_size;
>   loff_t used_pos;
>   bool may_fallocate = io->files.file->f_op->fallocate &&
> - io->files.file->f_op->fsync &&
>   io->files.flags & EXT4_EXTENTS_FL;
>  
>   trace_cached_submit(preq);
> @@ -815,10 +814,8 @@ static int dio_fsync_thread(void * data)
>   /* filemap_fdatawrite() has been made already */
>   filemap_fdatawait(io->files.mapping);
>  
> - err = 0;
> - if (io->files.file->f_op->fsync)
> - err = io->files.file->f_op->fsync(io->files.file, 0,
> -   LLONG_MAX, 0);
> + err = io->files.file->f_op->fsync(io->files.file, 0,
> +   LLONG_MAX, 0);
>  
>   /* Do we need to invalidate page cache? Not really,
>* because we use it only to create full new pages,
> @@ -1386,12 +1383,11 @@ static int dio_alloc_sync(struct ploop_io * io, 
> loff_t pos, loff_t len)
>   if (err)
>   goto fail;
>  
> - if (io->files.file->f_op && io->files.file->f_op->fsync) {
> - err = io->files.file->f_op->fsync(io->files.file, 0,
> -   LLONG_MAX, 0);
> - if (err)
> - goto fail;
> - }
> + err = io->files.file->f_op->fsync(io->files.file, 0,
> +   LLONG_MAX, 0);
> + if (err)
> + goto fail;
> +
>   err = filemap_fdatawait(io->files.mapping);
>  
>  fail:
> @@ -1878,6 +1874,11 @@ static int dio_autodetect(struct ploop_io * io)
>   return -1;
>   }
>  
> + if (!file->f_op->fsync) {
> + printk("Cannot run on EXT4(%s): no fsync\n", s_id);
> + return -1;
> + }
> +
>   fs = get_fs();
>   set_fs(KERNEL_DS);
>   flags = 0;


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ms/KVM: x86: move steal time initialization to vcpu entry time

2016-05-20 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.5
-->
commit 4a055bb845eda81847c8e2b2233ba8ac3a889ab6
Author: Marcelo Tosatti 
Date:   Fri May 20 16:38:02 2016 +0400

ms/KVM: x86: move steal time initialization to vcpu entry time

As reported at https://bugs.launchpad.net/qemu/+bug/1494350,
it is possible to have vcpu->arch.st.last_steal initialized
from a thread other than vcpu thread, say the iothread, via
KVM_SET_MSRS.

Which can cause an overflow later (when subtracting from vcpu threads
sched_info.run_delay).

To avoid that, move steal time accumulation to vcpu entry time,
before copying steal time data to guest.

Signed-off-by: Marcelo Tosatti 
Reviewed-by: David Matlack 
Signed-off-by: Paolo Bonzini 
(cherry picked from commit 7cae2bedcbd4680b155999655e49c27b9cf020fa)

https://jira.sw.ru/browse/PSBM-46737

Signed-off-by: Roman Kagan 
---
 arch/x86/kvm/x86.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ca78e5e..146c34a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2057,6 +2057,8 @@ static void accumulate_steal_time(struct kvm_vcpu *vcpu)
 
 static void record_steal_time(struct kvm_vcpu *vcpu)
 {
+   accumulate_steal_time(vcpu);
+
if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
return;
 
@@ -2207,12 +2209,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
if (!(data & KVM_MSR_ENABLED))
break;
 
-   vcpu->arch.st.last_steal = current->sched_info.run_delay;
-
-   preempt_disable();
-   accumulate_steal_time(vcpu);
-   preempt_enable();
-
kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
 
break;
@@ -2826,7 +2822,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
vcpu->cpu = cpu;
}
 
-   accumulate_steal_time(vcpu);
kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
 }
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 03/10] dcache: don't need rcu in shrink_dentry_list()

2016-05-20 Thread Vladimir Davydov
From: Miklos Szeredi 

Since now the shrink list is private and nobody can free the dentry while
it is on the shrink list, we can remove RCU protection from this.

Signed-off-by: Miklos Szeredi 
Signed-off-by: Al Viro 
(cherry picked from commit 60942f2f235ce7b817166cdf355eed729094834d)
Signed-off-by: Vladimir Davydov 
---
 fs/dcache.c | 27 ---
 1 file changed, 4 insertions(+), 23 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 1961048478cf..c584d23a194a 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -826,23 +826,9 @@ static void shrink_dentry_list(struct list_head *list)
 {
struct dentry *dentry, *parent;
 
-   rcu_read_lock();
-   for (;;) {
-   dentry = list_entry_rcu(list->prev, struct dentry, d_lru);
-   if (>d_lru == list)
-   break; /* empty */
-
-   /*
-* Get the dentry lock, and re-verify that the dentry is
-* this on the shrinking list. If it is, we know that
-* DCACHE_SHRINK_LIST and DCACHE_LRU_LIST are set.
-*/
+   while (!list_empty(list)) {
+   dentry = list_entry(list->prev, struct dentry, d_lru);
spin_lock(>d_lock);
-   if (dentry != list_entry(list->prev, struct dentry, d_lru)) {
-   spin_unlock(>d_lock);
-   continue;
-   }
-
/*
 * The dispose list is isolated and dentries are not accounted
 * to the LRU here, so we can simply remove it from the list
@@ -858,23 +844,20 @@ static void shrink_dentry_list(struct list_head *list)
spin_unlock(>d_lock);
continue;
}
-   rcu_read_unlock();
 
parent = dentry_kill(dentry, 0);
/*
 * If dentry_kill returns NULL, we have nothing more to do.
 */
-   if (!parent) {
-   rcu_read_lock();
+   if (!parent)
continue;
-   }
+
if (unlikely(parent == dentry)) {
/*
 * trylocks have failed and d_lock has been held the
 * whole time, so it could not have been added to any
 * other lists. Just add it back to the shrink list.
 */
-   rcu_read_lock();
d_shrink_add(dentry, list);
spin_unlock(>d_lock);
continue;
@@ -888,9 +871,7 @@ static void shrink_dentry_list(struct list_head *list)
dentry = parent;
while (dentry && !lockref_put_or_lock(>d_lockref))
dentry = dentry_kill(dentry, 1);
-   rcu_read_lock();
}
-   rcu_read_unlock();
 }
 
 static enum lru_status dentry_lru_isolate(struct list_head *item,
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 10/10] dcache: add missing lockdep annotation

2016-05-20 Thread Vladimir Davydov
From: Linus Torvalds 

lock_parent() very much on purpose does nested locking of dentries, and
is careful to maintain the right order (lock parent first).  But because
it didn't annotate the nested locking order, lockdep thought it might be
a deadlock on d_lock, and complained.

Add the proper annotation for the inner locking of the child dentry to
make lockdep happy.

Introduced by commit 046b961b45f9 ("shrink_dentry_list(): take parent's
->d_lock earlier").

Reported-and-tested-by: Josh Boyer 
Cc: Al Viro 
Signed-off-by: Linus Torvalds 
(cherry picked from commit 9f12600fe425bc28f0ccba034a77783c09c15af4)
Signed-off-by: Vladimir Davydov 
---
 fs/dcache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index ab6859764a6a..09ed486c9f1d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -582,7 +582,7 @@ again:
}
rcu_read_unlock();
if (parent != dentry)
-   spin_lock(>d_lock);
+   spin_lock_nested(>d_lock, DENTRY_D_LOCK_NESTED);
else
parent = NULL;
return parent;
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 08/10] dealing with the rest of shrink_dentry_list() livelock

2016-05-20 Thread Vladimir Davydov
From: Al Viro 

We have the same problem with ->d_lock order in the inner loop, where
we are dropping references to ancestors.  Same solution, basically -
instead of using dentry_kill() we use lock_parent() (introduced in the
previous commit) to get that lock in a safe way, recheck ->d_count
(in case if lock_parent() has ended up dropping and retaking ->d_lock
and somebody managed to grab a reference during that window), trylock
the inode->i_lock and use __dentry_kill() to do the rest.

Signed-off-by: Al Viro 
(cherry picked from commit b2b80195d8829921506880f6dccd21cabd163d0d)
Signed-off-by: Vladimir Davydov 
---
 fs/dcache.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index a287c5a63cde..7b75b482bec2 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -913,8 +913,26 @@ static void shrink_dentry_list(struct list_head *list)
 * fragmentation.
 */
dentry = parent;
-   while (dentry && !lockref_put_or_lock(>d_lockref))
-   dentry = dentry_kill(dentry, 1);
+   while (dentry && !lockref_put_or_lock(>d_lockref)) {
+   parent = lock_parent(dentry);
+   if (dentry->d_lockref.count != 1) {
+   dentry->d_lockref.count--;
+   spin_unlock(>d_lock);
+   if (parent)
+   spin_unlock(>d_lock);
+   break;
+   }
+   inode = dentry->d_inode;/* can't be NULL */
+   if (unlikely(!spin_trylock(>i_lock))) {
+   spin_unlock(>d_lock);
+   if (parent)
+   spin_unlock(>d_lock);
+   cpu_relax();
+   continue;
+   }
+   __dentry_kill(dentry);
+   dentry = parent;
+   }
}
 }
 
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 05/10] split dentry_kill()

2016-05-20 Thread Vladimir Davydov
From: Al Viro 

... into trylocks and everything else.  The latter (actual killing)
is __dentry_kill().

Signed-off-by: Al Viro 
(cherry picked from commit e55fd011549eae01a230e3cace6f4d031b6a3453)
Signed-off-by: Vladimir Davydov 

Conflicts:
fs/dcache.c
---
 fs/dcache.c | 66 +++--
 1 file changed, 38 insertions(+), 28 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 37420f461ca2..70b34bd63e4e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -482,37 +482,12 @@ void d_drop(struct dentry *dentry)
 }
 EXPORT_SYMBOL(d_drop);
 
-/*
- * Finish off a dentry we've decided to kill.
- * dentry->d_lock must be held, returns with it unlocked.
- * If ref is non-zero, then decrement the refcount too.
- * Returns dentry requiring refcount drop, or NULL if we're done.
- */
-static inline struct dentry *
-dentry_kill(struct dentry *dentry, int unlock_on_failure)
-   __releases(dentry->d_lock)
+static void __dentry_kill(struct dentry *dentry)
 {
-   struct inode *inode;
-   struct dentry *parent;
+   struct dentry *parent = NULL;
 
-   inode = dentry->d_inode;
-   if (inode && !spin_trylock(>i_lock)) {
-relock:
-   if (unlock_on_failure) {
-   spin_unlock(>d_lock);
-   cpu_relax();
-   }
-   return dentry; /* try again with same dentry */
-   }
-   if (IS_ROOT(dentry))
-   parent = NULL;
-   else
+   if (!IS_ROOT(dentry))
parent = dentry->d_parent;
-   if (parent && !spin_trylock(>d_lock)) {
-   if (inode)
-   spin_unlock(>i_lock);
-   goto relock;
-   }
 
/*
 * The dentry is now unrecoverably dead to the world.
@@ -546,7 +521,42 @@ relock:
 * transient RCU lookups) can reach this dentry.
 */
d_free(dentry);
+}
+
+/*
+ * Finish off a dentry we've decided to kill.
+ * dentry->d_lock must be held, returns with it unlocked.
+ * If ref is non-zero, then decrement the refcount too.
+ * Returns dentry requiring refcount drop, or NULL if we're done.
+ */
+static struct dentry *
+dentry_kill(struct dentry *dentry, int unlock_on_failure)
+   __releases(dentry->d_lock)
+{
+   struct inode *inode = dentry->d_inode;
+   struct dentry *parent = NULL;
+
+   if (inode && unlikely(!spin_trylock(>i_lock)))
+   goto failed;
+
+   if (!IS_ROOT(dentry)) {
+   parent = dentry->d_parent;
+   if (unlikely(!spin_trylock(>d_lock))) {
+   if (inode)
+   spin_unlock(>i_lock);
+   goto failed;
+   }
+   }
+
+   __dentry_kill(dentry);
return parent;
+
+failed:
+   if (unlock_on_failure) {
+   spin_unlock(>d_lock);
+   cpu_relax();
+   }
+   return dentry; /* try again with same dentry */
 }
 
 /* 
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 06/10] expand dentry_kill(dentry, 0) in shrink_dentry_list()

2016-05-20 Thread Vladimir Davydov
From: Al Viro 

Result will be massaged to saner shape in the next commits.  It is
ugly, no questions - the point of that one is to be a provably
equivalent transformation (and it might be worth splitting a bit
more).

Signed-off-by: Al Viro 
(cherry picked from commit ff2fde9929feb2aef45377ce56b8b12df85dda69)
Signed-off-by: Vladimir Davydov 
---
 fs/dcache.c | 30 +-
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 70b34bd63e4e..2e4a7192a877 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -829,6 +829,7 @@ static void shrink_dentry_list(struct list_head *list)
struct dentry *dentry, *parent;
 
while (!list_empty(list)) {
+   struct inode *inode;
dentry = list_entry(list->prev, struct dentry, d_lru);
spin_lock(>d_lock);
/*
@@ -856,23 +857,26 @@ static void shrink_dentry_list(struct list_head *list)
continue;
}
 
-   parent = dentry_kill(dentry, 0);
-   /*
-* If dentry_kill returns NULL, we have nothing more to do.
-*/
-   if (!parent)
-   continue;
-
-   if (unlikely(parent == dentry)) {
-   /*
-* trylocks have failed and d_lock has been held the
-* whole time, so it could not have been added to any
-* other lists. Just add it back to the shrink list.
-*/
+   inode = dentry->d_inode;
+   if (inode && unlikely(!spin_trylock(>i_lock))) {
d_shrink_add(dentry, list);
spin_unlock(>d_lock);
continue;
}
+
+   parent = NULL;
+   if (!IS_ROOT(dentry)) {
+   parent = dentry->d_parent;
+   if (unlikely(!spin_trylock(>d_lock))) {
+   if (inode)
+   spin_unlock(>i_lock);
+   d_shrink_add(dentry, list);
+   spin_unlock(>d_lock);
+   continue;
+   }
+   }
+
+   __dentry_kill(dentry);
/*
 * We need to prune ancestors too. This is necessary to prevent
 * quadratic behavior of shrink_dcache_parent(), but is also
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 00/10] Backport fix for shrink_dentry_list livelock

2016-05-20 Thread Vladimir Davydov
https://jira.sw.ru/browse/PSBM-47312

Al Viro (7):
  fold try_prune_one_dentry()
  lift the "already marked killed" case into shrink_dentry_list()
  split dentry_kill()
  expand dentry_kill(dentry, 0) in shrink_dentry_list()
  shrink_dentry_list(): take parent's ->d_lock earlier
  dealing with the rest of shrink_dentry_list() livelock
  dentry_kill() doesn't need the second argument now

Linus Torvalds (1):
  dcache: add missing lockdep annotation

Miklos Szeredi (1):
  dcache: don't need rcu in shrink_dentry_list()

Vladimir Davydov (1):
  fold d_kill()

 fs/dcache.c | 275 +---
 1 file changed, 135 insertions(+), 140 deletions(-)

-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 04/10] lift the "already marked killed" case into shrink_dentry_list()

2016-05-20 Thread Vladimir Davydov
From: Al Viro 

It can happen only when dentry_kill() is called with unlock_on_failure
equal to 0 - other callers had dentry pinned until the moment they've
got ->d_lock and DCACHE_DENTRY_KILLED is set only after lockref_mark_dead().

IOW, only one of three call sites of dentry_kill() might end up reaching
that code.  Just move it there.

Signed-off-by: Al Viro 
(cherry picked from commit 64fd72e0a44bdd62c5ca277cb24d0d02b2d8e9dc)
Signed-off-by: Vladimir Davydov 

Conflicts:
fs/dcache.c
---
 fs/dcache.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index c584d23a194a..37420f461ca2 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -495,14 +495,6 @@ dentry_kill(struct dentry *dentry, int unlock_on_failure)
struct inode *inode;
struct dentry *parent;
 
-   if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
-   bool can_free = dentry->d_flags & DCACHE_MAY_FREE;
-   spin_unlock(>d_lock);
-   if (likely(can_free))
-   dentry_free(dentry);
-   return NULL;
-   }
-
inode = dentry->d_inode;
if (inode && !spin_trylock(>i_lock)) {
 relock:
@@ -845,6 +837,15 @@ static void shrink_dentry_list(struct list_head *list)
continue;
}
 
+
+   if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+   bool can_free = dentry->d_flags & DCACHE_MAY_FREE;
+   spin_unlock(>d_lock);
+   if (can_free)
+   dentry_free(dentry);
+   continue;
+   }
+
parent = dentry_kill(dentry, 0);
/*
 * If dentry_kill returns NULL, we have nothing more to do.
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 07/10] shrink_dentry_list(): take parent's ->d_lock earlier

2016-05-20 Thread Vladimir Davydov
From: Al Viro 

The cause of livelocks there is that we are taking ->d_lock on
dentry and its parent in the wrong order, forcing us to use
trylock on the parent's one.  d_walk() takes them in the right
order, and unfortunately it's not hard to create a situation
when shrink_dentry_list() can't make progress since trylock
keeps failing, and shrink_dcache_parent() or check_submounts_and_drop()
keeps calling d_walk() disrupting the very shrink_dentry_list() it's
waiting for.

Solution is straightforward - if that trylock fails, let's unlock
the dentry itself and take locks in the right order.  We need to
stabilize ->d_parent without holding ->d_lock, but that's doable
using RCU.  And we'd better do that in the very beginning of the
loop in shrink_dentry_list(), since the checks on refcount, etc.
would need to be redone anyway.

That deals with a half of the problem - killing dentries on the
shrink list itself.  Another one (dropping their parents) is
in the next commit.

locking parent is interesting - it would be easy to do rcu_read_lock(),
lock whatever we think is a parent, lock dentry itself and check
if the parent is still the right one.  Except that we need to check
that *before* locking the dentry, or we are risking taking ->d_lock
out of order.  Fortunately, once the D1 is locked, we can check if
D2->d_parent is equal to D1 without the need to lock D2; D2->d_parent
can start or stop pointing to D1 only under D1->d_lock, so taking
D1->d_lock is enough.  In other words, the right solution is
rcu_read_lock/lock what looks like parent right now/check if it's
still our parent/rcu_read_unlock/lock the child.

Signed-off-by: Al Viro 
(cherry picked from commit 046b961b45f93a92e4c70525a12f3d378bced130)
Signed-off-by: Vladimir Davydov 
---
 fs/dcache.c | 53 +
 1 file changed, 41 insertions(+), 12 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 2e4a7192a877..a287c5a63cde 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -559,6 +559,38 @@ failed:
return dentry; /* try again with same dentry */
 }
 
+static inline struct dentry *lock_parent(struct dentry *dentry)
+{
+   struct dentry *parent = dentry->d_parent;
+   if (IS_ROOT(dentry))
+   return NULL;
+   if (likely(spin_trylock(>d_lock)))
+   return parent;
+   spin_unlock(>d_lock);
+   rcu_read_lock();
+again:
+   parent = ACCESS_ONCE(dentry->d_parent);
+   spin_lock(>d_lock);
+   /*
+* We can't blindly lock dentry until we are sure
+* that we won't violate the locking order.
+* Any changes of dentry->d_parent must have
+* been done with parent->d_lock held, so
+* spin_lock() above is enough of a barrier
+* for checking if it's still our child.
+*/
+   if (unlikely(parent != dentry->d_parent)) {
+   spin_unlock(>d_lock);
+   goto again;
+   }
+   rcu_read_unlock();
+   if (parent != dentry)
+   spin_lock(>d_lock);
+   else
+   parent = NULL;
+   return parent;
+}
+
 /* 
  * This is dput
  *
@@ -832,6 +864,8 @@ static void shrink_dentry_list(struct list_head *list)
struct inode *inode;
dentry = list_entry(list->prev, struct dentry, d_lru);
spin_lock(>d_lock);
+   parent = lock_parent(dentry);
+
/*
 * The dispose list is isolated and dentries are not accounted
 * to the LRU here, so we can simply remove it from the list
@@ -845,6 +879,8 @@ static void shrink_dentry_list(struct list_head *list)
 */
if (dentry->d_lockref.count) {
spin_unlock(>d_lock);
+   if (parent)
+   spin_unlock(>d_lock);
continue;
}
 
@@ -852,6 +888,8 @@ static void shrink_dentry_list(struct list_head *list)
if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
bool can_free = dentry->d_flags & DCACHE_MAY_FREE;
spin_unlock(>d_lock);
+   if (parent)
+   spin_unlock(>d_lock);
if (can_free)
dentry_free(dentry);
continue;
@@ -861,22 +899,13 @@ static void shrink_dentry_list(struct list_head *list)
if (inode && unlikely(!spin_trylock(>i_lock))) {
d_shrink_add(dentry, list);
spin_unlock(>d_lock);
+   if (parent)
+   spin_unlock(>d_lock);
continue;
}
 
-   parent = NULL;
-   if (!IS_ROOT(dentry)) {
-   parent = dentry->d_parent;
- 

[Devel] [PATCH 09/10] dentry_kill() doesn't need the second argument now

2016-05-20 Thread Vladimir Davydov
From: Al Viro 

it's 1 in the only remaining caller.

Signed-off-by: Al Viro 
(cherry picked from commit 8cbf74da435d1bd13dbb790f94c7ff67b2fb6af4)
Signed-off-by: Vladimir Davydov 
---
 fs/dcache.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 7b75b482bec2..ab6859764a6a 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -529,8 +529,7 @@ static void __dentry_kill(struct dentry *dentry)
  * If ref is non-zero, then decrement the refcount too.
  * Returns dentry requiring refcount drop, or NULL if we're done.
  */
-static struct dentry *
-dentry_kill(struct dentry *dentry, int unlock_on_failure)
+static struct dentry *dentry_kill(struct dentry *dentry)
__releases(dentry->d_lock)
 {
struct inode *inode = dentry->d_inode;
@@ -552,10 +551,8 @@ dentry_kill(struct dentry *dentry, int unlock_on_failure)
return parent;
 
 failed:
-   if (unlock_on_failure) {
-   spin_unlock(>d_lock);
-   cpu_relax();
-   }
+   spin_unlock(>d_lock);
+   cpu_relax();
return dentry; /* try again with same dentry */
 }
 
@@ -643,7 +640,7 @@ repeat:
return;
 
 kill_it:
-   dentry = dentry_kill(dentry, 1);
+   dentry = dentry_kill(dentry);
if (dentry)
goto repeat;
 }
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 01/10] fold d_kill()

2016-05-20 Thread Vladimir Davydov
This partially backports commit 03b3b889e79c ("fold d_kill() and
d_free()"). I can't backport the whole commit, because d_free() is still
used elsewhere.

Signed-off-by: Vladimir Davydov 
---
 fs/dcache.c | 50 +++---
 1 file changed, 15 insertions(+), 35 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 0c2826d94c26..bfc284d797d1 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -429,40 +429,6 @@ static void dentry_lru_add(struct dentry *dentry)
d_lru_add(dentry);
 }
 
-/**
- * d_kill - kill dentry and return parent
- * @dentry: dentry to kill
- * @parent: parent dentry
- *
- * The dentry must already be unhashed and removed from the LRU.
- *
- * If this is the root of the dentry tree, return NULL.
- *
- * dentry->d_lock and parent->d_lock must be held by caller, and are dropped by
- * d_kill.
- */
-static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent)
-   __releases(dentry->d_lock)
-   __releases(parent->d_lock)
-   __releases(dentry->d_inode->i_lock)
-{
-   __list_del_entry(>d_u.d_child);
-   /*
-* Inform d_walk() that we are no longer attached to the
-* dentry tree
-*/
-   dentry->d_flags |= DCACHE_DENTRY_KILLED;
-   if (parent)
-   spin_unlock(>d_lock);
-   dentry_iput(dentry);
-   /*
-* dentry_iput drops the locks, at which point nobody (except
-* transient RCU lookups) can reach this dentry.
-*/
-   d_free(dentry);
-   return parent;
-}
-
 /*
  * Unhash a dentry without inserting an RCU walk barrier or checking that
  * dentry->d_lock is locked.  The caller must take care of that, if
@@ -574,7 +540,21 @@ relock:
}
/* if it was on the hash then remove it */
__d_drop(dentry);
-   return d_kill(dentry, parent);
+   __list_del_entry(>d_u.d_child);
+   /*
+* Inform d_walk() that we are no longer attached to the
+* dentry tree
+*/
+   dentry->d_flags |= DCACHE_DENTRY_KILLED;
+   if (parent)
+   spin_unlock(>d_lock);
+   dentry_iput(dentry);
+   /*
+* dentry_iput drops the locks, at which point nobody (except
+* transient RCU lookups) can reach this dentry.
+*/
+   d_free(dentry);
+   return parent;
 }
 
 /* 
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] mm: memcontrol: drop useless VM_BUG_ON's in memcg_{start, stop}_kmem_account

2016-05-20 Thread Vladimir Davydov
No point in them, really - if these are called from a kernel thread or
by a dying process nothing bad is going to happen.

Quite the contrary, having them will result in KP if any of those is
called after exit_mm. An example goes below:

  kernel BUG at mm/memcontrol.c:3358!
  invalid opcode:  [#1] SMP
  CPU: 13 PID: 35647 Comm: mailsrvanalog ve: 1000 Not tainted 
3.10.0-327.10.1.ovz.12.17+ #863 12.17
  task: 8800a0188cb0 ti: 88005634c000 task.ti: 88005634c000
  RIP: 0010:[]  [] 
memcg_stop_kmem_account+0x25/0x30
  RSP: 0018:88005634fd00  EFLAGS: 00010246
  RAX: 8800a0188cb0 RBX: 880118b580b0 RCX: 88005634fd39
  RDX: 0033 RSI: 88005634fd36 RDI: 0035
  RBP: 88005634fd00 R08: 880118b58000 R09: bed7ff49
  R10: 7c640f5b R11: b6f8c81e R12: 880036aa4000
  R13: 88005634fdc4 R14: 880118b58130 R15: 88009f440020
  FS:  () GS:88014b14() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 7fcbdb0977c0 CR3: 0194e000 CR4: 06e0
  DR0:  DR1:  DR2: 
  DR3:  DR6: 0ff0 DR7: 0400
  Stack:
   88005634fd80 a01ab5f6 880144cf1f20 8800364a38c0
   2078696674736f50 737973206c69616d 336300800a6d6574 643332343339342f
   3839336165356264 6236373636343735 3232626364373932 0033353639653465
  Call Trace:
   [] ext4_open_pfcache+0xc6/0x1a0 [ext4]
   [] ext4_save_data_csum+0xc9/0x150 [ext4]
   [] ext4_commit_data_csum+0xac/0xd0 [ext4]
   [] ext4_release_file+0xdc/0xf0 [ext4]
   [] __fput+0xe1/0x250
   [] fput+0xe/0x10
   [] task_work_run+0xc4/0xe0
   [] do_exit+0x2eb/0xb10
   [] ? __do_page_fault+0x164/0x450
   [] do_group_exit+0x3f/0xa0
   [] SyS_exit_group+0x14/0x20
   [] system_call_fastpath+0x16/0x1b

Also, let's move memcg_{start,stop}_kmem_account to the header as they
both are tiny.

Fixes: 6812907ce7c3 ("pfcache: do not account peer files to memcg")
Signed-off-by: Vladimir Davydov 
---
 include/linux/memcontrol.h | 11 +--
 mm/memcontrol.c| 13 -
 2 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1c5f916054be..37ce43221f3c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -517,8 +517,15 @@ extern int memcg_nr_cache_ids;
 extern void memcg_get_cache_ids(void);
 extern void memcg_put_cache_ids(void);
 
-extern void memcg_stop_kmem_account(void);
-extern void memcg_resume_kmem_account(void);
+static inline void memcg_stop_kmem_account(void)
+{
+   current->memcg_kmem_skip_account++;
+}
+
+static inline void memcg_resume_kmem_account(void)
+{
+   current->memcg_kmem_skip_account--;
+}
 
 /*
  * Helper macro to loop through all memcg-specific caches. Callers must still
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 89c9edfe6ce1..af2c14b88060 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3353,19 +3353,6 @@ static void memcg_free_cache_id(int id)
  * memcg_kmem_skip_account. So we enclose anything that might allocate memory
  * inside the following two functions.
  */
-void memcg_stop_kmem_account(void)
-{
-   VM_BUG_ON(!current->mm);
-   current->memcg_kmem_skip_account++;
-}
-EXPORT_SYMBOL(memcg_stop_kmem_account);
-
-void memcg_resume_kmem_account(void)
-{
-   VM_BUG_ON(!current->mm);
-   current->memcg_kmem_skip_account--;
-}
-EXPORT_SYMBOL(memcg_resume_kmem_account);
 
 struct memcg_kmem_cache_create_work {
struct mem_cgroup *memcg;
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RH7 1/2] fs/overlay: set flag FS_VIRTUALIZED

2016-05-20 Thread Pavel Tikhomirov



On 05/20/2016 01:45 PM, Vladimir Davydov wrote:

On Fri, May 20, 2016 at 12:49:35PM +0300, Pavel Tikhomirov wrote:

I remember I tried to add "fs-overlay" to these list and docker failed
to start, as it first checks /proc/filesystems to know that overlay is
supported, but overlay is not in these file until the module is
loaded.


I see. But it tries modprobe first, right? Let's then allow modprobe
to load modules that are on the whitelist. I don't think it would affect
security, because these modules can be easily auto-loaded anyway.


That is reasonable, OK.





--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RH7 1/2] fs/overlay: set flag FS_VIRTUALIZED

2016-05-20 Thread Pavel Tikhomirov



On 05/20/2016 11:59 AM, Vladimir Davydov wrote:

On Thu, May 19, 2016 at 06:32:00PM +0300, Pavel Tikhomirov wrote:

these is temporary decision to make docker in CT work with overlayfs
storage driver, it can be unsafe to give access to fs-overlay module
from container.


Why? (just curious)


As overlayfs works on files/folders level and not on block device level 
unlike dm-thin it seem to be safe enough to give access to it in CT.
For dm-thin the access to a block device and it's raw data in image file 
in CT was the stumbling-block.


But may be overlay driver need to be checked hard to really be safe. 
From docker.com: "As promising as OverlayFS is, it is still relatively 
young. Therefore caution should be taken before using it in production 
Docker environments."






*need to modprobe overlay module on host


May be, we should add overlayfs module to the whitelist? (see
ve0_allowed_mod)



--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RH7 1/2] fs/overlay: set flag FS_VIRTUALIZED

2016-05-20 Thread Vladimir Davydov
On Fri, May 20, 2016 at 12:49:35PM +0300, Pavel Tikhomirov wrote:
> I remember I tried to add "fs-overlay" to these list and docker failed
> to start, as it first checks /proc/filesystems to know that overlay is
> supported, but overlay is not in these file until the module is
> loaded.

I see. But it tries modprobe first, right? Let's then allow modprobe
to load modules that are on the whitelist. I don't think it would affect
security, because these modules can be easily auto-loaded anyway.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RH7 1/2] fs/overlay: set flag FS_VIRTUALIZED

2016-05-20 Thread Vladimir Davydov
On Fri, May 20, 2016 at 12:17:18PM +0300, Pavel Tikhomirov wrote:
> >
> >>
> >>*need to modprobe overlay module on host
> >
> >May be, we should add overlayfs module to the whitelist? (see
> >ve0_allowed_mod)
> 
> I thought that whilelist will work only for auto-loading iptables modules on
> special rules creation? But not for manual load:
> 
> [root@pcs7 ~]# vzctl exec 109 modprobe binfmt_misc
> [root@pcs7 ~]# lsmod | grep binfmt_misc
> [root@pcs7 ~]#
> 
> 3.10.0-327.18.2.ovz.14.4

AFAIK the fs module should auto-loaded on mount - see get_fs_type().
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RH7 1/2] fs/overlay: set flag FS_VIRTUALIZED

2016-05-20 Thread Pavel Tikhomirov


Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.

Пользователь Vladimir Davydov написал 

> On Fri, May 20, 2016 at 12:17:18PM +0300, Pavel Tikhomirov wrote:
> > >
> > >>
> > >>*need to modprobe overlay module on host
> > >
> > >May be, we should add overlayfs module to the whitelist? (see
> > >ve0_allowed_mod)
> > 
> > I thought that whilelist will work only for auto-loading iptables modules on
> > special rules creation? But not for manual load:
> > 
> > [root@pcs7 ~]# vzctl exec 109 modprobe binfmt_misc
> > [root@pcs7 ~]# lsmod | grep binfmt_misc
> > [root@pcs7 ~]#
> > 
> > 3.10.0-327.18.2.ovz.14.4
> 
> AFAIK the fs module should auto-loaded on mount - see get_fs_type().

I remember I tried to add "fs-overlay" to these list and docker failed to 
start, as it first 
checks /proc/filesystems to know that overlay is supported, but overlay is not 
in these file until the module is loaded.___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RH7 1/2] fs/overlay: set flag FS_VIRTUALIZED

2016-05-20 Thread Pavel Tikhomirov




*need to modprobe overlay module on host


May be, we should add overlayfs module to the whitelist? (see
ve0_allowed_mod)


I thought that whilelist will work only for auto-loading iptables 
modules on special rules creation? But not for manual load:


[root@pcs7 ~]# vzctl exec 109 modprobe binfmt_misc
[root@pcs7 ~]# lsmod | grep binfmt_misc
[root@pcs7 ~]#

3.10.0-327.18.2.ovz.14.4





--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RH7 2/2] proc/cpuset: do not show cpuset in CT

2016-05-20 Thread Vladimir Davydov
On Thu, May 19, 2016 at 06:32:01PM +0300, Pavel Tikhomirov wrote:
> After commit da53619c5d49 ("ve/cpuset: revert changes allowing to
> attach to empty cpusets") one can not create non-empty cpuset
> cgroup in CT. And docker which tries to create cgroup for every
> visible controller creates cpuset cgroup for docker-ct and fails
> to add processes to it.
> 
> Cgroup files cpuset.cpus are by design not valid to use
> in our CTs as they pin processes in cgroup to defined range of
> processors, but we don't want processes in container to be able
> to pin itself to cpus they want. We have other mechanism to restric
> CT's cpus usage - cpu.nr_cpus cgroup file, which allows balansing
> containers between cpus. So we faked cpuset.cpus in CT so one can
> not realy pin processes in CT. But that makes all cpuset cgroups
> non-initialized and we also can't attach processes to cgroups. Same
> is valid for cpuset.mems exept we do not have ~nr_mems.
> 
> We can just hide cpuset cgroup from /proc/self/cgroup and /proc/cgroups
> to protect it from being used in CT(and also do not mount it in
> libvzctl, which seem to automaticly happen).

Yeah, libvzctl is smart enough to check /proc/cgroups when adding cgroup
bindmounts, which happens after attaching to ve cgroup, so it should
just work.

> Docker not seeing
> cpuset will almost silently skip it and work as usual.

I hope nobody but docker has intention to use cpuset inside container.

> 
> *Over option is to make fake cpuset.{cpus,mems} to be semi-fake copying
> values from root cgroup on write. But that can lock us changing
> these files on host root cgroup as they are hierarchical.

Nah, that's bad. I bet we would end up reintroducing our ugly hacks
allowing to attach to empty cpusets then.

BTW, I think we can now revert commit 7c57b078d025b ("sched: Port
diff-fairsched-cpuset-add-fake-cpuset-for-containers"). Please do.

> 
> https://jira.sw.ru/browse/PSBM-47280
> 
> Signed-off-by: Pavel Tikhomirov 
> ---
>  kernel/cgroup.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 5afeb59b..0a284f2 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -4953,6 +4953,11 @@ int proc_cgroup_show(struct seq_file *m, void *v)
>   struct cgroup *cgrp;
>   int count = 0;
>  
> + if (!ve_is_super(get_exec_env()) &&
> + (root->subsys_mask & (1UL << cpuset_subsys_id)))
> + /* do not show cpuset in CT */
> + continue;
> +
>   seq_printf(m, "%d:", root->hierarchy_id);
>   for_each_subsys(root, ss)
>   seq_printf(m, "%s%s", count++ ? "," : "", ss->name);
> @@ -4997,6 +5002,10 @@ static int proc_cgroupstats_show(struct seq_file *m, 
> void *v)
>  
>   if (ss == NULL)
>   continue;
> + if (!ve_is_super(get_exec_env()) &&
> + (ss->root->subsys_mask & (1UL << cpuset_subsys_id)))
> + /* do not show cpuset in CT */
> + continue;

This check is better to be in a function. More flexible that way - you
can extend the black list by patching just one place instead of two
then.

>   num = _cg_virtualized(ss->root->number_of_cgroups);
>   seq_printf(m, "%s\t%d\t%d\t%d\n",
>  ss->name, ss->root->hierarchy_id,
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RH7 1/2] fs/overlay: set flag FS_VIRTUALIZED

2016-05-20 Thread Vladimir Davydov
On Thu, May 19, 2016 at 06:32:00PM +0300, Pavel Tikhomirov wrote:
> these is temporary decision to make docker in CT work with overlayfs
> storage driver, it can be unsafe to give access to fs-overlay module
> from container.

Why? (just curious)

> 
> *need to modprobe overlay module on host

May be, we should add overlayfs module to the whitelist? (see
ve0_allowed_mod)
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] ms/net: sock: move ->sk_shutdown out of bitfields.

2016-05-20 Thread Andrey Ryabinin
->sk_shutdown bits share one bitfield with some other bits in sock struct,
such as ->sk_no_check_[r,t]x, ->sk_userlocks ...
sock_setsockopt() may write to these bits, while holding the socket lock.

In case of AF_UNIX sockets, we change ->sk_shutdown bits while holding only
unix_state_lock(). So concurrent setsockopt() and shutdown() may lead
to corrupting these bits.

Fix this by moving ->sk_shutdown bits out of bitfield into a separate byte.
This will not change the 'struct sock' size since ->sk_shutdown moved into
previously unused 16-bit hole.

https://jira.sw.ru/browse/PSBM-47023

Link: 
http://lkml.kernel.org/g/1463588367-16310-1-git-send-email-aryabi...@virtuozzo.com
Signed-off-by: Andrey Ryabinin 
Suggested-by: Hannes Frederic Sowa 
Acked-by: Eric Dumazet 
---
 include/net/sock.h | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index b940aa5..ccaacb6 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -370,8 +370,13 @@ struct sock {
atomic_tsk_omem_alloc;
int sk_sndbuf;
struct sk_buff_head sk_write_queue;
+
+   /*
+* Because of non atomicity rules, all
+* changes are protected by socket lock.
+*/
kmemcheck_bitfield_begin(flags);
-   unsigned intsk_shutdown  : 2,
+   unsigned intsk_padding  : 2,
 #ifdef __GENKSYMS__
sk_no_check : 2,
 #else
@@ -382,6 +387,7 @@ struct sock {
sk_protocol  : 8,
sk_type  : 16;
kmemcheck_bitfield_end(flags);
+
int sk_wmem_queued;
gfp_t   sk_allocation;
u32 sk_pacing_rate; /* bytes per second */
@@ -390,6 +396,7 @@ struct sock {
int sk_gso_type;
unsigned intsk_gso_max_size;
u16 sk_gso_max_segs;
+   u8  sk_shutdown;
int sk_rcvlowat;
unsigned long   sk_lingertime;
struct sk_buff_head sk_error_queue;
-- 
2.7.3

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel