[Devel] [PATCH RHEL7 COMMIT] ve/fs/locks: Make CAP_LEASE work in containers

2016-04-29 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.10.1.vz7.12.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.10.1.vz7.12.15
-->
commit 0944de0f22af4201224b2469647808352330a2a0
Author: Evgenii Shatokhin 
Date:   Fri Apr 29 17:39:25 2016 +0400

ve/fs/locks: Make CAP_LEASE work in containers

Allowing the privileged processes in the containers to set leases on
arbitrary files seems to make no harm. Let us make CAP_LEASE work there.

https://jira.sw.ru/browse/PSBM-46199

Signed-off-by: Evgenii Shatokhin 
Acked-by: Cyrill Gorcunov 
---
 fs/locks.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/locks.c b/fs/locks.c
index 93c097b..82e9bc3 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1693,7 +1693,7 @@ int generic_setlease(struct file *filp, long arg, struct 
file_lock **flp,
struct inode *inode = dentry->d_inode;
int error;
 
-   if ((!uid_eq(current_fsuid(), inode->i_uid)) && !capable(CAP_LEASE))
+   if ((!uid_eq(current_fsuid(), inode->i_uid)) && !ve_capable(CAP_LEASE))
return -EACCES;
if (!S_ISREG(inode->i_mode))
return -EINVAL;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH] fs/locks: Make CAP_LEASE work in containers

2016-04-29 Thread Konstantin Khorenko

On 04/26/2016 12:36 PM, Cyrill Gorcunov wrote:

On Mon, Apr 25, 2016 at 06:22:10PM +0300, Evgenii Shatokhin wrote:

https://jira.sw.ru/browse/PSBM-46199

Allowing the privileged processes in the containers to set leases on
arbitrary files seems to make no harm. Let us make CAP_LEASE work there.

Signed-off-by: Evgenii Shatokhin 

Acked-by: Cyrill Gorcunov 

There is one point which worries me a bit actually: ve_capable is
rather a check for creds in user-ns we created for container during
its startup. Do we prohibit creating new user-namespaces inside
container? If not -- we better should.


After commit 59d3d058b80bf976126ff7cd4c6b429e3d7f6557
we do allow to create user namespaces inside Containers.
Why we better prohibit them?
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH] fs/locks: Make CAP_LEASE work in containers

2016-04-29 Thread Cyrill Gorcunov
On Fri, Apr 29, 2016 at 04:48:09PM +0300, Konstantin Khorenko wrote:
> >>
> >>Allowing the privileged processes in the containers to set leases on
> >>arbitrary files seems to make no harm. Let us make CAP_LEASE work there.
> >>
> >>Signed-off-by: Evgenii Shatokhin 
> >Acked-by: Cyrill Gorcunov 
> >
> >There is one point which worries me a bit actually: ve_capable is
> >rather a check for creds in user-ns we created for container during
> >its startup. Do we prohibit creating new user-namespaces inside
> >container? If not -- we better should.
> 
> After commit 59d3d058b80bf976126ff7cd4c6b429e3d7f6557
> we do allow to create user namespaces inside Containers.
> Why we better prohibit them?

ve-capable tests for creds in userns, while vanilla
uses plain capable() here which test for init namespace
only. which is a difference and i would like to make sure
it's safe here to use ve-capable. can one create nested
userns inside with same caps and drop the lease on this
file? Or I miss somehting?
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] cbt: add uuid arg to blk_cbt_map_copy_once()

2016-04-29 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.10.1.vz7.12.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.10.1.vz7.12.15
-->
commit ccd5d5a157cee360300b67914bd0198671b238d0
Author: Maxim Patlasov 
Date:   Fri Apr 29 18:28:19 2016 +0400

cbt: add uuid arg to blk_cbt_map_copy_once()

The caller must specify explicitly the uuid of CBT mask to copy. This will
help to avoid collision between applications performing backup: the second
will get EINVAL if it attempts to collide with the first.

https://jira.sw.ru/browse/PSBM-45000

Signed-off-by: Maxim Patlasov 
---
 block/blk-cbt.c| 10 --
 include/linux/blkdev.h |  5 +++--
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/block/blk-cbt.c b/block/blk-cbt.c
index 850fd7e..8c52bd8 100644
--- a/block/blk-cbt.c
+++ b/block/blk-cbt.c
@@ -278,8 +278,9 @@ err_cbt:
return ERR_PTR(-ENOMEM);
 }
 
-int blk_cbt_map_copy_once(struct request_queue *q, struct page ***map_ptr,
- blkcnt_t *block_max, blkcnt_t *block_bits)
+int blk_cbt_map_copy_once(struct request_queue *q, __u8 *uuid,
+ struct page ***map_ptr, blkcnt_t *block_max,
+ blkcnt_t *block_bits)
 {
struct cbt_info *cbt;
struct page **map;
@@ -293,6 +294,11 @@ int blk_cbt_map_copy_once(struct request_queue *q, struct 
page ***map_ptr,
BUG_ON(!cbt->map);
BUG_ON(!cbt->block_max);
 
+   if (!uuid || memcmp(uuid, cbt->uuid, sizeof(cbt->uuid))) {
+   mutex_unlock(&cbt_mutex);
+   return -EINVAL;
+   }
+
cbt_flush_cache(cbt);
 
npages = NR_PAGES(cbt->block_max);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7ae962a..56c3a08 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1651,8 +1651,9 @@ extern void blk_cbt_update_size(struct block_device 
*bdev);
 extern void blk_cbt_release(struct request_queue *q);
 extern void blk_cbt_bio_queue(struct request_queue *q, struct bio *bio);
 extern int blk_cbt_ioctl(struct block_device *bdev, unsigned cmd, char __user 
*arg);
-extern int blk_cbt_map_copy_once(struct request_queue *q, struct page 
***map_ptr,
-blkcnt_t *block_max, blkcnt_t *block_bits);
+extern int blk_cbt_map_copy_once(struct request_queue *q, __u8 *uuid,
+struct page ***map_ptr, blkcnt_t *block_max,
+blkcnt_t *block_bits);
 #else /* CONFIG_BLK_DEV_CBT */
 static inline void blk_cbt_update_size(struct block_device *bdev)
 {
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] fuse: never skip writeback if there are too many background requests

2016-04-29 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.10.1.vz7.12.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.10.1.vz7.12.15
-->
commit d74eeb9445184dda1f97a718f85743cdd9ebfb42
Author: Vladimir Davydov 
Date:   Fri Apr 29 18:34:36 2016 +0400

fuse: never skip writeback if there are too many background requests

This patch removes two hunks introduced while porting
diff-fuse-some-fairness-in-handling-writeback from PCS6, see commit
9525f63ba3151 ("fuse: some fairness in handling writeback").

In the original patch, these two hunks skip page writeback only if
writeback_control->nonblocking flag is set, while in the ported version
writeback may be skipped if writeback_control->sync_mode = WB_SYNC_NONE.
This is not equivalent, because in RH6 based kernels wbc->nonblocking is
set only from the reclaim and compaction paths, while WB_SYNC_NONE is
used in all paths except fsync. As a result, in Vz7 the possibility of
reordering writeback sequence is higher than in PCS6.

Since in RH7 based kernels wbc->nonblocking is absent, let's simply zap
these two hunks in order to be closer to conventional PCS6 behavior.

Note, the possibility of reordering writeback sequence still exists (as
it does in PCS6), because writeback can be performed by kswapd on memory
shortage - see shrink_page_list -> pageout.

https://jira.sw.ru/browse/PSBM-45497

Signed-off-by: Vladimir Davydov 
Acked-by: Maxim Patlasov 
---
 fs/fuse/file.c | 11 ---
 1 file changed, 11 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 11c5959..9889e26 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2109,12 +2109,6 @@ static int fuse_writepages_fill(struct page *page,
 req->pages[req->num_pages - 1]->index + 1 != page->index)) {
int err;
 
-   if (wbc->sync_mode == WB_SYNC_NONE && fc->blocked) {
-   redirty_page_for_writepage(wbc, page);
-   unlock_page(page);
-   return 0;
-   }
-
err = fuse_send_writepages(data);
if (err) {
unlock_page(page);
@@ -2168,11 +2162,6 @@ static int fuse_writepages(struct address_space *mapping,
if (is_bad_inode(inode))
goto out;
 
-   if (wbc->sync_mode == WB_SYNC_NONE) {
-   if (fc->blocked)
-   return 0;
-   }
-
/*
 * We use fuse_blocked_for_wb() instead of just fc->blocked to avoid
 * deadlock when we are called from fuse_invalidate_files() in case
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] fuse: increase min/max_dirty_pages up to 256/512 MB

2016-04-29 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.10.1.vz7.12.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.10.1.vz7.12.15
-->
commit 851b61670a39e50d87a6ee1120e0f7d4b7a45bc0
Author: Vladimir Davydov 
Date:   Fri Apr 29 18:30:33 2016 +0400

fuse: increase min/max_dirty_pages up to 256/512 MB

According to Alexey, the pstorage chunk size was increased from 64 MB up
to 256 MB, which made the default min/max fuse dirty pages limits (64
and 256 MB) too low to yield maximal rate while doing sequential writes
over pstorage. Let's increase the default limits up to 256 and 512 MB
respectively to restore pstorage performance.

Suggested-by: Alexey Kuznetsov 
Signed-off-by: Vladimir Davydov 

Acked-by: Maxim Patlasov 
---
 fs/fuse/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index bb010cb..3772b62 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1100,8 +1100,8 @@ static int fuse_bdi_init(struct fuse_conn *fc, struct 
super_block *sb)
/*
 * These values have precedence over max_ratio
 */
-   bdi_set_max_dirty(&fc->bdi, (256 * 1024 * 1024) / PAGE_SIZE);
-   bdi_set_min_dirty(&fc->bdi, (64 * 1024 * 1024) / PAGE_SIZE);
+   bdi_set_max_dirty(&fc->bdi, (512 * 1024 * 1024) / PAGE_SIZE);
+   bdi_set_min_dirty(&fc->bdi, (256 * 1024 * 1024) / PAGE_SIZE);
 
return 0;
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 10/9] arch: x86: fix accounting page tables to kmemcg

2016-04-29 Thread Vladimir Davydov
After we switched to the whitelist kmem accounting design, using
alloc_kmem_pages is not enough for the new page to be accounted -
one must also pass __GFP_ACCOUNT in gfp flags.

Signed-off-by: Vladimir Davydov 
---
 arch/x86/include/asm/pgalloc.h | 4 ++--
 arch/x86/mm/pgtable.c  | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index 758a6a7c527a..f5897582b88c 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -81,7 +81,7 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t 
*pmd,
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
struct page *page;
-   page = alloc_kmem_pages(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO, 0);
+   page = alloc_kmem_pages(GFP_KERNEL_ACCOUNT | __GFP_REPEAT | __GFP_ZERO, 
0);
if (!page)
return NULL;
if (!pgtable_pmd_page_ctor(page)) {
@@ -125,7 +125,7 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t 
*pgd, pud_t *pud)
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   return (pud_t *)__get_free_kmem_pages(GFP_KERNEL|__GFP_REPEAT|
+   return (pud_t *)__get_free_kmem_pages(GFP_KERNEL_ACCOUNT|__GFP_REPEAT|
  __GFP_ZERO, 0);
 }
 
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 3715dda0c41b..02ec6243372d 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -6,7 +6,7 @@
 #include 
 #include 
 
-#define PGALLOC_GFP GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO
+#define PGALLOC_GFP GFP_KERNEL_ACCOUNT | __GFP_NOTRACK | __GFP_REPEAT | 
__GFP_ZERO
 
 #ifdef CONFIG_HIGHPTE
 #define PGALLOC_USER_GFP __GFP_HIGHMEM
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 11/9] fs: fix accounting pipe buffers to kmemcg

2016-04-29 Thread Vladimir Davydov
After we switched to the whitelist kmem accounting design, using
alloc_kmem_pages is not enough for the new page to be accounted -
one must also pass __GFP_ACCOUNT in gfp flags.

Signed-off-by: Vladimir Davydov 
---
 fs/pipe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 57aa78e7c63e..d038ff89a43f 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -590,7 +590,7 @@ redo1:
size_t remaining;
 
if (!page) {
-   page = alloc_kmem_pages(GFP_HIGHUSER, 0);
+   page = alloc_kmem_pages(GFP_HIGHUSER | 
__GFP_ACCOUNT, 0);
if (unlikely(!page)) {
ret = ret ? : -ENOMEM;
break;
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH] fs/locks: Make CAP_LEASE work in containers

2016-04-29 Thread Cyrill Gorcunov
On Fri, Apr 29, 2016 at 05:17:48PM +0300, Cyrill Gorcunov wrote:
> > After commit 59d3d058b80bf976126ff7cd4c6b429e3d7f6557
> > we do allow to create user namespaces inside Containers.
> > Why we better prohibit them?
> 
> ve-capable tests for creds in userns, while vanilla
> uses plain capable() here which test for init namespace
> only. which is a difference and i would like to make sure
> it's safe here to use ve-capable. can one create nested
> userns inside with same caps and drop the lease on this
> file? Or I miss somehting?

Sorry, false alarm, we check for toplevel userns in ve_capable,
so everything is safe.

Cyrill
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] mm: memcontrol: check more carefully if current is oom killed

2016-04-29 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.10.1.vz7.12.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.10.1.vz7.12.15
-->
commit 5948747c2342fb62d91d5a7dd7140e7922634a6b
Author: Vladimir Davydov 
Date:   Fri Apr 29 19:28:58 2016 +0400

mm: memcontrol: check more carefully if current is oom killed

Currently, we abort memcg reclaim only if fatal_signal_pending returns
true on current. That's not enough, because a process can be killed
after it entered do_exit, in which case signal_pending may not be set
and therefore an OOM killed process may be looping in mem_cgroup_reclaim
for quite some time (the latter retries reclaim 100 times!), resulting
in OOM timeout at best or soft lockup panic at worst.

Let's elaborate the abort condition by adding TIF_MEMDIE check.

Signed-off-by: Vladimir Davydov 
Reviewed-by: Kirill Tkhai 
---
 mm/memcontrol.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index df78ffd..a2ac582 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2092,7 +2092,8 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup 
*memcg,
drain_all_stock_async(memcg);
total += try_to_free_mem_cgroup_pages(memcg, SWAP_CLUSTER_MAX,
  gfp_mask, noswap);
-   if (fatal_signal_pending(current))
+   if (test_thread_flag(TIF_MEMDIE) ||
+   fatal_signal_pending(current))
return 1;
/*
 * Allow limit shrinkers, which are triggered directly
@@ -2943,7 +2944,8 @@ again:
bool invoke_oom = oom && !nr_oom_retries;
 
/* If killed, bypass charge */
-   if (fatal_signal_pending(current)) {
+   if (test_thread_flag(TIF_MEMDIE) ||
+   fatal_signal_pending(current)) {
css_put(&memcg->css);
goto bypass;
}
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] oom: zap unused oom_scan_process_thread arguments

2016-04-29 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.10.1.vz7.12.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.10.1.vz7.12.15
-->
commit e7c4b8f6b315c9e74e86da4fe5e7b981916efe07
Author: Vladimir Davydov 
Date:   Fri Apr 29 19:28:59 2016 +0400

oom: zap unused oom_scan_process_thread arguments

totalpages hasn't been used for ages. force_kill doesn't make sense in
our case, because we skip TIF_MEMDIE tasks anyway.

Signed-off-by: Vladimir Davydov 
Reviewed-by: Kirill Tkhai 
---
 drivers/tty/sysrq.c |  2 +-
 include/linux/oom.h |  5 ++---
 mm/memcontrol.c |  3 +--
 mm/oom_kill.c   | 24 +---
 mm/page_alloc.c |  2 +-
 5 files changed, 14 insertions(+), 22 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index de29f8e..e7d334c 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -356,7 +356,7 @@ static struct sysrq_key_op sysrq_term_op = {
 static void moom_callback(struct work_struct *ignored)
 {
out_of_memory(node_zonelist(first_online_node, GFP_KERNEL), GFP_KERNEL,
- 0, NULL, true);
+ 0, NULL);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index acf58fc..5c2a7a4 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -99,11 +99,10 @@ extern void check_panic_on_oom(enum oom_constraint 
constraint, gfp_t gfp_mask,
   int order, const nodemask_t *nodemask);
 
 extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
-   unsigned long totalpages, const nodemask_t *nodemask,
-   bool force_kill);
+  const nodemask_t *nodemask);
 
 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
-   int order, nodemask_t *mask, bool force_kill);
+ int order, nodemask_t *mask);
 
 extern void exit_oom_victim(void);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a2ac582..82204b3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2039,8 +2039,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
 
cgroup_iter_start(cgroup, &it);
while ((task = cgroup_iter_next(cgroup, &it))) {
-   switch (oom_scan_process_thread(task, totalpages, NULL,
-   false)) {
+   switch (oom_scan_process_thread(task, NULL)) {
case OOM_SCAN_SELECT:
if (chosen)
put_task_struct(chosen);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8d1843f..5bc4ccf 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -317,8 +317,7 @@ static enum oom_constraint constrained_alloc(struct 
zonelist *zonelist,
 #endif
 
 enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
-   unsigned long totalpages, const nodemask_t *nodemask,
-   bool force_kill)
+   const nodemask_t *nodemask)
 {
if (oom_unkillable_task(task, NULL, nodemask))
return OOM_SCAN_CONTINUE;
@@ -330,10 +329,9 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct 
*task,
 * This can only happen if oom_trylock timeout-ed, which most probably
 * means that the victim had dead-locked.
 */
-   if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-   if (!force_kill)
-   return OOM_SCAN_CONTINUE;
-   }
+   if (test_tsk_thread_flag(task, TIF_MEMDIE))
+   return OOM_SCAN_CONTINUE;
+
if (!task->mm)
return OOM_SCAN_CONTINUE;
 
@@ -355,8 +353,7 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct 
*task,
  */
 static struct task_struct *select_bad_process(unsigned long *ppoints,
unsigned long *poverdraft,
-   unsigned long totalpages, const nodemask_t *nodemask,
-   bool force_kill)
+   unsigned long totalpages, const nodemask_t *nodemask)
 {
struct task_struct *g, *p;
struct task_struct *chosen = NULL;
@@ -368,8 +365,7 @@ static struct task_struct *select_bad_process(unsigned long 
*ppoints,
unsigned int points;
unsigned long overdraft;
 
-   switch (oom_scan_process_thread(p, totalpages, nodemask,
-   force_kill)) {
+   switch (oom_scan_process_thread(p, nodemask)) {
case OOM_SCAN_SELECT:
chosen = p;
chosen_points = ULONG_MAX;
@@ -954,7 +950,6 @@ EXPORT_SYMBOL_GPL(unregister_oom_notifier);
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
  * @nodemask: nodemask passed to page 

[Devel] [PATCH RHEL7 COMMIT] oom: do not select child that has already been killed

2016-04-29 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.10.1.vz7.12.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.10.1.vz7.12.15
-->
commit 1ed898800a3069f582ea3f82e0e920ecb4f0740c
Author: Vladimir Davydov 
Date:   Fri Apr 29 19:28:57 2016 +0400

oom: do not select child that has already been killed

Patchset description:
A few fixes for oom killer

This is for https://jira.sw.ru/browse/PSBM-40842

Vladimir Davydov (5):
  oom: do not select child that has already been killed
  mm: memcontrol: check more carefully if current is oom killed
  oom: do not ignore score of exiting tasks
  oom: zap unused oom_scan_process_thread arguments
  oom: make berserker more aggressive

==
This patch description:

Otherwise, we might end up selecting the same process over and over
again in case it got stuck somewhere in exit path for some reason.

Signed-off-by: Vladimir Davydov 
Reviewed-by: Kirill Tkhai 
---
 mm/oom_kill.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 2402fcc..b21641f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -847,6 +847,9 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
 
if (child->mm == p->mm)
continue;
+   if (!child->mm ||
+   test_tsk_thread_flag(child, TIF_MEMDIE))
+   continue;
/*
 * oom_badness() returns 0 if the thread is unkillable
 */
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] oom: do not ignore score of exiting tasks

2016-04-29 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.10.1.vz7.12.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.10.1.vz7.12.15
-->
commit 4439834255824766734137c5090acd2fca6c0614
Author: Vladimir Davydov 
Date:   Fri Apr 29 19:28:59 2016 +0400

oom: do not ignore score of exiting tasks

This patch effectively backports commit 6a618957ad17 ("mm: oom_kill:
don't ignore oom score on exiting tasks").

The reason I'm doing this is that it hurts oom berserker logic. When oom
berserker enrages, it may kill quite a few tasks. All of them will soon
end in do_exit, where they can get stuck trying to acquire some mutex
(quite likely if a sort of fork bomb is running inside a container). Due
to the check removed by this patch the next oom kill will surely select
one of such tasks and hence is likely to timeout and so will the one
after the next and so on until all such tasks have been marked with
TIF_MEMDIE or exited. I.e. instead of killing more tasks we will be
wasting our time marking those that have already been killed and waiting
for them to exit. This may slow down oom berserker significantly. Since
this PF_EXITING check was removed from oom_scan_process_thread upstream
for a similar reason, let's zap it in our kernel too.

Signed-off-by: Vladimir Davydov 
Reviewed-by: Kirill Tkhai 
---
 mm/oom_kill.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b21641f..8d1843f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -344,14 +344,6 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct 
*task,
if (oom_task_origin(task))
return OOM_SCAN_SELECT;
 
-   if (task->flags & PF_EXITING && !force_kill) {
-   /*
-* If this task is not being ptraced on exit, then wait for it
-* to finish before killing some other task unnecessarily.
-*/
-   if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
-   return OOM_SCAN_SELECT;
-   }
return OOM_SCAN_OK;
 }
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] oom: make berserker more aggressive

2016-04-29 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.10.1.vz7.12.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.10.1.vz7.12.15
-->
commit 4ddb655454d7893e6f9f5c3fc202694f6078219f
Author: Vladimir Davydov 
Date:   Fri Apr 29 19:29:00 2016 +0400

oom: make berserker more aggressive

In the berserker mode we kill a bunch of tasks that are as bad as the
selected victim. We assume two tasks to be equally bad if they consume
the same permille of memory. With such a strict check, it might turn out
that oom berserker won't kill any tasks in case a fork bomb is running
inside a container while the effect of killing a task eating <=1/1000th
of memory won't be enough to cope with memory shortage. Let's loosen
this check and use percentage instead of permille. In this case, it
might still happen that berserker won't kill anyone, but in this case
the regular oom should free at least 1/100th of memory, which should be
enough even for small containers.

Also, check berserker mode even if the victim has already exited by the
time we are about to send SIGKILL to it. Rationale: when the berserker
is in rage, it might kill hundreds of tasks so that the next oom kill is
likely to select an exiting task. Not triggering berserker in this case
will result in oom stalls.

Signed-off-by: Vladimir Davydov 
Reviewed-by: Kirill Tkhai 
---
 mm/oom_kill.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5bc4ccf..2d0fcac 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -757,11 +757,11 @@ static void oom_berserker(unsigned long points, unsigned 
long overdraft,
continue;
 
/*
-* Consider tasks as equally bad if they have equal
-* normalized scores.
+* Consider tasks as equally bad if they occupy equal
+* percentage of available memory.
 */
-   if (tsk_points * 1000 / totalpages <
-   points * 1000 / totalpages)
+   if (tsk_points * 100 / totalpages <
+   points * 100 / totalpages)
continue;
 
if (__ratelimit(&berserker_rs)) {
@@ -809,8 +809,7 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
if (p->mm && p->flags & PF_EXITING) {
mark_oom_victim(p);
task_unlock(p);
-   put_task_struct(p);
-   return;
+   goto out;
}
task_unlock(p);
 
@@ -855,8 +854,7 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
 
p = find_lock_task_mm(victim);
if (!p) {
-   put_task_struct(victim);
-   return;
+   goto out;
} else if (victim != p) {
get_task_struct(p);
put_task_struct(victim);
@@ -902,8 +900,8 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
 
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
mem_cgroup_note_oom_kill(memcg, victim);
+out:
put_task_struct(victim);
-
oom_berserker(points, overdraft, totalpages, memcg, nodemask);
 }
 #undef K
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] oom: fix NULL ptr deref on oom if memory cgroup is disabled

2016-04-29 Thread Konstantin Khorenko

Going to send it to mainstream?

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 04/27/2016 04:09 PM, Vladimir Davydov wrote:

mem_cgroup_iter and try_get_mem_cgroup_from_mm return NULL in this case,
handle this properly.

https://jira.sw.ru/browse/PSBM-43328

Signed-off-by: Vladimir Davydov 
---
  include/linux/memcontrol.h |  5 +++--
  mm/memcontrol.c|  4 +++-
  mm/oom_kill.c  | 20 +++-
  3 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d90f6c77dc69..743fb0b6f621 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -31,6 +31,8 @@ struct mm_struct;
  struct kmem_cache;
  struct oom_context;

+extern struct oom_context global_oom_ctx;
+
  /* Stats that can be updated by kernel. */
  enum mem_cgroup_page_stat_item {
MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
@@ -392,8 +394,7 @@ mem_cgroup_update_lru_size(struct lruvec *lruvec, enum 
lru_list lru,
  static inline struct oom_context *
  mem_cgroup_oom_context(struct mem_cgroup *memcg)
  {
-   extern struct oom_context oom_ctx;
-   return &oom_ctx;
+   return &global_oom_ctx;
  }

  static inline unsigned long mem_cgroup_overdraft(struct mem_cgroup *memcg)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 61c395b7c4ed..fa66d1128cfb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1699,6 +1699,8 @@ void mem_cgroup_note_oom_kill(struct mem_cgroup 
*root_memcg,

  struct oom_context *mem_cgroup_oom_context(struct mem_cgroup *memcg)
  {
+   if (mem_cgroup_disabled())
+   return &global_oom_ctx;
if (!memcg)
memcg = root_mem_cgroup;
return &memcg->oom_ctx;
@@ -1708,7 +1710,7 @@ unsigned long mem_cgroup_overdraft(struct mem_cgroup 
*memcg)
  {
unsigned long long guarantee, usage;

-   if (mem_cgroup_is_root(memcg))
+   if (mem_cgroup_disabled() || mem_cgroup_is_root(memcg))
return 0;

guarantee = ACCESS_ONCE(memcg->oom_guarantee);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 2402fcceda6e..7a328e8c3204 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -51,12 +51,10 @@ static DEFINE_SPINLOCK(oom_context_lock);
  #define OOM_BASE_RAGE -10
  #define OOM_MAX_RAGE  20

-#ifndef CONFIG_MEMCG
-struct oom_context oom_ctx = {
+struct oom_context global_oom_ctx = {
.rage   = OOM_BASE_RAGE,
-   .waitq  = __WAIT_QUEUE_HEAD_INITIALIZER(oom_ctx.waitq),
+   .waitq  = __WAIT_QUEUE_HEAD_INITIALIZER(global_oom_ctx.waitq),
  };
-#endif

  void init_oom_context(struct oom_context *ctx)
  {
@@ -187,7 +185,8 @@ static unsigned long mm_overdraft(struct mm_struct *mm)
memcg = try_get_mem_cgroup_from_mm(mm);
ctx = mem_cgroup_oom_context(memcg);
overdraft = ctx->overdraft;
-   mem_cgroup_put(memcg);
+   if (memcg)
+   mem_cgroup_put(memcg);

return overdraft;
  }
@@ -497,7 +496,8 @@ void mark_oom_victim(struct task_struct *tsk)
ctx->marked = true;
}
spin_unlock(&oom_context_lock);
-   mem_cgroup_put(memcg);
+   if (memcg)
+   mem_cgroup_put(memcg);
  }

  /**
@@ -608,7 +608,7 @@ bool oom_trylock(struct mem_cgroup *memcg)
 * information will be used in oom_badness.
 */
ctx->overdraft = mem_cgroup_overdraft(iter);
-   parent = parent_mem_cgroup(iter);
+   parent = iter ? parent_mem_cgroup(iter) : NULL;
if (parent && iter != memcg)
ctx->overdraft = max(ctx->overdraft,
mem_cgroup_oom_context(parent)->overdraft);
@@ -645,7 +645,8 @@ void oom_unlock(struct mem_cgroup *memcg)
 * on it for the victim to exit below.
 */
victim_memcg = iter;
-   mem_cgroup_get(iter);
+   if (iter)
+   mem_cgroup_get(iter);

mem_cgroup_iter_break(memcg, iter);
break;
@@ -695,7 +696,8 @@ void oom_unlock(struct mem_cgroup *memcg)
 */
ctx = mem_cgroup_oom_context(victim_memcg);
__wait_oom_context(ctx);
-   mem_cgroup_put(victim_memcg);
+   if (victim_memcg)
+   mem_cgroup_put(victim_memcg);
  }

  /*


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH] ploop: force journal commit after dio_post_submit

2016-04-29 Thread Konstantin Khorenko

https://jira.sw.ru/browse/PSBM-45730

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 04/27/2016 05:42 PM, Dmitry Monakhov wrote:

Once we converted extent to initialized it can be part of uncompleted
journal transaction, so we have to force transaction commit at some point.
The easiest way to do it is to perform unconditional fsync.
https://jira.sw.ru/browse/PSBM-45326

TODO: This case and others can be optimized by deferring fsync.But this is
   subject of another patch.

Signed-off-by: Dmitry Monakhov 
---
  drivers/block/ploop/io_direct.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 8032999..5a2e12a 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -523,6 +523,8 @@ dio_post_submit(struct ploop_io *io, struct ploop_request * 
preq)
err = io->files.file->f_op->fallocate(io->files.file,
  FALLOC_FL_CONVERT_UNWRITTEN,
  (loff_t)sec << 9, clu_siz);
+   if (!err)
+   err = io->files.file->f_op->FOP_FSYNC(io->files.file, 0);
file_end_write(io->files.file);
if (err) {
PLOOP_REQ_SET_ERROR(preq, err);


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] oom: fix NULL ptr deref on oom if memory cgroup is disabled

2016-04-29 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-327.10.1.vz7.12.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.10.1.vz7.12.15
-->
commit d5e7fba013768ed994b7af329f466b40a72ffe4e
Author: Vladimir Davydov 
Date:   Fri Apr 29 20:28:22 2016 +0400

oom: fix NULL ptr deref on oom if memory cgroup is disabled

mem_cgroup_iter and try_get_mem_cgroup_from_mm return NULL in this case,
handle this properly.

https://jira.sw.ru/browse/PSBM-43328

Signed-off-by: Vladimir Davydov 
---
 include/linux/memcontrol.h |  5 +++--
 mm/memcontrol.c|  4 +++-
 mm/oom_kill.c  | 20 +++-
 3 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 27b3c56..1427692 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -31,6 +31,8 @@ struct mm_struct;
 struct kmem_cache;
 struct oom_context;
 
+extern struct oom_context global_oom_ctx;
+
 /* Stats that can be updated by kernel. */
 enum mem_cgroup_page_stat_item {
MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
@@ -392,8 +394,7 @@ mem_cgroup_update_lru_size(struct lruvec *lruvec, enum 
lru_list lru,
 static inline struct oom_context *
 mem_cgroup_oom_context(struct mem_cgroup *memcg)
 {
-   extern struct oom_context oom_ctx;
-   return &oom_ctx;
+   return &global_oom_ctx;
 }
 
 static inline unsigned long mem_cgroup_overdraft(struct mem_cgroup *memcg)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 82204b3..7061864 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1699,6 +1699,8 @@ void mem_cgroup_note_oom_kill(struct mem_cgroup 
*root_memcg,
 
 struct oom_context *mem_cgroup_oom_context(struct mem_cgroup *memcg)
 {
+   if (mem_cgroup_disabled())
+   return &global_oom_ctx;
if (!memcg)
memcg = root_mem_cgroup;
return &memcg->oom_ctx;
@@ -1708,7 +1710,7 @@ unsigned long mem_cgroup_overdraft(struct mem_cgroup 
*memcg)
 {
unsigned long long guarantee, usage;
 
-   if (mem_cgroup_is_root(memcg))
+   if (mem_cgroup_disabled() || mem_cgroup_is_root(memcg))
return 0;
 
guarantee = ACCESS_ONCE(memcg->oom_guarantee);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 2d0fcac..f9a8e62 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -51,12 +51,10 @@ static DEFINE_SPINLOCK(oom_context_lock);
 #define OOM_BASE_RAGE  -10
 #define OOM_MAX_RAGE   20
 
-#ifndef CONFIG_MEMCG
-struct oom_context oom_ctx = {
+struct oom_context global_oom_ctx = {
.rage   = OOM_BASE_RAGE,
-   .waitq  = __WAIT_QUEUE_HEAD_INITIALIZER(oom_ctx.waitq),
+   .waitq  = __WAIT_QUEUE_HEAD_INITIALIZER(global_oom_ctx.waitq),
 };
-#endif
 
 void init_oom_context(struct oom_context *ctx)
 {
@@ -187,7 +185,8 @@ static unsigned long mm_overdraft(struct mm_struct *mm)
memcg = try_get_mem_cgroup_from_mm(mm);
ctx = mem_cgroup_oom_context(memcg);
overdraft = ctx->overdraft;
-   mem_cgroup_put(memcg);
+   if (memcg)
+   mem_cgroup_put(memcg);
 
return overdraft;
 }
@@ -485,7 +484,8 @@ void mark_oom_victim(struct task_struct *tsk)
ctx->marked = true;
}
spin_unlock(&oom_context_lock);
-   mem_cgroup_put(memcg);
+   if (memcg)
+   mem_cgroup_put(memcg);
 }
 
 /**
@@ -596,7 +596,7 @@ bool oom_trylock(struct mem_cgroup *memcg)
 * information will be used in oom_badness.
 */
ctx->overdraft = mem_cgroup_overdraft(iter);
-   parent = parent_mem_cgroup(iter);
+   parent = iter ? parent_mem_cgroup(iter) : NULL;
if (parent && iter != memcg)
ctx->overdraft = max(ctx->overdraft,
mem_cgroup_oom_context(parent)->overdraft);
@@ -633,7 +633,8 @@ void oom_unlock(struct mem_cgroup *memcg)
 * on it for the victim to exit below.
 */
victim_memcg = iter;
-   mem_cgroup_get(iter);
+   if (iter)
+   mem_cgroup_get(iter);
 
mem_cgroup_iter_break(memcg, iter);
break;
@@ -683,7 +684,8 @@ void oom_unlock(struct mem_cgroup *memcg)
 */
ctx = mem_cgroup_oom_context(victim_memcg);
__wait_oom_context(ctx);
-   mem_cgroup_put(victim_memcg);
+   if (victim_memcg)
+   mem_cgroup_put(victim_memcg);
 }
 
 /*
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [NEW KERNEL] 3.10.0-327.10.1.vz7.12.16 (rhel7)

2016-04-29 Thread builder
Changelog:

OpenVZ kernel rh7-3.10.0-327.10.1.vz7.12.16

* allow to use file leases inside Containers
* cbt: enhance API so the caller must specify explicitly the uuid of CBT mask
  to copy
* fuse: increase min/max_dirty_pages up to 256/512 MB
* fuse: fix periodic writeback speed drop under high load
* OOM: improve berserk mode to fight fork bombs more effective in case
  the Node has big swap
* config: enable panic-on-oops in debug kernel by default


Generated changelog:

* Sat Apr 30 2016 Konstantin Khorenko  
[3.10.0-327.10.1.vz7.12.16]
- config.OpenVZ.debug: enable panic-on-oops in debug kernel (Konstantin 
Khorenko)
- oom: fix NULL ptr deref on oom if memory cgroup is disabled (Vladimir 
Davydov) [PSBM-43328]
- oom: make berserker more aggressive (Vladimir Davydov)
- oom: zap unused oom_scan_process_thread arguments (Vladimir Davydov)
- oom: do not ignore score of exiting tasks (Vladimir Davydov)
- mm: memcontrol: check more carefully if current is oom killed (Vladimir 
Davydov)
- oom: do not select child that has already been killed (Vladimir Davydov) 
[PSBM-40842]
- fuse: never skip writeback if there are too many background requests 
(Vladimir Davydov) [PSBM-45497]
- fuse: increase min/max_dirty_pages up to 256/512 MB (Vladimir Davydov)
- cbt: add uuid arg to blk_cbt_map_copy_once() (Maxim Patlasov) [PSBM-45000]
- ve/fs/locks: Make CAP_LEASE work in containers (Evgenii Shatokhin) 
[PSBM-46199]


Built packages: 
http://kojistorage.eng.sw.ru/packages/vzkernel/3.10.0/327.10.1.vz7.12.16/
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 1/4] ploop: introduce pbd

2016-04-29 Thread Maxim Patlasov
The patch introduce push_backup descriptor ("pbd") and a few simple
functions to create and release it.

Userspace can govern it by new ioctls: PLOOP_IOC_PUSH_BACKUP_INIT and
PLOOP_IOC_PUSH_BACKUP_STOP.

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/Makefile  |2 
 drivers/block/ploop/dev.c |   89 
 drivers/block/ploop/push_backup.c |  271 +
 drivers/block/ploop/push_backup.h |8 +
 include/linux/ploop/ploop.h   |3 
 include/linux/ploop/ploop_if.h|   19 +++
 6 files changed, 391 insertions(+), 1 deletion(-)
 create mode 100644 drivers/block/ploop/push_backup.c
 create mode 100644 drivers/block/ploop/push_backup.h

diff --git a/drivers/block/ploop/Makefile b/drivers/block/ploop/Makefile
index e36a027..0fecf16 100644
--- a/drivers/block/ploop/Makefile
+++ b/drivers/block/ploop/Makefile
@@ -5,7 +5,7 @@ CFLAGS_io_direct.o = -I$(src)
 CFLAGS_ploop_events.o = -I$(src)
 
 obj-$(CONFIG_BLK_DEV_PLOOP)+= ploop.o
-ploop-objs := dev.o map.o io.o sysfs.o tracker.o freeblks.o ploop_events.o 
discard.o
+ploop-objs := dev.o map.o io.o sysfs.o tracker.o freeblks.o ploop_events.o 
discard.o push_backup.o
 
 obj-$(CONFIG_BLK_DEV_PLOOP)+= pfmt_ploop1.o
 pfmt_ploop1-objs := fmt_ploop1.o
diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 1da073c..23da9f5 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -19,6 +19,7 @@
 #include "ploop_events.h"
 #include "freeblks.h"
 #include "discard.h"
+#include "push_backup.h"
 
 /* Structures and terms:
  *
@@ -3766,6 +3767,9 @@ static int ploop_stop(struct ploop_device * plo, struct 
block_device *bdev)
return -EBUSY;
}
 
+   clear_bit(PLOOP_S_PUSH_BACKUP, &plo->state);
+   ploop_pb_stop(plo->pbd);
+
for (p = plo->disk->minors - 1; p > 0; p--)
invalidate_partition(plo->disk, p);
invalidate_partition(plo->disk, 0);
@@ -3892,6 +3896,7 @@ static int ploop_clear(struct ploop_device * plo, struct 
block_device * bdev)
}
 
ploop_fb_fini(plo->fbd, 0);
+   ploop_pb_fini(plo->pbd);
 
plo->maintenance_type = PLOOP_MNTN_OFF;
plo->bd_size = 0;
@@ -4477,6 +4482,84 @@ static int ploop_getdevice_ioc(unsigned long arg)
return err;
 }
 
+static int ploop_push_backup_init(struct ploop_device *plo, unsigned long arg)
+{
+   struct ploop_push_backup_init_ctl ctl;
+   struct ploop_pushbackup_desc *pbd = NULL;
+   int rc = 0;
+
+   if (list_empty(&plo->map.delta_list))
+   return -ENOENT;
+
+   if (plo->maintenance_type != PLOOP_MNTN_OFF)
+   return -EINVAL;
+
+   BUG_ON(plo->pbd);
+
+   if (copy_from_user(&ctl, (void*)arg, sizeof(ctl)))
+   return -EFAULT;
+
+   pbd = ploop_pb_alloc(plo);
+   if (!pbd) {
+   rc = -ENOMEM;
+   goto pb_init_done;
+   }
+
+   ploop_quiesce(plo);
+
+   rc = ploop_pb_init(pbd, ctl.cbt_uuid, !ctl.cbt_mask_addr);
+   if (rc) {
+   ploop_relax(plo);
+   goto pb_init_done;
+   }
+
+   plo->pbd = pbd;
+
+   atomic_set(&plo->maintenance_cnt, 0);
+   plo->maintenance_type = PLOOP_MNTN_PUSH_BACKUP;
+   set_bit(PLOOP_S_PUSH_BACKUP, &plo->state);
+
+   ploop_relax(plo);
+
+   if (ctl.cbt_mask_addr)
+   rc = ploop_pb_copy_cbt_to_user(pbd, (char *)ctl.cbt_mask_addr);
+pb_init_done:
+   if (rc)
+   ploop_pb_fini(pbd);
+   return rc;
+}
+
+static int ploop_push_backup_stop(struct ploop_device *plo, unsigned long arg)
+{
+   struct ploop_pushbackup_desc *pbd = plo->pbd;
+   struct ploop_push_backup_stop_ctl ctl;
+
+   if (plo->maintenance_type != PLOOP_MNTN_PUSH_BACKUP)
+   return -EINVAL;
+
+   if (copy_from_user(&ctl, (void*)arg, sizeof(ctl)))
+   return -EFAULT;
+
+   if (pbd && ploop_pb_check_uuid(pbd, ctl.cbt_uuid)) {
+   printk("ploop(%d): PUSH_BACKUP_STOP uuid mismatch\n",
+  plo->index);
+   return -EINVAL;
+   }
+
+   if (!test_and_clear_bit(PLOOP_S_PUSH_BACKUP, &plo->state))
+   return -EINVAL;
+
+   BUG_ON (!pbd);
+   ctl.status = ploop_pb_stop(pbd);
+
+   ploop_quiesce(plo);
+   ploop_pb_fini(plo->pbd);
+   plo->maintenance_type = PLOOP_MNTN_OFF;
+   ploop_relax(plo);
+
+   return 0;
+}
+
 static int ploop_ioctl(struct block_device *bdev, fmode_t fmode, unsigned int 
cmd,
   unsigned long arg)
 {
@@ -4581,6 +4664,12 @@ static int ploop_ioctl(struct block_device *bdev, 
fmode_t fmode, unsigned int cm
case PLOOP_IOC_MAX_DELTA_SIZE:
err = ploop_set_max_delta_size(plo, arg);
break;
+   case PLOOP_IOC_PUSH_BACKUP_INIT:
+   err = ploop_push_backup_init(plo, arg);
+   break;
+   case PLOOP_IOC_PUSH_BACKUP_STOP:
+   er

[Devel] [PATCH rh7 0/4] ploop: push_backup

2016-04-29 Thread Maxim Patlasov
The following series implements new feature: ploop push_backup.

The idea is to suspend all incoming WRITE-requests until userspace
backup application reports explicitly that corresponding parts of
ploop block device are "pushed" -- i.e. stored in backup.

To improve latency, the kernel ploop tells userspace about suspended
requests. This lets userspace to "push" correspondign parts of device
out-of-band. After that, the userspace may tell kernel to re-schedule
those requests.

https://jira.sw.ru/browse/PSBM-45000

---

Maxim Patlasov (4):
  ploop: introduce pbd
  ploop: implement PLOOP_IOC_PUSH_BACKUP_IO
  ploop: wire push_backup into state-machine
  ploop: push_backup cleanup


 drivers/block/ploop/Makefile  |2 
 drivers/block/ploop/dev.c |  226 +++
 drivers/block/ploop/push_backup.c |  564 +
 drivers/block/ploop/push_backup.h |   19 +
 include/linux/ploop/ploop.h   |4 
 include/linux/ploop/ploop_if.h|   42 +++
 6 files changed, 856 insertions(+), 1 deletion(-)
 create mode 100644 drivers/block/ploop/push_backup.c
 create mode 100644 drivers/block/ploop/push_backup.h

--
Signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 2/4] ploop: implement PLOOP_IOC_PUSH_BACKUP_IO

2016-04-29 Thread Maxim Patlasov
The ioctl(PLOOP_IOC_PUSH_BACKUP_IO) has two mode of operation:

1) ctl.direction=PLOOP_READ tells userspace which cluster-blocks to
push out-of-band; moves processed preq-s from pending_tree to reported_tree

2) ctl.direction=PLOOP_WRITE tells kernel which cluster-blocks were pushed --
they are either ordinarily processed preq-s or out-of-band ones; the kernel
match the blocks to preq-s in reported_tree and re-schedules them.

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/dev.c |  105 
 drivers/block/ploop/push_backup.c |  197 +
 drivers/block/ploop/push_backup.h |5 +
 include/linux/ploop/ploop_if.h|   23 
 4 files changed, 330 insertions(+)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 23da9f5..2a77d2e 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4529,6 +4529,108 @@ pb_init_done:
return rc;
 }
 
+static int ploop_push_backup_io_read(struct ploop_device *plo, unsigned long 
arg,
+struct ploop_push_backup_io_ctl *ctl)
+{
+   struct ploop_push_backup_ctl_extent *e;
+   unsigned n_extents = 0;
+   int rc = 0;
+
+   e = kmalloc(sizeof(*e) * ctl->n_extents, GFP_KERNEL);
+   if (!e)
+   return -ENOMEM;
+
+   while (n_extents < ctl->n_extents) {
+   cluster_t clu, len;
+   rc = ploop_pb_get_pending(plo->pbd, &clu, &len, n_extents);
+   if (rc)
+   goto io_read_done;
+
+   e[n_extents].clu = clu;
+   e[n_extents].len = len;
+   n_extents++;
+   }
+
+   rc = -EFAULT;
+   ctl->n_extents = n_extents;
+   if (copy_to_user((void*)arg, ctl, sizeof(*ctl)))
+   goto io_read_done;
+   if (n_extents &&
+   copy_to_user((void*)(arg + sizeof(*ctl)), e,
+n_extents * sizeof(*e)))
+   goto io_read_done;
+   rc = 0;
+
+io_read_done:
+   kfree(e);
+   return rc;
+}
+
+static int ploop_push_backup_io_write(struct ploop_device *plo, unsigned long 
arg,
+ struct ploop_push_backup_io_ctl *ctl)
+{
+   struct ploop_push_backup_ctl_extent *e;
+   unsigned i;
+   int rc = 0;
+
+   e = kmalloc(sizeof(*e) * ctl->n_extents, GFP_KERNEL);
+   if (!e)
+   return -ENOMEM;
+
+   rc = -EFAULT;
+   if (copy_from_user(e, (void*)(arg + sizeof(*ctl)),
+  ctl->n_extents * sizeof(*e)))
+   goto io_write_done;
+
+   rc = 0;
+   for (i = 0; i < ctl->n_extents; i++) {
+   cluster_t j;
+   for (j = e[i].clu; j < e[i].clu + e[i].len; j++)
+   ploop_pb_put_reported(plo->pbd, j, 1);
+/* OPTIMIZE ME LATER: like this:
+* ploop_pb_put_reported(plo->pbd, e[i].clu, e[i].len); */
+   }
+
+io_write_done:
+   kfree(e);
+   return rc;
+}
+
+static int ploop_push_backup_io(struct ploop_device *plo, unsigned long arg)
+{
+   struct ploop_push_backup_io_ctl ctl;
+   struct ploop_pushbackup_desc *pbd = plo->pbd;
+
+   if (list_empty(&plo->map.delta_list))
+   return -ENOENT;
+
+   if (plo->maintenance_type != PLOOP_MNTN_PUSH_BACKUP)
+   return -EINVAL;
+
+   BUG_ON (!pbd);
+
+   if (copy_from_user(&ctl, (void*)arg, sizeof(ctl)))
+   return -EFAULT;
+
+   if (!ctl.n_extents)
+   return -EINVAL;
+
+   if (ploop_pb_check_uuid(pbd, ctl.cbt_uuid)) {
+   printk("ploop(%d): PUSH_BACKUP_IO uuid mismatch\n",
+  plo->index);
+   return -EINVAL;
+   }
+
+   switch(ctl.direction) {
+   case PLOOP_READ:
+   return ploop_push_backup_io_read(plo, arg, &ctl);
+   case PLOOP_WRITE:
+   return ploop_push_backup_io_write(plo, arg, &ctl);
+   }
+
+   return -EINVAL;
+}
+
 static int ploop_push_backup_stop(struct ploop_device *plo, unsigned long arg)
 {
struct ploop_pushbackup_desc *pbd = plo->pbd;
@@ -4667,6 +4769,9 @@ static int ploop_ioctl(struct block_device *bdev, fmode_t 
fmode, unsigned int cm
case PLOOP_IOC_PUSH_BACKUP_INIT:
err = ploop_push_backup_init(plo, arg);
break;
+   case PLOOP_IOC_PUSH_BACKUP_IO:
+   err = ploop_push_backup_io(plo, arg);
+   break;
case PLOOP_IOC_PUSH_BACKUP_STOP:
err = ploop_push_backup_stop(plo, arg);
break;
diff --git a/drivers/block/ploop/push_backup.c 
b/drivers/block/ploop/push_backup.c
index ecc9862..477caf7 100644
--- a/drivers/block/ploop/push_backup.c
+++ b/drivers/block/ploop/push_backup.c
@@ -256,6 +256,89 @@ int ploop_pb_copy_cbt_to_user(struct ploop_pushbackup_desc 
*pbd, char *user_addr
return 0;
 }
 
+static void ploop_pb_add_req_to_tree(struct p

[Devel] [PATCH rh7 3/4] ploop: wire push_backup into state-machine

2016-04-29 Thread Maxim Patlasov
When ploop state-machine looks at preq first time, it suspends the preq if
its cluster-block matches pbd->ppb_map -- the copy of CBT mask initially.
To suspend preq we simply put it to pbd->pending_tree and plo->lockout_tree.

Later, when userspace reports that out-of-band processing is done, we
set PLOOP_REQ_PUSH_BACKUP bit in preq->state, re-schedule the preq and
wakeup ploop state-machine. This PLOOP_REQ_PUSH_BACKUP bit lets state-machine
know that given preq is OK and we shouldn't suspend further preq-s for
given cluster-block anymore.

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/dev.c |   32 +++
 drivers/block/ploop/push_backup.c |   62 +
 drivers/block/ploop/push_backup.h |6 
 include/linux/ploop/ploop.h   |1 +
 4 files changed, 101 insertions(+)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 2a77d2e..c7cc385 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -2021,6 +2021,38 @@ restart:
return;
}
 
+   /* push_backup special processing */
+   if (!test_bit(PLOOP_REQ_LOCKOUT, &preq->state) &&
+   (preq->req_rw & REQ_WRITE) && preq->req_size &&
+   ploop_pb_check_bit(plo->pbd, preq->req_cluster)) {
+   if (ploop_pb_preq_add_pending(plo->pbd, preq)) {
+   /* already reported by userspace push_backup */
+   ploop_pb_clear_bit(plo->pbd, preq->req_cluster);
+   } else {
+   spin_lock_irq(&plo->lock);
+   ploop_add_lockout(preq, 0);
+   spin_unlock_irq(&plo->lock);
+   /*
+* preq IN: preq is in ppb_pending tree waiting for
+* out-of-band push_backup processing by userspace ...
+*/
+   return;
+   }
+   } else if (test_bit(PLOOP_REQ_LOCKOUT, &preq->state) &&
+  test_and_clear_bit(PLOOP_REQ_PUSH_BACKUP, &preq->state)) {
+   /*
+* preq OUT: out-of-band push_backup processing by
+* userspace done; preq was re-scheduled
+*/
+   ploop_pb_clear_bit(plo->pbd, preq->req_cluster);
+
+   spin_lock_irq(&plo->lock);
+   del_lockout(preq);
+   if (!list_empty(&preq->delay_list))
+   list_splice_init(&preq->delay_list, 
plo->ready_queue.prev);
+   spin_unlock_irq(&plo->lock);
+   }
+
if (plo->trans_map) {
err = ploop_find_trans_map(plo->trans_map, preq);
if (err) {
diff --git a/drivers/block/ploop/push_backup.c 
b/drivers/block/ploop/push_backup.c
index 477caf7..488b8fb 100644
--- a/drivers/block/ploop/push_backup.c
+++ b/drivers/block/ploop/push_backup.c
@@ -146,6 +146,32 @@ static void set_bit_in_map(struct page **map, u64 map_max, 
u64 blk)
do_bit_in_map(map, map_max, blk, SET_BIT);
 }
 
+static void clear_bit_in_map(struct page **map, u64 map_max, u64 blk)
+{
+   do_bit_in_map(map, map_max, blk, CLEAR_BIT);
+}
+
+static bool check_bit_in_map(struct page **map, u64 map_max, u64 blk)
+{
+   return do_bit_in_map(map, map_max, blk, CHECK_BIT);
+}
+
+/* intentionally lockless */
+void ploop_pb_clear_bit(struct ploop_pushbackup_desc *pbd, cluster_t clu)
+{
+   BUG_ON(!pbd);
+   clear_bit_in_map(pbd->ppb_map, pbd->ppb_block_max, clu);
+}
+
+/* intentionally lockless */
+bool ploop_pb_check_bit(struct ploop_pushbackup_desc *pbd, cluster_t clu)
+{
+   if (!pbd)
+   return false;
+
+   return check_bit_in_map(pbd->ppb_map, pbd->ppb_block_max, clu);
+}
+
 static int convert_map_to_map(struct ploop_pushbackup_desc *pbd)
 {
struct page **from_map = pbd->cbt_map;
@@ -278,6 +304,12 @@ static void ploop_pb_add_req_to_tree(struct ploop_request 
*preq,
rb_insert_color(&preq->reloc_link, tree);
 }
 
+static void ploop_pb_add_req_to_pending(struct ploop_pushbackup_desc *pbd,
+   struct ploop_request *preq)
+{
+   ploop_pb_add_req_to_tree(preq, &pbd->pending_tree);
+}
+
 static void ploop_pb_add_req_to_reported(struct ploop_pushbackup_desc *pbd,
 struct ploop_request *preq)
 {
@@ -339,6 +371,33 @@ ploop_pb_get_req_from_reported(struct 
ploop_pushbackup_desc *pbd,
return ploop_pb_get_req_from_tree(&pbd->reported_tree, clu);
 }
 
+int ploop_pb_preq_add_pending(struct ploop_pushbackup_desc *pbd,
+  struct ploop_request *preq)
+{
+   BUG_ON(!pbd);
+
+   spin_lock(&pbd->ppb_lock);
+
+   if (!test_bit(PLOOP_S_PUSH_BACKUP, &pbd->plo->state)) {
+   spin_unlock(&pbd->ppb_lock);
+   return -EINTR;
+   }
+
+   /* if (preq matches pbd->reported_map) return -EALREADY; */
+   if (preq->req_cluster < pbd->p

[Devel] [PATCH rh7 4/4] ploop: push_backup cleanup

2016-04-29 Thread Maxim Patlasov
ploop_pb_stop() is called either explicitly, when userspace makes
ioctl(PLOOP_IOC_PUSH_BACKUP_STOP), or implicitly on ploop shutdown
when userspace stops ploop device by ioctl(PLOOP_IOC_STOP).

In both cases, it's useful to re-schedule all suspended preq-s. Otherwise,
we won't be able to destroy ploop because some preq-s are still not
completed.

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/push_backup.c |   36 +++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/drivers/block/ploop/push_backup.c 
b/drivers/block/ploop/push_backup.c
index 488b8fb..05af67c 100644
--- a/drivers/block/ploop/push_backup.c
+++ b/drivers/block/ploop/push_backup.c
@@ -358,6 +358,12 @@ ploop_pb_get_first_req_from_pending(struct 
ploop_pushbackup_desc *pbd)
 }
 
 static struct ploop_request *
+ploop_pb_get_first_req_from_reported(struct ploop_pushbackup_desc *pbd)
+{
+   return ploop_pb_get_first_req_from_tree(&pbd->reported_tree);
+}
+
+static struct ploop_request *
 ploop_pb_get_req_from_pending(struct ploop_pushbackup_desc *pbd,
  cluster_t clu)
 {
@@ -400,16 +406,44 @@ int ploop_pb_preq_add_pending(struct 
ploop_pushbackup_desc *pbd,
 
 unsigned long ploop_pb_stop(struct ploop_pushbackup_desc *pbd)
 {
+   unsigned long ret = 0;
+   LIST_HEAD(drop_list);
+
if (pbd == NULL)
return 0;
 
spin_lock(&pbd->ppb_lock);
 
+   while (!RB_EMPTY_ROOT(&pbd->pending_tree)) {
+   struct ploop_request *preq =
+   ploop_pb_get_first_req_from_pending(pbd);
+   list_add(&preq->list, &drop_list);
+   ret++;
+   }
+
+   while (!RB_EMPTY_ROOT(&pbd->reported_tree)) {
+   struct ploop_request *preq =
+   ploop_pb_get_first_req_from_reported(pbd);
+   list_add(&preq->list, &drop_list);
+   ret++;
+   }
+
if (pbd->ppb_waiting)
complete(&pbd->ppb_comp);
spin_unlock(&pbd->ppb_lock);
 
-   return 0;
+   if (!list_empty(&drop_list)) {
+   struct ploop_device *plo = pbd->plo;
+
+   BUG_ON(!plo);
+   spin_lock_irq(&plo->lock);
+   list_splice_init(&drop_list, plo->ready_queue.prev);
+   if (test_bit(PLOOP_S_WAIT_PROCESS, &plo->state))
+   wake_up_interruptible(&plo->waitq);
+   spin_unlock_irq(&plo->lock);
+   }
+
+   return ret;
 }
 
 int ploop_pb_get_pending(struct ploop_pushbackup_desc *pbd,

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel