[Devel] [PATCH vz8] kernel/cgroup: remove unnecessary cgroup_mutex lock.

2020-10-09 Thread Andrey Ryabinin
Stopping container causes the lockdep to complain (see report bellow).
We can avoid it simply by removing cgroup_mutex lock from
cgroup_mark_ve_root(). I believe it's not needed there, it seems to be
added just in case.

 WARNING: possible circular locking dependency detected
 4.18.0-193.6.3.vz8.4.6+debug #1 Not tainted
 --
 vzctl/36606 is trying to acquire lock:
 88814b195ca0 (kn->count#338){}, at: kernfs_remove_by_name_ns+0x40/0x80

 but task is already holding lock:
 9cf75a90 (cgroup_mutex){+.+.}, at: cgroup_kn_lock_live+0x106/0x390

 which lock already depends on the new lock.
 the existing dependency chain (in reverse order) is:

 -> #2 (cgroup_mutex){+.+.}:
__mutex_lock+0x163/0x13d0
cgroup_mark_ve_root+0x1d/0x2e0
ve_state_write+0xb81/0xdc0
cgroup_file_write+0x2da/0x7a0
kernfs_fop_write+0x255/0x410
vfs_write+0x157/0x460
ksys_write+0xb8/0x170
do_syscall_64+0xa5/0x4d0
entry_SYSCALL_64_after_hwframe+0x6a/0xdf

 -> #1 (>op_sem){}:
down_write+0xa0/0x3d0
ve_state_write+0x6b/0xdc0
cgroup_file_write+0x2da/0x7a0
kernfs_fop_write+0x255/0x410
vfs_write+0x157/0x460
ksys_write+0xb8/0x170
do_syscall_64+0xa5/0x4d0
entry_SYSCALL_64_after_hwframe+0x6a/0xdf

 -> #0 (kn->count#338){}:
__lock_acquire+0x22cb/0x48c0
lock_acquire+0x14f/0x3b0
__kernfs_remove+0x61e/0x810
kernfs_remove_by_name_ns+0x40/0x80
cgroup_addrm_files+0x531/0x940
css_clear_dir+0xfb/0x200
kill_css+0x8f/0x120
cgroup_destroy_locked+0x246/0x5e0
cgroup_rmdir+0x2f/0x2c0
kernfs_iop_rmdir+0x131/0x1b0
vfs_rmdir+0x142/0x3c0
do_rmdir+0x2b2/0x340
do_syscall_64+0xa5/0x4d0
entry_SYSCALL_64_after_hwframe+0x6a/0xdf

 other info that might help us debug this:

 Chain exists of:
   kn->count#338 --> >op_sem --> cgroup_mutex

  Possible unsafe locking scenario:

CPU0CPU1

   lock(cgroup_mutex);
lock(>op_sem);
lock(cgroup_mutex);
   lock(kn->count#338);

*** DEADLOCK ***

 4 locks held by vzctl/36606:
  #0: 88813c02c890 (sb_writers#7){.+.+}, at: mnt_want_write+0x3c/0xa0
  #1: 88814414ad48 (>i_mutex_dir_key#5/1){+.+.}, at: 
do_rmdir+0x23c/0x340
  #2: 88811d3054e8 (>i_mutex_dir_key#5){}, at: 
vfs_rmdir+0xb6/0x3c0
  #3: 9cf75a90 (cgroup_mutex){+.+.}, at: cgroup_kn_lock_live+0x106/0x390

 Call Trace:
  dump_stack+0x9a/0xf0
  check_noncircular+0x317/0x3c0
  __lock_acquire+0x22cb/0x48c0
  lock_acquire+0x14f/0x3b0
  __kernfs_remove+0x61e/0x810
  kernfs_remove_by_name_ns+0x40/0x80
  cgroup_addrm_files+0x531/0x940
  css_clear_dir+0xfb/0x200
  kill_css+0x8f/0x120
  cgroup_destroy_locked+0x246/0x5e0
  cgroup_rmdir+0x2f/0x2c0
  kernfs_iop_rmdir+0x131/0x1b0
  vfs_rmdir+0x142/0x3c0
  do_rmdir+0x2b2/0x340
  do_syscall_64+0xa5/0x4d0
  entry_SYSCALL_64_after_hwframe+0x6a/0xdf

https://jira.sw.ru/browse/PSBM-120670
Signed-off-by: Andrey Ryabinin 
---
 kernel/cgroup/cgroup.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8420f3547f1a..08137d43f3ab 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1883,7 +1883,6 @@ void cgroup_mark_ve_root(struct ve_struct *ve)
struct css_set *cset;
struct cgroup *cgrp;
 
-   mutex_lock(_mutex);
spin_lock_irq(_set_lock);
 
rcu_read_lock();
@@ -1899,7 +1898,6 @@ void cgroup_mark_ve_root(struct ve_struct *ve)
rcu_read_unlock();
 
spin_unlock_irq(_set_lock);
-   mutex_unlock(_mutex);
 }
 
 static struct cgroup *cgroup_get_ve_root1(struct cgroup *cgrp)
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7] bcache: fix NULL pointer deref in blk_add_request_payload

2020-10-09 Thread Evgenii Shatokhin
From: Lars Ellenberg 

[https://lkml.org/lkml/2014/2/19/264]

bch_generic_make_request_hack() tries to be smart,
and fake a bi_max_bvecs = bi_vcnt.

If those bios have been REQ_DISCARD, and get submitted to a driver
(md raid) that uses bio_clone, the clone will end up with bi_io_vec == NULL,
passed down the stack, end up in sd_prep_fn and blk_add_request_payload,
which then tries to use bio->bi_io_vec->page.

Fix: try to be even smarter in bch_generic_make_request_hack(),
and always pretend to have at least bi_max_vecs of 1,
unless the incoming bio was already created without a single bvec.

Signed-off-by: Lars Ellenberg 

https://jira.sw.ru/browse/PSBM-121142

The fix did not make it into the mainline or stable kernels but it was not
rejected either, just forgotten.

The problem was fixed in the kernel 3.14 with commit
e90abc8ec323 "block: Remove bi_idx hacks" and its prerequisites, which are
rather invasive.

Signed-off-by: Evgenii Shatokhin 
---
 drivers/md/bcache/io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index d285cd49104c..4482c0982e8f 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -45,7 +45,7 @@ static void bch_generic_make_request_hack(struct bio *bio)
 *
 * To be taken out once immutable bvec stuff is in.
 */
-   bio->bi_max_vecs = bio->bi_vcnt;
+   bio->bi_max_vecs = bio->bi_vcnt ?: (bio->bi_io_vec ? 1 : 0);
 
generic_make_request(bio);
 }
-- 
2.27.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh8] ve/posix-timers: reference ve monotonic clock from ve start (v2)

2020-10-09 Thread Konstantin Khorenko
From: Kirill Tkhai 

So that CLOCK_MONOTONIC will be monotonic even if ve is migrated to
another hw node.

Note, translating ve <-> abs time in clock_settime and timer_settime is
not necessary because (1) clock_settime won't set monotonic clock and
(2) timer_gettime always returns relative time.

https://jira.sw.ru/browse/PSBM-13860

diff-posix_timers-reference-ct-monotonic-clock-from-ct-start

Signed-off-by: Vladimir Davydov 

Acked-by: Pavel Emelyanov 
Signed-off-by: Kirill Tkhai 

+++
ve/posix-timers: reference ve monotonic clock from start in clock_nanosleep

This is an addition to 
diff-posix_timers-reference-ct-monotonic-clock-from-ct-start

Otherwise, apps that use sys_clock_nanosleep() to suspend their
execution can hang after ve migration.

diff-posix-timers-reference-ve-monotonic-clock-from-ve-start-in-clock_nanosleep

Signed-off-by: Vladimir Davydov 

Acked-by: Konstantin Khlebnikov 
Acked-by: Pavel Emelyanov 
Signed-off-by: Kirill Tkhai 

+++
timers: Port 
diff-ve-timers-convert-ve-monotonic-to-abs-time-when-setting-timerfd-2

Need this for docker, as sometimes systemd-tmpfiles-clean.timer inside
a PCS7 CT is spamming dbus with requests to start corresponding service.
And at the same time Docker tries to create cgroup for container and
attach it to hierarchies like memory and blkio.

That is because systemd timer was triggered with non-virtualized timerfd
using plain host clock but check that timer is successfull uses
virtualized clock_gettime and don't pass before proper(in-container)
timer activation. And timers charges again and again starts service
got in busy loop.

https://jira.sw.ru/browse/PSBM-34017

v2: move the stubs to ve.h

Port the following RH6 commit:

  Author: Vladimir Davydov
  Email: vdavy...@parallels.com
  Subject: fs: convert ve monotonic to abs time when setting timerfd
  Date: Fri, 15 Feb 2013 11:57:09 +0400

  * [timers] corrected TFD_TIMER_ABSTIME timer handling,
the issue led to high cpu usage inside a Fedora 18 CT
by 'init' process (PSBM-18284)

  Monotonic time inside container, as it can be obtained using various
  system calls such as clock_gettime, is reported since start of the container,
  not since start of the whole system. This was made in order to avoid time
  issues while a container is migrated between different physical hosts, but 
this
  also introduced a lot of problems in time- related system calls because
  absolute monotonic time, which is in fact relative to container, passed to 
those
  system calls must be converted to system-wide monotonic time, which is used by
  kernel hrtimers.

  One of those buggy system calls is timerfd_settime which accepts as an
  argument absolute time if flag TFD_TIMER_ABSTIME is specified.

  The patch fixes it by converting container monotonic time to system-
  wide monotonic time using the monotonic_ve_to_abs() function, which was
  introduced earlier and is now exported for that reason.

  https://jira.sw.ru/browse/PSBM-18284

  Signed-off-by: Vladimir Davydov 

Signed-off-by: Pavel Tikhomirov 
Signed-off-by: Kirill Tkhai 
Reviewed-by: Vladimir Davydov 

(cherry picked from vz7 commit 869542c24c41c0578b47d2ef83cfa63427e0e5e1)
Signed-off-by: Konstantin Khorenko 

+++
timers should not get negative argument

This patch fixes 25-sec delay on login into systemd based containers.

Userspace application can set timer for past
and expect that the timer will be expired immediately.

This can do not work as expected inside migrated containers.
Translated argument provided to timer can become negative,
and according timer will sleep a very long time.

https://jira.sw.ru/browse/PSBM-48475

CC: Vladimir Davydov 
CC: Konstantin Khorenko 
Signed-off-by: Vasily Averin 
Acked-by: Cyrill Gorcunov 

(cherry picked from vz7 commit a71fa19facb00472e47760255ab2e6fa16885732)
Signed-off-by: Konstantin Khorenko 
---
 fs/timerfd.c   |  8 --
 include/linux/ve.h |  8 ++
 kernel/time/posix-timers.c | 55 +-
 3 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/fs/timerfd.c b/fs/timerfd.c
index cdad49da3ff7..59ed38c29941 100644
--- a/fs/timerfd.c
+++ b/fs/timerfd.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct timerfd_ctx {
union {
@@ -432,8 +433,8 @@ SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags)
return ufd;
 }
 
-static int do_timerfd_settime(int ufd, int flags, 
-   const struct itimerspec64 *new,
+static int do_timerfd_settime(int ufd, int flags,
+   struct itimerspec64 *new,
struct itimerspec64 *old)
 {
struct fd f;
@@ -493,6 +494,9 @@ static int do_timerfd_settime(int ufd, int flags,
/*
 * Re-program the timer to the new value ...
 */
+   if ((flags & TFD_TIMER_ABSTIME) &&
+   (new->it_value.tv_sec || new->it_value.tv_nsec))
+   monotonic_ve_to_abs(ctx->clockid, >it_value);
ret = timerfd_setup(ctx, 

[Devel] [PATCH vz8] memcg: Fix missing memcg->cache charges during page migration

2020-10-09 Thread Andrey Ryabinin
Since 44b7a8d33d66 ("mm: memcontrol: do not uncharge old page in
 page cache replacement") the mem_cgroup_migrate() charges newpage,
but the ->cache charge is missing here. Add it to fix negative ->cache
values, which leads to WARNING like bellow and softlockups.

 WARNING: CPU: 14 PID: 1372 at mm/page_counter.c:62 
page_counter_cancel+0x26/0x30

 Call Trace:
  page_counter_uncharge+0x1d/0x30
  uncharge_batch+0x25c/0x2e0
  mem_cgroup_uncharge_list+0x64/0x90
  release_pages+0x33e/0x3c0
  __pagevec_release+0x1b/0x40
  truncate_inode_pages_range+0x358/0x8b0
  ext4_evict_inode+0x167/0x580 [ext4]
  evict+0xd2/0x1a0
  do_unlinkat+0x250/0x2e0
  do_syscall_64+0x5b/0x1a0
  entry_SYSCALL_64_after_hwframe+0x65/0xca

https://jira.sw.ru/browse/PSBM-120653
Signed-off-by: Andrey Ryabinin 
---
 mm/memcontrol.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index df70c3bdd444..134cb27307f2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6867,6 +6867,8 @@ void mem_cgroup_migrate(struct page *oldpage, struct page 
*newpage)
page_counter_charge(>memory, nr_pages);
if (do_memsw_account())
page_counter_charge(>memsw, nr_pages);
+   if (!PageAnon(newpage) && !PageSwapBacked(newpage))
+   page_counter_charge(>cache, nr_pages);
css_get_many(>css, nr_pages);
 
commit_charge(newpage, memcg, false);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] vmscan: don't report reclaim progress if there was no progress.

2020-10-09 Thread Andrey Ryabinin



On 10/9/20 10:22 AM, Vasily Averin wrote:
> Andrey,
> could you please clarify, is it required for vz8 too?
> 

vz8 don't need this. This part  was removed by commit 0a0337e0d1 ("mm, oom: 
rework oom detection")

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] mm/filemap: fix potential memcg->cache charge leak

2020-10-09 Thread Andrey Ryabinin



On 10/9/20 10:14 AM, Vasily Averin wrote:
> vz8 is affected too, please cherry-pick 
> vz7 commit 79a5642e9d9a6bdbb56d9e0ee990fd96b7c8625c
> 

vz8 is not affected
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] vmscan: don't report reclaim progress if there was no progress.

2020-10-09 Thread Vasily Averin
Andrey,
could you please clarify, is it required for vz8 too?

On 10/5/20 4:21 PM, Andrey Ryabinin wrote:
> __alloc_pages_slowpath relies on the direct reclaim and did_some_progress
> as an indicator that it makes sense to retry allocation rather than
> declaring OOM. shrink_zones checks if all zones reclaimable and if
> shrink_zone didn't make any progress it prevents from a premature OOM
> killer invocation by reporting the progress.
> This might happen if the LRU is full of dirty or writeback pages
> and direct reclaim cannot clean those up.
> 
> zone_reclaimable allows to rescan the reclaimable lists several times
> and restart if a page is freed.  This is really subtle behavior and it
> might lead to a livelock when a single freed page keeps allocator
> looping but the current task will not be able to allocate that single
> page.  OOM killer would be more appropriate than looping without any
> progress for unbounded amount of time.
> 
> Report no progress even if zones are reclaimable as OOM is more appropiate
> in that case.
> 
> https://jira.sw.ru/browse/PSBM-104900
> Signed-off-by: Andrey Ryabinin 
> ---
>  mm/vmscan.c | 24 
>  1 file changed, 24 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 13ae9bd1e92e..85622f235e78 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2952,26 +2952,6 @@ static void snapshot_refaults(struct mem_cgroup 
> *root_memcg, struct zone *zone)
> } while ((memcg = mem_cgroup_iter(root_memcg, memcg, NULL)));
>  }
>  
> -/* All zones in zonelist are unreclaimable? */
> -static bool all_unreclaimable(struct zonelist *zonelist,
> - struct scan_control *sc)
> -{
> - struct zoneref *z;
> - struct zone *zone;
> -
> - for_each_zone_zonelist_nodemask(zone, z, zonelist,
> - gfp_zone(sc->gfp_mask), sc->nodemask) {
> - if (!populated_zone(zone))
> - continue;
> - if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> - continue;
> - if (zone_reclaimable(zone))
> - return false;
> - }
> -
> - return true;
> -}
> -
>  static void shrink_tcrutches(struct scan_control *scan_ctrl)
>  {
>   int nid;
> @@ -3097,10 +3077,6 @@ out:
>   goto retry;
>   }
>  
> - /* top priority shrink_zones still had more to do? don't OOM, then */
> - if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
> - return 1;
> -
>   return 0;
>  }
>  
> 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] tun: Silence allocation failer if user asked for too big header

2020-10-09 Thread Vasily Averin
it looks like vz8 is affected too,
please cherry-pick vz7 commit 1e0ad3477bddaf5621b7cc620e6ed64e405ec8cd

On 10/5/20 4:42 PM, Andrey Ryabinin wrote:
> Userspace may ask tun device to send packet with ridiculously
> big header and trigger this:
> 
>  [ cut here ]
>  WARNING: CPU: 1 PID: 15366 at mm/page_alloc.c:3548 
> __alloc_pages_nodemask+0x537/0x1200
>  order 19 >= 11, gfp 0x2044d0
>  Call Trace:
>dump_stack+0x19/0x1b
>__warn+0x17f/0x1c0
>warn_slowpath_fmt+0xad/0xe0
>__alloc_pages_nodemask+0x537/0x1200
>kmalloc_large_node+0x5f/0xd0
>__kmalloc_node_track_caller+0x425/0x630
>__kmalloc_reserve.isra.33+0x47/0xd0
>__alloc_skb+0xdd/0x5f0
>alloc_skb_with_frags+0x8f/0x540
>sock_alloc_send_pskb+0x5e5/0x940
>tun_get_user+0x38b/0x24a0 [tun]
>tun_chr_aio_write+0x13a/0x250 [tun]
>do_sync_readv_writev+0xdf/0x1c0
>do_readv_writev+0x1a5/0x850
>vfs_writev+0xba/0x190
>SyS_writev+0x17c/0x340
>system_call_fastpath+0x25/0x2a
> 
> Just add __GFP_NOWARN and silently return -ENOMEM to fix this.
> 
> https://jira.sw.ru/browse/PSBM-103639
> Signed-off-by: Andrey Ryabinin 
> ---
>  drivers/net/tun.c  | 4 ++--
>  include/net/sock.h | 7 +++
>  net/core/sock.c| 9 +
>  3 files changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index e95a89ba48b7..c0879c6a9703 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1142,8 +1142,8 @@ static struct sk_buff *tun_alloc_skb(struct tun_file 
> *tfile,
>   if (prepad + len < PAGE_SIZE || !linear)
>   linear = len;
>  
> - skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
> -, 0);
> + skb = sock_alloc_send_pskb_flags(sk, prepad + linear, len - linear, 
> noblock,
> + , 0, __GFP_NOWARN);
>   if (!skb)
>   return ERR_PTR(err);
>  
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 4136d2c3080c..1912d85ecc4d 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1626,6 +1626,13 @@ extern struct sk_buff  
> *sock_alloc_send_pskb(struct sock *sk,
> int noblock,
> int *errcode,
> int max_page_order);
> +extern struct sk_buff*sock_alloc_send_pskb_flags(struct sock 
> *sk,
> +   unsigned long header_len,
> +   unsigned long data_len,
> +   int noblock,
> +   int *errcode,
> +   int max_page_order,
> +   gfp_t extra_flags);
>  extern void *sock_kmalloc(struct sock *sk, int size,
> gfp_t priority);
>  extern void sock_kfree_s(struct sock *sk, void *mem, int size);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 508fc6093a26..07ea42f976cf 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1964,6 +1964,15 @@ struct sk_buff *sock_alloc_send_pskb(struct sock *sk, 
> unsigned long header_len,
>  }
>  EXPORT_SYMBOL(sock_alloc_send_pskb);
>  
> +struct sk_buff *sock_alloc_send_pskb_flags(struct sock *sk, unsigned long 
> header_len,
> +  unsigned long data_len, int noblock,
> +  int *errcode, int max_page_order, gfp_t 
> extra_flags)
> +{
> + return __sock_alloc_send_pskb(sk, header_len, data_len, noblock,
> + errcode, max_page_order, extra_flags);
> +}
> +EXPORT_SYMBOL(sock_alloc_send_pskb_flags);
> +
>  struct sk_buff *sock_alloc_send_skb(struct sock *sk, unsigned long size,
>   int noblock, int *errcode)
>  {
> 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] mm/filemap: fix potential memcg->cache charge leak

2020-10-09 Thread Vasily Averin
vz8 is affected too, please cherry-pick 
vz7 commit 79a5642e9d9a6bdbb56d9e0ee990fd96b7c8625c

On 10/8/20 1:10 PM, Andrey Ryabinin wrote:
> __add_to_page_cache_locked() after mem_cgroup_try_charge_cache()
> uses mem_cgroup_cancel_charge() in one of the error paths.
> This may lead to leaking a few memcg->cache charges.
> 
> Use mem_cgroup_cancel_cache_charge() to fix this.
> 
> https://jira.sw.ru/browse/PSBM-121046
> Signed-off-by: Andrey Ryabinin 
> ---
>  mm/filemap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 53db13f236da..2bd5ca4e7528 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -732,7 +732,7 @@ static int __add_to_page_cache_locked(struct page *page,
>   error = radix_tree_maybe_preload(gfp_mask & GFP_RECLAIM_MASK);
>   if (error) {
>   if (!huge)
> - mem_cgroup_cancel_charge(page, memcg);
> + mem_cgroup_cancel_cache_charge(page, memcg);
>   return error;
>   }
>  
> 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel