from:"Pavel Tikhomirov"

Re: [Devel] [PATCH rh7 1/2] ms/netfilter: nf_tables: fix out-of-bounds in nft_chain_commit_update

2022-08-31 Thread Pavel Tikhomirov


Reviewed-by: Pavel Tikhomirov 

On 31.08.2022 19:24, Konstantin Khorenko wrote:

From: Taehee Yoo 

When chain name is changed, nft_chain_commit_update is called.
In the nft_chain_commit_update, trans->ctx.chain->name has old chain name
and nft_trans_chain_name(trans) has new chain name.
If new chain name is longer than old chain name, KASAN warns
slab-out-of-bounds.

[  175.015012] BUG: KASAN: slab-out-of-bounds in strcpy+0x9e/0xb0
[  175.022735] Write of size 1 at addr 880114e022da by task 
iptables-compat/1458

[  175.031353] CPU: 0 PID: 1458 Comm: iptables-compat Not tainted 4.16.0-rc7+ 
#146
[  175.031353] Hardware name: To be filled by O.E.M. To be filled by 
O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
[  175.031353] Call Trace:
[  175.031353]  dump_stack+0x68/0xa0
[  175.031353]  print_address_description+0xd0/0x260
[  175.031353]  ? strcpy+0x9e/0xb0
[  175.031353]  kasan_report+0x234/0x350
[  175.031353]  __asan_report_store1_noabort+0x1c/0x20
[  175.031353]  strcpy+0x9e/0xb0
[  175.031353]  nf_tables_commit+0x1ccc/0x2990
[  175.031353]  nfnetlink_rcv+0x141e/0x16c0
[  175.031353]  ? nfnetlink_net_init+0x150/0x150
[  175.031353]  ? lock_acquire+0x370/0x370
[  175.031353]  ? lock_acquire+0x370/0x370
[  175.031353]  netlink_unicast+0x444/0x640
[  175.031353]  ? netlink_attachskb+0x700/0x700
[  175.031353]  ? _copy_from_iter_full+0x180/0x740
[  175.031353]  ? kasan_check_write+0x14/0x20
[  175.031353]  ? _copy_from_user+0x9b/0xd0
[  175.031353]  netlink_sendmsg+0x845/0xc70
[ ... ]

Steps to reproduce:
iptables-compat -N 1
iptables-compat -E 1 a

Signed-off-by: Taehee Yoo 
Signed-off-by: Pablo Neira Ayuso 

(cherry picked from ms commit d71efb599ad42ef1e564c652d8084252bdc85edf)
Signed-off-by: Konstantin Khorenko 
---
  net/netfilter/nf_tables_api.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index b3840f07b82ca..a231a67d62c07 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -4068,7 +4068,7 @@ static void nft_chain_commit_update(struct nft_trans 
*trans)
struct nft_base_chain *basechain;
  
  	if (nft_trans_chain_name(trans))

-   strcpy(trans->ctx.chain->name, nft_trans_chain_name(trans));
+   swap(trans->ctx.chain->name, nft_trans_chain_name(trans));
  
  	if (!(trans->ctx.chain->flags & NFT_BASE_CHAIN))

return;


--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RH9 01/10] drivers/vhost: vhost-blk accelerator for virtio-blk guests

2022-09-05 Thread Pavel Tikhomirov

General comment: it would be nice to have changes relative to "Asias' 
version: https://lkml.org/lkml/2012/12/1/174"; listed in commit message, 
so that we not only see the resulting patch, but also have idea why it 
was reworked this way.

--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 07/12] blk-wbt: improve waking of tasks

2022-09-29 Thread Pavel Tikhomirov

From: Jens Axboe 

We have two potential issues:

1) After commit 2887e41b910b, we only wake one process at the time when
   we finish an IO. We really want to wake up as many tasks as can
   queue IO. Before this commit, we woke up everyone, which could cause
   a thundering herd issue.

2) A task can potentially consume two wakeups, causing us to (in
   practice) miss a wakeup.

Fix both by providing our own wakeup function, which stops
__wake_up_common() from waking up more tasks if we fail to get a
queueing token. With the strict ordering we have on the wait list, this
wakes the right tasks and the right amount of tasks.

Based on a patch from Jianchao Wang .

Tested-by: Agarwal, Anchal 
Signed-off-by: Jens Axboe 

Changes when porting to vz7:
- s/wait_queue_entry/__wait_queue/
- s/entry/task_list/
- add wb_acct argument to __wbt_wait

https://jira.sw.ru/browse/PSBM-141883
(cherry picked from commit 38cfb5a45ee013bfab5d1ae4c4738815e744b440)
Signed-off-by: Pavel Tikhomirov 
---
 block/blk-wbt.c | 73 +
 1 file changed, 61 insertions(+), 12 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 8719a820bfb7..49d11e089c97 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -159,7 +159,7 @@ static void wbt_rqw_done(struct rq_wb *rwb, struct rq_wait 
*rqw,
int diff = limit - inflight;
 
if (!inflight || diff >= rwb->wb_background / 2)
-   wake_up(&rqw->wait);
+   wake_up_all(&rqw->wait);
}
 }
 
@@ -518,26 +518,76 @@ static inline unsigned int get_limit(struct rq_wb *rwb, 
unsigned long rw)
return limit;
 }
 
+struct wbt_wait_data {
+   struct __wait_queue wq;
+   struct task_struct *task;
+   struct rq_wb *rwb;
+   struct rq_wait *rqw;
+   unsigned long rw;
+   bool got_token;
+};
+
+static int wbt_wake_function(struct __wait_queue *curr, unsigned int mode,
+int wake_flags, void *key)
+{
+   struct wbt_wait_data *data = container_of(curr, struct wbt_wait_data,
+   wq);
+
+   /*
+* If we fail to get a budget, return -1 to interrupt the wake up
+* loop in __wake_up_common.
+*/
+   if (!rq_wait_inc_below(data->rqw, get_limit(data->rwb, data->rw)))
+   return -1;
+
+   data->got_token = true;
+   list_del_init(&curr->task_list);
+   wake_up_process(data->task);
+   return 1;
+}
+
 /*
  * Block if we will exceed our limit, or if we are currently waiting for
  * the timer to kick off queuing again.
  */
-static void __wbt_wait(struct rq_wb *rwb, unsigned long rw, spinlock_t *lock)
+static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
+  unsigned long rw, spinlock_t *lock)
 {
struct rq_wait *rqw = get_rq_wait(rwb, current_is_kswapd());
-   DECLARE_WAITQUEUE(wait, current);
+   struct wbt_wait_data data = {
+   .wq = {
+   .func = wbt_wake_function,
+   .task_list = LIST_HEAD_INIT(data.wq.task_list),
+   },
+   .task = current,
+   .rwb = rwb,
+   .rqw = rqw,
+   .rw = rw,
+   };
bool has_sleeper;
 
has_sleeper = wq_has_sleeper(&rqw->wait);
if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
return;
 
-   add_wait_queue_exclusive(&rqw->wait, &wait);
+   prepare_to_wait_exclusive(&rqw->wait, &data.wq, TASK_UNINTERRUPTIBLE);
do {
-   set_current_state(TASK_UNINTERRUPTIBLE);
+   if (data.got_token)
+   break;
 
-   if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
+   if (!has_sleeper &&
+   rq_wait_inc_below(rqw, get_limit(rwb, rw))) {
+   finish_wait(&rqw->wait, &data.wq);
+
+   /*
+* We raced with wbt_wake_function() getting a token,
+* which means we now have two. Put our local token
+* and wake anyone else potentially waiting for one.
+*/
+   if (data.got_token)
+   wbt_rqw_done(rwb, rqw, wb_acct);
break;
+   }
 
if (lock)
spin_unlock_irq(lock);
@@ -550,8 +600,7 @@ static void __wbt_wait(struct rq_wb *rwb, unsigned long rw, 
spinlock_t *lock)
has_sleeper = false;
} while (1);
 
-   __set_current_state(TASK_RUNNING);
-   remove_wait_queue(&rqw->wait, &wait);
+   finish_wait(&rqw->wait, &data.wq);
 }
 
 static inline bool wbt_should_throttle(struc

[Devel] [PATCH RH7 06/12] blk-wbt: abstract out end IO completion handler

2022-09-29 Thread Pavel Tikhomirov

From: Jens Axboe 

Prep patch for calling the handler from a different context,
no functional changes in this patch.

Tested-by: Agarwal, Anchal 
Signed-off-by: Jens Axboe 

Changes when porting to vz7:
- second argument of get_rq_wait is a 'is_kswapd' boolean

https://jira.sw.ru/browse/PSBM-141883
(cherry picked from commit 061a5427530633de93ace4ef001b99961984af62)
Signed-off-by: Pavel Tikhomirov 
---
 block/blk-wbt.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 97adb724df09..8719a820bfb7 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -124,15 +124,11 @@ static void rwb_wake_all(struct rq_wb *rwb)
}
 }
 
-void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
+static void wbt_rqw_done(struct rq_wb *rwb, struct rq_wait *rqw,
+enum wbt_flags wb_acct)
 {
-   struct rq_wait *rqw;
int inflight, limit;
 
-   if (!(wb_acct & WBT_TRACKED))
-   return;
-
-   rqw = get_rq_wait(rwb, wb_acct & WBT_KSWAPD);
inflight = atomic_dec_return(&rqw->inflight);
 
/*
@@ -167,6 +163,17 @@ void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
}
 }
 
+void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
+{
+   struct rq_wait *rqw;
+
+   if (!(wb_acct & WBT_TRACKED))
+   return;
+
+   rqw = get_rq_wait(rwb, wb_acct & WBT_KSWAPD);
+   wbt_rqw_done(rwb, rqw, wb_acct);
+}
+
 /*
  * Called on completion of a request. Note that it's also called when
  * a request is merged, when the request gets freed.
-- 
2.37.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 03/12] net: Generalise wq_has_sleeper helper

2022-09-29 Thread Pavel Tikhomirov

From: Herbert Xu 

The memory barrier in the helper wq_has_sleeper is needed by just
about every user of waitqueue_active.  This patch generalises it
by making it take a wait_queue_head_t directly.  The existing
helper is renamed to skwq_has_sleeper.

Signed-off-by: Herbert Xu 
Signed-off-by: David S. Miller 

Changes when porting to vz7:
- skip rypto/algif_aead.c hunks

https://jira.sw.ru/browse/PSBM-141883
(cherry picked from commit 1ce0bf50ae2233c7115a18c0c623662d177b434c)
Signed-off-by: Pavel Tikhomirov 
---
 crypto/algif_skcipher.c |  4 ++--
 include/linux/wait.h| 21 +
 include/net/sock.h  | 15 +--
 net/atm/common.c|  4 ++--
 net/core/sock.c |  8 
 net/core/stream.c   |  2 +-
 net/dccp/output.c   |  2 +-
 net/iucv/af_iucv.c  |  2 +-
 net/rxrpc/af_rxrpc.c|  2 +-
 net/sctp/socket.c   |  2 +-
 net/tipc/socket.c   |  4 ++--
 net/unix/af_unix.c  |  2 +-
 12 files changed, 42 insertions(+), 26 deletions(-)

diff --git a/crypto/algif_skcipher.c b/crypto/algif_skcipher.c
index 9a62fa9e02ec..ad4a88628f95 100644
--- a/crypto/algif_skcipher.c
+++ b/crypto/algif_skcipher.c
@@ -186,7 +186,7 @@ static void skcipher_wmem_wakeup(struct sock *sk)
 
rcu_read_lock();
wq = rcu_dereference(sk->sk_wq);
-   if (wq_has_sleeper(wq))
+   if (skwq_has_sleeper(wq))
wake_up_interruptible_sync_poll(&wq->wait, POLLIN |
   POLLRDNORM |
   POLLRDBAND);
@@ -236,7 +236,7 @@ static void skcipher_data_wakeup(struct sock *sk)
 
rcu_read_lock();
wq = rcu_dereference(sk->sk_wq);
-   if (wq_has_sleeper(wq))
+   if (skwq_has_sleeper(wq))
wake_up_interruptible_sync_poll(&wq->wait, POLLOUT |
   POLLRDNORM |
   POLLRDBAND);
diff --git a/include/linux/wait.h b/include/linux/wait.h
index b52741b1775a..2cd2201fc1e4 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -123,6 +123,27 @@ static inline int waitqueue_active(wait_queue_head_t *q)
return !list_empty(&q->task_list);
 }
 
+/**
+ * wq_has_sleeper - check if there are any waiting processes
+ * @wq: wait queue head
+ *
+ * Returns true if wq has waiting processes
+ *
+ * Please refer to the comment for waitqueue_active.
+ */
+static inline bool wq_has_sleeper(wait_queue_head_t *wq)
+{
+   /*
+* We need to be sure we are in sync with the
+* add_wait_queue modifications to the wait queue.
+*
+* This memory barrier should be paired with one on the
+* waiting side.
+*/
+   smp_mb();
+   return waitqueue_active(wq);
+}
+
 extern void add_wait_queue(wait_queue_head_t *q, wait_queue_t *wait);
 extern void add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait);
 extern void remove_wait_queue(wait_queue_head_t *q, wait_queue_t *wait);
diff --git a/include/net/sock.h b/include/net/sock.h
index e67f4de07c6b..a8609a8c04a1 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -59,6 +59,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2054,12 +2055,12 @@ static inline bool sk_has_allocations(const struct sock 
*sk)
 }
 
 /**
- * wq_has_sleeper - check if there are any waiting processes
+ * skwq_has_sleeper - check if there are any waiting processes
  * @wq: struct socket_wq
  *
  * Returns true if socket_wq has waiting processes
  *
- * The purpose of the wq_has_sleeper and sock_poll_wait is to wrap the memory
+ * The purpose of the skwq_has_sleeper and sock_poll_wait is to wrap the memory
  * barrier call. They were added due to the race found within the tcp code.
  *
  * Consider following tcp code paths:
@@ -2085,15 +2086,9 @@ static inline bool sk_has_allocations(const struct sock 
*sk)
  * data on the socket.
  *
  */
-static inline bool wq_has_sleeper(struct socket_wq *wq)
+static inline bool skwq_has_sleeper(struct socket_wq *wq)
 {
-   /* We need to be sure we are in sync with the
-* add_wait_queue modifications to the wait queue.
-*
-* This memory barrier is paired in the sock_poll_wait.
-*/
-   smp_mb();
-   return wq && waitqueue_active(&wq->wait);
+   return wq && wq_has_sleeper(&wq->wait);
 }
 
 /**
diff --git a/net/atm/common.c b/net/atm/common.c
index ecaface2878d..6325ab578401 100644
--- a/net/atm/common.c
+++ b/net/atm/common.c
@@ -96,7 +96,7 @@ static void vcc_def_wakeup(struct sock *sk)
 
rcu_read_lock();
wq = rcu_dereference(sk->sk_wq);
-   if (wq_has_sleeper(wq))
+   if (skwq_has_sleeper(wq))
wake_up(&wq->wait);
rcu_read_unlock();
 }
@@ -117,7 +117,7 @@ static void vcc_write_space(struct sock *sk)
 
if

[Devel] [PATCH RH7 04/12] blk-wbt: use wq_has_sleeper() for wq active check

2022-09-29 Thread Pavel Tikhomirov

From: Jens Axboe 

We need the memory barrier before checking the list head,
use the appropriate helper for this. The matching queue
side memory barrier is provided by set_current_state().

Tested-by: Anchal Agarwal 
Signed-off-by: Jens Axboe 

https://jira.sw.ru/browse/PSBM-141883
(cherry picked from commit b78820937b4762b7d30b807d7156bec1d89e4dd3)
Signed-off-by: Pavel Tikhomirov 
---
 block/blk-wbt.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index c178c06c3276..26986435d969 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -119,7 +119,7 @@ static void rwb_wake_all(struct rq_wb *rwb)
for (i = 0; i < WBT_NUM_RWQ; i++) {
struct rq_wait *rqw = &rwb->rq_wait[i];
 
-   if (waitqueue_active(&rqw->wait))
+   if (wq_has_sleeper(&rqw->wait))
wake_up_all(&rqw->wait);
}
 }
@@ -159,7 +159,7 @@ void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
if (inflight && inflight >= limit)
return;
 
-   if (waitqueue_active(&rqw->wait)) {
+   if (wq_has_sleeper(&rqw->wait)) {
int diff = limit - inflight;
 
if (!inflight || diff >= rwb->wb_background / 2)
@@ -520,8 +520,8 @@ static void __wbt_wait(struct rq_wb *rwb, unsigned long rw, 
spinlock_t *lock)
struct rq_wait *rqw = get_rq_wait(rwb, current_is_kswapd());
DECLARE_WAITQUEUE(wait, current);
 
-   if (!waitqueue_active(&rqw->wait)
-   && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
+   if (!wq_has_sleeper(&rqw->wait) &&
+   rq_wait_inc_below(rqw, get_limit(rwb, rw)))
return;
 
add_wait_queue_exclusive(&rqw->wait, &wait);
-- 
2.37.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 11/12] rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule

2022-09-29 Thread Pavel Tikhomirov

From: Josef Bacik 

In case we get a spurious wakeup we need to make sure to re-set
ourselves to TASK_UNINTERRUPTIBLE so we don't busy wait.

Reviewed-by: Oleg Nesterov 
Signed-off-by: Josef Bacik 
Signed-off-by: Jens Axboe 

Changes when porting to vz7:
- original patch is patching block/blk-rq-qos.c:rq_qos_wait, but in vz7
  similar hunk is in block/blk-wbt.c:__wbt_wait

https://jira.sw.ru/browse/PSBM-141883
(cherry picked from commit d14a9b389a86a5154b704bc88ce8dd37c701456a)
Signed-off-by: Pavel Tikhomirov 
---
 block/blk-wbt.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index f3c0841f009a..4c5b6899db71 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -599,6 +599,7 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags 
wb_acct,
spin_lock_irq(lock);
 
has_sleeper = true;
+   set_current_state(TASK_UNINTERRUPTIBLE);
} while (1);
 
finish_wait(&rqw->wait, &data.wq);
-- 
2.37.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 02/12] blk-wbt: move disable check into get_limit()

2022-09-29 Thread Pavel Tikhomirov

From: Jens Axboe 

Check it in one place, instead of in multiple places.

Tested-by: Anchal Agarwal 
Signed-off-by: Jens Axboe 

https://jira.sw.ru/browse/PSBM-141883
(cherry picked from commit ffa358dcaae1f2f00926484e712e06daa8953cb4)
Signed-off-by: Pavel Tikhomirov 
---
 block/blk-wbt.c | 22 +++---
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 600046a47ed8..c178c06c3276 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -482,6 +482,13 @@ static inline unsigned int get_limit(struct rq_wb *rwb, 
unsigned long rw)
 {
unsigned int limit;
 
+   /*
+* If we got disabled, just return UINT_MAX. This ensures that
+* we'll properly inc a new IO, and dec+wakeup at the end.
+*/
+   if (!rwb_enabled(rwb))
+   return UINT_MAX;
+
/*
 * At this point we know it's a buffered write. If this is
 * kswapd trying to free memory, or REQ_SYNC is set, set, then
@@ -513,16 +520,6 @@ static void __wbt_wait(struct rq_wb *rwb, unsigned long 
rw, spinlock_t *lock)
struct rq_wait *rqw = get_rq_wait(rwb, current_is_kswapd());
DECLARE_WAITQUEUE(wait, current);
 
-   /*
-   * inc it here even if disabled, since we'll dec it at completion.
-   * this only happens if the task was sleeping in __wbt_wait(),
-   * and someone turned it off at the same time.
-   */
-   if (!rwb_enabled(rwb)) {
-   atomic_inc(&rqw->inflight);
-   return;
-   }
-
if (!waitqueue_active(&rqw->wait)
&& rq_wait_inc_below(rqw, get_limit(rwb, rw)))
return;
@@ -531,11 +528,6 @@ static void __wbt_wait(struct rq_wb *rwb, unsigned long 
rw, spinlock_t *lock)
do {
set_current_state(TASK_UNINTERRUPTIBLE);
 
-   if (!rwb_enabled(rwb)) {
-   atomic_inc(&rqw->inflight);
-   break;
-   }
-
if (rq_wait_inc_below(rqw, get_limit(rwb, rw)))
break;
 
-- 
2.37.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 10/12] rq-qos: don't reset has_sleepers on spurious wakeups

2022-09-29 Thread Pavel Tikhomirov

From: Josef Bacik 

If we raced with somebody else getting an inflight counter we could fail
to get an inflight counter with no sleepers on the list, and thus need
to go to sleep.  In this case has_sleepers should be true because we are
now relying on the waker to get our inflight counter for us.  And in the
case of spurious wakeups we'd still want this to be the case.  So set
has_sleepers to true if we went to sleep to make sure we're woken up the
proper way.

Reviewed-by: Oleg Nesterov 
Signed-off-by: Josef Bacik 
Signed-off-by: Jens Axboe 

Changes when porting to vz7:
- original patch is patching block/blk-rq-qos.c:rq_qos_wait, but in vz7
  similar hunk is in block/blk-wbt.c:__wbt_wait

https://jira.sw.ru/browse/PSBM-141883
(cherry picked from commit 64e7ea875ef63b2801be7954cf7257d1bfccc266)
Signed-off-by: Pavel Tikhomirov 
---
 block/blk-wbt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 5477c3ffe7a7..f3c0841f009a 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -598,7 +598,7 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags 
wb_acct,
if (lock)
spin_lock_irq(lock);
 
-   has_sleeper = false;
+   has_sleeper = true;
} while (1);
 
finish_wait(&rqw->wait, &data.wq);
-- 
2.37.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 00/12] fix hardlockup in wbt_done

2022-09-29 Thread Pavel Tikhomirov

We have a hard lockup detected in this stack:

#13 [9103fe603af8] __enqueue_entity at b8ce64c5
#14 [9103fe603b00] enqueue_entity at b8cee27a
#15 [9103fe603b50] enqueue_task_fair at b8ceea9c
#16 [9103fe603ba0] activate_task at b8cdd029
#17 [9103fe603bc8] ttwu_do_activate at b8cdd491
#18 [9103fe603bf0] try_to_wake_up at b8ce124a
#19 [9103fe603c40] default_wake_function at b8ce1552
#20 [9103fe603c50] autoremove_wake_function at b8ccb178
#21 [9103fe603c78] __wake_up_common at b8cd7752
#22 [9103fe603cd0] __wake_up_common_lock at b8cd7873
#23 [9103fe603d40] __wake_up at b8cd78c3
#24 [9103fe603d50] __wbt_done at b8fb6573
#25 [9103fe603d60] wbt_done at b8fb65f2
#26 [9103fe603d80] __blk_mq_finish_request at b8f8daa1
#27 [9103fe603db8] blk_mq_finish_request at b8f8db6a
#28 [9103fe603dc8] blk_mq_sched_put_request at b8f93ee0
#29 [9103fe603de8] blk_mq_end_request at b8f8d1a4
#30 [9103fe603e08] nvme_complete_rq at c033dcfc [nvme_core]
#31 [9103fe603e18] nvme_pci_complete_rq at c038be70 [nvme]
#32 [9103fe603e40] __blk_mq_complete_request at b8f8d316
#33 [9103fe603e68] blk_mq_complete_request at b8f8d3c7
#34 [9103fe603e78] nvme_irq at c038c0b2 [nvme]
#35 [9103fe603eb0] __handle_irq_event_percpu at b8d66bb4
#36 [9103fe603ef8] handle_irq_event_percpu at b8d66d62
#37 [9103fe603f28] handle_irq_event at b8d66dec
#38 [9103fe603f50] handle_edge_irq at b8d69c0f
#39 [9103fe603f70] handle_irq at b8c30524
#40 [9103fe603fb8] do_IRQ at b93d898d

which is exactly the same as ubuntu problem here:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1810998

this is because we have writeback throttling ported which does not work
well in some cases.

In launchpad bug it helped to port these patches from mainstream:

  * CPU hard lockup with rigorous writes to NVMe drive (LP: #1810998)
- blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait
- blk-wbt: move disable check into get_limit()
- blk-wbt: use wq_has_sleeper() for wq active check
- blk-wbt: fix has-sleeper queueing check
- blk-wbt: abstract out end IO completion handler
- blk-wbt: improve waking of tasks

which fixes similar lockup issues in wbt.

More over I've found some more small and useful patches which fix races
(missed wakeups) in this code, so I've also put them in the patchset.

Anchal Agarwal (1):
  blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait

Herbert Xu (1):
  net: Generalise wq_has_sleeper helper

Jens Axboe (5):
  blk-wbt: move disable check into get_limit()
  blk-wbt: use wq_has_sleeper() for wq active check
  blk-wbt: fix has-sleeper queueing check
  blk-wbt: abstract out end IO completion handler
  blk-wbt: improve waking of tasks

Josef Bacik (5):
  wait: add wq_has_single_sleeper helper
  rq-qos: fix missed wake-ups in rq_qos_throttle
  rq-qos: don't reset has_sleepers on spurious wakeups
  rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule
  rq-qos: use a mb for got_token

 block/blk-wbt.c | 130 
 crypto/algif_skcipher.c |   4 +-
 include/linux/wait.h|  34 +++
 include/net/sock.h  |  15 ++---
 net/atm/common.c|   4 +-
 net/core/sock.c |   8 +--
 net/core/stream.c   |   2 +-
 net/dccp/output.c   |   2 +-
 net/iucv/af_iucv.c  |   2 +-
 net/rxrpc/af_rxrpc.c|   2 +-
 net/sctp/socket.c   |   2 +-
 net/tipc/socket.c   |   4 +-
 net/unix/af_unix.c  |   2 +-
 13 files changed, 147 insertions(+), 64 deletions(-)

-- 
2.37.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 01/12] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait

2022-09-29 Thread Pavel Tikhomirov

[0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

In the original patch, wbt_done is waking up all the exclusive
processes in the wait queue, which can cause a thundering herd
if there is a large number of writer threads in the queue. The
original intention of the code seems to be to wake up one thread
only however, it uses wake_up_all() in __wbt_done(), and then
uses the following check in __wbt_wait to have only one thread
actually get out of the wait loop:

if (waitqueue_active(&rqw->wait) &&
rqw->wait.head.next != &wait->entry)
return false;

The problem with this is that the wait entry in wbt_wait is
define with DEFINE_WAIT, which uses the autoremove wakeup function.
That means that the above check is invalid - the wait entry will
have been removed from the queue already by the time we hit the
check in the loop.

Secondly, auto-removing the wait entries also means that the wait
queue essentially gets reordered "randomly" (e.g. threads re-add
themselves in the order they got to run after being woken up).
Additionally, new requests entering wbt_wait might overtake requests
that were queued earlier, because the wait queue will be
(temporarily) empty after the wake_up_all, so the waitqueue_active
check will not stop them. This can cause certain threads to starve
under high load.

The fix is to leave the woken up requests in the queue and remove
them in finish_wait() once the current thread breaks out of the
wait loop in __wbt_wait. This will ensure new requests always
end up at the back of the queue, and they won't overtake requests
that are already in the wait queue. With that change, the loop
in wbt_wait is also in line with many other wait loops in the kernel.
Waking up just one thread drastically reduces lock contention, as
does moving the wait queue add/remove out of the loop.

A significant drop in lockdep's lock contention numbers is seen when
running the test application on the patched kernel.

Signed-off-by: Anchal Agarwal 
Signed-off-by: Frank van der Linden 
Signed-off-by: Jens Axboe 

Changes porting to vz7:
- add rq_wait_inc_below helper

https://jira.sw.ru/browse/PSBM-141883
(cherry picked from commit 2887e41b910bb14fd847cf01ab7a5993db989d88)
Signed-off-by: Pavel Tikhomirov 
---
 block/blk-wbt.c | 60 -
 1 file changed, 29 insertions(+), 31 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 67f0c9e6451f..600046a47ed8 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -81,6 +81,11 @@ static bool atomic_inc_below(atomic_t *v, int below)
return true;
 }
 
+bool rq_wait_inc_below(struct rq_wait *rq_wait, unsigned int limit)
+{
+   return atomic_inc_below(&rq_wait->inflight, limit);
+}
+
 static void wb_timestamp(struct rq_wb *rwb, unsigned long *var)
 {
if (rwb_enabled(rwb)) {
@@ -158,7 +163,7 @@ void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
int diff = limit - inflight;
 
if (!inflight || diff >= rwb->wb_background / 2)
-   wake_up_all(&rqw->wait);
+   wake_up(&rqw->wait);
}
 }
 
@@ -499,30 +504,6 @@ static inline unsigned int get_limit(struct rq_wb *rwb, 
unsigned long rw)
return limit;
 }
 
-static inline bool may_queue(struct rq_wb *rwb, struct rq_wait *rqw,
-wait_queue_t *wait, unsigned long rw)
-{
-   /*
-* inc it here even if disabled, since we'll dec it at completion.
-* this only happens if the task was sleeping in __wbt_wait(),
-* and someone turned it off at the same time.
-*/
-   if (!rwb_enabled(rwb)) {
-   atomic_inc(&rqw->inflight);
-   return true;
-   }
-
-   /*
-* If the waitqueue is already active and we are not the next
-* in line to be woken up, wait for our turn.
-*/
-   if (waitqueue_active(&rqw->wait) &&
-   rqw->wait.task_list.next != &wait->task_list)
-   return false;
-
-   return atomic_inc_below(&rqw->inflight, get_limit(rwb, rw));
-}
-
 /*
  * Block if we will exceed our limit, or if we are currently waiting for
  * the timer to kick off queuing again.
@@ -530,16 +511,32 @@ static inline bool may_queue(struct rq_wb *rwb, struct 
rq_wait *rqw,
 static void __wbt_wait(struct rq_wb *rwb, unsigned long rw, spinlock_t *lock)
 {
struct rq_wait *rqw = get_rq_wait(rwb, current_is_kswapd());
-   DEFINE_WAIT(wait);
+   DECLARE_WAITQUEUE(wait, current);
 
-   if (may_queue(rwb, rqw, &wait, rw))
+   /*
+   * inc it here even if disabled, since we'll dec it at completion.
+   * this only happens if the task was sleeping in __wbt_wait(),
+   * and someone turned it off at the same time.
+   */
+   if (!rwb_enabled(rwb

[Devel] [PATCH RH7 12/12] rq-qos: use a mb for got_token

2022-09-29 Thread Pavel Tikhomirov

From: Josef Bacik 

Oleg noticed that our checking of data.got_token is unsafe in the
cleanup case, and should really use a memory barrier.  Use a wmb on the
write side, and a rmb() on the read side.  We don't need one in the main
loop since we're saved by set_current_state().

Reviewed-by: Oleg Nesterov 
Signed-off-by: Josef Bacik 
Signed-off-by: Jens Axboe 

Changes when porting to vz7:
- original patch is patching block/blk-rq-qos.c:rq_qos_wait, but in vz7
  similar hunk is in block/blk-wbt.c:__wbt_wait
- also original patch is patching rq_qos_wake_function, but in vz7
  similar hunk is in wbt_wake_function

https://jira.sw.ru/browse/PSBM-141883
(cherry picked from commit ac38297f7038cd5b80d66f8809c7bbf5b70031f3)
Signed-off-by: Pavel Tikhomirov 
---
 block/blk-wbt.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 4c5b6899db71..099678cd0d04 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -541,6 +541,7 @@ static int wbt_wake_function(struct __wait_queue *curr, 
unsigned int mode,
return -1;
 
data->got_token = true;
+   smp_wmb();
list_del_init(&curr->task_list);
wake_up_process(data->task);
return 1;
@@ -573,6 +574,7 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags 
wb_acct,
prepare_to_wait_exclusive(&rqw->wait, &data.wq, TASK_UNINTERRUPTIBLE);
has_sleeper = !wq_has_single_sleeper(&rqw->wait);
do {
+   /* The memory barrier in set_task_state saves us here. */
if (data.got_token)
break;
 
@@ -585,6 +587,7 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags 
wb_acct,
 * which means we now have two. Put our local token
 * and wake anyone else potentially waiting for one.
 */
+   smp_rmb();
if (data.got_token)
wbt_rqw_done(rwb, rqw, wb_acct);
break;
-- 
2.37.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 09/12] rq-qos: fix missed wake-ups in rq_qos_throttle

2022-09-29 Thread Pavel Tikhomirov

From: Josef Bacik 

We saw a hang in production with WBT where there was only one waiter in
the throttle path and no outstanding IO.  This is because of the
has_sleepers optimization that is used to make sure we don't steal an
inflight counter for new submitters when there are people already on the
list.

We can race with our check to see if the waitqueue has any waiters (this
is done locklessly) and the time we actually add ourselves to the
waitqueue.  If this happens we'll go to sleep and never be woken up
because nobody is doing IO to wake us up.

Fix this by checking if the waitqueue has a single sleeper on the list
after we add ourselves, that way we have an uptodate view of the list.

Reviewed-by: Oleg Nesterov 
Signed-off-by: Josef Bacik 
Signed-off-by: Jens Axboe 

Changes when porting to vz7:
- original patch is patching block/blk-rq-qos.c:rq_qos_wait, but in vz7
  similar hunk is in block/blk-wbt.c:__wbt_wait

https://jira.sw.ru/browse/PSBM-141883
(cherry picked from commit 545fbd0775bafcefc8f7bc844291bd13c44b7fdc)
Signed-off-by: Pavel Tikhomirov 
---
 block/blk-wbt.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 49d11e089c97..5477c3ffe7a7 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -571,6 +571,7 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags 
wb_acct,
return;
 
prepare_to_wait_exclusive(&rqw->wait, &data.wq, TASK_UNINTERRUPTIBLE);
+   has_sleeper = !wq_has_single_sleeper(&rqw->wait);
do {
if (data.got_token)
break;
-- 
2.37.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 05/12] blk-wbt: fix has-sleeper queueing check

2022-09-29 Thread Pavel Tikhomirov

From: Jens Axboe 

We need to do this inside the loop as well, or we can allow new
IO to supersede previous IO.

Tested-by: Anchal Agarwal 
Signed-off-by: Jens Axboe 

https://jira.sw.ru/browse/PSBM-141883
(cherry picked from commit c45e6a037a536530bd25781ac7c989e52deb2a63)
Signed-off-by: Pavel Tikhomirov 
---
 block/blk-wbt.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 26986435d969..97adb724df09 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -519,16 +519,17 @@ static void __wbt_wait(struct rq_wb *rwb, unsigned long 
rw, spinlock_t *lock)
 {
struct rq_wait *rqw = get_rq_wait(rwb, current_is_kswapd());
DECLARE_WAITQUEUE(wait, current);
+   bool has_sleeper;
 
-   if (!wq_has_sleeper(&rqw->wait) &&
-   rq_wait_inc_below(rqw, get_limit(rwb, rw)))
+   has_sleeper = wq_has_sleeper(&rqw->wait);
+   if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
return;
 
add_wait_queue_exclusive(&rqw->wait, &wait);
do {
set_current_state(TASK_UNINTERRUPTIBLE);
 
-   if (rq_wait_inc_below(rqw, get_limit(rwb, rw)))
+   if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw)))
break;
 
if (lock)
@@ -538,6 +539,8 @@ static void __wbt_wait(struct rq_wb *rwb, unsigned long rw, 
spinlock_t *lock)
 
if (lock)
spin_lock_irq(lock);
+
+   has_sleeper = false;
} while (1);
 
__set_current_state(TASK_RUNNING);
-- 
2.37.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 08/12] wait: add wq_has_single_sleeper helper

2022-09-29 Thread Pavel Tikhomirov

From: Josef Bacik 

rq-qos sits in the io path so we want to take locks as sparingly as
possible.  To accomplish this we try not to take the waitqueue head lock
unless we are sure we need to go to sleep, and we have an optimization
to make sure that we don't starve out existing waiters.  Since we check
if there are existing waiters locklessly we need to be able to update
our view of the waitqueue list after we've added ourselves to the
waitqueue.  Accomplish this by adding this helper to see if there is
more than just ourselves on the list.

Reviewed-by: Oleg Nesterov 
Signed-off-by: Josef Bacik 
Signed-off-by: Jens Axboe 

Changes porting to vz7:
- s/wait_queue_head/wait_queue_head_t/

https://jira.sw.ru/browse/PSBM-141883
(cherry picked from commit a6d81d30d3cd87f85bfd922358eb18b8146c4925)
Signed-off-by: Pavel Tikhomirov 

fix
---
 include/linux/wait.h | 13 +
 1 file changed, 13 insertions(+)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 2cd2201fc1e4..12075edebfd6 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -123,6 +123,19 @@ static inline int waitqueue_active(wait_queue_head_t *q)
return !list_empty(&q->task_list);
 }
 
+/**
+ * wq_has_single_sleeper - check if there is only one sleeper
+ * @wq_head: wait queue head
+ *
+ * Returns true of wq_head has only one sleeper on the list.
+ *
+ * Please refer to the comment for waitqueue_active.
+ */
+static inline bool wq_has_single_sleeper(wait_queue_head_t *q)
+{
+   return list_is_singular(&q->task_list);
+}
+
 /**
  * wq_has_sleeper - check if there are any waiting processes
  * @wq: wait queue head
-- 
2.37.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RH9 v3 02/10] drivers/vhost: use array to store workers

2022-10-13 Thread Pavel Tikhomirov





On 10.10.2022 17:56, Andrey Zhadchenko wrote:

We want to support several vhost workers. The first step is to
rework vhost to use array of workers rather than single pointer.
Update creation and cleanup routines.

https://jira.sw.ru/browse/PSBM-139414
Signed-off-by: Andrey Zhadchenko 
---
v3:
set vq->worker to NULL in vhost_vq_reset()


vhost_virtqueue->worker field is added in [PATCH RH9 v3 07/10] 
drivers/vhost: assign workers to virtqueues, so at this point you don't 
yet have it, this would break compilation on this patch


=> please re-check compilation on each patch



  drivers/vhost/vhost.c | 76 +++
  drivers/vhost/vhost.h | 10 +-
  2 files changed, 64 insertions(+), 22 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a0bfc77c6a43..321967322285 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -231,11 +231,24 @@ void vhost_poll_stop(struct vhost_poll *poll)
  }
  EXPORT_SYMBOL_GPL(vhost_poll_stop);
  
+static void vhost_work_queue_at_worker(struct vhost_worker *w,

+  struct vhost_work *work)
+{
+   if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) {
+   /* We can only add the work to the list after we're
+* sure it was not in the list.
+* test_and_set_bit() implies a memory barrier.
+*/
+   llist_add(&work->node, &w->work_list);
+   wake_up_process(w->worker);
+   }
+}
+
  void vhost_work_dev_flush(struct vhost_dev *dev)
  {
struct vhost_flush_struct flush;
  
-	if (dev->worker) {

+   if (dev->workers[0].worker) {
init_completion(&flush.wait_event);
vhost_work_init(&flush.work, vhost_flush_work);
  
@@ -255,17 +268,12 @@ EXPORT_SYMBOL_GPL(vhost_poll_flush);
  
  void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)

  {
-   if (!dev->worker)
+   struct vhost_worker *w = &dev->workers[0];
+
+   if (!w->worker)
return;
  
-	if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) {

-   /* We can only add the work to the list after we're
-* sure it was not in the list.
-* test_and_set_bit() implies a memory barrier.
-*/
-   llist_add(&work->node, &dev->work_list);
-   wake_up_process(dev->worker);
-   }
+   vhost_work_queue_at_worker(w, work);
  }
  EXPORT_SYMBOL_GPL(vhost_work_queue);
  
@@ -339,11 +347,32 @@ static void vhost_vq_reset(struct vhost_dev *dev,

vq->iotlb = NULL;
vhost_vring_call_reset(&vq->call_ctx);
__vhost_vq_meta_reset(vq);
+   vq->worker = NULL;
+}
+
+static void vhost_worker_reset(struct vhost_worker *w)
+{
+   init_llist_head(&w->work_list);
+   w->worker = NULL;
+}
+
+void vhost_cleanup_workers(struct vhost_dev *dev)
+{
+   int i;
+
+   for (i = 0; i < dev->nworkers; ++i) {
+   WARN_ON(!llist_empty(&dev->workers[i].work_list));
+   kthread_stop(dev->workers[i].worker);
+   vhost_worker_reset(&dev->workers[i]);
+   }
+
+   dev->nworkers = 0;
  }
  
  static int vhost_worker(void *data)

  {
-   struct vhost_dev *dev = data;
+   struct vhost_worker *w = data;
+   struct vhost_dev *dev = w->dev;
struct vhost_work *work, *work_next;
struct llist_node *node;
  
@@ -358,7 +387,7 @@ static int vhost_worker(void *data)

break;
}
  
-		node = llist_del_all(&dev->work_list);

+   node = llist_del_all(&w->work_list);
if (!node)
schedule();
  
@@ -481,7 +510,6 @@ void vhost_dev_init(struct vhost_dev *dev,

dev->umem = NULL;
dev->iotlb = NULL;
dev->mm = NULL;
-   dev->worker = NULL;
dev->iov_limit = iov_limit;
dev->weight = weight;
dev->byte_weight = byte_weight;
@@ -493,6 +521,11 @@ void vhost_dev_init(struct vhost_dev *dev,
INIT_LIST_HEAD(&dev->pending_list);
spin_lock_init(&dev->iotlb_lock);
  
+	dev->nworkers = 0;

+   for (i = 0; i < VHOST_MAX_WORKERS; ++i) {
+   dev->workers[i].dev = dev;
+   vhost_worker_reset(&dev->workers[i]);
+   }
  
  	for (i = 0; i < dev->nvqs; ++i) {

vq = dev->vqs[i];
@@ -602,7 +635,8 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
goto err_worker;
}
  
-		dev->worker = worker;

+   dev->workers[0].worker = worker;
+   dev->nworkers = 1;
wake_up_process(worker); /* avoid contributing to loadavg */
  
  		err = vhost_attach_cgroups(dev);

@@ -616,9 +650,10 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
  
  	return 0;

  err_cgroup:
-   if (dev->worker) {
-   kthread_stop(dev->worker);
-   dev->worker = NULL;
+   dev->nworke

Re: [Devel] [PATCH RH9 v3 01/10] drivers/vhost: vhost-blk accelerator for virtio-blk guests

2022-10-13 Thread Pavel Tikhomirov





On 10.10.2022 17:56, Andrey Zhadchenko wrote:

Although QEMU virtio is quite fast, there is still some room for
improvements. Disk latency can be reduced if we handle virito-blk requests
in host kernel istead of passing them to QEMU. The patch adds vhost-blk
kernel module to do so.

Some test setups:
fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128
QEMU drive options: cache=none
filesystem: xfs

SSD:
| randread, IOPS  | randwrite, IOPS |
Host   |  95.8k  |  85.3k  |
QEMU virtio|  57.5k  |  79.4k  |
QEMU vhost-blk |  95.6k  |  84.3k  |

RAMDISK (vq == vcpu):
  | randread, IOPS | randwrite, IOPS |
virtio, 1vcpu|  123k  |  129k   |
virtio, 2vcpu|  253k (??) |  250k (??)  |
virtio, 4vcpu|  158k  |  154k   |
vhost-blk, 1vcpu |  110k  |  113k   |
vhost-blk, 2vcpu |  247k  |  252k   |
vhost-blk, 4vcpu |  576k  |  567k   |

https://jira.sw.ru/browse/PSBM-139414
Signed-off-by: Andrey Zhadchenko 
---
v2:
  - removed unused VHOST_BLK_VQ
  - reworked bio handling a bit: now add all pages from signle iov into
single bio istead of allocating one bio per page
  - changed how to calculate sector incrementation
  - check move_iovec() in vhost_blk_req_handle()
  - remove snprintf check and better check ret from copy_to_iter for
VIRTIO_BLK_ID_BYTES requests
  - discard vq request if vhost_blk_req_handle() returned negative code
  - forbid to change nonzero backend in vhost_blk_set_backend(). First of
all, QEMU sets backend only once. Also if we want to change backend when
we already running requests we need to be much more careful in
vhost_blk_handle_guest_kick() as it is not taking any references. If
userspace want to change backend that bad it can always reset device.
  - removed EXPERIMENTAL from Kconfig

v3:
  - a bit reworked bio handling - allocate new bio only if the previous
is full

  drivers/vhost/Kconfig  |  12 +
  drivers/vhost/Makefile |   3 +
  drivers/vhost/blk.c| 828 +
  include/uapi/linux/vhost.h |   5 +
  4 files changed, 848 insertions(+)
  create mode 100644 drivers/vhost/blk.c

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 587fbae06182..e1389bf0c10b 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -89,4 +89,16 @@ config VHOST_CROSS_ENDIAN_LEGACY
  
  	  If unsure, say "N".
  
+config VHOST_BLK

+   tristate "Host kernel accelerator for virtio-blk"
+   depends on BLOCK && EVENTFD
+   select VHOST
+   default n
+   help
+ This kernel module can be loaded in host kernel to accelerate
+ guest vm with virtio-blk driver.
+
+ To compile this driver as a module, choose M here: the module will
+ be called vhost_blk.
+
  endif
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index f3e1897cce85..c76cc4f5fcd8 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -17,3 +17,6 @@ obj-$(CONFIG_VHOST)   += vhost.o
  
  obj-$(CONFIG_VHOST_IOTLB) += vhost_iotlb.o

  vhost_iotlb-y := iotlb.o
+
+obj-$(CONFIG_VHOST_BLK) += vhost_blk.o
+vhost_blk-y := blk.o
diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
new file mode 100644
index ..933c9c50b0a6
--- /dev/null
+++ b/drivers/vhost/blk.c
@@ -0,0 +1,828 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2011 Taobao, Inc.
+ * Author: Liu Yuan 
+ *
+ * Copyright (C) 2012 Red Hat, Inc.
+ * Author: Asias He 
+ *
+ * Copyright (c) 2022 Virtuozzo International GmbH.
+ * Author: Andrey Zhadchenko 
+ *
+ * virtio-blk host kernel accelerator.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vhost.h"
+
+enum {
+   VHOST_BLK_FEATURES = VHOST_FEATURES |
+(1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+(1ULL << VIRTIO_RING_F_EVENT_IDX) |
+(1ULL << VIRTIO_BLK_F_MQ) |
+(1ULL << VIRTIO_BLK_F_FLUSH),
+};
+
+/*
+ * Max number of bytes transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others.
+ */
+#define VHOST_DEV_WEIGHT 0x8
+
+/*
+ * Max number of packets transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others with
+ * pkts.
+ */
+#define VHOST_DEV_PKT_WEIGHT 256
+
+#define VHOST_BLK_VQ_MAX 8
+
+#define VHOST_MAX_METADATA_IOV 1
+
+#define VHOST_BLK_SECTOR_BITS 9
+#define VHOST_BLK_SECTOR_SIZE (1 << VHOST_BLK_SECTOR_BITS)
+#define VHOST_BLK_SECTOR_MASK (VHOST_BLK_SECTOR_SIZE - 1)
+
+struct req_page_list {
+   struct page **pages;
+   int pages_nr;
+};
+
+#define NR_INLINE 16
+
+struct vhost_blk_req {
+   struct req_page_list inline_pl[NR_INLINE];
+   struct page *inline_page[NR_INLINE];
+   str

Re: [Devel] [PATCH RH9 v3 01/10] drivers/vhost: vhost-blk accelerator for virtio-blk guests

2022-10-13 Thread Pavel Tikhomirov





On 10.10.2022 17:56, Andrey Zhadchenko wrote:

Although QEMU virtio is quite fast, there is still some room for
improvements. Disk latency can be reduced if we handle virito-blk requests
in host kernel istead of passing them to QEMU. The patch adds vhost-blk
kernel module to do so.

Some test setups:
fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128
QEMU drive options: cache=none
filesystem: xfs

SSD:
| randread, IOPS  | randwrite, IOPS |
Host   |  95.8k  |  85.3k  |
QEMU virtio|  57.5k  |  79.4k  |
QEMU vhost-blk |  95.6k  |  84.3k  |

RAMDISK (vq == vcpu):
  | randread, IOPS | randwrite, IOPS |
virtio, 1vcpu|  123k  |  129k   |
virtio, 2vcpu|  253k (??) |  250k (??)  |
virtio, 4vcpu|  158k  |  154k   |
vhost-blk, 1vcpu |  110k  |  113k   |
vhost-blk, 2vcpu |  247k  |  252k   |
vhost-blk, 4vcpu |  576k  |  567k   |

https://jira.sw.ru/browse/PSBM-139414
Signed-off-by: Andrey Zhadchenko 
---
v2:
  - removed unused VHOST_BLK_VQ
  - reworked bio handling a bit: now add all pages from signle iov into
single bio istead of allocating one bio per page
  - changed how to calculate sector incrementation
  - check move_iovec() in vhost_blk_req_handle()
  - remove snprintf check and better check ret from copy_to_iter for
VIRTIO_BLK_ID_BYTES requests
  - discard vq request if vhost_blk_req_handle() returned negative code
  - forbid to change nonzero backend in vhost_blk_set_backend(). First of
all, QEMU sets backend only once. Also if we want to change backend when
we already running requests we need to be much more careful in
vhost_blk_handle_guest_kick() as it is not taking any references. If
userspace want to change backend that bad it can always reset device.
  - removed EXPERIMENTAL from Kconfig

v3:
  - a bit reworked bio handling - allocate new bio only if the previous
is full

  drivers/vhost/Kconfig  |  12 +
  drivers/vhost/Makefile |   3 +
  drivers/vhost/blk.c| 828 +
  include/uapi/linux/vhost.h |   5 +
  4 files changed, 848 insertions(+)
  create mode 100644 drivers/vhost/blk.c

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 587fbae06182..e1389bf0c10b 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -89,4 +89,16 @@ config VHOST_CROSS_ENDIAN_LEGACY
  
  	  If unsure, say "N".
  
+config VHOST_BLK

+   tristate "Host kernel accelerator for virtio-blk"
+   depends on BLOCK && EVENTFD
+   select VHOST
+   default n
+   help
+ This kernel module can be loaded in host kernel to accelerate
+ guest vm with virtio-blk driver.
+
+ To compile this driver as a module, choose M here: the module will
+ be called vhost_blk.
+
  endif
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index f3e1897cce85..c76cc4f5fcd8 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -17,3 +17,6 @@ obj-$(CONFIG_VHOST)   += vhost.o
  
  obj-$(CONFIG_VHOST_IOTLB) += vhost_iotlb.o

  vhost_iotlb-y := iotlb.o
+
+obj-$(CONFIG_VHOST_BLK) += vhost_blk.o
+vhost_blk-y := blk.o
diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
new file mode 100644
index ..933c9c50b0a6
--- /dev/null
+++ b/drivers/vhost/blk.c
@@ -0,0 +1,828 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2011 Taobao, Inc.
+ * Author: Liu Yuan 
+ *
+ * Copyright (C) 2012 Red Hat, Inc.
+ * Author: Asias He 
+ *
+ * Copyright (c) 2022 Virtuozzo International GmbH.
+ * Author: Andrey Zhadchenko 
+ *
+ * virtio-blk host kernel accelerator.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vhost.h"
+
+enum {
+   VHOST_BLK_FEATURES = VHOST_FEATURES |
+(1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+(1ULL << VIRTIO_RING_F_EVENT_IDX) |
+(1ULL << VIRTIO_BLK_F_MQ) |
+(1ULL << VIRTIO_BLK_F_FLUSH),
+};
+
+/*
+ * Max number of bytes transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others.
+ */
+#define VHOST_DEV_WEIGHT 0x8
+
+/*
+ * Max number of packets transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others with
+ * pkts.
+ */
+#define VHOST_DEV_PKT_WEIGHT 256
+
+#define VHOST_BLK_VQ_MAX 8
+
+#define VHOST_MAX_METADATA_IOV 1
+
+#define VHOST_BLK_SECTOR_BITS 9
+#define VHOST_BLK_SECTOR_SIZE (1 << VHOST_BLK_SECTOR_BITS)
+#define VHOST_BLK_SECTOR_MASK (VHOST_BLK_SECTOR_SIZE - 1)
+
+struct req_page_list {
+   struct page **pages;
+   int pages_nr;
+};
+
+#define NR_INLINE 16
+
+struct vhost_blk_req {
+   struct req_page_list inline_pl[NR_INLINE];
+   struct page *inline_page[NR_INLINE];
+   str

Re: [Devel] [PATCH RH9 v3 01/10] drivers/vhost: vhost-blk accelerator for virtio-blk guests

2022-10-13 Thread Pavel Tikhomirov





On 10.10.2022 17:56, Andrey Zhadchenko wrote:

Although QEMU virtio is quite fast, there is still some room for
improvements. Disk latency can be reduced if we handle virito-blk requests
in host kernel istead of passing them to QEMU. The patch adds vhost-blk
kernel module to do so.

Some test setups:
fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128
QEMU drive options: cache=none
filesystem: xfs

SSD:
| randread, IOPS  | randwrite, IOPS |
Host   |  95.8k  |  85.3k  |
QEMU virtio|  57.5k  |  79.4k  |
QEMU vhost-blk |  95.6k  |  84.3k  |

RAMDISK (vq == vcpu):
  | randread, IOPS | randwrite, IOPS |
virtio, 1vcpu|  123k  |  129k   |
virtio, 2vcpu|  253k (??) |  250k (??)  |
virtio, 4vcpu|  158k  |  154k   |
vhost-blk, 1vcpu |  110k  |  113k   |
vhost-blk, 2vcpu |  247k  |  252k   |
vhost-blk, 4vcpu |  576k  |  567k   |

https://jira.sw.ru/browse/PSBM-139414
Signed-off-by: Andrey Zhadchenko 
---
v2:
  - removed unused VHOST_BLK_VQ
  - reworked bio handling a bit: now add all pages from signle iov into
single bio istead of allocating one bio per page
  - changed how to calculate sector incrementation
  - check move_iovec() in vhost_blk_req_handle()
  - remove snprintf check and better check ret from copy_to_iter for
VIRTIO_BLK_ID_BYTES requests
  - discard vq request if vhost_blk_req_handle() returned negative code
  - forbid to change nonzero backend in vhost_blk_set_backend(). First of
all, QEMU sets backend only once. Also if we want to change backend when
we already running requests we need to be much more careful in
vhost_blk_handle_guest_kick() as it is not taking any references. If
userspace want to change backend that bad it can always reset device.
  - removed EXPERIMENTAL from Kconfig

v3:
  - a bit reworked bio handling - allocate new bio only if the previous
is full

  drivers/vhost/Kconfig  |  12 +
  drivers/vhost/Makefile |   3 +
  drivers/vhost/blk.c| 828 +
  include/uapi/linux/vhost.h |   5 +
  4 files changed, 848 insertions(+)
  create mode 100644 drivers/vhost/blk.c

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 587fbae06182..e1389bf0c10b 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -89,4 +89,16 @@ config VHOST_CROSS_ENDIAN_LEGACY
  
  	  If unsure, say "N".
  
+config VHOST_BLK

+   tristate "Host kernel accelerator for virtio-blk"
+   depends on BLOCK && EVENTFD
+   select VHOST
+   default n
+   help
+ This kernel module can be loaded in host kernel to accelerate
+ guest vm with virtio-blk driver.
+
+ To compile this driver as a module, choose M here: the module will
+ be called vhost_blk.
+
  endif
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index f3e1897cce85..c76cc4f5fcd8 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -17,3 +17,6 @@ obj-$(CONFIG_VHOST)   += vhost.o
  
  obj-$(CONFIG_VHOST_IOTLB) += vhost_iotlb.o

  vhost_iotlb-y := iotlb.o
+
+obj-$(CONFIG_VHOST_BLK) += vhost_blk.o
+vhost_blk-y := blk.o
diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
new file mode 100644
index ..933c9c50b0a6
--- /dev/null
+++ b/drivers/vhost/blk.c
@@ -0,0 +1,828 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2011 Taobao, Inc.
+ * Author: Liu Yuan 
+ *
+ * Copyright (C) 2012 Red Hat, Inc.
+ * Author: Asias He 
+ *
+ * Copyright (c) 2022 Virtuozzo International GmbH.
+ * Author: Andrey Zhadchenko 
+ *
+ * virtio-blk host kernel accelerator.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vhost.h"
+
+enum {
+   VHOST_BLK_FEATURES = VHOST_FEATURES |
+(1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+(1ULL << VIRTIO_RING_F_EVENT_IDX) |
+(1ULL << VIRTIO_BLK_F_MQ) |
+(1ULL << VIRTIO_BLK_F_FLUSH),
+};
+
+/*
+ * Max number of bytes transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others.
+ */
+#define VHOST_DEV_WEIGHT 0x8
+
+/*
+ * Max number of packets transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others with
+ * pkts.
+ */
+#define VHOST_DEV_PKT_WEIGHT 256
+
+#define VHOST_BLK_VQ_MAX 8
+
+#define VHOST_MAX_METADATA_IOV 1
+
+#define VHOST_BLK_SECTOR_BITS 9
+#define VHOST_BLK_SECTOR_SIZE (1 << VHOST_BLK_SECTOR_BITS)
+#define VHOST_BLK_SECTOR_MASK (VHOST_BLK_SECTOR_SIZE - 1)
+
+struct req_page_list {
+   struct page **pages;
+   int pages_nr;
+};
+
+#define NR_INLINE 16
+
+struct vhost_blk_req {
+   struct req_page_list inline_pl[NR_INLINE];
+   struct page *inline_page[NR_INLINE];
+   str

Re: [Devel] [PATCH RH9 v3 01/10] drivers/vhost: vhost-blk accelerator for virtio-blk guests

2022-10-13 Thread Pavel Tikhomirov





On 10.10.2022 17:56, Andrey Zhadchenko wrote:

Although QEMU virtio is quite fast, there is still some room for
improvements. Disk latency can be reduced if we handle virito-blk requests
in host kernel istead of passing them to QEMU. The patch adds vhost-blk
kernel module to do so.

Some test setups:
fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128
QEMU drive options: cache=none
filesystem: xfs

SSD:
| randread, IOPS  | randwrite, IOPS |
Host   |  95.8k  |  85.3k  |
QEMU virtio|  57.5k  |  79.4k  |
QEMU vhost-blk |  95.6k  |  84.3k  |

RAMDISK (vq == vcpu):
  | randread, IOPS | randwrite, IOPS |
virtio, 1vcpu|  123k  |  129k   |
virtio, 2vcpu|  253k (??) |  250k (??)  |
virtio, 4vcpu|  158k  |  154k   |
vhost-blk, 1vcpu |  110k  |  113k   |
vhost-blk, 2vcpu |  247k  |  252k   |
vhost-blk, 4vcpu |  576k  |  567k   |

https://jira.sw.ru/browse/PSBM-139414
Signed-off-by: Andrey Zhadchenko 
---
v2:
  - removed unused VHOST_BLK_VQ
  - reworked bio handling a bit: now add all pages from signle iov into
single bio istead of allocating one bio per page
  - changed how to calculate sector incrementation
  - check move_iovec() in vhost_blk_req_handle()
  - remove snprintf check and better check ret from copy_to_iter for
VIRTIO_BLK_ID_BYTES requests
  - discard vq request if vhost_blk_req_handle() returned negative code
  - forbid to change nonzero backend in vhost_blk_set_backend(). First of
all, QEMU sets backend only once. Also if we want to change backend when
we already running requests we need to be much more careful in
vhost_blk_handle_guest_kick() as it is not taking any references. If
userspace want to change backend that bad it can always reset device.
  - removed EXPERIMENTAL from Kconfig

v3:
  - a bit reworked bio handling - allocate new bio only if the previous
is full

  drivers/vhost/Kconfig  |  12 +
  drivers/vhost/Makefile |   3 +
  drivers/vhost/blk.c| 828 +
  include/uapi/linux/vhost.h |   5 +
  4 files changed, 848 insertions(+)
  create mode 100644 drivers/vhost/blk.c

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 587fbae06182..e1389bf0c10b 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -89,4 +89,16 @@ config VHOST_CROSS_ENDIAN_LEGACY
  
  	  If unsure, say "N".
  
+config VHOST_BLK

+   tristate "Host kernel accelerator for virtio-blk"
+   depends on BLOCK && EVENTFD
+   select VHOST
+   default n
+   help
+ This kernel module can be loaded in host kernel to accelerate
+ guest vm with virtio-blk driver.
+
+ To compile this driver as a module, choose M here: the module will
+ be called vhost_blk.
+
  endif
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index f3e1897cce85..c76cc4f5fcd8 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -17,3 +17,6 @@ obj-$(CONFIG_VHOST)   += vhost.o
  
  obj-$(CONFIG_VHOST_IOTLB) += vhost_iotlb.o

  vhost_iotlb-y := iotlb.o
+
+obj-$(CONFIG_VHOST_BLK) += vhost_blk.o
+vhost_blk-y := blk.o
diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
new file mode 100644
index ..933c9c50b0a6
--- /dev/null
+++ b/drivers/vhost/blk.c
@@ -0,0 +1,828 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2011 Taobao, Inc.
+ * Author: Liu Yuan 
+ *
+ * Copyright (C) 2012 Red Hat, Inc.
+ * Author: Asias He 
+ *
+ * Copyright (c) 2022 Virtuozzo International GmbH.
+ * Author: Andrey Zhadchenko 
+ *
+ * virtio-blk host kernel accelerator.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vhost.h"
+
+enum {
+   VHOST_BLK_FEATURES = VHOST_FEATURES |
+(1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+(1ULL << VIRTIO_RING_F_EVENT_IDX) |
+(1ULL << VIRTIO_BLK_F_MQ) |
+(1ULL << VIRTIO_BLK_F_FLUSH),
+};
+
+/*
+ * Max number of bytes transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others.
+ */
+#define VHOST_DEV_WEIGHT 0x8
+
+/*
+ * Max number of packets transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others with
+ * pkts.
+ */
+#define VHOST_DEV_PKT_WEIGHT 256
+
+#define VHOST_BLK_VQ_MAX 8
+
+#define VHOST_MAX_METADATA_IOV 1
+
+#define VHOST_BLK_SECTOR_BITS 9
+#define VHOST_BLK_SECTOR_SIZE (1 << VHOST_BLK_SECTOR_BITS)
+#define VHOST_BLK_SECTOR_MASK (VHOST_BLK_SECTOR_SIZE - 1)
+
+struct req_page_list {
+   struct page **pages;
+   int pages_nr;
+};
+
+#define NR_INLINE 16
+
+struct vhost_blk_req {
+   struct req_page_list inline_pl[NR_INLINE];
+   struct page *inline_page[NR_INLINE];
+   str

Re: [Devel] [PATCH RH9 v2 01/10] drivers/vhost: vhost-blk accelerator for virtio-blk guests

2022-10-13 Thread Pavel Tikhomirov





On 08.09.2022 18:32, Andrey Zhadchenko wrote:

Although QEMU virtio is quite fast, there is still some room for
improvements. Disk latency can be reduced if we handle virito-blk requests
in host kernel istead of passing them to QEMU. The patch adds vhost-blk
kernel module to do so.

Some test setups:
fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128
QEMU drive options: cache=none
filesystem: xfs

SSD:
| randread, IOPS  | randwrite, IOPS |
Host   |  95.8k  |  85.3k  |
QEMU virtio|  57.5k  |  79.4k  |
QEMU vhost-blk |  95.6k  |  84.3k  |

RAMDISK (vq == vcpu):
  | randread, IOPS | randwrite, IOPS |
virtio, 1vcpu|  123k  |  129k   |
virtio, 2vcpu|  253k (??) |  250k (??)  |
virtio, 4vcpu|  158k  |  154k   |
vhost-blk, 1vcpu |  110k  |  113k   |
vhost-blk, 2vcpu |  247k  |  252k   |
vhost-blk, 4vcpu |  576k  |  567k   |

https://jira.sw.ru/browse/PSBM-139414
Signed-off-by: Andrey Zhadchenko 
---
v2:
  - removed unused VHOST_BLK_VQ
  - reworked bio handling a bit: now add all pages from signle iov into
single bio istead of allocating one bio per page
  - changed how to calculate sector incrementation
  - check move_iovec() in vhost_blk_req_handle()
  - remove snprintf check and better check ret from copy_to_iter for
VIRTIO_BLK_ID_BYTES requests
  - discard vq request if vhost_blk_req_handle() returned negative code
  - forbid to change nonzero backend in vhost_blk_set_backend(). First of
all, QEMU sets backend only once. Also if we want to change backend when
we already running requests we need to be much more careful in
vhost_blk_handle_guest_kick() as it is not taking any references. If
userspace want to change backend that bad it can always reset device.
  - removed EXPERIMENTAL from Kconfig

  drivers/vhost/Kconfig  |  12 +
  drivers/vhost/Makefile |   3 +
  drivers/vhost/blk.c| 829 +
  include/uapi/linux/vhost.h |   5 +
  4 files changed, 849 insertions(+)
  create mode 100644 drivers/vhost/blk.c

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 587fbae06182..e1389bf0c10b 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -89,4 +89,16 @@ config VHOST_CROSS_ENDIAN_LEGACY
  
  	  If unsure, say "N".
  
+config VHOST_BLK

+   tristate "Host kernel accelerator for virtio-blk"
+   depends on BLOCK && EVENTFD
+   select VHOST
+   default n
+   help
+ This kernel module can be loaded in host kernel to accelerate
+ guest vm with virtio-blk driver.
+
+ To compile this driver as a module, choose M here: the module will
+ be called vhost_blk.
+
  endif
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index f3e1897cce85..c76cc4f5fcd8 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -17,3 +17,6 @@ obj-$(CONFIG_VHOST)   += vhost.o
  
  obj-$(CONFIG_VHOST_IOTLB) += vhost_iotlb.o

  vhost_iotlb-y := iotlb.o
+
+obj-$(CONFIG_VHOST_BLK) += vhost_blk.o
+vhost_blk-y := blk.o
diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
new file mode 100644
index ..c62b8ae70716
--- /dev/null
+++ b/drivers/vhost/blk.c
@@ -0,0 +1,829 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2011 Taobao, Inc.
+ * Author: Liu Yuan 
+ *
+ * Copyright (C) 2012 Red Hat, Inc.
+ * Author: Asias He 
+ *
+ * Copyright (c) 2022 Virtuozzo International GmbH.
+ * Author: Andrey Zhadchenko 
+ *
+ * virtio-blk host kernel accelerator.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vhost.h"
+
+enum {
+   VHOST_BLK_FEATURES = VHOST_FEATURES |
+(1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+(1ULL << VIRTIO_RING_F_EVENT_IDX) |
+(1ULL << VIRTIO_BLK_F_MQ) |
+(1ULL << VIRTIO_BLK_F_FLUSH),
+};
+
+/*
+ * Max number of bytes transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others.
+ */
+#define VHOST_DEV_WEIGHT 0x8
+
+/*
+ * Max number of packets transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others with
+ * pkts.
+ */
+#define VHOST_DEV_PKT_WEIGHT 256
+
+#define VHOST_BLK_VQ_MAX 8
+
+#define VHOST_MAX_METADATA_IOV 1
+
+#define VHOST_BLK_SECTOR_BITS 9
+#define VHOST_BLK_SECTOR_SIZE (1 << VHOST_BLK_SECTOR_BITS)
+#define VHOST_BLK_SECTOR_MASK (VHOST_BLK_SECTOR_SIZE - 1)
+
+struct req_page_list {
+   struct page **pages;
+   int pages_nr;
+};
+
+#define NR_INLINE 16
+
+struct vhost_blk_req {
+   struct req_page_list inline_pl[NR_INLINE];
+   struct page *inline_page[NR_INLINE];
+   struct bio *inline_bio[NR_INLINE];
+   struct req_page_list *pl;
+   int during_

Re: [Devel] [PATCH RH9 v3 01/10] drivers/vhost: vhost-blk accelerator for virtio-blk guests

2022-10-13 Thread Pavel Tikhomirov





On 10.10.2022 17:56, Andrey Zhadchenko wrote:

Although QEMU virtio is quite fast, there is still some room for
improvements. Disk latency can be reduced if we handle virito-blk requests
in host kernel istead of passing them to QEMU. The patch adds vhost-blk
kernel module to do so.

Some test setups:
fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128
QEMU drive options: cache=none
filesystem: xfs

SSD:
| randread, IOPS  | randwrite, IOPS |
Host   |  95.8k  |  85.3k  |
QEMU virtio|  57.5k  |  79.4k  |
QEMU vhost-blk |  95.6k  |  84.3k  |

RAMDISK (vq == vcpu):
  | randread, IOPS | randwrite, IOPS |
virtio, 1vcpu|  123k  |  129k   |
virtio, 2vcpu|  253k (??) |  250k (??)  |
virtio, 4vcpu|  158k  |  154k   |
vhost-blk, 1vcpu |  110k  |  113k   |
vhost-blk, 2vcpu |  247k  |  252k   |
vhost-blk, 4vcpu |  576k  |  567k   |

https://jira.sw.ru/browse/PSBM-139414
Signed-off-by: Andrey Zhadchenko 
---
v2:
  - removed unused VHOST_BLK_VQ
  - reworked bio handling a bit: now add all pages from signle iov into
single bio istead of allocating one bio per page
  - changed how to calculate sector incrementation
  - check move_iovec() in vhost_blk_req_handle()
  - remove snprintf check and better check ret from copy_to_iter for
VIRTIO_BLK_ID_BYTES requests
  - discard vq request if vhost_blk_req_handle() returned negative code
  - forbid to change nonzero backend in vhost_blk_set_backend(). First of
all, QEMU sets backend only once. Also if we want to change backend when
we already running requests we need to be much more careful in
vhost_blk_handle_guest_kick() as it is not taking any references. If
userspace want to change backend that bad it can always reset device.
  - removed EXPERIMENTAL from Kconfig

v3:
  - a bit reworked bio handling - allocate new bio only if the previous
is full

  drivers/vhost/Kconfig  |  12 +
  drivers/vhost/Makefile |   3 +
  drivers/vhost/blk.c| 828 +
  include/uapi/linux/vhost.h |   5 +
  4 files changed, 848 insertions(+)
  create mode 100644 drivers/vhost/blk.c

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 587fbae06182..e1389bf0c10b 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -89,4 +89,16 @@ config VHOST_CROSS_ENDIAN_LEGACY
  
  	  If unsure, say "N".
  
+config VHOST_BLK

+   tristate "Host kernel accelerator for virtio-blk"
+   depends on BLOCK && EVENTFD
+   select VHOST
+   default n
+   help
+ This kernel module can be loaded in host kernel to accelerate
+ guest vm with virtio-blk driver.
+
+ To compile this driver as a module, choose M here: the module will
+ be called vhost_blk.
+
  endif
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index f3e1897cce85..c76cc4f5fcd8 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -17,3 +17,6 @@ obj-$(CONFIG_VHOST)   += vhost.o
  
  obj-$(CONFIG_VHOST_IOTLB) += vhost_iotlb.o

  vhost_iotlb-y := iotlb.o
+
+obj-$(CONFIG_VHOST_BLK) += vhost_blk.o
+vhost_blk-y := blk.o
diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
new file mode 100644
index ..933c9c50b0a6
--- /dev/null
+++ b/drivers/vhost/blk.c
@@ -0,0 +1,828 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2011 Taobao, Inc.
+ * Author: Liu Yuan 
+ *
+ * Copyright (C) 2012 Red Hat, Inc.
+ * Author: Asias He 
+ *
+ * Copyright (c) 2022 Virtuozzo International GmbH.
+ * Author: Andrey Zhadchenko 
+ *
+ * virtio-blk host kernel accelerator.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vhost.h"
+
+enum {
+   VHOST_BLK_FEATURES = VHOST_FEATURES |
+(1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+(1ULL << VIRTIO_RING_F_EVENT_IDX) |
+(1ULL << VIRTIO_BLK_F_MQ) |
+(1ULL << VIRTIO_BLK_F_FLUSH),
+};
+
+/*
+ * Max number of bytes transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others.
+ */
+#define VHOST_DEV_WEIGHT 0x8
+
+/*
+ * Max number of packets transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others with
+ * pkts.
+ */
+#define VHOST_DEV_PKT_WEIGHT 256
+
+#define VHOST_BLK_VQ_MAX 8
+
+#define VHOST_MAX_METADATA_IOV 1
+
+#define VHOST_BLK_SECTOR_BITS 9
+#define VHOST_BLK_SECTOR_SIZE (1 << VHOST_BLK_SECTOR_BITS)
+#define VHOST_BLK_SECTOR_MASK (VHOST_BLK_SECTOR_SIZE - 1)
+
+struct req_page_list {
+   struct page **pages;
+   int pages_nr;
+};
+
+#define NR_INLINE 16
+
+struct vhost_blk_req {
+   struct req_page_list inline_pl[NR_INLINE];
+   struct page *inline_page[NR_INLINE];
+   str

Re: [Devel] [PATCH RH9 v2 01/10] drivers/vhost: vhost-blk accelerator for virtio-blk guests

2022-10-13 Thread Pavel Tikhomirov


Please drop this one. I accidentally sent reply to v2 not to v3.

On 13.10.2022 19:07, Pavel Tikhomirov wrote:



On 08.09.2022 18:32, Andrey Zhadchenko wrote:

Although QEMU virtio is quite fast, there is still some room for
improvements. Disk latency can be reduced if we handle virito-blk 
requests

in host kernel istead of passing them to QEMU. The patch adds vhost-blk
kernel module to do so.

Some test setups:
fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128
QEMU drive options: cache=none
filesystem: xfs

SSD:
    | randread, IOPS  | randwrite, IOPS |
Host   |  95.8k  |  85.3k  |
QEMU virtio    |  57.5k  |  79.4k  |
QEMU vhost-blk |  95.6k  |  84.3k  |

RAMDISK (vq == vcpu):
  | randread, IOPS | randwrite, IOPS |
virtio, 1vcpu    |  123k  |  129k   |
virtio, 2vcpu    |  253k (??) |  250k (??)  |
virtio, 4vcpu    |  158k  |  154k   |
vhost-blk, 1vcpu |  110k  |  113k   |
vhost-blk, 2vcpu |  247k  |  252k   |
vhost-blk, 4vcpu |  576k  |  567k   |

https://jira.sw.ru/browse/PSBM-139414
Signed-off-by: Andrey Zhadchenko 
---
v2:
  - removed unused VHOST_BLK_VQ
  - reworked bio handling a bit: now add all pages from signle iov into
single bio istead of allocating one bio per page
  - changed how to calculate sector incrementation
  - check move_iovec() in vhost_blk_req_handle()
  - remove snprintf check and better check ret from copy_to_iter for
VIRTIO_BLK_ID_BYTES requests
  - discard vq request if vhost_blk_req_handle() returned negative code
  - forbid to change nonzero backend in vhost_blk_set_backend(). First of
all, QEMU sets backend only once. Also if we want to change backend when
we already running requests we need to be much more careful in
vhost_blk_handle_guest_kick() as it is not taking any references. If
userspace want to change backend that bad it can always reset device.
  - removed EXPERIMENTAL from Kconfig

  drivers/vhost/Kconfig  |  12 +
  drivers/vhost/Makefile |   3 +
  drivers/vhost/blk.c    | 829 +
  include/uapi/linux/vhost.h |   5 +
  4 files changed, 849 insertions(+)
  create mode 100644 drivers/vhost/blk.c

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 587fbae06182..e1389bf0c10b 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -89,4 +89,16 @@ config VHOST_CROSS_ENDIAN_LEGACY
    If unsure, say "N".
+config VHOST_BLK
+    tristate "Host kernel accelerator for virtio-blk"
+    depends on BLOCK && EVENTFD
+    select VHOST
+    default n
+    help
+  This kernel module can be loaded in host kernel to accelerate
+  guest vm with virtio-blk driver.
+
+  To compile this driver as a module, choose M here: the module will
+  be called vhost_blk.
+
  endif
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index f3e1897cce85..c76cc4f5fcd8 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -17,3 +17,6 @@ obj-$(CONFIG_VHOST)    += vhost.o
  obj-$(CONFIG_VHOST_IOTLB) += vhost_iotlb.o
  vhost_iotlb-y := iotlb.o
+
+obj-$(CONFIG_VHOST_BLK) += vhost_blk.o
+vhost_blk-y := blk.o
diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
new file mode 100644
index ..c62b8ae70716
--- /dev/null
+++ b/drivers/vhost/blk.c
@@ -0,0 +1,829 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2011 Taobao, Inc.
+ * Author: Liu Yuan 
+ *
+ * Copyright (C) 2012 Red Hat, Inc.
+ * Author: Asias He 
+ *
+ * Copyright (c) 2022 Virtuozzo International GmbH.
+ * Author: Andrey Zhadchenko 
+ *
+ * virtio-blk host kernel accelerator.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vhost.h"
+
+enum {
+    VHOST_BLK_FEATURES = VHOST_FEATURES |
+ (1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+ (1ULL << VIRTIO_RING_F_EVENT_IDX) |
+ (1ULL << VIRTIO_BLK_F_MQ) |
+ (1ULL << VIRTIO_BLK_F_FLUSH),
+};
+
+/*
+ * Max number of bytes transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others.
+ */
+#define VHOST_DEV_WEIGHT 0x8
+
+/*
+ * Max number of packets transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others with
+ * pkts.
+ */
+#define VHOST_DEV_PKT_WEIGHT 256
+
+#define VHOST_BLK_VQ_MAX 8
+
+#define VHOST_MAX_METADATA_IOV 1
+
+#define VHOST_BLK_SECTOR_BITS 9
+#define VHOST_BLK_SECTOR_SIZE (1 << VHOST_BLK_SECTOR_BITS)
+#define VHOST_BLK_SECTOR_MASK (VHOST_BLK_SECTOR_SIZE - 1)
+
+struct req_page_list {
+    struct page **pages;
+    int pages_nr;
+};
+
+#define NR_INLINE 16
+
+struct vhost_blk_req {
+    struct req_page_list inline_pl[NR_INLINE];
+    struct page *inline_page[NR_INLINE];
+

Re: [Devel] [PATCH RH9 v3 01/10] drivers/vhost: vhost-blk accelerator for virtio-blk guests

2022-10-13 Thread Pavel Tikhomirov





On 10.10.2022 17:56, Andrey Zhadchenko wrote:

Although QEMU virtio is quite fast, there is still some room for
improvements. Disk latency can be reduced if we handle virito-blk requests
in host kernel istead of passing them to QEMU. The patch adds vhost-blk
kernel module to do so.

Some test setups:
fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128
QEMU drive options: cache=none
filesystem: xfs

SSD:
| randread, IOPS  | randwrite, IOPS |
Host   |  95.8k  |  85.3k  |
QEMU virtio|  57.5k  |  79.4k  |
QEMU vhost-blk |  95.6k  |  84.3k  |

RAMDISK (vq == vcpu):
  | randread, IOPS | randwrite, IOPS |
virtio, 1vcpu|  123k  |  129k   |
virtio, 2vcpu|  253k (??) |  250k (??)  |
virtio, 4vcpu|  158k  |  154k   |
vhost-blk, 1vcpu |  110k  |  113k   |
vhost-blk, 2vcpu |  247k  |  252k   |
vhost-blk, 4vcpu |  576k  |  567k   |

https://jira.sw.ru/browse/PSBM-139414
Signed-off-by: Andrey Zhadchenko 
---
v2:
  - removed unused VHOST_BLK_VQ
  - reworked bio handling a bit: now add all pages from signle iov into
single bio istead of allocating one bio per page
  - changed how to calculate sector incrementation
  - check move_iovec() in vhost_blk_req_handle()
  - remove snprintf check and better check ret from copy_to_iter for
VIRTIO_BLK_ID_BYTES requests
  - discard vq request if vhost_blk_req_handle() returned negative code
  - forbid to change nonzero backend in vhost_blk_set_backend(). First of
all, QEMU sets backend only once. Also if we want to change backend when
we already running requests we need to be much more careful in
vhost_blk_handle_guest_kick() as it is not taking any references. If
userspace want to change backend that bad it can always reset device.
  - removed EXPERIMENTAL from Kconfig

v3:
  - a bit reworked bio handling - allocate new bio only if the previous
is full

  drivers/vhost/Kconfig  |  12 +
  drivers/vhost/Makefile |   3 +
  drivers/vhost/blk.c| 828 +
  include/uapi/linux/vhost.h |   5 +
  4 files changed, 848 insertions(+)
  create mode 100644 drivers/vhost/blk.c

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 587fbae06182..e1389bf0c10b 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -89,4 +89,16 @@ config VHOST_CROSS_ENDIAN_LEGACY
  
  	  If unsure, say "N".
  
+config VHOST_BLK

+   tristate "Host kernel accelerator for virtio-blk"
+   depends on BLOCK && EVENTFD
+   select VHOST
+   default n
+   help
+ This kernel module can be loaded in host kernel to accelerate
+ guest vm with virtio-blk driver.
+
+ To compile this driver as a module, choose M here: the module will
+ be called vhost_blk.
+
  endif
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index f3e1897cce85..c76cc4f5fcd8 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -17,3 +17,6 @@ obj-$(CONFIG_VHOST)   += vhost.o
  
  obj-$(CONFIG_VHOST_IOTLB) += vhost_iotlb.o

  vhost_iotlb-y := iotlb.o
+
+obj-$(CONFIG_VHOST_BLK) += vhost_blk.o
+vhost_blk-y := blk.o
diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
new file mode 100644
index ..933c9c50b0a6
--- /dev/null
+++ b/drivers/vhost/blk.c
@@ -0,0 +1,828 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2011 Taobao, Inc.
+ * Author: Liu Yuan 
+ *
+ * Copyright (C) 2012 Red Hat, Inc.
+ * Author: Asias He 
+ *
+ * Copyright (c) 2022 Virtuozzo International GmbH.
+ * Author: Andrey Zhadchenko 
+ *
+ * virtio-blk host kernel accelerator.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vhost.h"
+
+enum {
+   VHOST_BLK_FEATURES = VHOST_FEATURES |
+(1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+(1ULL << VIRTIO_RING_F_EVENT_IDX) |
+(1ULL << VIRTIO_BLK_F_MQ) |
+(1ULL << VIRTIO_BLK_F_FLUSH),
+};
+
+/*
+ * Max number of bytes transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others.
+ */
+#define VHOST_DEV_WEIGHT 0x8
+
+/*
+ * Max number of packets transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others with
+ * pkts.
+ */
+#define VHOST_DEV_PKT_WEIGHT 256
+
+#define VHOST_BLK_VQ_MAX 8
+
+#define VHOST_MAX_METADATA_IOV 1
+
+#define VHOST_BLK_SECTOR_BITS 9
+#define VHOST_BLK_SECTOR_SIZE (1 << VHOST_BLK_SECTOR_BITS)
+#define VHOST_BLK_SECTOR_MASK (VHOST_BLK_SECTOR_SIZE - 1)
+
+struct req_page_list {
+   struct page **pages;
+   int pages_nr;
+};
+
+#define NR_INLINE 16
+
+struct vhost_blk_req {
+   struct req_page_list inline_pl[NR_INLINE];
+   struct page *inline_page[NR_INLINE];
+   str

[Devel] Test email - please ignore

2022-10-13 Thread Pavel Tikhomirov


Test email - please ignore
--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RH9 v3 01/10] drivers/vhost: vhost-blk accelerator for virtio-blk guests

2022-10-14 Thread Pavel Tikhomirov





On 10.10.2022 17:56, Andrey Zhadchenko wrote:

Although QEMU virtio is quite fast, there is still some room for
improvements. Disk latency can be reduced if we handle virito-blk requests
in host kernel istead of passing them to QEMU. The patch adds vhost-blk
kernel module to do so.

Some test setups:
fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128
QEMU drive options: cache=none
filesystem: xfs

SSD:
| randread, IOPS  | randwrite, IOPS |
Host   |  95.8k  |  85.3k  |
QEMU virtio|  57.5k  |  79.4k  |
QEMU vhost-blk |  95.6k  |  84.3k  |

RAMDISK (vq == vcpu):
  | randread, IOPS | randwrite, IOPS |
virtio, 1vcpu|  123k  |  129k   |
virtio, 2vcpu|  253k (??) |  250k (??)  |
virtio, 4vcpu|  158k  |  154k   |
vhost-blk, 1vcpu |  110k  |  113k   |
vhost-blk, 2vcpu |  247k  |  252k   |
vhost-blk, 4vcpu |  576k  |  567k   |

https://jira.sw.ru/browse/PSBM-139414
Signed-off-by: Andrey Zhadchenko 
---
v2:
  - removed unused VHOST_BLK_VQ
  - reworked bio handling a bit: now add all pages from signle iov into
single bio istead of allocating one bio per page
  - changed how to calculate sector incrementation
  - check move_iovec() in vhost_blk_req_handle()
  - remove snprintf check and better check ret from copy_to_iter for
VIRTIO_BLK_ID_BYTES requests
  - discard vq request if vhost_blk_req_handle() returned negative code
  - forbid to change nonzero backend in vhost_blk_set_backend(). First of
all, QEMU sets backend only once. Also if we want to change backend when
we already running requests we need to be much more careful in
vhost_blk_handle_guest_kick() as it is not taking any references. If
userspace want to change backend that bad it can always reset device.
  - removed EXPERIMENTAL from Kconfig

v3:
  - a bit reworked bio handling - allocate new bio only if the previous
is full

  drivers/vhost/Kconfig  |  12 +
  drivers/vhost/Makefile |   3 +
  drivers/vhost/blk.c| 828 +
  include/uapi/linux/vhost.h |   5 +
  4 files changed, 848 insertions(+)
  create mode 100644 drivers/vhost/blk.c

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 587fbae06182..e1389bf0c10b 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -89,4 +89,16 @@ config VHOST_CROSS_ENDIAN_LEGACY
  
  	  If unsure, say "N".
  
+config VHOST_BLK

+   tristate "Host kernel accelerator for virtio-blk"
+   depends on BLOCK && EVENTFD
+   select VHOST
+   default n
+   help
+ This kernel module can be loaded in host kernel to accelerate
+ guest vm with virtio-blk driver.
+
+ To compile this driver as a module, choose M here: the module will
+ be called vhost_blk.
+
  endif
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index f3e1897cce85..c76cc4f5fcd8 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -17,3 +17,6 @@ obj-$(CONFIG_VHOST)   += vhost.o
  
  obj-$(CONFIG_VHOST_IOTLB) += vhost_iotlb.o

  vhost_iotlb-y := iotlb.o
+
+obj-$(CONFIG_VHOST_BLK) += vhost_blk.o
+vhost_blk-y := blk.o
diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
new file mode 100644
index ..933c9c50b0a6
--- /dev/null
+++ b/drivers/vhost/blk.c
@@ -0,0 +1,828 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2011 Taobao, Inc.
+ * Author: Liu Yuan 
+ *
+ * Copyright (C) 2012 Red Hat, Inc.
+ * Author: Asias He 
+ *
+ * Copyright (c) 2022 Virtuozzo International GmbH.
+ * Author: Andrey Zhadchenko 
+ *
+ * virtio-blk host kernel accelerator.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vhost.h"
+
+enum {
+   VHOST_BLK_FEATURES = VHOST_FEATURES |
+(1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+(1ULL << VIRTIO_RING_F_EVENT_IDX) |
+(1ULL << VIRTIO_BLK_F_MQ) |
+(1ULL << VIRTIO_BLK_F_FLUSH),
+};
+
+/*
+ * Max number of bytes transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others.
+ */
+#define VHOST_DEV_WEIGHT 0x8
+
+/*
+ * Max number of packets transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others with
+ * pkts.
+ */
+#define VHOST_DEV_PKT_WEIGHT 256
+
+#define VHOST_BLK_VQ_MAX 8
+
+#define VHOST_MAX_METADATA_IOV 1
+
+#define VHOST_BLK_SECTOR_BITS 9
+#define VHOST_BLK_SECTOR_SIZE (1 << VHOST_BLK_SECTOR_BITS)
+#define VHOST_BLK_SECTOR_MASK (VHOST_BLK_SECTOR_SIZE - 1)
+
+struct req_page_list {
+   struct page **pages;
+   int pages_nr;
+};
+
+#define NR_INLINE 16
+
+struct vhost_blk_req {
+   struct req_page_list inline_pl[NR_INLINE];
+   struct page *inline_page[NR_INLINE];
+   str

Re: [Devel] [PATCH RH9 v3 01/10] drivers/vhost: vhost-blk accelerator for virtio-blk guests

2022-10-14 Thread Pavel Tikhomirov





On 10.10.2022 17:56, Andrey Zhadchenko wrote:

Although QEMU virtio is quite fast, there is still some room for
improvements. Disk latency can be reduced if we handle virito-blk requests
in host kernel istead of passing them to QEMU. The patch adds vhost-blk
kernel module to do so.

Some test setups:
fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128
QEMU drive options: cache=none
filesystem: xfs

SSD:
| randread, IOPS  | randwrite, IOPS |
Host   |  95.8k  |  85.3k  |
QEMU virtio|  57.5k  |  79.4k  |
QEMU vhost-blk |  95.6k  |  84.3k  |

RAMDISK (vq == vcpu):
  | randread, IOPS | randwrite, IOPS |
virtio, 1vcpu|  123k  |  129k   |
virtio, 2vcpu|  253k (??) |  250k (??)  |
virtio, 4vcpu|  158k  |  154k   |
vhost-blk, 1vcpu |  110k  |  113k   |
vhost-blk, 2vcpu |  247k  |  252k   |
vhost-blk, 4vcpu |  576k  |  567k   |

https://jira.sw.ru/browse/PSBM-139414
Signed-off-by: Andrey Zhadchenko 
---
v2:
  - removed unused VHOST_BLK_VQ
  - reworked bio handling a bit: now add all pages from signle iov into
single bio istead of allocating one bio per page
  - changed how to calculate sector incrementation
  - check move_iovec() in vhost_blk_req_handle()
  - remove snprintf check and better check ret from copy_to_iter for
VIRTIO_BLK_ID_BYTES requests
  - discard vq request if vhost_blk_req_handle() returned negative code
  - forbid to change nonzero backend in vhost_blk_set_backend(). First of
all, QEMU sets backend only once. Also if we want to change backend when
we already running requests we need to be much more careful in
vhost_blk_handle_guest_kick() as it is not taking any references. If
userspace want to change backend that bad it can always reset device.
  - removed EXPERIMENTAL from Kconfig

v3:
  - a bit reworked bio handling - allocate new bio only if the previous
is full

  drivers/vhost/Kconfig  |  12 +
  drivers/vhost/Makefile |   3 +
  drivers/vhost/blk.c| 828 +
  include/uapi/linux/vhost.h |   5 +
  4 files changed, 848 insertions(+)
  create mode 100644 drivers/vhost/blk.c

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 587fbae06182..e1389bf0c10b 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -89,4 +89,16 @@ config VHOST_CROSS_ENDIAN_LEGACY
  
  	  If unsure, say "N".
  
+config VHOST_BLK

+   tristate "Host kernel accelerator for virtio-blk"
+   depends on BLOCK && EVENTFD
+   select VHOST
+   default n
+   help
+ This kernel module can be loaded in host kernel to accelerate
+ guest vm with virtio-blk driver.
+
+ To compile this driver as a module, choose M here: the module will
+ be called vhost_blk.
+
  endif
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index f3e1897cce85..c76cc4f5fcd8 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -17,3 +17,6 @@ obj-$(CONFIG_VHOST)   += vhost.o
  
  obj-$(CONFIG_VHOST_IOTLB) += vhost_iotlb.o

  vhost_iotlb-y := iotlb.o
+
+obj-$(CONFIG_VHOST_BLK) += vhost_blk.o
+vhost_blk-y := blk.o
diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
new file mode 100644
index ..933c9c50b0a6
--- /dev/null
+++ b/drivers/vhost/blk.c
@@ -0,0 +1,828 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2011 Taobao, Inc.
+ * Author: Liu Yuan 
+ *
+ * Copyright (C) 2012 Red Hat, Inc.
+ * Author: Asias He 
+ *
+ * Copyright (c) 2022 Virtuozzo International GmbH.
+ * Author: Andrey Zhadchenko 
+ *
+ * virtio-blk host kernel accelerator.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vhost.h"
+
+enum {
+   VHOST_BLK_FEATURES = VHOST_FEATURES |
+(1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+(1ULL << VIRTIO_RING_F_EVENT_IDX) |
+(1ULL << VIRTIO_BLK_F_MQ) |
+(1ULL << VIRTIO_BLK_F_FLUSH),
+};
+
+/*
+ * Max number of bytes transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others.
+ */
+#define VHOST_DEV_WEIGHT 0x8
+
+/*
+ * Max number of packets transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others with
+ * pkts.
+ */
+#define VHOST_DEV_PKT_WEIGHT 256
+
+#define VHOST_BLK_VQ_MAX 8
+
+#define VHOST_MAX_METADATA_IOV 1
+
+#define VHOST_BLK_SECTOR_BITS 9
+#define VHOST_BLK_SECTOR_SIZE (1 << VHOST_BLK_SECTOR_BITS)
+#define VHOST_BLK_SECTOR_MASK (VHOST_BLK_SECTOR_SIZE - 1)
+
+struct req_page_list {
+   struct page **pages;
+   int pages_nr;
+};
+
+#define NR_INLINE 16
+
+struct vhost_blk_req {
+   struct req_page_list inline_pl[NR_INLINE];
+   struct page *inline_page[NR_INLINE];
+   str

[Devel] Test email again - please ignore

2022-10-17 Thread Pavel Tikhomirov


Test email again - please ignore

--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RH9 v3 05/10] drivers/vhost: rework worker creation

2022-10-17 Thread Pavel Tikhomirov





On 10.10.2022 17:56, Andrey Zhadchenko wrote:

Add function to create a vhost worker and add it into the device.
Rework vhost_dev_set_owner

https://jira.sw.ru/browse/PSBM-139414
Signed-off-by: Andrey Zhadchenko 
---
  drivers/vhost/vhost.c | 68 +--
  1 file changed, 40 insertions(+), 28 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index df9c57c82a52..173a14041678 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -626,53 +626,65 @@ static void vhost_detach_mm(struct vhost_dev *dev)
dev->mm = NULL;
  }
  
-/* Caller should have device mutex */

-long vhost_dev_set_owner(struct vhost_dev *dev)
+static int vhost_add_worker(struct vhost_dev *dev)
  {
+   struct vhost_worker *w = &dev->workers[dev->nworkers];
struct task_struct *worker;
int err;
  
+	if (dev->nworkers == VHOST_MAX_WORKERS)


Can we use ">=" just in case.


+   return -E2BIG;
+
+   worker = kthread_create(vhost_worker, w,
+   "vhost-%d-%d", current->pid, dev->nworkers);
+   if (IS_ERR(worker))
+   return PTR_ERR(worker);
+
+   w->worker = worker;
+   wake_up_process(worker); /* avoid contributing to loadavg */
+
+   err = vhost_worker_attach_cgroups(w);
+   if (err)
+   goto cleanup;
+
+   dev->nworkers++;
+   return 0;
+
+cleanup:
+   kthread_stop(worker);
+   w->worker = NULL;
+
+   return err;
+}
+
+/* Caller should have device mutex */
+long vhost_dev_set_owner(struct vhost_dev *dev)
+{
+   int err;
+
/* Is there an owner already? */
-   if (vhost_dev_has_owner(dev)) {
-   err = -EBUSY;
-   goto err_mm;
-   }
+   if (vhost_dev_has_owner(dev))
+   return -EBUSY;
  
  	vhost_attach_mm(dev);
  
  	dev->kcov_handle = kcov_common_handle();

if (dev->use_worker) {
-   worker = kthread_create(vhost_worker, dev,
-   "vhost-%d", current->pid);
-   if (IS_ERR(worker)) {
-   err = PTR_ERR(worker);
-   goto err_worker;
-   }
-
-   dev->workers[0].worker = worker;
-   dev->nworkers = 1;
-   wake_up_process(worker); /* avoid contributing to loadavg */
-
-   err = vhost_worker_attach_cgroups(&dev->workers[0]);
+   err = vhost_add_worker(dev);
if (err)
-   goto err_cgroup;
+   goto err_mm;
}
  
  	err = vhost_dev_alloc_iovecs(dev);

if (err)
-   goto err_cgroup;
+   goto err_worker;
  
  	return 0;

-err_cgroup:
-   dev->nworkers = 0;
-   if (dev->workers[0].worker) {
-   kthread_stop(dev->workers[0].worker);
-   dev->workers[0].worker = NULL;
-   }
  err_worker:
-   vhost_detach_mm(dev);
-   dev->kcov_handle = 0;
+   vhost_cleanup_workers(dev);
  err_mm:
+   vhost_detach_mm(dev);
+   dev->kcov_handle = 0;
return err;
  }
  EXPORT_SYMBOL_GPL(vhost_dev_set_owner);


--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RH9 v3 07/10] drivers/vhost: assign workers to virtqueues

2022-10-17 Thread Pavel Tikhomirov





On 10.10.2022 17:56, Andrey Zhadchenko wrote:

Add worker pointer to every virtqueue. Add routine to assing
workers to virtqueues and call it after any worker creation

https://jira.sw.ru/browse/PSBM-139414
Signed-off-by: Andrey Zhadchenko 
---
  drivers/vhost/vhost.c | 13 +
  drivers/vhost/vhost.h |  2 ++
  2 files changed, 15 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index be184adcdbe5..1b17d8dd0202 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -676,6 +676,17 @@ static int vhost_set_workers(struct vhost_dev *dev, int n)
return ret;
  }
  
+static void vhost_assign_workers(struct vhost_dev *dev)

+{
+   int i, j = 0;
+
+   for (i = 0; i < dev->nvqs; i++) {
+   dev->vqs[i]->worker = &dev->workers[j];
+   if (++j == dev->nworkers)


I'd rather use ">=" to make it more rebase-safe.


+   j = 0;
+   }
+}
+
  /* Caller should have device mutex */
  long vhost_dev_set_owner(struct vhost_dev *dev)
  {
@@ -698,6 +709,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
if (err)
goto err_worker;
  
+	vhost_assign_workers(dev);

return 0;
  err_worker:
vhost_cleanup_workers(dev);
@@ -1896,6 +1908,7 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int 
ioctl, void __user *argp)
}
  
  		r = vhost_set_workers(d, n);

+   vhost_assign_workers(d);
break;
default:
r = -ENOIOCTLCMD;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 634ea828cbba..9632f6501617 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -138,6 +138,8 @@ struct vhost_virtqueue {
bool user_be;
  #endif
u32 busyloop_timeout;
+
+   struct vhost_worker *worker;
  };
  
  struct vhost_msg_node {


--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 1/2] blk-cbt: fix count decrement and check in cbt_page_alloc

2022-11-02 Thread Pavel Tikhomirov

Before this line of code cbt->count is always > 0 as it is:
symmetrically incremented/decremented in this function under cbt->lock,
and we are at the point just before decrementing it. This means that
!cbt->count-- (note: postfix decrement returns value before operation)
is always false and we never enter the true branch of this condition.

It seems the intent was to call release callbacks on reaching zero
count, let's fix it.

We have a cbt->cache percpu allocation leak detected by kmemleak, which
might be caused by this uncalled release callback.

https://jira.sw.ru/browse/PSBM-141114

Signed-off-by: Pavel Tikhomirov 
---
 block/blk-cbt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-cbt.c b/block/blk-cbt.c
index 04a9434b524a..2580ccabaa17 100644
--- a/block/blk-cbt.c
+++ b/block/blk-cbt.c
@@ -124,7 +124,7 @@ static int cbt_page_alloc(struct cbt_info  **cbt_pp, 
unsigned long idx,
if (in_rcu)
rcu_read_lock();
spin_lock_irq(&cbt->lock);
-   if (unlikely(!cbt->count-- && test_bit(CBT_DEAD, &cbt->flags))) {
+   if (unlikely(!--(cbt->count) && test_bit(CBT_DEAD, &cbt->flags))) {
spin_unlock_irq(&cbt->lock);
call_rcu(&cbt->rcu, &cbt_release_callback);
if (page)
-- 
2.37.3

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 0/2] blk-cbt: fix memory leak

2022-11-02 Thread Pavel Tikhomirov

We have a cbt->cache percpu allocation leak detected by kmemleak, which
might be caused by these uncalled release callbacks.

This applies cleanly to vz7 and vz9, we need both.

https://jira.sw.ru/browse/PSBM-141114

Pavel Tikhomirov (2):
  blk-cbt: fix count decrement and check in cbt_page_alloc
  blk-cbt: fix in_use check in blk_cbt_release

 block/blk-cbt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

-- 
2.37.3

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 2/2] blk-cbt: fix in_use check in blk_cbt_release

2022-11-02 Thread Pavel Tikhomirov

Calling cpt_release_callback on cbt->count != 0 here is really strange
because cbt_page_alloc would anyway do it on decrementing count to zero,
but in opposite case where cbt->count == 0 we should call the callback
but we do not, let's fix it by reversing the condition.

We have a cbt->cache percpu allocation leak detected by kmemleak, which
might be caused by this uncalled release callback.

https://jira.sw.ru/browse/PSBM-141114

Signed-off-by: Pavel Tikhomirov 
---
 block/blk-cbt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-cbt.c b/block/blk-cbt.c
index 2580ccabaa17..054c73c6ef9f 100644
--- a/block/blk-cbt.c
+++ b/block/blk-cbt.c
@@ -540,7 +540,7 @@ void blk_cbt_release(struct request_queue *q)
rcu_assign_pointer(q->cbt, NULL);
in_use = cbt->count;
spin_unlock(&cbt->lock);
-   if (in_use)
+   if (!in_use)
call_rcu(&cbt->rcu, &cbt_release_callback);
 }
 
-- 
2.37.3

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RH9 v4 01/10] drivers/vhost: vhost-blk accelerator for virtio-blk guests

2022-11-04 Thread Pavel Tikhomirov


This codding style is wrong:

On 01.11.2022 10:25, Andrey Zhadchenko wrote:

+static void vhost_blk_req_done(struct bio *bio)
+{
+   struct vhost_blk_req *req = bio->bi_private;
+   struct vhost_blk *blk = req->blk;
+   int err = blk_status_to_errno(bio->bi_status);
+
+   if (err)
+   req->bio_err = err;


should be:

funca()
{
int err;

err = funcb();
if (err)
do_something;
}

Please fix.

--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RH9 v4 01/10] drivers/vhost: vhost-blk accelerator for virtio-blk guests

2022-11-04 Thread Pavel Tikhomirov


On 01.11.2022 10:25, Andrey Zhadchenko wrote:

+   snprintf(blk->serial, VIRTIO_BLK_ID_BYTES, "vhost-blk%d", gen++);


Still no explanation comment about gen++ thing, please do.

--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RH9 v4 01/10] drivers/vhost: vhost-blk accelerator for virtio-blk guests

2022-11-04 Thread Pavel Tikhomirov





On 01.11.2022 10:25, Andrey Zhadchenko wrote:

+   bio_len = (total_pages / (UINT_MAX / PAGE_SIZE) + 1)  * sizeof(struct 
bio *);

   excess space ^^
--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RH9 v4 01/10] drivers/vhost: vhost-blk accelerator for virtio-blk guests

2022-11-04 Thread Pavel Tikhomirov





On 01.11.2022 10:25, Andrey Zhadchenko wrote:

+/* It is forbidden to call more than one vhost_blk_flush() simultaneously */


Let's add at least a warning for this case:


+static void vhost_blk_flush(struct vhost_blk *blk)
+{
+   int flush_bin;
+
+   spin_lock(&blk->flush_lock);


WARN_ON(blk->during_flush);


+   blk->during_flush = 1;


Because in the case of two flushes, first flush can hang infinitely 
waiting for its flush_bin to get empty.


--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RH9 v4 01/10] drivers/vhost: vhost-blk accelerator for virtio-blk guests

2022-11-04 Thread Pavel Tikhomirov





On 01.11.2022 10:25, Andrey Zhadchenko wrote:

+/* It is forbidden to call more than one vhost_blk_flush() simultaneously */
+static void vhost_blk_flush(struct vhost_blk *blk)
+{
+   int flush_bin;
+
+   spin_lock(&blk->flush_lock);
+   blk->during_flush = 1;
+   flush_bin = blk->new_req_bin;
+   blk->new_req_bin = (blk->new_req_bin) ? 0 : 1;


We can use shorter and simpler expression for this:

blk->new_req_bin = !blk->new_req_bin;


+   spin_unlock(&blk->flush_lock);


--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RH9 v4 01/10] drivers/vhost: vhost-blk accelerator for virtio-blk guests

2022-11-04 Thread Pavel Tikhomirov




On 04.11.2022 11:54, Pavel Tikhomirov wrote:

On 01.11.2022 10:25, Andrey Zhadchenko wrote:

+    snprintf(blk->serial, VIRTIO_BLK_ID_BYTES, "vhost-blk%d", gen++);


Still no explanation comment about gen++ thing, please do.



Ah sorry, now I see that serial can be configured from userspace by 
VHOST_BLK_SET_SERIAL so gen++ overflow is not so important now. Please 
disregard.


--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RH9 v4 01/10] drivers/vhost: vhost-blk accelerator for virtio-blk guests

2022-11-04 Thread Pavel Tikhomirov




On 04.11.2022 15:28, Andrey Zhadchenko wrote:



On 11/4/22 14:31, Pavel Tikhomirov wrote:



On 01.11.2022 10:25, Andrey Zhadchenko wrote:
+/* It is forbidden to call more than one vhost_blk_flush() 
simultaneously */

+static void vhost_blk_flush(struct vhost_blk *blk)
+{
+    int flush_bin;
+
+    spin_lock(&blk->flush_lock);
+    blk->during_flush = 1;
+    flush_bin = blk->new_req_bin;
+    blk->new_req_bin = (blk->new_req_bin) ? 0 : 1;


We can use shorter and simpler expression for this:

blk->new_req_bin = !blk->new_req_bin;


Are you sure C standard actually defines it this way and we are not 
relying on unspecified behavior?

I thought ! operator only defined for logical expressions


C99 6.5.3.3.5 says:
"The result of the logical negation operator ! is 0 if the value of its 
operand compares
unequal to 0, 1 if the value of its operand compares equal to 0. The 
result has type int.

The expression !E is equivalent to (0==E)"
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf

So I don't see any problem comparing new_req_bin of type int to 0.






+    spin_unlock(&blk->flush_lock);




--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RH7] netfilter: core: fix NAT hooks collision check

2022-11-07 Thread Pavel Tikhomirov


Reviewed-by: Pavel Tikhomirov 

On 03.11.2022 21:23, Konstantin Khorenko wrote:

Pasha,

please review this patch as well.

Thank you.

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 31.10.2022 10:51, Alexander Mikhalitsyn wrote:
In commit ("nat: allow nft NAT and iptables NAT work on the same 
node") we are
trying to prevent simultaneous nft and nf NAT hooks registration. But 
in fact,

it affects not only NAT-related hooks but all hooks!

Reproducer:
-#!/usr/sbin/nft -f
flush ruleset

table inet filter {
 chain input {
 type filter hook input priority 0;
 }
}

This simple script should work fine if we run it more than one time in 
a row,

but in fact it breaks with -EBUSY error.

This is because we not checking hook type in our code at all!

But this bug is a little bit deeper. Consider another reproducer:
-#!/usr/sbin/nft -f
flush ruleset

table inet filter {
 chain input {
 type filter hook input priority 0;
 }
}

table ip nat {
chain postrouting {
    type nat hook postrouting priority 0; policy accept;
}
}

In this case we have nat hook and we have to allow nat hook collision
during nft transaction execution. See analogical mainstream commit:
ae6153b50f ("netfilter: nf_tables: permit second nat hook if colliding 
hook is going away")


Our mainstream colleagues introduce nat_hook field in struct 
nf_hook_ops, but we don't need that,
because we only handling hooks from nft side and can easily detect if 
hook is nat related or not

by checking basechain type (basechain->type->type == NFT_CHAIN_T_NAT).

I'm addressing both problems here and adding other small cleanups for 
safety sake.


https://jira.sw.ru/browse/PSBM-142895

Fixes: d3a05a0552d7 ("nat: allow nft NAT and iptables NAT work on the 
same node")

Signed-off-by: Alexander Mikhalitsyn 
---
  net/netfilter/core.c | 83 ++--
  1 file changed, 81 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 74dee8c1623c..6628d73ec5b8 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -66,6 +66,67 @@ EXPORT_SYMBOL(nf_hooks_needed);
  static DEFINE_MUTEX(nf_hook_mutex);
  #include 
  #include 
+
+/* removal requests are queued in the commit_list, but not acted upon
+ * until after all new rules are in place.
+ *
+ * Therefore, nf_register_net_hook(net, &nat_hook) runs before pending
+ * nf_unregister_net_hook().
+ *
+ * nf_register_net_hook thus fails if a nat hook is already in place
+ * even if the conflicting hook is about to be removed.
+ *
+ * If collision is detected, search commit_log for DELCHAIN matching
+ * the new nat hooknum; if we find one collision is temporary:
+ *
+ * Either transaction is aborted (new/colliding hook is removed), or
+ * transaction is committed (old hook is removed).
+ *
+ * -- OpenVZ specific:
+ * - reworked not to use struct nf_hook_ops "nat_hook" field which is 
absent

+ * in our kernels.
+ * - rebased to RHEL7 kernel
+ *
+ * Please refer to original commit:
+ * 
https://github.com/torvalds/linux/commit/ae6153b50f9bf75a4952050f32fe168f68cdd657
+ * ("netfilter: nf_tables: permit second nat hook if colliding hook 
is going away")

+ */
+static bool nf_tables_allow_nat_conflict(const struct net *net,
+ const struct nft_base_chain *basechain)
+{
+    const struct nft_trans *trans;
+    const struct nf_hook_ops *ops = &basechain->ops[0];
+    bool ret = false;
+
+    list_for_each_entry(trans, &net->nft.commit_list, list) {
+    const struct nf_hook_ops *pending_ops;
+    struct nft_base_chain *pending_chain;
+    const struct nft_chain *pending;
+
+    if (trans->msg_type != NFT_MSG_NEWCHAIN &&
+    trans->msg_type != NFT_MSG_DELCHAIN)
+    continue;
+
+    pending = trans->ctx.chain;
+    if (!(pending->flags & NFT_BASE_CHAIN))
+    continue;
+
+    pending_chain = nft_base_chain(pending);
+    pending_ops = &pending_chain->ops[0];
+    if ((pending_chain->type->type == NFT_CHAIN_T_NAT) &&
+    pending_ops->pf == ops->pf &&
+    pending_ops->hooknum == ops->hooknum) {
+    /* other hook registration already pending? */
+    if (trans->msg_type == NFT_MSG_NEWCHAIN)
+    return false;
+
+    ret = true;
+    }
+    }
+
+    return ret;
+}
+
  int nf_register_hook(struct nf_hook_ops *reg)
  {
  struct nf_hook_ops *elem;
@@ -75,11 +136,29 @@ int nf_register_hook(struct nf_hook_ops *reg)
  if (reg->priority < elem->priority)
  break;
  else if ((reg->priority == elem->priority) && 
reg->is_nft_ops) {

-    const struct nft_chain *c = reg->priv;
-    struct net *net = read_pnet

Re: [Devel] [PATCH RH9 v5 00/10] vhost-blk: in-kernel accelerator for virtio-blk guests

2022-11-13 Thread Pavel Tikhomirov


Reviewed-by: Pavel Tikhomirov 

On 11.11.2022 12:55, Andrey Zhadchenko wrote:

Although QEMU virtio-blk is quite fast, there is still some room for
improvements. Disk latency can be reduced if we handle virito-blk requests
in host kernel so we avoid a lot of syscalls and context switches.
The idea is quite simple - QEMU gives us block device and we translate
any incoming virtio requests into bio and push them into bdev.
The biggest disadvantage of this vhost-blk flavor is raw format.
Luckily Kirill Thai proposed device mapper driver for QCOW2 format to attach
files as block devices: https://www.spinics.net/lists/kernel/msg4292965.html

Also by using kernel modules we can bypass iothread limitation and finaly scale
block requests with cpus for high-performance devices.


There have already been several attempts to write vhost-blk:

Asias' version: https://lkml.org/lkml/2012/12/1/174
Badari's version: https://lwn.net/Articles/379864/
Vitaly's https://lwn.net/Articles/770965/

The main difference between them is API to access backend file. The fastest
one is Asias's version with bio flavor. It is also the most reviewed and
have the most features. So vhost_blk module is partially based on it. Multiple
virtqueue support was addded, some places reworked. Added support for several
vhost workers.

test setup and results:
fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128
QEMU drive options: cache=none
filesystem: xfs

SSD:
| randread, IOPS  | randwrite, IOPS |
Host   |  95.8k  |  85.3k  |
QEMU virtio|  57.5k  |  79.4k  |
QEMU vhost-blk |  95.6k  |  84.3k  |

RAMDISK (vq == vcpu == numjobs):
  | randread, IOPS | randwrite, IOPS |
virtio, 1vcpu|  133k  |  133k   |
virtio, 2vcpu|  305k  |  306k   |
virtio, 4vcpu|  310k  |  298k   |
virtio, 8vcpu|  271k  |  252k   |
vhost-blk, 1vcpu |  110k  |  113k   |
vhost-blk, 2vcpu |  247k  |  252k   |
vhost-blk, 4vcpu |  558k  |  556k   |
vhost-blk, 8vcpu |  576k  |  575k   | *single kernel thread
vhost-blk, 8vcpu |  803k  |  779k   | *two kernel threads

v2:
patch 1/10
  - removed unused VHOST_BLK_VQ
  - reworked bio handling a bit: now add all pages from signle iov into
single bio istead of allocating one bio per page
  - changed how to calculate sector incrementation
  - check move_iovec() in vhost_blk_req_handle()
  - remove snprintf check and better check ret from copy_to_iter for
VIRTIO_BLK_ID_BYTES requests
  - discard vq request if vhost_blk_req_handle() returned negative code
  - forbid to change nonzero backend in vhost_blk_set_backend(). First of
all, QEMU sets backend only once. Also if we want to change backend when
we already running requests we need to be much more careful in
vhost_blk_handle_guest_kick() as it is not taking any references. If
userspace want to change backend that bad it can always reset device.
  - removed EXPERIMENTAL from Kconfig

patch 3/10
  - don't bother with checking dev->workers[0].worker since dev->nworkers
will always contain 0 in this case

patch 6/10
  - Make code do what docs suggest. Previously ioctl-supplied new number
of workers were treated like an amount that should be added. Use new
number as a ceiling instead and add workers up to that number.


v3:
patch 1/10
  - reworked bio handling a bit - now create new only if the previous is
full

patch 2/10
  - set vq->worker = NULL in vhost_vq_reset()


v4:
patch 1/10
  - vhost_blk_req_done() now won't hide errors for multi-bio requests
  - vhost_blk_prepare_req() now better estimates bio_len
  - alloc bio for max pages_nr_total pages instead of nr_pages
  - added new ioctl VHOST_BLK_SET_SERIAL to set serial
  - rework flush alghoritm a bit - now use two bins "new req" and
"for flush" and swap them at the start of the flush
  - moved backing file dereference to vhost_blk_req_submit() and
after request was added to flush bin to avoid race in
vhost_blk_release(). Now even if we dropped backend and started
flush the request will either be tracked by flush or be rolled back

patch 2/10
  - moved vq->worker = NULL to patch #7 where this field is
introduced.

patch 7/10
  - Set vq->worker = NULL in vhost_vq_reset. This will fix both
https://jira.sw.ru/browse/PSBM-142058
https://jira.sw.ru/browse/PSBM-142852

v5:
patch 1/10
  - several codestyle/spacing fixes
  - added WARN_ON() for vhost_blk_flush

Andrey Zhadchenko (10):
   drivers/vhost: vhost-blk accelerator for virtio-blk guests
   drivers/vhost: use array to store workers
   drivers/vhost: adjust vhost to flush all workers
   drivers/vhost: rework attaching cgroups to be worker aware
   drivers/vhost: rework worker creation
   drivers/vhost: add ioctl to increase the number of workers
   drivers/vhost

[Devel] [PATCH RH7] cgroup_freezer: print information about unfreezable process

2022-11-24 Thread Pavel Tikhomirov

Add a sysctl kernel.freeze_cgroup_timeout (default value 30 * HZ).

If one writes FROZEN to freezer.state file and after a timeout of
kernel.freeze_cgroup_timeout one still reads FREEZING from freezer.state
file (meaning that kernel does not succeed to freeze cgroup processes
still) - let's print a warning with information about the problem, e.g.:

[ 7196.621368] Freeze of /test took 0 sec, due to unfreezable process 
13732:bash, stack:
[ 7196.621396] [] retint_careful+0x14/0x32
[ 7196.621431] [] 0x

The output includes:
- path to problematic freezer cgroup
- timeout in seconds
- unfeezable process pid, comm and stack

https://jira.sw.ru/browse/PSBM-142970

Signed-off-by: Pavel Tikhomirov 
---
Will send for vz9 separately, it does not apply cleanly.
---
 include/linux/sysctl.h  |  2 ++
 kernel/cgroup_freezer.c | 55 ++---
 kernel/sysctl.c | 10 
 3 files changed, 64 insertions(+), 3 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index f28d9fb58c03..798b0465cb93 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -189,6 +189,8 @@ struct ctl_path {
 extern int ve_allow_module_load;
 extern int __read_mostly lazytime_default;
 extern int trusted_exec;
+#define DEFAULT_FREEZE_TIMEOUT (30 * HZ)
+extern int sysctl_freeze_timeout;
 
 #ifdef CONFIG_SYSCTL
 
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index f31d68f55db0..343ebfed05fc 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -21,6 +21,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 
 /*
  * A cgroup is freezing if any FREEZING flags are set.  FREEZING_SELF is
@@ -43,6 +47,7 @@ struct freezer {
struct cgroup_subsys_state  css;
unsigned intstate;
spinlock_t  lock;
+   unsigned long   freeze_jiffies;
 };
 
 static inline struct freezer *cgroup_freezer(struct cgroup *cgroup)
@@ -242,6 +247,44 @@ static void freezer_fork(struct task_struct *task, void 
*private)
rcu_read_unlock();
 }
 
+#define MAX_STACK_TRACE_DEPTH   64
+
+static void check_freezer_timeout(struct cgroup *cgroup, struct task_struct 
*task)
+{
+   static DEFINE_RATELIMIT_STATE(freeze_timeout_rs, 
DEFAULT_FREEZE_TIMEOUT, 1);
+   int __freeze_timeout = READ_ONCE(sysctl_freeze_timeout);
+   struct freezer *freezer = cgroup_freezer(cgroup);
+   unsigned long entries[MAX_STACK_TRACE_DEPTH];
+   static char freezer_cg_name[PATH_MAX];
+   struct stack_trace trace;
+   pid_t tgid;
+   int i;
+
+   if (!freezer->freeze_jiffies ||
+   freezer->freeze_jiffies + __freeze_timeout > get_jiffies_64())
+   return;
+
+   if (!__ratelimit(&freeze_timeout_rs))
+   return;
+
+   if (cgroup_path(cgroup, freezer_cg_name, PATH_MAX) < 0)
+   return;
+
+   tgid = task_pid_nr_ns(task, &init_pid_ns);
+
+   printk(KERN_WARNING "Freeze of %s took %d sec, due to unfreezable 
process %d:%s, stack:\n",
+  freezer_cg_name, __freeze_timeout/HZ, tgid, task->comm);
+
+   memset(&trace, 0, sizeof(trace));
+   trace.max_entries = MAX_STACK_TRACE_DEPTH;
+   trace.entries = entries;
+   save_stack_trace_tsk(task, &trace);
+
+   for (i = 0; i < trace.nr_entries; i++) {
+   printk(KERN_WARNING "[<%pK>] %pS\n", (void *)entries[i], (void 
*)entries[i]);
+   }
+}
+
 /**
  * update_if_frozen - update whether a cgroup finished freezing
  * @cgroup: cgroup of interest
@@ -293,8 +336,10 @@ static void update_if_frozen(struct cgroup *cgroup)
 * completion.  Consider it frozen in addition to
 * the usual frozen condition.
 */
-   if (!frozen(task) && !freezer_should_skip(task))
+   if (!frozen(task) && !freezer_should_skip(task)) {
+   check_freezer_timeout(cgroup, task);
goto out_iter_end;
+   }
}
}
 
@@ -367,8 +412,10 @@ static void freezer_apply_state(struct freezer *freezer, 
bool freeze,
return;
 
if (freeze) {
-   if (!(freezer->state & CGROUP_FREEZING))
+   if (!(freezer->state & CGROUP_FREEZING)) {
atomic_inc(&system_freezing_cnt);
+   freezer->freeze_jiffies = get_jiffies_64();
+   }
freezer->state |= state;
freeze_cgroup(freezer);
} else {
@@ -377,8 +424,10 @@ static void freezer_apply_state(struct freezer *freezer, 
bool freeze,
freezer->state &= ~state;
 
if (!(freezer->state & CGROUP_FREEZING)) {
-

Re: [Devel] [PATCH RH7] cgroup_freezer: print information about unfreezable process

2022-11-25 Thread Pavel Tikhomirov


I found one improvement which can be done while porting to vz9, see inline:

On 24.11.2022 21:46, Pavel Tikhomirov wrote:

Add a sysctl kernel.freeze_cgroup_timeout (default value 30 * HZ).

If one writes FROZEN to freezer.state file and after a timeout of
kernel.freeze_cgroup_timeout one still reads FREEZING from freezer.state
file (meaning that kernel does not succeed to freeze cgroup processes
still) - let's print a warning with information about the problem, e.g.:

[ 7196.621368] Freeze of /test took 0 sec, due to unfreezable process 
13732:bash, stack:
[ 7196.621396] [] retint_careful+0x14/0x32
[ 7196.621431] [] 0x

The output includes:
- path to problematic freezer cgroup
- timeout in seconds
- unfeezable process pid, comm and stack

https://jira.sw.ru/browse/PSBM-142970

Signed-off-by: Pavel Tikhomirov 
---
Will send for vz9 separately, it does not apply cleanly.
---
  include/linux/sysctl.h  |  2 ++
  kernel/cgroup_freezer.c | 55 ++---
  kernel/sysctl.c | 10 
  3 files changed, 64 insertions(+), 3 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index f28d9fb58c03..798b0465cb93 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -189,6 +189,8 @@ struct ctl_path {
  extern int ve_allow_module_load;
  extern int __read_mostly lazytime_default;
  extern int trusted_exec;
+#define DEFAULT_FREEZE_TIMEOUT (30 * HZ)
+extern int sysctl_freeze_timeout;
  
  #ifdef CONFIG_SYSCTL
  
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c

index f31d68f55db0..343ebfed05fc 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -21,6 +21,10 @@
  #include 
  #include 
  #include 
+#include 
+#include 
+#include 
+#include 
  
  /*

   * A cgroup is freezing if any FREEZING flags are set.  FREEZING_SELF is
@@ -43,6 +47,7 @@ struct freezer {
struct cgroup_subsys_state  css;
unsigned intstate;
spinlock_t  lock;
+   unsigned long   freeze_jiffies;
  };
  
  static inline struct freezer *cgroup_freezer(struct cgroup *cgroup)

@@ -242,6 +247,44 @@ static void freezer_fork(struct task_struct *task, void 
*private)
rcu_read_unlock();
  }
  
+#define MAX_STACK_TRACE_DEPTH   64

+
+static void check_freezer_timeout(struct cgroup *cgroup, struct task_struct 
*task)
+{
+   static DEFINE_RATELIMIT_STATE(freeze_timeout_rs, 
DEFAULT_FREEZE_TIMEOUT, 1);
+   int __freeze_timeout = READ_ONCE(sysctl_freeze_timeout);
+   struct freezer *freezer = cgroup_freezer(cgroup);
+   unsigned long entries[MAX_STACK_TRACE_DEPTH];
+   static char freezer_cg_name[PATH_MAX];
+   struct stack_trace trace;
+   pid_t tgid;
+   int i;
+
+   if (!freezer->freeze_jiffies ||
+   freezer->freeze_jiffies + __freeze_timeout > get_jiffies_64())
+   return;
+
+   if (!__ratelimit(&freeze_timeout_rs))
+   return;
+
+   if (cgroup_path(cgroup, freezer_cg_name, PATH_MAX) < 0)
+   return;
+
+   tgid = task_pid_nr_ns(task, &init_pid_ns);
+
+   printk(KERN_WARNING "Freeze of %s took %d sec, due to unfreezable process 
%d:%s, stack:\n",
+  freezer_cg_name, __freeze_timeout/HZ, tgid, task->comm);
+
+   memset(&trace, 0, sizeof(trace));
+   trace.max_entries = MAX_STACK_TRACE_DEPTH;
+   trace.entries = entries;
+   save_stack_trace_tsk(task, &trace);
+
+   for (i = 0; i < trace.nr_entries; i++) {
+   printk(KERN_WARNING "[<%pK>] %pS\n", (void *)entries[i], (void 
*)entries[i]);


%pB is better than %pS, seee 8b927d734122 ("proc: Fix return address 
printk conversion specifer in /proc//stack")



+   }
+}
+
  /**
   * update_if_frozen - update whether a cgroup finished freezing
   * @cgroup: cgroup of interest
@@ -293,8 +336,10 @@ static void update_if_frozen(struct cgroup *cgroup)
 * completion.  Consider it frozen in addition to
 * the usual frozen condition.
 */
-   if (!frozen(task) && !freezer_should_skip(task))
+   if (!frozen(task) && !freezer_should_skip(task)) {
+   check_freezer_timeout(cgroup, task);
goto out_iter_end;
+   }
}
}
  
@@ -367,8 +412,10 @@ static void freezer_apply_state(struct freezer *freezer, bool freeze,

return;
  
  	if (freeze) {

-   if (!(freezer->state & CGROUP_FREEZING))
+   if (!(freezer->state & CGROUP_FREEZING)) {
atomic_inc(&system_freezing_cnt);
+   freezer->freeze_jiffies = get_jiffies_64();
+   }
freezer->state |= state;

[Devel] [PATCH v2 RH7] cgroup_freezer: print information about unfreezable process

2022-11-25 Thread Pavel Tikhomirov

Add a sysctl kernel.freeze_cgroup_timeout (default value 30 * HZ).

If one writes FROZEN to freezer.state file and after a timeout of
kernel.freeze_cgroup_timeout one still reads FREEZING from freezer.state
file (meaning that kernel does not succeed to freeze cgroup processes
still) - let's print a warning with information about the problem, e.g.:

[ 7196.621368] Freeze of /test took 0 sec, due to unfreezable process 
13732:bash, stack:
[ 7196.621396] [] retint_careful+0x14/0x32
[ 7196.621431] [] 0x

The output includes:
- path to problematic freezer cgroup
- timeout in seconds
- unfeezable process pid, comm and stack

https://jira.sw.ru/browse/PSBM-142970

Signed-off-by: Pavel Tikhomirov 
---
v2: fix pointer print formating %pS -> %pB
---
 include/linux/sysctl.h  |  2 ++
 kernel/cgroup_freezer.c | 55 ++---
 kernel/sysctl.c | 10 
 3 files changed, 64 insertions(+), 3 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index f28d9fb58c03..798b0465cb93 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -189,6 +189,8 @@ struct ctl_path {
 extern int ve_allow_module_load;
 extern int __read_mostly lazytime_default;
 extern int trusted_exec;
+#define DEFAULT_FREEZE_TIMEOUT (30 * HZ)
+extern int sysctl_freeze_timeout;
 
 #ifdef CONFIG_SYSCTL
 
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index f31d68f55db0..bb5380b89d4f 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -21,6 +21,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 
 /*
  * A cgroup is freezing if any FREEZING flags are set.  FREEZING_SELF is
@@ -43,6 +47,7 @@ struct freezer {
struct cgroup_subsys_state  css;
unsigned intstate;
spinlock_t  lock;
+   unsigned long   freeze_jiffies;
 };
 
 static inline struct freezer *cgroup_freezer(struct cgroup *cgroup)
@@ -242,6 +247,44 @@ static void freezer_fork(struct task_struct *task, void 
*private)
rcu_read_unlock();
 }
 
+#define MAX_STACK_TRACE_DEPTH   64
+
+static void check_freezer_timeout(struct cgroup *cgroup, struct task_struct 
*task)
+{
+   static DEFINE_RATELIMIT_STATE(freeze_timeout_rs, 
DEFAULT_FREEZE_TIMEOUT, 1);
+   int __freeze_timeout = READ_ONCE(sysctl_freeze_timeout);
+   struct freezer *freezer = cgroup_freezer(cgroup);
+   unsigned long entries[MAX_STACK_TRACE_DEPTH];
+   static char freezer_cg_name[PATH_MAX];
+   struct stack_trace trace;
+   pid_t tgid;
+   int i;
+
+   if (!freezer->freeze_jiffies ||
+   freezer->freeze_jiffies + __freeze_timeout > get_jiffies_64())
+   return;
+
+   if (!__ratelimit(&freeze_timeout_rs))
+   return;
+
+   if (cgroup_path(cgroup, freezer_cg_name, PATH_MAX) < 0)
+   return;
+
+   tgid = task_pid_nr_ns(task, &init_pid_ns);
+
+   printk(KERN_WARNING "Freeze of %s took %d sec, due to unfreezable 
process %d:%s, stack:\n",
+  freezer_cg_name, __freeze_timeout/HZ, tgid, task->comm);
+
+   memset(&trace, 0, sizeof(trace));
+   trace.max_entries = MAX_STACK_TRACE_DEPTH;
+   trace.entries = entries;
+   save_stack_trace_tsk(task, &trace);
+
+   for (i = 0; i < trace.nr_entries; i++) {
+   printk(KERN_WARNING "[<%pK>] %pB\n", (void *)entries[i], (void 
*)entries[i]);
+   }
+}
+
 /**
  * update_if_frozen - update whether a cgroup finished freezing
  * @cgroup: cgroup of interest
@@ -293,8 +336,10 @@ static void update_if_frozen(struct cgroup *cgroup)
 * completion.  Consider it frozen in addition to
 * the usual frozen condition.
 */
-   if (!frozen(task) && !freezer_should_skip(task))
+   if (!frozen(task) && !freezer_should_skip(task)) {
+   check_freezer_timeout(cgroup, task);
goto out_iter_end;
+   }
}
}
 
@@ -367,8 +412,10 @@ static void freezer_apply_state(struct freezer *freezer, 
bool freeze,
return;
 
if (freeze) {
-   if (!(freezer->state & CGROUP_FREEZING))
+   if (!(freezer->state & CGROUP_FREEZING)) {
atomic_inc(&system_freezing_cnt);
+   freezer->freeze_jiffies = get_jiffies_64();
+   }
freezer->state |= state;
freeze_cgroup(freezer);
} else {
@@ -377,8 +424,10 @@ static void freezer_apply_state(struct freezer *freezer, 
bool freeze,
freezer->state &= ~state;
 
if (!(freezer->state & CGROUP_FREEZING)) {
-

[Devel] [PATCH RH9] cgroup_freezer: print information about unfreezable process

2022-11-25 Thread Pavel Tikhomirov

Add a sysctl kernel.freeze_cgroup_timeout (default value 30 * HZ).

If one writes FROZEN to freezer.state file and after a timeout of
kernel.freeze_cgroup_timeout one still reads FREEZING from freezer.state
file (meaning that kernel does not succeed to freeze cgroup processes
still) - let's print a warning with information about the problem, e.g.:

[ 7196.621368] Freeze of /test took 0 sec, due to unfreezable process 
13732:bash, stack:
[ 7196.621396] [] retint_careful+0x14/0x32
[ 7196.621431] [] 0x

The output includes:
- path to problematic freezer cgroup
- timeout in seconds
- unfeezable process pid, comm and stack

https://jira.sw.ru/browse/PSBM-142970

Signed-off-by: Pavel Tikhomirov 
---
vz9: switch to new stack_trace_save_tsk, and also use css
---
 include/linux/sysctl.h |  2 ++
 kernel/cgroup/legacy_freezer.c | 53 --
 kernel/sysctl.c| 10 +++
 3 files changed, 62 insertions(+), 3 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 8fd2d3c217c2..b641dd2bba82 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -188,6 +188,8 @@ struct ctl_path {
 };
 
 extern int trusted_exec;
+#define DEFAULT_FREEZE_TIMEOUT (30 * HZ)
+extern int sysctl_freeze_timeout;
 
 extern int ve_allow_module_load;
 
diff --git a/kernel/cgroup/legacy_freezer.c b/kernel/cgroup/legacy_freezer.c
index 08236798d173..fc7ee3ad529b 100644
--- a/kernel/cgroup/legacy_freezer.c
+++ b/kernel/cgroup/legacy_freezer.c
@@ -22,6 +22,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 
 /*
  * A cgroup is freezing if any FREEZING flags are set.  FREEZING_SELF is
@@ -43,6 +47,7 @@ enum freezer_state_flags {
 struct freezer {
struct cgroup_subsys_state  css;
unsigned intstate;
+   unsigned long   freeze_jiffies;
 };
 
 static DEFINE_MUTEX(freezer_mutex);
@@ -225,6 +230,42 @@ static void freezer_fork(struct task_struct *task)
mutex_unlock(&freezer_mutex);
 }
 
+#define MAX_STACK_TRACE_DEPTH   64
+
+static void check_freezer_timeout(struct cgroup_subsys_state *css, struct 
task_struct *task)
+{
+   static DEFINE_RATELIMIT_STATE(freeze_timeout_rs, 
DEFAULT_FREEZE_TIMEOUT, 1);
+   int __freeze_timeout = READ_ONCE(sysctl_freeze_timeout);
+   struct freezer *freezer = css_freezer(css);
+   unsigned long entries[MAX_STACK_TRACE_DEPTH];
+   static char freezer_cg_name[PATH_MAX];
+   unsigned long nr_entries;
+   pid_t tgid;
+   int i;
+
+   if (!freezer->freeze_jiffies ||
+   freezer->freeze_jiffies + __freeze_timeout > get_jiffies_64())
+   return;
+
+   if (!__ratelimit(&freeze_timeout_rs))
+   return;
+
+   if (cgroup_path(css->cgroup, freezer_cg_name, PATH_MAX) < 0)
+   return;
+
+   tgid = task_pid_nr_ns(task, &init_pid_ns);
+
+   printk(KERN_WARNING "Freeze of %s took %d sec, due to unfreezable 
process %d:%s, stack:\n",
+  freezer_cg_name, __freeze_timeout/HZ, tgid, task->comm);
+
+   nr_entries = stack_trace_save_tsk(task, entries,
+ MAX_STACK_TRACE_DEPTH, 0);
+
+   for (i = 0; i < nr_entries; i++) {
+   printk(KERN_WARNING "[<%pK>] %pB\n", (void *)entries[i], (void 
*)entries[i]);
+   }
+}
+
 /**
  * update_if_frozen - update whether a cgroup finished freezing
  * @css: css of interest
@@ -278,8 +319,10 @@ static void update_if_frozen(struct cgroup_subsys_state 
*css)
 * completion.  Consider it frozen in addition to
 * the usual frozen condition.
 */
-   if (!frozen(task) && !freezer_should_skip(task))
+   if (!frozen(task) && !freezer_should_skip(task)) {
+   check_freezer_timeout(css, task);
goto out_iter_end;
+   }
}
}
 
@@ -356,8 +399,10 @@ static void freezer_apply_state(struct freezer *freezer, 
bool freeze,
return;
 
if (freeze) {
-   if (!(freezer->state & CGROUP_FREEZING))
+   if (!(freezer->state & CGROUP_FREEZING)) {
atomic_inc(&system_freezing_cnt);
+   freezer->freeze_jiffies = get_jiffies_64();
+   }
freezer->state |= state;
freeze_cgroup(freezer);
} else {
@@ -366,8 +411,10 @@ static void freezer_apply_state(struct freezer *freezer, 
bool freeze,
freezer->state &= ~state;
 
if (!(freezer->state & CGROUP_FREEZING)) {
-   if (was_freezing)
+   if (was_freezing) {
+

[Devel] [PATCH v3 RH7] cgroup_freezer: print information about unfreezable process

2022-11-29 Thread Pavel Tikhomirov

Add a sysctl kernel.freeze_cgroup_timeout (default value 30 * HZ).

If one writes FROZEN to freezer.state file and after a timeout of
kernel.freeze_cgroup_timeout one still reads FREEZING from freezer.state
file (meaning that kernel does not succeed to freeze cgroup processes
still) - let's print a warning with information about the problem, e.g.:

[ 7196.621368] Freeze of /test took 0 sec, due to unfreezable process 
13732:bash, stack:
[ 7196.621396] [] retint_careful+0x14/0x32
[ 7196.621431] [] 0x

The output includes:
- path to problematic freezer cgroup
- timeout in seconds
- unfeezable process pid, comm and stack

https://jira.sw.ru/browse/PSBM-142970

Signed-off-by: Pavel Tikhomirov 
---
v2: fix pointer print formating %pS -> %pB
v3: use kmalloc for entries and freezer_cg_name + reformat
---
 include/linux/sysctl.h  |  2 ++
 kernel/cgroup_freezer.c | 72 +++--
 kernel/sysctl.c | 10 ++
 3 files changed, 81 insertions(+), 3 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index f28d9fb58c03..798b0465cb93 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -189,6 +189,8 @@ struct ctl_path {
 extern int ve_allow_module_load;
 extern int __read_mostly lazytime_default;
 extern int trusted_exec;
+#define DEFAULT_FREEZE_TIMEOUT (30 * HZ)
+extern int sysctl_freeze_timeout;
 
 #ifdef CONFIG_SYSCTL
 
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index f31d68f55db0..d4747ff98090 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -21,6 +21,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 
 /*
  * A cgroup is freezing if any FREEZING flags are set.  FREEZING_SELF is
@@ -43,6 +47,7 @@ struct freezer {
struct cgroup_subsys_state  css;
unsigned intstate;
spinlock_t  lock;
+   unsigned long   freeze_jiffies;
 };
 
 static inline struct freezer *cgroup_freezer(struct cgroup *cgroup)
@@ -242,6 +247,61 @@ static void freezer_fork(struct task_struct *task, void 
*private)
rcu_read_unlock();
 }
 
+#define MAX_STACK_TRACE_DEPTH   64
+
+static void check_freezer_timeout(struct cgroup *cgroup,
+ struct task_struct *task)
+{
+   static DEFINE_RATELIMIT_STATE(freeze_timeout_rs,
+ DEFAULT_FREEZE_TIMEOUT, 1);
+   int __freeze_timeout = READ_ONCE(sysctl_freeze_timeout);
+   struct freezer *freezer = cgroup_freezer(cgroup);
+   struct stack_trace trace;
+   unsigned long *entries;
+   char *freezer_cg_name;
+   pid_t tgid;
+   int i;
+
+   if (!freezer->freeze_jiffies ||
+   freezer->freeze_jiffies + __freeze_timeout > get_jiffies_64())
+   return;
+
+   if (!__ratelimit(&freeze_timeout_rs))
+   return;
+
+   freezer_cg_name = kmalloc(PATH_MAX, GFP_KERNEL);
+   if (!freezer_cg_name)
+   return;
+
+   if (cgroup_path(cgroup, freezer_cg_name, PATH_MAX) < 0)
+   goto free_cg_name;
+
+   tgid = task_pid_nr_ns(task, &init_pid_ns);
+
+   printk(KERN_WARNING "Freeze of %s took %d sec, "
+  "due to unfreezable process %d:%s, stack:\n",
+  freezer_cg_name, __freeze_timeout/HZ, tgid, task->comm);
+
+   entries = kmalloc(MAX_STACK_TRACE_DEPTH * sizeof(*entries),
+ GFP_KERNEL);
+   if (!entries)
+   goto free_cg_name;
+
+   memset(&trace, 0, sizeof(trace));
+   trace.max_entries = MAX_STACK_TRACE_DEPTH;
+   trace.entries = entries;
+   save_stack_trace_tsk(task, &trace);
+
+   for (i = 0; i < trace.nr_entries; i++) {
+   printk(KERN_WARNING "[<%pK>] %pB\n",
+  (void *)entries[i], (void *)entries[i]);
+   }
+
+   kfree(entries);
+free_cg_name:
+   kfree(freezer_cg_name);
+}
+
 /**
  * update_if_frozen - update whether a cgroup finished freezing
  * @cgroup: cgroup of interest
@@ -293,8 +353,10 @@ static void update_if_frozen(struct cgroup *cgroup)
 * completion.  Consider it frozen in addition to
 * the usual frozen condition.
 */
-   if (!frozen(task) && !freezer_should_skip(task))
+   if (!frozen(task) && !freezer_should_skip(task)) {
+   check_freezer_timeout(cgroup, task);
goto out_iter_end;
+   }
}
}
 
@@ -367,8 +429,10 @@ static void freezer_apply_state(struct freezer *freezer, 
bool freeze,
return;
 
if (freeze) {
-   if (!(freezer->state & CGROUP_FREEZING))
+   if (!(freezer->state & C

[Devel] [PATCH v3 RH9] cgroup_freezer: print information about unfreezable process

2022-11-30 Thread Pavel Tikhomirov

Add a sysctl kernel.freeze_cgroup_timeout (default value 30 * HZ).

If one writes FROZEN to freezer.state file and after a timeout of
kernel.freeze_cgroup_timeout one still reads FREEZING from freezer.state
file (meaning that kernel does not succeed to freeze cgroup processes
still) - let's print a warning with information about the problem, e.g.:

[ 7196.621368] Freeze of /test took 0 sec, due to unfreezable process 
13732:bash, stack:
[ 7196.621396] [] retint_careful+0x14/0x32
[ 7196.621431] [] 0x

The output includes:
- path to problematic freezer cgroup
- timeout in seconds
- unfeezable process pid, comm and stack

https://jira.sw.ru/browse/PSBM-142970

Signed-off-by: Pavel Tikhomirov 
---
v3: use kmalloc for entries and freezer_cg_name + reformat
---
 include/linux/sysctl.h |  2 +
 kernel/cgroup/legacy_freezer.c | 71 --
 kernel/sysctl.c| 10 +
 3 files changed, 80 insertions(+), 3 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 8fd2d3c217c2..b641dd2bba82 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -188,6 +188,8 @@ struct ctl_path {
 };
 
 extern int trusted_exec;
+#define DEFAULT_FREEZE_TIMEOUT (30 * HZ)
+extern int sysctl_freeze_timeout;
 
 extern int ve_allow_module_load;
 
diff --git a/kernel/cgroup/legacy_freezer.c b/kernel/cgroup/legacy_freezer.c
index 08236798d173..eea2f0b924f1 100644
--- a/kernel/cgroup/legacy_freezer.c
+++ b/kernel/cgroup/legacy_freezer.c
@@ -22,6 +22,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 
 /*
  * A cgroup is freezing if any FREEZING flags are set.  FREEZING_SELF is
@@ -43,6 +47,7 @@ enum freezer_state_flags {
 struct freezer {
struct cgroup_subsys_state  css;
unsigned intstate;
+   unsigned long   freeze_jiffies;
 };
 
 static DEFINE_MUTEX(freezer_mutex);
@@ -225,6 +230,60 @@ static void freezer_fork(struct task_struct *task)
mutex_unlock(&freezer_mutex);
 }
 
+#define MAX_STACK_TRACE_DEPTH   64
+
+static void check_freezer_timeout(struct cgroup_subsys_state *css,
+ struct task_struct *task)
+
+{
+   static DEFINE_RATELIMIT_STATE(freeze_timeout_rs,
+ DEFAULT_FREEZE_TIMEOUT, 1);
+   int __freeze_timeout = READ_ONCE(sysctl_freeze_timeout);
+   struct freezer *freezer = css_freezer(css);
+   unsigned long nr_entries;
+   unsigned long *entries;
+   char *freezer_cg_name;
+   pid_t tgid;
+   int i;
+
+   if (!freezer->freeze_jiffies ||
+   freezer->freeze_jiffies + __freeze_timeout > get_jiffies_64())
+   return;
+
+   if (!__ratelimit(&freeze_timeout_rs))
+   return;
+
+   freezer_cg_name = kmalloc(PATH_MAX, GFP_KERNEL);
+   if (!freezer_cg_name)
+   return;
+
+   if (cgroup_path(css->cgroup, freezer_cg_name, PATH_MAX) < 0)
+   goto free_cg_name;
+
+   tgid = task_pid_nr_ns(task, &init_pid_ns);
+
+   printk(KERN_WARNING "Freeze of %s took %d sec, "
+  "due to unfreezable process %d:%s, stack:\n",
+  freezer_cg_name, __freeze_timeout/HZ, tgid, task->comm);
+
+   entries = kmalloc(MAX_STACK_TRACE_DEPTH * sizeof(*entries),
+ GFP_KERNEL);
+   if (!entries)
+   goto free_cg_name;
+
+   nr_entries = stack_trace_save_tsk(task, entries,
+ MAX_STACK_TRACE_DEPTH, 0);
+
+   for (i = 0; i < nr_entries; i++) {
+   printk(KERN_WARNING "[<%pK>] %pB\n",
+  (void *)entries[i], (void *)entries[i]);
+   }
+
+   kfree(entries);
+free_cg_name:
+   kfree(freezer_cg_name);
+}
+
 /**
  * update_if_frozen - update whether a cgroup finished freezing
  * @css: css of interest
@@ -278,8 +337,10 @@ static void update_if_frozen(struct cgroup_subsys_state 
*css)
 * completion.  Consider it frozen in addition to
 * the usual frozen condition.
 */
-   if (!frozen(task) && !freezer_should_skip(task))
+   if (!frozen(task) && !freezer_should_skip(task)) {
+   check_freezer_timeout(css, task);
goto out_iter_end;
+   }
}
}
 
@@ -356,8 +417,10 @@ static void freezer_apply_state(struct freezer *freezer, 
bool freeze,
return;
 
if (freeze) {
-   if (!(freezer->state & CGROUP_FREEZING))
+   if (!(freezer->state & CGROUP_FREEZING)) {
atomic_inc(&system_freezing_cnt);
+   freezer->freeze_jiffies = get_jiffies_64();
+

Re: [Devel] [PATCH RH9] cgroup/ve: fix ve_hide_cgroups calling in cgroup_get_tree

2022-12-22 Thread Pavel Tikhomirov

Please merge, as https://jira.sw.ru/browse/PSBM-139100 is fixed and now 
we can have this too.


On 05.03.2022 12:40, Pavel Tikhomirov wrote:

Variable ret was used uninitialized in case of !ve_hide_cgroups() and
also reference on cgrp_dfl_root.cgrp was leaked in the oposite case.

Fixes: 360077892030 ("ve/cgroup: hide non-virtualized cgroups in container")
Signed-off-by: Pavel Tikhomirov 
---
  kernel/cgroup/cgroup.c | 9 -
  1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index be154b5eed77..f0c844087964 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2475,15 +2475,14 @@ static int cgroup_get_tree(struct fs_context *fc)
struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
int ret;
  
+	if (ve_hide_cgroups(ctx->root))

+   return -EPERM;
+
cgrp_dfl_visible = true;
cgroup_get_live(&cgrp_dfl_root.cgrp);
ctx->root = &cgrp_dfl_root;
  
-	if (ve_hide_cgroups(ctx->root))

-   ret = -EPERM;
-
-   if (!ret)
-   ret = cgroup_do_get_tree(fc);
+   ret = cgroup_do_get_tree(fc);
if (!ret)
apply_cgroup_root_flags(ctx->flags);
return ret;


--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH9 2/2] netfilter: fix compilation error in nf_log_unknown_packet

2023-01-10 Thread Pavel Tikhomirov

net/netfilter/nf_log_syslog.c: In function 'nf_log_unknown_packet':
net/netfilter/nf_log_syslog.c:932:9: error: too few arguments to function 
'nf_log_buf_close'
  932 | nf_log_buf_close(m);
  | ^~~~
In file included from net/netfilter/nf_log_syslog.c:25:
./include/net/netfilter/nf_log.h:100:6: note: declared here
  100 | void nf_log_buf_close(struct nf_log_buf *m, struct net *net);
  |  ^~~~

Fixes: a245e8cecfbe ("ve/nf_log_syslog: virtualize packet logging per-ve")

Signed-off-by: Pavel Tikhomirov 
---
 net/netfilter/nf_log_syslog.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nf_log_syslog.c b/net/netfilter/nf_log_syslog.c
index b05089283ddc..1ec5c15f426b 100644
--- a/net/netfilter/nf_log_syslog.c
+++ b/net/netfilter/nf_log_syslog.c
@@ -929,7 +929,7 @@ static void nf_log_unknown_packet(struct net *net, u_int8_t 
pf,
 
dump_mac_header(m, loginfo, skb);
 
-   nf_log_buf_close(m);
+   nf_log_buf_close(m, net);
 }
 
 static void nf_log_netdev_packet(struct net *net, u_int8_t pf,
-- 
2.39.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH9 1/2] ve: fix compilation warning in nf_log_allowed_ve

2023-01-10 Thread Pavel Tikhomirov

net/netfilter/nf_log_syslog.c: In function 'nf_log_allowed_ve':
net/netfilter/nf_log_syslog.c:52:31: warning: passing argument 1 of 
'is_ve_init_net' discards 'const' qualifier from pointer target type 
[-Wdiscarded-qualifiers]
   52 | return is_ve_init_net(net) || sysctl_nf_log_all_netns;
  |   ^~~
In file included from net/netfilter/nf_log_syslog.c:19:
./include/linux/ve.h:222:40: note: expected 'struct net *' but argument is of 
type 'const struct net *'
  222 | extern bool is_ve_init_net(struct net *net);
  |^~~

Fixes: e817f12be9be ("ve/nf_log_syslog: allow packet logging in ve init netns")

Signed-off-by: Pavel Tikhomirov 
---
 kernel/ve/ve.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 8f15aefcd6d0..e6b76e9b6175 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -297,7 +297,7 @@ struct net *ve_get_net_ns(struct ve_struct* ve)
 }
 EXPORT_SYMBOL(ve_get_net_ns);
 
-bool is_ve_init_net(struct net *net)
+bool is_ve_init_net(const struct net *net)
 {
struct ve_struct *ve = net->owner_ve;
struct nsproxy *ve_ns;
-- 
2.39.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [vz7 PATCH 2/2] devcg: Allow wildcard exceptions in DENY child cgroups PSBM-144033

2023-01-19 Thread Pavel Tikhomirov


The patch does not apply please fix.

[snorch@turmoil vzkernel-vz7]$ git am -3 
~/Downloads/patches/nborisov/devcg/*
Applying: devcg: Move match_exception_partial before match_exception 
PSBM-144033

Applying: devcg: Allow wildcard exceptions in DENY child cgroups PSBM-144033
Using index info to reconstruct a base tree...
error: patch failed: security/device_cgroup.c:504
error: security/device_cgroup.c: patch does not apply
error: Did you hand edit your patch?
It does not apply to blobs recorded in its index.
Patch failed at 0002 devcg: Allow wildcard exceptions in DENY child 
cgroups PSBM-144033

hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

On 16.12.2022 17:38, Nikolay Borisov wrote:

In containerized environments there arise cases where we might want to
allow wildcard exceptions when the parent cg doesn't have such. This for
example arises when systemd services are being setup in containers. In
order to allow systemd to function we must allow it to write wildcard
(i.e b *:* rwm) rules in the child group. At the same time in order not
to break the fundamental invariant of the device cgroup hierarchy that
children cannot be more permissive than their parents instead of blindly
trusting those rules, simply skip them in the child cgroup and defer to
the parent's exceptions.

For example assume we have A/B, where A has default behavior 'deny' and
B was created afterwards and subsequently also has 'deny' default
behavior. With this patch it's possible to write "b *:* rwm" in B which
would also result in EPERM when trying to access any device that doesn't
contain an exception in A:

 mkdir A
 echo "a" > A/devices.deny
 mkdir A/B
 echo "c *:*" > A/B/devices.allow <-- allows to create the exception
 but it's essentially a noop

 echo "c 1:3 rw" > A < -- now trying to open /dev/nul (matching 1:3)
 by a process in B would succeed.

Implementing this doesn't really break the main invariant that children
shouldn't have more access than their ancestors.

Signed-off-by: Nikolay Borisov 
---
  security/device_cgroup.c | 54 +++-
  1 file changed, 42 insertions(+), 12 deletions(-)

diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index f7948334e318..302159d21d15 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -61,6 +61,11 @@ struct dev_cgroup {
struct list_head propagate_pending;
  };

+static bool is_wildcard_exception(struct dev_exception_item *ex)
+{
+   return ex->minor == ~0 || ex->major == ~0;
+}
+
  static inline struct dev_cgroup *css_to_devcgroup(struct cgroup_subsys_state 
*s)
  {
return container_of(s, struct dev_cgroup, css);
@@ -224,6 +229,7 @@ static int devcgroup_online(struct cgroup *cgroup)
if (!ret)
dev_cgroup->behavior = parent_dev_cgroup->behavior;
}
+
mutex_unlock(&devcgroup_mutex);

return ret;
@@ -434,9 +440,9 @@ static bool match_exception_partial(struct list_head 
*exceptions, short type,

  /**
   * match_exception- iterates the exception list trying to match a rule
- *   based on type, major, minor and access type. It is
- *   considered a match if an exception is found that
- *   will contain the entire range of provided parameters.
+ *   based on type, major, minor and access type. It is
+ *   considered a match if an exception is found that
+ *   will contain the entire range of provided parameters.
   * @exceptions: list of exceptions
   * @type: device type (DEV_BLOCK or DEV_CHAR)
   * @major: device file major number, ~0 to match all
@@ -446,10 +452,11 @@ static bool match_exception_partial(struct list_head 
*exceptions, short type,
   * returns: true in case it matches an exception completely
   */
  static bool match_exception(struct dev_cgroup *dev_cgroup, short type,
-   u32 major, u32 minor, short access)
+   u32 major, u32 minor, short access, bool 
check_parent)
  {
struct dev_exception_item *ex;
struct cgroup *cgrp = dev_cgroup->css.cgroup;
+   bool wildcard_skipped = false;
struct list_head *exceptions = &dev_cgroup->exceptions;

list_for_each_entry_rcu(ex, exceptions, list) {
@@ -464,6 +471,11 @@ static bool match_exception(struct dev_cgroup *dev_cgroup, 
short type,
continue;
if (ex->minor != ~0 && ex->minor != minor)
continue;
+   /* skip wildcard rule if we are child cg */
+   if (is_wildcard_exception(ex) && !test_bit(CGRP_

Re: [Devel] [vz7 PATCH 2/2] devcg: Allow wildcard exceptions in DENY child cgroups PSBM-144033

2023-01-19 Thread Pavel Tikhomirov





On 16.12.2022 17:38, Nikolay Borisov wrote:

In containerized environments there arise cases where we might want to
allow wildcard exceptions when the parent cg doesn't have such. This for
example arises when systemd services are being setup in containers. In
order to allow systemd to function we must allow it to write wildcard
(i.e b *:* rwm) rules in the child group. At the same time in order not
to break the fundamental invariant of the device cgroup hierarchy that
children cannot be more permissive than their parents instead of blindly
trusting those rules, simply skip them in the child cgroup and defer to
the parent's exceptions.

For example assume we have A/B, where A has default behavior 'deny' and
B was created afterwards and subsequently also has 'deny' default
behavior. With this patch it's possible to write "b *:* rwm" in B which
would also result in EPERM when trying to access any device that doesn't
contain an exception in A:

 mkdir A
 echo "a" > A/devices.deny
 mkdir A/B
 echo "c *:*" > A/B/devices.allow <-- allows to create the exception
 but it's essentially a noop

 echo "c 1:3 rw" > A < -- now trying to open /dev/nul (matching 1:3)
 by a process in B would succeed.

Implementing this doesn't really break the main invariant that children
shouldn't have more access than their ancestors.

Signed-off-by: Nikolay Borisov 
---
  security/device_cgroup.c | 54 +++-
  1 file changed, 42 insertions(+), 12 deletions(-)

diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index f7948334e318..302159d21d15 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -61,6 +61,11 @@ struct dev_cgroup {
struct list_head propagate_pending;
  };

+static bool is_wildcard_exception(struct dev_exception_item *ex)
+{
+   return ex->minor == ~0 || ex->major == ~0;
+}
+
  static inline struct dev_cgroup *css_to_devcgroup(struct cgroup_subsys_state 
*s)
  {
return container_of(s, struct dev_cgroup, css);
@@ -224,6 +229,7 @@ static int devcgroup_online(struct cgroup *cgroup)
if (!ret)
dev_cgroup->behavior = parent_dev_cgroup->behavior;
}
+
mutex_unlock(&devcgroup_mutex);

return ret;
@@ -434,9 +440,9 @@ static bool match_exception_partial(struct list_head 
*exceptions, short type,

  /**
   * match_exception- iterates the exception list trying to match a rule
- *   based on type, major, minor and access type. It is
- *   considered a match if an exception is found that
- *   will contain the entire range of provided parameters.
+ *   based on type, major, minor and access type. It is
+ *   considered a match if an exception is found that
+ *   will contain the entire range of provided parameters.
   * @exceptions: list of exceptions
   * @type: device type (DEV_BLOCK or DEV_CHAR)
   * @major: device file major number, ~0 to match all
@@ -446,10 +452,11 @@ static bool match_exception_partial(struct list_head 
*exceptions, short type,
   * returns: true in case it matches an exception completely
   */
  static bool match_exception(struct dev_cgroup *dev_cgroup, short type,
-   u32 major, u32 minor, short access)
+   u32 major, u32 minor, short access, bool 
check_parent)
  {
struct dev_exception_item *ex;
struct cgroup *cgrp = dev_cgroup->css.cgroup;
+   bool wildcard_skipped = false;
struct list_head *exceptions = &dev_cgroup->exceptions;

list_for_each_entry_rcu(ex, exceptions, list) {
@@ -464,6 +471,11 @@ static bool match_exception(struct dev_cgroup *dev_cgroup, 
short type,
continue;
if (ex->minor != ~0 && ex->minor != minor)
continue;
+   /* skip wildcard rule if we are child cg */
+   if (is_wildcard_exception(ex) && !test_bit(CGRP_VE_ROOT, 
&cgrp->flags)) {
+   wildcard_skipped = true;
+   continue;
+   }

/* provided access cannot have more than the exception rule */
mismatched_bits = access & (~ex->access) & ~ACC_MOUNT;
@@ -477,9 +489,26 @@ static bool match_exception(struct dev_cgroup *dev_cgroup, 
short type,
return true;
}

+   /* we matched a wildcard rule, so let's check for a
+* more specific one in the parent
+*/
+   if (wildcard_skipped && check_parent) {
+   struct cgroup *parent = cgrp->parent;
+   struct dev_cgroup *devcg_parent = cgroup_to_devcgroup(parent);
+   if (devcg_parent->behavior == DEVCG_DEFAULT_ALLOW)
+   /* Can't match any of the exceptions, even partially */
+

Re: [Devel] [vz7 PATCH 1/2] devcg: Move match_exception_partial before match_exception PSBM-144033

2023-01-19 Thread Pavel Tikhomirov





-static bool match_exception(struct list_head *exceptions, short type,
-   u32 major, u32 minor, short access)
+static bool match_exception(struct dev_cgroup *dev_cgroup, short type,
+   u32 major, u32 minor, short access)


Does it compile after this change? Seems you not only move, but also 
break this function.

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH vz9] mm: per memory cgroup page cache limit

2023-01-19 Thread Pavel Tikhomirov

Author should be changed back to Andrey Ryabinin 
 as on rebase we don't change author of the 
original patch.


In general patch looks ok.

I feel a bit suspicious about removing WARN_ON in mem_cgroup_migrate, if 
you can explain why we don't need it it would be nice.


See other comments inline:

On 19.01.2023 13:51, Alexander Atanasov wrote:

Forward port feature: mm: per memory cgroup page cache limit.

The original implementation consisted of these commits:
commit 758d52e33a67 ("configs: Enable CONFIG_PAGE_EXTENSION")
commit 741beaa93c89 ("mm: introduce page vz extension (using page_ext)")
commit d42d3c8b849d ("mm/memcg: limit page cache in memcg hack")

This port drops the page vz extensions and uses a bit in memcg_data
to mark the page as cache. The benefit is that the implementation
and porting got more simple. If we require new flags then the newly
introduced folio can be used.

Feature exposes two files to set limit and to check usage at
/sys/fs/cgroup/memory/pagecache_limiter/memory.cache.limit_in_bytes
/sys/fs/cgroup/memory/pagecache_limiter/memory.cache.usage_in_bytes
and is enabled via /sys/fs/cgroup/memory/pagecache_limiter/tasks.

https://jira.sw.ru/browse/PSBM-144609
Signed-off-by: Alexander Atanasov 
---
  include/linux/memcontrol.h |  23 +++-
  mm/filemap.c   |   3 +-
  mm/memcontrol.c| 211 +
  3 files changed, 191 insertions(+), 46 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 561db06f1fd8..dc450dce7049 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -273,6 +273,7 @@ struct mem_cgroup {
/* Legacy consumer-oriented counters */
struct page_counter kmem;   /* v1 only */
struct page_counter tcpmem; /* v1 only */
+   struct page_counter cache;
  
  	/* Range enforcement for interrupt charges */

struct work_struct high_work;
@@ -405,8 +406,10 @@ enum page_memcg_data_flags {
MEMCG_DATA_OBJCGS = (1UL << 0),
/* page has been accounted as a non-slab kernel page */
MEMCG_DATA_KMEM = (1UL << 1),
+   /* page has been accounted as a cache page */
+   MEMCG_DATA_PGCACHE = (1UL << 2),
/* the next bit after the last actual flag */
-   __NR_MEMCG_DATA_FLAGS  = (1UL << 2),
+   __NR_MEMCG_DATA_FLAGS  = (1UL << 3),
  };
  
  #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)

@@ -771,11 +774,25 @@ int __mem_cgroup_charge(struct folio *folio, struct 
mm_struct *mm, gfp_t gfp);
  static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
gfp_t gfp)
  {
-   if (mem_cgroup_disabled())
-   return 0;
return __mem_cgroup_charge(folio, mm, gfp);
  }
  
+int mem_cgroup_charge_cache(struct folio *folio, struct mm_struct *mm,

+   gfp_t gfp);
+


You probably lost this hunk in `#else /* CONFIG_MEMCG */`:

@@  @@ static inline int mem_cgroup_charge(struct page 
*page, str

  return 0;
   }

  +static inline int mem_cgroup_charge_cache(struct page *page, struct 
mm_struc

  + gfp_t gfp_mask)
  +{
  +   return 0; 


  +}
  +

This would likely break compilation with disabled memcg.


+/*
+ * folio_memcg_cache - Check if the folio has the pgcache flag set.
+ * @folio: Pointer to the folio.
+ *
+ * Checks if the folio has page cache flag set. The caller must ensure
+ * that the folio has an associated memory cgroup. It's not safe to call
+ * this function against some types of folios, e.g. slab folios.
+ */
+static inline bool folio_memcg_cache(struct folio *folio)
+{
+   return folio->memcg_data & MEMCG_DATA_PGCACHE;
+}
+
  int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm,
  gfp_t gfp, swp_entry_t entry);
  void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
diff --git a/mm/filemap.c b/mm/filemap.c
index 2d63e53980e4..d568ffc0d416 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -841,7 +841,8 @@ noinline int __filemap_add_folio(struct address_space 
*mapping,
mapping_set_update(&xas, mapping);
  
  	if (!huge) {

-   int error = mem_cgroup_charge(folio, NULL, gfp);
+   int error = mem_cgroup_charge_cache(folio, NULL, gfp);
+
VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
if (error)
return error;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6fa13539f3e5..ae1cb300eebc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -218,6 +218,7 @@ enum res_type {
_OOM_TYPE,
_KMEM,
_TCP,
+   _CACHE,
  };
  
  #define MEMFILE_PRIVATE(x, val)	((x) << 16 | (val))

@@ -2207,6 +2208,7 @@ struct memcg_stock_pcp {
int nr_slab_unreclaimable_b;
  #endif
  
+	unsigned int cache_nr_pages;

struct work_struct work;
unsigned

Re: [Devel] [vz7 v2 PATCH 1/2] devcg: Move match_exception_partial before match_exception PSBM-144033

2023-01-19 Thread Pavel Tikhomirov

Maybe just put a match_exception_partial declaration line above 
match_exception, to decrease next rebase work with it, also this 
one-liner could be logically merged to second patch.


On 19.01.2023 15:52, Nikolay Borisov wrote:

This is required as the latter would call the former in upcoming
patch.

Signed-off-by: Nikolay Borisov 
---
v2:
  * Fix compilation breakage
  * Removed irrelevant changes

  security/device_cgroup.c | 85 +---
  1 file changed, 44 insertions(+), 41 deletions(-)

diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index f9d205f95c25..2f6d5e0ffd00 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -387,42 +387,45 @@ static int devcgroup_seq_read(struct cgroup *cgroup, 
struct cftype *cft,
  }

  /**
- * match_exception - iterates the exception list trying to match a rule
- *   based on type, major, minor and access type. It is
- *   considered a match if an exception is found that
- *   will contain the entire range of provided parameters.
+ * match_exception_partial - iterates the exception list trying to match a rule
+ *  based on type, major, minor and access type. It is
+ *  considered a match if an exception's range is
+ *  found to contain *any* of the devices specified by
+ *  provided parameters. This is used to make sure no
+ *  extra access is being granted that is forbidden by
+ *  any of the exception list.
   * @exceptions: list of exceptions
   * @type: device type (DEV_BLOCK or DEV_CHAR)
   * @major: device file major number, ~0 to match all
   * @minor: device file minor number, ~0 to match all
   * @access: permission mask (ACC_READ, ACC_WRITE, ACC_MKNOD)
   *
- * returns: true in case it matches an exception completely
+ * returns: true in case the provided range mat matches an exception completely
   */
-static bool match_exception(struct list_head *exceptions, short type,
-   u32 major, u32 minor, short access)
+static bool match_exception_partial(struct list_head *exceptions, short type,
+   u32 major, u32 minor, short access)
  {
struct dev_exception_item *ex;

list_for_each_entry_rcu(ex, exceptions, list) {
-   short mismatched_bits;
-   bool allowed_mount;
-
if ((type & DEV_BLOCK) && !(ex->type & DEV_BLOCK))
continue;
if ((type & DEV_CHAR) && !(ex->type & DEV_CHAR))
continue;
-   if (ex->major != ~0 && ex->major != major)
+   /*
+* We must be sure that both the exception and the provided
+* range aren't masking all devices
+*/
+   if (ex->major != ~0 && major != ~0 && ex->major != major)
continue;
-   if (ex->minor != ~0 && ex->minor != minor)
+   if (ex->minor != ~0 && minor != ~0 && ex->minor != minor)
continue;
-   /* provided access cannot have more than the exception rule */
-   mismatched_bits = access & (~ex->access) & ~ACC_MOUNT;
-   allowed_mount = !(mismatched_bits & ~ACC_WRITE) &&
-   (ex->access & ACC_MOUNT) &&
-   (access & ACC_MOUNT);
-
-   if (mismatched_bits && !allowed_mount)
+   /*
+* In order to make sure the provided range isn't matching
+* an exception, all its access bits shouldn't match the
+* exception's access bits
+*/
+   if (!(access & ex->access))
continue;
return true;
}
@@ -430,48 +433,48 @@ static bool match_exception(struct list_head *exceptions, 
short type,
  }

  /**
- * match_exception_partial - iterates the exception list trying to match a rule
- *  based on type, major, minor and access type. It is
- *  considered a match if an exception's range is
- *  found to contain *any* of the devices specified by
- *  provided parameters. This is used to make sure no
- *  extra access is being granted that is forbidden by
- *  any of the exception list.
+ * match_exception - iterates the exception list trying to match a rule
+ *   based on type, major, minor and access type. It is
+ *   considered a match if an exception is found that
+ *   will contain the entire range of provided parameters.
   * @exceptions: list of exceptions
   * @type: device type (DEV_BLOCK or DEV_CHAR)
   * @major: de

Re: [Devel] [vz7 v2 PATCH 2/2] devcg: Allow wildcard exceptions in DENY child cgroups

2023-01-19 Thread Pavel Tikhomirov

I believe this is not covering all cases, for instance it would break 
adding rules to second level nested cgroup, if second level nested 
cgroup is in "default deny" and it's parent is in "default deny" and 
none of them have CGRP_VE_ROOT set. In parent there is allowing wildcard 
rule and adding same allowing wildcard rule to child would fail as 
verify_new_ex -> match_exception would return false.


If you have a node with kernel with this patch installed I would like to 
play with it, so that it would be easier to review.


On 19.01.2023 15:52, Nikolay Borisov wrote:

In containerized environments there arise cases where we might want to
allow wildcard exceptions when the parent cg doesn't have such. This for
example arises when systemd services are being setup in containers. In
order to allow systemd to function we must allow it to write wildcard
(i.e b *:* rwm) rules in the child group. At the same time in order not
to break the fundamental invariant of the device cgroup hierarchy that
children cannot be more permissive than their parents instead of blindly
trusting those rules, simply skip them in the child cgroup and defer to
the parent's exceptions.

For example assume we have A/B, where A has default behavior 'deny' and
B was created afterwards and subsequently also has 'deny' default
behavior. With this patch it's possible to write "b *:* rwm" in B which
would also result in EPERM when trying to access any device that doesn't
contain an exception in A:

 mkdir A
 echo "a" > A/devices.deny
 mkdir A/B
 echo "c *:*" > A/B/devices.allow <-- allows to create the exception
 but it's essentially a noop

 echo "c 1:3 rw" > A < -- now trying to open /dev/nul (matching 1:3)
 by a process in B would succeed.

Implementing this doesn't really break the main invariant that children
shouldn't have more access than their ancestors.

Signed-off-by: Nikolay Borisov 
---

Changes in v2:
  * Patch is now self-contained.

  security/device_cgroup.c | 55 +++-
  1 file changed, 43 insertions(+), 12 deletions(-)

diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 2f6d5e0ffd00..0e5fdcc0cbff 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -61,6 +61,11 @@ struct dev_cgroup {
struct list_head propagate_pending;
  };

+static bool is_wildcard_exception(struct dev_exception_item *ex)
+{
+   return ex->minor == ~0 || ex->major == ~0;
+}
+
  static inline struct dev_cgroup *css_to_devcgroup(struct cgroup_subsys_state 
*s)
  {
return container_of(s, struct dev_cgroup, css);
@@ -434,10 +439,10 @@ static bool match_exception_partial(struct list_head 
*exceptions, short type,

  /**
   * match_exception- iterates the exception list trying to match a rule
- *   based on type, major, minor and access type. It is
- *   considered a match if an exception is found that
- *   will contain the entire range of provided parameters.
- * @exceptions: list of exceptions
+ *   based on type, major, minor and access type. It is
+ *   considered a match if an exception is found that
+ *   will contain the entire range of provided parameters.
+ * @dev_cgroup: cgroup whose exceptions we are checking
   * @type: device type (DEV_BLOCK or DEV_CHAR)
   * @major: device file major number, ~0 to match all
   * @minor: device file minor number, ~0 to match all
@@ -445,10 +450,13 @@ static bool match_exception_partial(struct list_head 
*exceptions, short type,
   *
   * returns: true in case it matches an exception completely
   */
-static bool match_exception(struct list_head *exceptions, short type,
-   u32 major, u32 minor, short access)
+static bool match_exception(struct dev_cgroup *dev_cgroup, short type,
+   u32 major, u32 minor, short access, bool 
check_parent)
  {
struct dev_exception_item *ex;
+   struct cgroup *cgrp = dev_cgroup->css.cgroup;
+   struct list_head *exceptions = &dev_cgroup->exceptions;
+   bool wildcard_skipped = false;

list_for_each_entry_rcu(ex, exceptions, list) {
short mismatched_bits;
@@ -462,6 +470,11 @@ static bool match_exception(struct list_head *exceptions, 
short type,
continue;
if (ex->minor != ~0 && ex->minor != minor)
continue;
+   /* skip wildcard rule if we are child cg */
+   if (is_wildcard_exception(ex) && !test_bit(CGRP_VE_ROOT, 
&cgrp->flags)) {


Comment does not correspond to what you do, so either comment is wrong 
or what you do is wrong. Note: CGRP_VE_ROOT bit is only set on root 
container cgroups when container is running. O maybe I'm not getting 
"child cg" thing.



+   wildcard_skipped = true;

Re: [Devel] [vz7 v2 PATCH 2/2] devcg: Allow wildcard exceptions in DENY child cgroups

2023-01-20 Thread Pavel Tikhomirov


I don't like this behavior:

[root@ptikh-vz7 ~]# mkdir /sys/fs/cgroup/devices/test1
[root@ptikh-vz7 ~]# echo "a *:* rwmM" > 
/sys/fs/cgroup/devices/test1/devices.deny

[root@ptikh-vz7 ~]# cat /sys/fs/cgroup/devices/test1/devices.list
[root@ptikh-vz7 ~]# echo "c *:* rwm" > 
/sys/fs/cgroup/devices/test1/devices.allow

[root@ptikh-vz7 ~]# mkdir /sys/fs/cgroup/devices/test1/test2
[root@ptikh-vz7 ~]# echo "b *:* rwm" > 
/sys/fs/cgroup/devices/test1/test2/devices.allow

[root@ptikh-vz7 ~]# cat /sys/fs/cgroup/devices/test1/test2/devices.list
c *:* rwm
b *:* rwm
[root@ptikh-vz7 ~]# echo "c *:* rwm" > 
/sys/fs/cgroup/devices/test1/devices.deny

[root@ptikh-vz7 ~]# cat /sys/fs/cgroup/devices/test1/test2/devices.list
[root@ptikh-vz7 ~]#

When we remove any exception in ancestors "fake"-wildcard rules disappear.

I don't like this behavior even more:

[root@ptikh-vz7 ~]# mkdir /sys/fs/cgroup/devices/test3
[root@ptikh-vz7 ~]# echo "a *:* rwmM" > 
/sys/fs/cgroup/devices/test3/devices.deny

[root@ptikh-vz7 ~]# cat /sys/fs/cgroup/devices/test3/devices.list
[root@ptikh-vz7 ~]# echo "c *:* rwm" > 
/sys/fs/cgroup/devices/test3/devices.allow

[root@ptikh-vz7 ~]# mkdir /sys/fs/cgroup/devices/test3/test4
[root@ptikh-vz7 ~]# cat /sys/fs/cgroup/devices/test3/test4/devices.list
c *:* rwm
[root@ptikh-vz7 ~]# echo $$ > /sys/fs/cgroup/devices/test3/test4/tasks
[root@ptikh-vz7 ~]# mknod test c 1 3
mknod: ‘test’: Operation not permitted

Cgroup which is allowed to mknod all char devices can't mknod /dev/null.

I believe there would be more corner cases which are working similarly 
bad. I don't feel safe to change match_exception as this can lead to 
security issues. And properly reviewing such a change takes too much time.



If you have a node with kernel with this patch installed I would like to play 
with it, so that it would be easier to review.


Next time just provide node with kernel with patches applied, so that I 
don't need to build it myself, please.


On 19.01.2023 18:15, nb wrote:



On 19.01.23 г. 17:02 ч., Pavel Tikhomirov wrote:
I believe this is not covering all cases, for instance it would break 
adding rules to second level nested cgroup, if second level nested 
cgroup is in "default deny" and it's parent is in "default deny" and 
none of them have CGRP_VE_ROOT set. In parent there is allowing 
wildcard rule and adding same allowing wildcard rule to child would 
fail as verify_new_ex -> match_exception would return false.


If we don't have CGRP_VE_ROOT then the code should remain as is 
currently in upstream (see my comment about using ve_is_super below). 
After all this is only a problem for our container setup.



A  (default deny)
  \
   B (default deny) (allow wild card)
    \
     C <--- Add an allow wild card will be ignored and check will be 
deferred to parent.


Can you describe visually what scenario you have in mind?





@@ -445,10 +450,13 @@ static bool match_exception_partial(struct 
list_head *exceptions, short type,

   *
   * returns: true in case it matches an exception completely
   */
-static bool match_exception(struct list_head *exceptions, short type,
-    u32 major, u32 minor, short access)
+static bool match_exception(struct dev_cgroup *dev_cgroup, short type,
+    u32 major, u32 minor, short access, bool check_parent)
  {
  struct dev_exception_item *ex;
+    struct cgroup *cgrp = dev_cgroup->css.cgroup;
+    struct list_head *exceptions = &dev_cgroup->exceptions;
+    bool wildcard_skipped = false;

  list_for_each_entry_rcu(ex, exceptions, list) {
  short mismatched_bits;
@@ -462,6 +470,11 @@ static bool match_exception(struct list_head 
*exceptions, short type,

  continue;
  if (ex->minor != ~0 && ex->minor != minor)
  continue;
+    /* skip wildcard rule if we are child cg */
+    if (is_wildcard_exception(ex) && !test_bit(CGRP_VE_ROOT, 
&cgrp->flags)) {


Comment does not correspond to what you do, so either comment is wrong 
or what you do is wrong. Note: CGRP_VE_ROOT bit is only set on root 
container cgroups when container is running. O maybe I'm not getting 
"child cg" thing.


A child cg is really any cgroup under an CGRP_VE_ROOT group. What did 
you think it was? Actually thinking more about it I think in this 
condition I should also explicitly be checking if we are in a container 
context via ve_is_super().





+    wildcard_skipped = true;
+    continue;
+    }

  /* provided access cannot have more than the exception rule */
  mismatched_bits = access & (~ex->access) & ~ACC_MOUNT;






--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH vz9 v2] mm: per memory cgroup page cache limit

2023-01-20 Thread Pavel Tikhomirov





On 20.01.2023 11:47, Alexander Atanasov wrote:

  From: Andrey Ryabinin 

Forward port feature: mm: per memory cgroup page cache limit.

The original implementation consisted of these commits:
commit 758d52e33a67 ("configs: Enable CONFIG_PAGE_EXTENSION")
commit 741beaa93c89 ("mm: introduce page vz extension (using page_ext)")
commit d42d3c8b849d ("mm/memcg: limit page cache in memcg hack")

This port drops the page vz extensions in favor of using a memcg_data
bit to mark a page as cache. The benefit is that the implementation
and porting got more simple. If we require new flags then the newly
introduced folio can be used.


Why do you remove interface description? Please bring it back.

Feature exposes two files to set limit and to check usage at 


  /sys/fs/cgroup/memory/pagecache_limiter/memory.cache.limit_in_bytes
  /sys/fs/cgroup/memory/pagecache_limiter/memory.cache.usage_in_bytes
  and is enabled via /sys/fs/cgroup/memory/pagecache_limiter/tasks.



https://jira.sw.ru/browse/PSBM-144609
Signed-off-by: Alexander Atanasov 
  Signed-off-by: Andrey Ryabinin 
---
  include/linux/memcontrol.h |  29 -
  mm/filemap.c   |   3 +-
  mm/memcontrol.c| 219 ++---
  3 files changed, 207 insertions(+), 44 deletions(-)

v1->v2: addressing Pavel's comments for v1
 - fixed compilation without MEMCG
 - try to preserve author
 - fixed line alignment
 - add missed bug traps and WARN_ONs
 - fixed spelling error

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 561db06f1fd8..1a49416300c9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -273,6 +273,7 @@ struct mem_cgroup {
/* Legacy consumer-oriented counters */
struct page_counter kmem;   /* v1 only */
struct page_counter tcpmem; /* v1 only */
+   struct page_counter cache;
  
  	/* Range enforcement for interrupt charges */

struct work_struct high_work;
@@ -405,8 +406,10 @@ enum page_memcg_data_flags {
MEMCG_DATA_OBJCGS = (1UL << 0),
/* page has been accounted as a non-slab kernel page */
MEMCG_DATA_KMEM = (1UL << 1),
+   /* page has been accounted as a cache page */
+   MEMCG_DATA_PGCACHE = (1UL << 2),
/* the next bit after the last actual flag */
-   __NR_MEMCG_DATA_FLAGS  = (1UL << 2),
+   __NR_MEMCG_DATA_FLAGS  = (1UL << 3),
  };
  
  #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)

@@ -771,11 +774,25 @@ int __mem_cgroup_charge(struct folio *folio, struct 
mm_struct *mm, gfp_t gfp);
  static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
gfp_t gfp)
  {
-   if (mem_cgroup_disabled())
-   return 0;
return __mem_cgroup_charge(folio, mm, gfp);
  }
  
+int mem_cgroup_charge_cache(struct folio *folio, struct mm_struct *mm,

+  gfp_t gfp);


One more space here please.


+
+/*
+ * folio_memcg_cache - Check if the folio has the pgcache flag set.
+ * @folio: Pointer to the folio.
+ *
+ * Checks if the folio has page cache flag set. The caller must ensure
+ * that the folio has an associated memory cgroup. It's not safe to call
+ * this function against some types of folios, e.g. slab folios.
+ */
+static inline bool folio_memcg_cache(struct folio *folio)
+{
+   return folio->memcg_data & MEMCG_DATA_PGCACHE;
+}
+
  int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm,
  gfp_t gfp, swp_entry_t entry);
  void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
@@ -1339,6 +1356,12 @@ static inline int mem_cgroup_charge(struct folio *folio,
return 0;
  }
  
+static inline int mem_cgroup_charge_cache(struct folio *folio,

+struct mm_struct *mm, gfp_t gfp)
+{
+   return 0;
+}
+
  static inline int mem_cgroup_swapin_charge_page(struct page *page,
struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
  {
diff --git a/mm/filemap.c b/mm/filemap.c
index 2d63e53980e4..d568ffc0d416 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -841,7 +841,8 @@ noinline int __filemap_add_folio(struct address_space 
*mapping,
mapping_set_update(&xas, mapping);
  
  	if (!huge) {

-   int error = mem_cgroup_charge(folio, NULL, gfp);
+   int error = mem_cgroup_charge_cache(folio, NULL, gfp);
+
VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
if (error)
return error;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6fa13539f3e5..6b462152e77f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -218,6 +218,7 @@ enum res_type {
_OOM_TYPE,
_KMEM,
_TCP,
+   _CACHE,
  };
  
  #define MEMFILE_PRIVATE(x, val)	((x) << 16 | (val))

@@ -2207,6 +2208,7 @@ struct memcg_stock_pcp {

Re: [Devel] [PATCH vz9 1/2] ve/cgroups: fix use after free in ve_exit_ns

2023-02-01 Thread Pavel Tikhomirov


Nacked-by: Pavel Tikhomirov 

This is not last reference to ve. Last reference is removed when we 
remove cgroup directory with rmdir. One cant remove cgroup until there 
is a process in it. ve_exit_ns is called from task in this ve cgroup.


On 30.01.2023 13:00, Alexander Atanasov wrote:

Release the lock before dropping the last reference to ve in
ve_exit_ns which can lead to a call to ve_destroy which in turns
does free the ve. In general it is always a bug to drop a reference
of an object with locks held inside of it.

https://jira.sw.ru/browse/PSBM-144580
Signed-off-by: Alexander Atanasov 
---
  kernel/ve/ve.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

i've checked vz7 and it does not have this issue.

diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 407d7de6e071..80865161670e 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -857,9 +857,11 @@ void ve_exit_ns(struct pid_namespace *pid_ns)
ve_hook_iterate_fini(VE_SS_CHAIN, ve);
ve_list_del(ve);
ve_drop_context(ve);
+   up_write(&ve->op_sem);
+
printk(KERN_INFO "CT: %s: stopped\n", ve_name(ve));
+
put_ve(ve); /* from ve_start_container() */
-   up_write(&ve->op_sem);
  }
  
  u64 ve_get_monotonic(struct ve_struct *ve)


--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH] blk-cbt: Enable interrupts when unlocking in blk_cbt_update_size

2023-02-01 Thread Pavel Tikhomirov





On 01.02.2023 21:22, Nikolay Borisov wrote:

blk_cbt_update_size uses spin_lock_irq to lock the cbt while pages are
being copied and the new cbt is published at q->cbt. This lock is used
to synchronize against blk_cbt_release, which can be called within
softirq context. This function requires unlocking be done with
spin_unlock_irq so that interrupts are properly reenabled. Without
this fix the core on which blk_cbt_update_size run would end up with
interrupts being disabled.



Reviewed-by: Pavel Tikhomirov 
Fixes: e69ca16f4135 ("cbt: introduce changed block tracking")


Reported-by: Pavel Tikhomirov 
Signed-off-by: Nikolay Borisov 
---
  block/blk-cbt.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-cbt.c b/block/blk-cbt.c
index 32485c793484..083ed5785031 100644
--- a/block/blk-cbt.c
+++ b/block/blk-cbt.c
@@ -607,7 +607,7 @@ void blk_cbt_update_size(struct block_device *bdev)
}
rcu_assign_pointer(q->cbt, new);
in_use = cbt->count;
-   spin_unlock(&cbt->lock);
+   spin_unlock_irq(&cbt->lock);
if (!in_use)
call_rcu(&cbt->rcu, &cbt_release_callback);
  err_mtx:


--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH vz9 1/2] ve/cgroups: fix use after free in ve_exit_ns

2023-02-01 Thread Pavel Tikhomirov




On 02.02.2023 14:34, Alexander Atanasov wrote:

On 1.02.23 15:06, Pavel Tikhomirov wrote:

Nacked-by: Pavel Tikhomirov 

This is not last reference to ve. Last reference is removed when we 
remove cgroup directory with rmdir. One cant remove cgroup until there 
is a process in it. ve_exit_ns is called from task in this ve cgroup.


Then should we fix vz7 do the same?


I don't say that after your change we get bad code, I say that your 
explanation about "last reference" is wrong.


But, yes, I can't find any reason why we need this put under lock. So, 
sorry I was to hasty, we can still apply patch, but please remove the 
commit message part about last reference.


And also probably mention that this change came at some point of rebase 
from vz7 and seems not needed, so let's return to vz7 variant.




if it is really not the last reference - why does it need to hold the 
mutex do drop the reference ?

And i haven't checked how they are destroyed on reboot.



On 30.01.2023 13:00, Alexander Atanasov wrote:

Release the lock before dropping the last reference to ve in
ve_exit_ns which can lead to a call to ve_destroy which in turns
does free the ve. In general it is always a bug to drop a reference
of an object with locks held inside of it.

https://jira.sw.ru/browse/PSBM-144580
Signed-off-by: Alexander Atanasov 
---
  kernel/ve/ve.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

i've checked vz7 and it does not have this issue.

diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 407d7de6e071..80865161670e 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -857,9 +857,11 @@ void ve_exit_ns(struct pid_namespace *pid_ns)
  ve_hook_iterate_fini(VE_SS_CHAIN, ve);
  ve_list_del(ve);
  ve_drop_context(ve);
+    up_write(&ve->op_sem);
+
  printk(KERN_INFO "CT: %s: stopped\n", ve_name(ve));
+
  put_ve(ve); /* from ve_start_container() */
-    up_write(&ve->op_sem);
  }
  u64 ve_get_monotonic(struct ve_struct *ve)






--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH v2 vz9 2/2] ve/cgroups: drop lock when stopping workqueue to avoid dead lock

2023-02-10 Thread Pavel Tikhomirov

Though I like the idea of explicit states, you must be really careful 
when you rework them as vzctl uses those states to wait for container to 
start or to stop and changing anything may break vzctl.


On 03.02.2023 04:26, Alexander Atanasov wrote:

Rework is_running variable into state to avoid guessing the state,
be able to do more strict checks in what state the ve is in
and verify state transitions.

This allows to reduce the size of critical sections that needs to
take lock. All entry points check for good state before proceeding.

The deadlock is between ve->op_sem and kernfs_rwsem  (lockdep tracked
via kn->active). It is ab-ba deadlock. When using the sysfs
kernfs_rwsem is taken first, then ve code takes ve->op_sem.
When ve code is called from inside the kernel ve->op_sem is taken
first but it can reach back into kernfs code and deadlock.
reboot and release_agent teardown is the reported case.

https://jira.sw.ru/browse/PSBM-144580
Signed-off-by: Alexander Atanasov 
---
  include/linux/ve.h | 26 ++-
  kernel/ve/ve.c | 65 +++---
  2 files changed, 69 insertions(+), 22 deletions(-)

diff --git a/include/linux/ve.h b/include/linux/ve.h
index a023d9a8d14a..8e173628b532 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -27,6 +27,14 @@ struct user_namespace;
  struct cn_private;
  struct vfsmount;
  
+#define VE_STATE_STARTING 	0

+#define VE_STATE_RUNNING   1
+#define VE_STATE_STOPPING  2
+#define VE_STATE_STOPPED   3
+#define VE_STATE_DEAD  4
+
+#define VE_IS_RUNNING(ve) ((ve)->state == VE_STATE_RUNNING)
+
  struct ve_struct {
struct cgroup_subsys_state  css;
  
@@ -35,7 +43,7 @@ struct ve_struct {

struct list_headve_list;
  
  	envid_t			veid;

-   int is_running;
+   int state;
u8  is_pseudosuper:1;
  
  	struct rw_semaphore	op_sem;

@@ -144,6 +152,18 @@ extern int nr_ve;
  extern unsigned int sysctl_ve_mount_nr;
  
  #ifdef CONFIG_VE

+static inline void ve_set_state(struct ve_struct *ve, int new_state)
+{
+   WARN_ONCE(ve->state == VE_STATE_DEAD, "VE is already DEAD");
+   WARN_ONCE(new_state == VE_STATE_RUNNING && ve->state != 
VE_STATE_STARTING,
+   "Invalid transition to running from %d\n", ve->state);
+   WARN_ONCE(new_state == VE_STATE_STOPPING && ve->state != 
VE_STATE_RUNNING,
+   "Invalid transition to stopping from %d\n", ve->state);
+   WARN_ONCE(new_state == VE_STATE_STOPPED && ve->state != 
VE_STATE_STOPPING,
+   "Invalid transition to stooped from %d\n", ve->state);
+   ve->state = new_state;
+}
+
  void ve_add_to_release_list(struct cgroup *cgrp);
  void ve_rm_from_release_list(struct cgroup *cgrp);
  
@@ -229,6 +249,10 @@ extern bool is_ve_init_net(const struct net *net);

  static inline void ve_stop_ns(struct pid_namespace *ns) { }
  static inline void ve_exit_ns(struct pid_namespace *ns) { }
  
+

+static inline void ve_set_state(struct ve_struct *ve, int new_state) {}
+
+


Don't add excess newlines.

@@ -249,9 +252,9 @@ extern bool is_ve_init_net(const struct net *net);
 static inline void ve_stop_ns(struct pid_namespace *ns) { }
 static inline void ve_exit_ns(struct pid_namespace *ns) { }

-
-static inline void ve_set_state(struct ve_struct *ve, int new_state) {}
-
+static inline void ve_set_state(struct ve_struct *ve, int new_state)
+{
+}


  #define ve_feature_set(ve, f) { true; }
  
  static inline bool current_user_ns_initial(void)

diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 80865161670e..8afef0e631d5 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -70,7 +70,7 @@ struct ve_struct ve0 = {
  
  	RCU_POINTER_INITIALIZER(ve_ns, &init_nsproxy),
  
-	.is_running		= 1,

+   .state  = VE_STATE_RUNNING,
.is_pseudosuper = 1,
  
  	.init_cred		= &init_cred,

@@ -662,7 +662,7 @@ void ve_add_to_release_list(struct cgroup *cgrp)
if (!ve)
ve = &ve0;
  
-	if (!ve->is_running) {

+   if (!VE_IS_RUNNING(ve)) {
rcu_read_unlock();
return;
}
@@ -718,7 +718,7 @@ static int ve_start_container(struct ve_struct *ve)
  
  	ve_ns = rcu_dereference_protected(ve->ve_ns, lockdep_is_held(&ve->op_sem));
  
-	if (ve->is_running || ve_ns)

+   if (ve->state != VE_STATE_STARTING || ve_ns)
return -EBUSY;
  
  	if (tsk->task_ve != ve || !is_child_reaper(task_pid(tsk)))

@@ -766,7 +766,7 @@ static int ve_start_container(struct ve_struct *ve)
if (err)
goto err_release_agent_setup;
  
-	ve->is_running = 1;

+   ve_set_state(ve, VE_STATE_RUNNING);
  
  	printk(KERN_INFO "CT: %s: started\n", ve_name(ve));
  
@@ -805,19 +805,31 @@ void ve_stop_ns(struct pid_namespace *pid_ns)

if (!ve_ns || ve_ns->pid_ns_for_children != pid_ns)
goto unlock;
/*
-* Here the VE cha

[Devel] [PATCH RH7] net: don't skip device_rename for non-root container netns

2023-02-22 Thread Pavel Tikhomirov

This patch effectively reverts the commit:
b1c5e22266b5 ("ve/net: allow to rename devices in non-ve namespaces")

The patch says that it allows to rename devices, but instead it skips
call to device_rename for non-root netnses of the container. Ending up
with not renamed sysfs link for the renamed device. And if such
inconsistent device with different device name and sysfs name is moved
to root netns of container the systemd-udevd gets an event notification
about it with mixed names. Systemd obviousely does not expect this and
goes mad if at the same time old moved device name intersects with some
other device name in root netns of container, thus systemd disables this
other device (e.g. eth0 and breaks container network).

The original patch from vz6
diff-ve-net-allow-to-rename-devices-in-non-ve-namespaces
seems just to be a crutch for
diff-ve-vedev-dont-call-netdev_fixup_sysfs-if-device_add-was-not-called
so that sysfs entries of vedev don't break on netns creation.

But as we don't have the latter now we also don't need the former.

https://jira.sw.ru/browse/PSBM-145324

Signed-off-by: Pavel Tikhomirov 
---
 net/core/dev.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index ec3e2d31c203..addd254c9c3c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1196,14 +1196,11 @@ int dev_change_name(struct net_device *dev, const char 
*newname)
}
 
 rollback:
-   if (!dev_net(dev)->owner_ve->ve_netns ||
-   dev_net(dev)->owner_ve->ve_netns == dev->nd_net) {
-   ret = device_rename(&dev->dev, dev->name);
-   if (ret) {
-   memcpy(dev->name, oldname, IFNAMSIZ);
-   write_seqcount_end(&devnet_rename_seq);
-   return ret;
-   }
+   ret = device_rename(&dev->dev, dev->name);
+   if (ret) {
+   memcpy(dev->name, oldname, IFNAMSIZ);
+   write_seqcount_end(&devnet_rename_seq);
+   return ret;
}
 
write_seqcount_end(&devnet_rename_seq);
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH9] net: don't skip device_rename for non-root container netns

2023-02-22 Thread Pavel Tikhomirov

This patch effectively reverts the commit:

2e365f4fbe5d ("ve/net: allow to rename devices in non-ve namespaces")

The patch says that it allows to rename devices, but instead it skips
call to device_rename for non-root netnses of the container. Ending up
with not renamed sysfs link for the renamed device. And if such
inconsistent device with different device name and sysfs name is moved
to root netns of container the systemd-udevd gets an event notification
about it with mixed names. Systemd obviousely does not expect this and
goes mad if at the same time old moved device name intersects with some
other device name in root netns of container, thus systemd disables this
other device (e.g. eth0 and breaks container network).

The original patch from vz6
diff-ve-net-allow-to-rename-devices-in-non-ve-namespaces
seems just to be a crutch for
diff-ve-vedev-dont-call-netdev_fixup_sysfs-if-device_add-was-not-called
so that sysfs entries of vedev don't break on netns creation.

But as we don't have the latter now we also don't need the former.

https://jira.sw.ru/browse/PSBM-145324

Signed-off-by: Pavel Tikhomirov 
---
 include/linux/ve.h |  1 -
 kernel/ve/ve.c | 15 ---
 net/core/dev.c | 22 --
 3 files changed, 38 deletions(-)

diff --git a/include/linux/ve.h b/include/linux/ve.h
index a023d9a8d14a..678cd9b6a94a 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -218,7 +218,6 @@ extern int vz_security_protocol_check(struct net *net, int 
protocol);
 
 int ve_net_hide_sysctl(struct net *net);
 
-extern struct net *ve_get_net_ns(struct ve_struct* ve);
 extern bool is_ve_init_net(const struct net *net);
 
 #else  /* CONFIG_VE */
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 80865161670e..55d45b5f2fbf 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -32,7 +32,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 #include 
@@ -283,20 +282,6 @@ int ve_net_hide_sysctl(struct net *net)
 }
 EXPORT_SYMBOL(ve_net_hide_sysctl);
 
-struct net *ve_get_net_ns(struct ve_struct* ve)
-{
-   struct nsproxy *ve_ns;
-   struct net *net_ns;
-
-   rcu_read_lock();
-   ve_ns = rcu_dereference(ve->ve_ns);
-   net_ns = ve_ns ? get_net(ve_ns->net_ns) : NULL;
-   rcu_read_unlock();
-
-   return net_ns;
-}
-EXPORT_SYMBOL(ve_get_net_ns);
-
 bool is_ve_init_net(const struct net *net)
 {
struct ve_struct *ve = net->owner_ve;
diff --git a/net/core/dev.c b/net/core/dev.c
index 826584477edb..e901c0b28387 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1135,20 +1135,6 @@ static int dev_get_valid_name(struct net *net, struct 
net_device *dev,
return 0;
 }
 
-#ifdef CONFIG_VE
-static bool ve_dev_can_rename(struct net_device *dev)
-{
-   struct net *net;
-   bool can;
-
-   net = ve_get_net_ns(dev_net(dev)->owner_ve);
-   can = !net || net == dev_net(dev);
-   if (net)
-   put_net(net);
-   return can;
-}
-#endif
-
 /**
  * dev_change_name - change name of a device
  * @dev: device
@@ -1208,11 +1194,6 @@ int dev_change_name(struct net_device *dev, const char 
*newname)
dev->name_assign_type = NET_NAME_RENAMED;
 
 rollback:
-#ifdef CONFIG_VE
-   if (!ve_dev_can_rename(dev))
-   goto skip_rename;
-#endif
-
ret = device_rename(&dev->dev, dev->name);
if (ret) {
memcpy(dev->name, oldname, IFNAMSIZ);
@@ -1221,9 +1202,6 @@ int dev_change_name(struct net_device *dev, const char 
*newname)
return ret;
}
 
-#ifdef CONFIG_VE
-skip_rename:
-#endif
up_write(&devnet_rename_sem);
 
netdev_adjacent_rename_links(dev, oldname);
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH vz9] Revert "userns: associate user_struct with the user_namespace"

2023-03-06 Thread Pavel Tikhomirov

After this revert user_struct->pipe_bufs (and obviousely other fields =) 
) becomes shared between root user of container and root user on host, 
which is likely not what we want, as one ct can make all node starve for 
pipe buffers now, see PIPE_MIN_DEF_BUFFERS case in alloc_pipe_info().


I don't feel that this revert is right.

On 06.03.2023 21:16, Nikolay Borisov wrote:

In current vz9 kernel ucounts are already "virtualized" since they are tracked
per-userns/per-uid. Let's remove the virtuozzo code which adds yet another level
and breaks code which is using setuid. This fixes a kernel warning which was
generated when running user08 test from ltp. The reason for the warning was
that on clone with CLONE_NEWUSER and setuid the NPROC rlimit would be 
erroneously
tracked across setuid() since the per-ns struct_user means release_task() 
invocation
on process exit would erroneously account nproc counts.

This reverts commit ff716deacf0c5c86ca53cee6793ff9382ad5aa02.

https://jira.sw.ru/browse/PSBM-145641

Signed-off-by: Nikolay Borisov 
---
  include/linux/sched/user.h |  1 -
  include/linux/user_namespace.h |  4 
  kernel/user.c  | 22 +-
  kernel/user_namespace.c| 13 -
  4 files changed, 9 insertions(+), 31 deletions(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 5945f42179fc..65e447bcb7a4 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -46,7 +46,6 @@ extern struct user_struct *find_user(kuid_t);
  extern struct user_struct root_user;
  #define INIT_USER (&root_user)

-extern struct user_struct * alloc_uid_ns(struct user_namespace *ns, kuid_t);

  /* per-UID process charging. */
  extern struct user_struct * alloc_uid(kuid_t);
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index e5f4c5e9e587..33a4240e6a6f 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -14,9 +14,6 @@
  #define UID_GID_MAP_MAX_BASE_EXTENTS 5
  #define UID_GID_MAP_MAX_EXTENTS 340

-#define UIDHASH_BITS   (CONFIG_BASE_SMALL ? 3 : 7)
-#define UIDHASH_SZ (1 << UIDHASH_BITS)
-
  struct uid_gid_extent {
u32 first;
u32 lower_first;
@@ -70,7 +67,6 @@ struct user_namespace {
struct uid_gid_map  uid_map;
struct uid_gid_map  gid_map;
struct uid_gid_map  projid_map;
-   struct hlist_head   uidhash_table[UIDHASH_SZ];
struct user_namespace   *parent;
int level;
kuid_t  owner;
diff --git a/kernel/user.c b/kernel/user.c
index ca0f7c78a045..c82399c1618a 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -9,7 +9,6 @@
   * able to have per-user limits for system resources.
   */

-#include 
  #include 
  #include 
  #include 
@@ -76,11 +75,14 @@ EXPORT_SYMBOL_GPL(init_user_ns);
   * when changing user ID's (ie setuid() and friends).
   */

+#define UIDHASH_BITS   (CONFIG_BASE_SMALL ? 3 : 7)
+#define UIDHASH_SZ (1 << UIDHASH_BITS)
  #define UIDHASH_MASK  (UIDHASH_SZ - 1)
  #define __uidhashfn(uid)  (((uid >> UIDHASH_BITS) + uid) & UIDHASH_MASK)
-#define uidhashentry(ns, uid)  ((ns)->uidhash_table + 
__uidhashfn((__kuid_val(uid
+#define uidhashentry(uid)  (uidhash_table + __uidhashfn((__kuid_val(uid

  static struct kmem_cache *uid_cachep;
+static struct hlist_head uidhash_table[UIDHASH_SZ];

  /*
   * The uidhash_lock is mostly taken from process context, but it is
@@ -149,10 +151,9 @@ struct user_struct *find_user(kuid_t uid)
  {
struct user_struct *ret;
unsigned long flags;
-   struct user_namespace *ns = current_user_ns();

spin_lock_irqsave(&uidhash_lock, flags);
-   ret = uid_hash_find(uid, uidhashentry(ns, uid));
+   ret = uid_hash_find(uid, uidhashentry(uid));
spin_unlock_irqrestore(&uidhash_lock, flags);
return ret;
  }
@@ -168,9 +169,9 @@ void free_uid(struct user_struct *up)
free_user(up, flags);
  }

-struct user_struct *alloc_uid_ns(struct user_namespace *ns, kuid_t uid)
+struct user_struct *alloc_uid(kuid_t uid)
  {
-   struct hlist_head *hashent = uidhashentry(ns, uid);
+   struct hlist_head *hashent = uidhashentry(uid);
struct user_struct *up, *new;

spin_lock_irq(&uidhash_lock);
@@ -205,11 +206,6 @@ struct user_struct *alloc_uid_ns(struct user_namespace 
*ns, kuid_t uid)
return up;
  }

-struct user_struct *alloc_uid(kuid_t uid)
-{
-   return alloc_uid_ns(current_user_ns(), uid);
-}
-
  static int __init uid_cache_init(void)
  {
int n;
@@ -218,11 +214,11 @@ static int __init uid_cache_init(void)
0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);

for(n = 0; n < UIDHASH_SZ; ++n)
-   INIT_HLIST_HEAD(init_user_ns.uidhash_table + n);
+   INIT_HLIST_HEAD(uidhash_table + n);

/* Insert the root user immediately (init already runs as root) */
spin

Re: [Devel] [PATCH vz9 1/2] prctl: add option to manage memory allocation scopes

2023-04-06 Thread Pavel Tikhomirov





On 05.04.2023 01:56, Alexander Atanasov wrote:

Currently there is no way to hint the kernel to avoid triggering
page reclaims. This is useful in networked file systems,
which can deadlock in the synchronous reclaim path and to reduce
jitter when streaming which can be induced by a synchronouse reclaim.

To aid the userspace add interface to manage PF_MEMALLOC, PF_MEMALLOC_NOIO,
PF_MEMALLOC_NOFS, PF_MEMALLOC_PIN flags via prctl.

Interface is defined via option PR_MEMALLOC_FLAGS and respective
PR_MEMALLOC_GET_FLAGS, PR_MEMALLOC_SET_FLAGS and PR_MEMALLOC_CLEAR_FLAGS.
Flag values used are defined in the kernel header include/linux/sched.h.

https://jira.sw.ru/browse/PSBM-141577
Signed-off-by: Alexander Atanasov 
---
  include/uapi/linux/prctl.h |  6 ++
  kernel/sys.c   | 33 +
  2 files changed, 39 insertions(+)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 4baf1c5b0be7..409bba71a92b 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -277,4 +277,10 @@ struct prctl_task_ct_fields {
__s64 start_boottime;
  };
  
+/* Set task memalloc flags */

+#define PR_MEMALLOC_FLAGS  1001
+#define PR_MEMALLOC_GET_FLAGS  1
+#define PR_MEMALLOC_SET_FLAGS  2
+#define PR_MEMALLOC_CLEAR_FLAGS3
+
  #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 54d7bc990e8f..170f179fa4e5 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2313,6 +2313,36 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct 
*t, unsigned long which,
  
  #define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)
  
+#define MEMALLOC_FLAGS_MASK (PF_MEMALLOC | PF_MEMALLOC_NOFS | \

+ PF_MEMALLOC_NOIO | PF_MEMALLOC_PIN)
+
+static int prctl_memalloc_flags(int opt, unsigned long flags)
+{
+   unsigned int pflags;
+
+#ifdef CONFIG_VE
+   if (!ve_is_super(get_exec_env()))
+   return -ENOSYS;
+#endif


Other, more generic, approach would be:

if (!capable(CAP_SYS_ADMIN))

So that only processes with admin cap in init userns will be able to do 
it. We probably don't want to allow this feature to non root on host.



+   switch(opt) {
+   case PR_MEMALLOC_GET_FLAGS:
+   return current->flags & MEMALLOC_FLAGS_MASK;
+   case PR_MEMALLOC_SET_FLAGS:
+   if (flags & ~MEMALLOC_FLAGS_MASK)
+   return -EINVAL;
+   pflags = current->flags & ~MEMALLOC_FLAGS_MASK;
+   current->flags = pflags | flags;
+   return current->flags;
+   case PR_MEMALLOC_CLEAR_FLAGS:
+   if (flags & ~MEMALLOC_FLAGS_MASK)
+   return -EINVAL;
+   current->flags &= ~flags;
+   return current->flags;
+   }
+
+   return -EINVAL;
+}
+
  SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
  {
@@ -2585,6 +2615,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, 
unsigned long, arg3,
case PR_SET_TASK_CT_FIELDS:
error = prctl_set_task_ct_fields(me, arg2, arg3);
break;
+   case PR_MEMALLOC_FLAGS:
+   error = prctl_memalloc_flags(arg2, arg3);
+   break;
default:
error = -EINVAL;
break;


--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH9 1/6] mount: rename do_set_group to do_set_group_old

2023-04-06 Thread Pavel Tikhomirov

We have a VZ-only feature to copy mount sharing between mounts via mount
syscall, to be able to handle mount restore in CRIU u15-u19 efficiently.

In mainstream there is now similar feature through move_mount syscall.

To support both old criu and new criu, which uses ms API, at the same
time let's fix name collision and leave both variants for now, several
updates later we can drop old mount syscall based API.

https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 fs/namespace.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 5ab41c9f09f0..99f54669929f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2950,7 +2950,7 @@ static bool check_for_nsfs_mounts(struct mount *subtree)
return ret;
 }
 
-static int do_set_group(struct path *path, const char *sibling_name)
+static int do_set_group_old(struct path *path, const char *sibling_name)
 {
struct ve_struct *ve = get_exec_env();
struct mount *sibling, *mnt;
@@ -3554,7 +3554,7 @@ int path_mount(const char *dev_name, struct path *path,
if (flags & MS_MOVE)
return do_move_mount_old(path, dev_name);
if (flags & MS_SET_GROUP)
-   return do_set_group(path, dev_name);
+   return do_set_group_old(path, dev_name);
 
return do_new_mount(path, type_page, sb_flags, mnt_flags, dev_name,
data_page);
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH9 5/6] selftests/move_mount_set_group remove unneeded conversion to bool

2023-04-06 Thread Pavel Tikhomirov

From: Yang Guang 

The coccinelle report
./tools/testing/selftests/move_mount_set_group/move_mount_set_group_test.c:225:18-23:
WARNING: conversion to bool not needed here
Relational and logical operators evaluate to bool,
explicit conversion is overly verbose and unneeded.

Reported-by: Zeal Robot 
Signed-off-by: Yang Guang 
Signed-off-by: Shuah Khan 

https://jira.sw.ru/browse/PSBM-144416
(cherry picked from commit 009482c0932ae2420277dc0adaefa5bd51cb433e)
Signed-off-by: Pavel Tikhomirov 
---
 .../move_mount_set_group/move_mount_set_group_test.c   | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git 
a/tools/testing/selftests/move_mount_set_group/move_mount_set_group_test.c 
b/tools/testing/selftests/move_mount_set_group/move_mount_set_group_test.c
index 860198f83a53..50ed5d475dd1 100644
--- a/tools/testing/selftests/move_mount_set_group/move_mount_set_group_test.c
+++ b/tools/testing/selftests/move_mount_set_group/move_mount_set_group_test.c
@@ -191,7 +191,7 @@ static bool is_shared_mount(const char *path)
 #define SET_GROUP_FROM "/tmp/move_mount_set_group_supported_from"
 #define SET_GROUP_TO   "/tmp/move_mount_set_group_supported_to"
 
-static int move_mount_set_group_supported(void)
+static bool move_mount_set_group_supported(void)
 {
int ret;
 
@@ -222,7 +222,7 @@ static int move_mount_set_group_supported(void)
  AT_FDCWD, SET_GROUP_TO, MOVE_MOUNT_SET_GROUP);
umount2("/tmp", MNT_DETACH);
 
-   return ret < 0 ? false : true;
+   return ret >= 0;
 }
 
 FIXTURE(move_mount_set_group) {
@@ -232,7 +232,7 @@ FIXTURE(move_mount_set_group) {
 
 FIXTURE_SETUP(move_mount_set_group)
 {
-   int ret;
+   bool ret;
 
ASSERT_EQ(prepare_unpriv_mountns(), 0);
 
@@ -254,7 +254,7 @@ FIXTURE_SETUP(move_mount_set_group)
 
 FIXTURE_TEARDOWN(move_mount_set_group)
 {
-   int ret;
+   bool ret;
 
ret = move_mount_set_group_supported();
ASSERT_GE(ret, 0);
@@ -348,7 +348,7 @@ TEST_F(move_mount_set_group, complex_sharing_copying)
.shared = false,
};
pid_t pid;
-   int ret;
+   bool ret;
 
ret = move_mount_set_group_supported();
ASSERT_GE(ret, 0);
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH9 6/6] selftests: move_mount_set_group: Fix incorrect kernel headers search path

2023-04-06 Thread Pavel Tikhomirov

From: Mathieu Desnoyers 

Use $(KHDR_INCLUDES) as lookup path for kernel headers. This prevents
building against kernel headers from the build environment in scenarios
where kernel headers are installed into a specific output directory
(O=...).

Signed-off-by: Mathieu Desnoyers 
Cc: Shuah Khan 
Cc: linux-kselft...@vger.kernel.org
Cc: Ingo Molnar 
Cc:   # 5.18+
Signed-off-by: Shuah Khan 

https://jira.sw.ru/browse/PSBM-144416
(cherry picked from commit 65c68af0131bfef8e395c325735b6c40638cb931)
Signed-off-by: Pavel Tikhomirov 
---
 tools/testing/selftests/move_mount_set_group/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/move_mount_set_group/Makefile 
b/tools/testing/selftests/move_mount_set_group/Makefile
index 80c2d86812b0..94235846b6f9 100644
--- a/tools/testing/selftests/move_mount_set_group/Makefile
+++ b/tools/testing/selftests/move_mount_set_group/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 # Makefile for mount selftests.
-CFLAGS = -g -I../../../../usr/include/ -Wall -O2
+CFLAGS = -g $(KHDR_INCLUDES) -Wall -O2
 
 TEST_GEN_FILES += move_mount_set_group_test
 
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH9 3/6] tools include UAPI: Update linux/mount.h copy

2023-04-06 Thread Pavel Tikhomirov

From: Arnaldo Carvalho de Melo 

To pick the changes from:

  9ffb14ef61bab83f ("move_mount: allow to add a mount into an existing group")

That ends up adding support for the new MOVE_MOUNT_SET_GROUP move_mount
flag.

  $ tools/perf/trace/beauty/move_mount_flags.sh > before
  $ cp include/uapi/linux/mount.h tools/include/uapi/linux/mount.h
  $ tools/perf/trace/beauty/move_mount_flags.sh > after
  $ diff -u before after
  --- before2021-09-10 12:28:43.865279808 -0300
  +++ after 2021-09-10 12:28:50.183429184 -0300
  @@ -5,4 +5,5 @@
[ilog2(0x0010) + 1] = "T_SYMLINKS",
[ilog2(0x0020) + 1] = "T_AUTOMOUNTS",
[ilog2(0x0040) + 1] = "T_EMPTY_PATH",
  + [ilog2(0x0100) + 1] = "SET_GROUP",
   };
  $

So now one can use it in --filter expressions for tracepoints.

This silences this perf build warnings:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/mount.h' differs from 
latest version at 'include/uapi/linux/mount.h'
  diff -u tools/include/uapi/linux/mount.h include/uapi/linux/mount.h

Cc: Christian Brauner 
Cc: Pavel Tikhomirov 
Signed-off-by: Arnaldo Carvalho de Melo 

https://jira.sw.ru/browse/PSBM-144416
(cherry picked from commit 37ce9e4fc596cf10a4d32ced741679bd1b4fa7a5)
Signed-off-by: Pavel Tikhomirov 
---
 tools/include/uapi/linux/mount.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/mount.h b/tools/include/uapi/linux/mount.h
index dd7a166fdf9c..4d93967f8aea 100644
--- a/tools/include/uapi/linux/mount.h
+++ b/tools/include/uapi/linux/mount.h
@@ -73,7 +73,8 @@
 #define MOVE_MOUNT_T_SYMLINKS  0x0010 /* Follow symlinks on to 
path */
 #define MOVE_MOUNT_T_AUTOMOUNTS0x0020 /* Follow automounts 
on to path */
 #define MOVE_MOUNT_T_EMPTY_PATH0x0040 /* Empty to path 
permitted */
-#define MOVE_MOUNT__MASK   0x0077
+#define MOVE_MOUNT_SET_GROUP   0x0100 /* Set sharing group instead 
*/
+#define MOVE_MOUNT__MASK   0x0177
 
 /*
  * fsopen() flags.
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH9 0/6] port move_mount_set_group related patches

2023-04-06 Thread Pavel Tikhomirov

We need this as in Virtuozzo criu after rebase to mainstream criu in u20
we will switch to this new API for sharing group setting accross mounts.

https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 

Arnaldo Carvalho de Melo (1):
  tools include UAPI: Update linux/mount.h copy

Mathieu Desnoyers (1):
  selftests: move_mount_set_group: Fix incorrect kernel headers search
path

Pavel Tikhomirov (3):
  mount: rename do_set_group to do_set_group_old
  move_mount: allow to add a mount into an existing group
  tests: add move_mount(MOVE_MOUNT_SET_GROUP) selftest

Yang Guang (1):
  selftests/move_mount_set_group remove unneeded conversion to bool

 fs/namespace.c|  81 +++-
 include/uapi/linux/mount.h|   3 +-
 tools/include/uapi/linux/mount.h  |   3 +-
 tools/testing/selftests/Makefile  |   1 +
 .../selftests/move_mount_set_group/.gitignore |   1 +
 .../selftests/move_mount_set_group/Makefile   |   7 +
 .../selftests/move_mount_set_group/config |   1 +
 .../move_mount_set_group_test.c   | 375 ++
 8 files changed, 467 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/move_mount_set_group/.gitignore
 create mode 100644 tools/testing/selftests/move_mount_set_group/Makefile
 create mode 100644 tools/testing/selftests/move_mount_set_group/config
 create mode 100644 
tools/testing/selftests/move_mount_set_group/move_mount_set_group_test.c

-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH9 4/6] tests: add move_mount(MOVE_MOUNT_SET_GROUP) selftest

2023-04-06 Thread Pavel Tikhomirov

Add a simple selftest for a move_mount(MOVE_MOUNT_SET_GROUP). This tests
that one can copy sharing from one mount from nested mntns with nested
userns owner to another mount from other nested mntns with other nested
userns owner while in their parent userns.

  TAP version 13
  1..1
  # Starting 1 tests from 2 test cases.
  #  RUN   move_mount_set_group.complex_sharing_copying ...
  #OK  move_mount_set_group.complex_sharing_copying
  ok 1 move_mount_set_group.complex_sharing_copying
  # PASSED: 1 / 1 tests passed.
  # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0

Link: 
https://lore.kernel.org/r/20210715100714.120228-2-ptikhomi...@virtuozzo.com
Cc: Shuah Khan 
Cc: Eric W. Biederman 
Cc: Alexander Viro 
Cc: Christian Brauner 
Cc: Mattias Nissler 
Cc: Aleksa Sarai 
Cc: Andrei Vagin 
Cc: linux-fsde...@vger.kernel.org
Cc: linux-kselft...@vger.kernel.org
Cc: lkml 
Signed-off-by: Pavel Tikhomirov 
Signed-off-by: Christian Brauner 

https://jira.sw.ru/browse/PSBM-144416
(cherry picked from commit 8374f43123a5957326095d108a12c49ae509624f)
Signed-off-by: Pavel Tikhomirov 
---
 tools/testing/selftests/Makefile  |   1 +
 .../selftests/move_mount_set_group/.gitignore |   1 +
 .../selftests/move_mount_set_group/Makefile   |   7 +
 .../selftests/move_mount_set_group/config |   1 +
 .../move_mount_set_group_test.c   | 375 ++
 5 files changed, 385 insertions(+)
 create mode 100644 tools/testing/selftests/move_mount_set_group/.gitignore
 create mode 100644 tools/testing/selftests/move_mount_set_group/Makefile
 create mode 100644 tools/testing/selftests/move_mount_set_group/config
 create mode 100644 
tools/testing/selftests/move_mount_set_group/move_mount_set_group_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index fb010a35d61a..dd0388eab94d 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -35,6 +35,7 @@ TARGETS += memory-hotplug
 TARGETS += mincore
 TARGETS += mount
 TARGETS += mount_setattr
+TARGETS += move_mount_set_group
 TARGETS += mqueue
 TARGETS += nci
 TARGETS += net
diff --git a/tools/testing/selftests/move_mount_set_group/.gitignore 
b/tools/testing/selftests/move_mount_set_group/.gitignore
new file mode 100644
index ..f5e339268720
--- /dev/null
+++ b/tools/testing/selftests/move_mount_set_group/.gitignore
@@ -0,0 +1 @@
+move_mount_set_group_test
diff --git a/tools/testing/selftests/move_mount_set_group/Makefile 
b/tools/testing/selftests/move_mount_set_group/Makefile
new file mode 100644
index ..80c2d86812b0
--- /dev/null
+++ b/tools/testing/selftests/move_mount_set_group/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+# Makefile for mount selftests.
+CFLAGS = -g -I../../../../usr/include/ -Wall -O2
+
+TEST_GEN_FILES += move_mount_set_group_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/move_mount_set_group/config 
b/tools/testing/selftests/move_mount_set_group/config
new file mode 100644
index ..416bd53ce982
--- /dev/null
+++ b/tools/testing/selftests/move_mount_set_group/config
@@ -0,0 +1 @@
+CONFIG_USER_NS=y
diff --git 
a/tools/testing/selftests/move_mount_set_group/move_mount_set_group_test.c 
b/tools/testing/selftests/move_mount_set_group/move_mount_set_group_test.c
new file mode 100644
index ..860198f83a53
--- /dev/null
+++ b/tools/testing/selftests/move_mount_set_group/move_mount_set_group_test.c
@@ -0,0 +1,375 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest_harness.h"
+
+#ifndef CLONE_NEWNS
+#define CLONE_NEWNS 0x0002
+#endif
+
+#ifndef CLONE_NEWUSER
+#define CLONE_NEWUSER 0x1000
+#endif
+
+#ifndef MS_SHARED
+#define MS_SHARED (1 << 20)
+#endif
+
+#ifndef MS_PRIVATE
+#define MS_PRIVATE (1<<18)
+#endif
+
+#ifndef MOVE_MOUNT_SET_GROUP
+#define MOVE_MOUNT_SET_GROUP 0x0100
+#endif
+
+#ifndef MOVE_MOUNT_F_EMPTY_PATH
+#define MOVE_MOUNT_F_EMPTY_PATH 0x0004
+#endif
+
+#ifndef MOVE_MOUNT_T_EMPTY_PATH
+#define MOVE_MOUNT_T_EMPTY_PATH 0x0040
+#endif
+
+static ssize_t write_nointr(int fd, const void *buf, size_t count)
+{
+   ssize_t ret;
+
+   do {
+   ret = write(fd, buf, count);
+   } while (ret < 0 && errno == EINTR);
+
+   return ret;
+}
+
+static int write_file(const char *path, const void *buf, size_t count)
+{
+   int fd;
+   ssize_t ret;
+
+   fd = open(path, O_WRONLY | O_CLOEXEC | O_NOCTTY | O_NOFOLLOW);
+   if (fd < 0)
+   return -1;
+
+   ret = write_nointr(fd, buf, count);
+   close(fd);
+   if (ret < 0 || (size_t)ret != count)
+   return -1;
+
+   return 0;
+}
+
+static int create_and_enter_userns(void)
+{
+   uid_t uid;
+   gid_t gi

[Devel] [PATCH RH9 2/6] move_mount: allow to add a mount into an existing group

2023-04-06 Thread Pavel Tikhomirov

Previously a sharing group (shared and master ids pair) can be only
inherited when mount is created via bindmount. This patch adds an
ability to add an existing private mount into an existing sharing group.

With this functionality one can first create the desired mount tree from
only private mounts (without the need to care about undesired mount
propagation or mount creation order implied by sharing group
dependencies), and next then setup any desired mount sharing between
those mounts in tree as needed.

This allows CRIU to restore any set of mount namespaces, mount trees and
sharing group trees for a container.

We have many issues with restoring mounts in CRIU related to sharing
groups and propagation:
- reverse sharing groups vs mount tree order requires complex mounts
  reordering which mostly implies also using some temporary mounts
(please see https://lkml.org/lkml/2021/3/23/569 for more info)

- mount() syscall creates tons of mounts due to propagation
- mount re-parenting due to propagation
- "Mount Trap" due to propagation
- "Non Uniform" propagation, meaning that with different tricks with
  mount order and temporary children-"lock" mounts one can create mount
  trees which can't be restored without those tricks
(see https://www.linuxplumbersconf.org/event/7/contributions/640/)

With this new functionality we can resolve all the problems with
propagation at once.

Link: 
https://lore.kernel.org/r/20210715100714.120228-1-ptikhomi...@virtuozzo.com
Cc: Eric W. Biederman 
Cc: Alexander Viro 
Cc: Christian Brauner 
Cc: Mattias Nissler 
Cc: Aleksa Sarai 
Cc: Andrei Vagin 
Cc: linux-fsde...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: lkml 
Co-developed-by: Andrei Vagin 
Acked-by: Christian Brauner 
Signed-off-by: Pavel Tikhomirov 
Signed-off-by: Andrei Vagin 
Signed-off-by: Christian Brauner 

https://jira.sw.ru/browse/PSBM-144416
(cherry picked from commit 9ffb14ef61bab83fa818736bf3e7e6b6e182e8e2)
Signed-off-by: Pavel Tikhomirov 
---
 fs/namespace.c | 77 +-
 include/uapi/linux/mount.h |  3 +-
 2 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 99f54669929f..568414031b27 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3006,6 +3006,78 @@ static int do_set_group_old(struct path *path, const 
char *sibling_name)
return err;
 }
 
+static int do_set_group(struct path *from_path, struct path *to_path)
+{
+   struct mount *from, *to;
+   int err;
+
+   from = real_mount(from_path->mnt);
+   to = real_mount(to_path->mnt);
+
+   namespace_lock();
+
+   err = -EINVAL;
+   /* To and From must be mounted */
+   if (!is_mounted(&from->mnt))
+   goto out;
+   if (!is_mounted(&to->mnt))
+   goto out;
+
+   err = -EPERM;
+   /* We should be allowed to modify mount namespaces of both mounts */
+   if (!ns_capable(from->mnt_ns->user_ns, CAP_SYS_ADMIN))
+   goto out;
+   if (!ns_capable(to->mnt_ns->user_ns, CAP_SYS_ADMIN))
+   goto out;
+
+   err = -EINVAL;
+   /* To and From paths should be mount roots */
+   if (from_path->dentry != from_path->mnt->mnt_root)
+   goto out;
+   if (to_path->dentry != to_path->mnt->mnt_root)
+   goto out;
+
+   /* Setting sharing groups is only allowed across same superblock */
+   if (from->mnt.mnt_sb != to->mnt.mnt_sb)
+   goto out;
+
+   /* From mount root should be wider than To mount root */
+   if (!is_subdir(to->mnt.mnt_root, from->mnt.mnt_root))
+   goto out;
+
+   /* From mount should not have locked children in place of To's root */
+   if (has_locked_children(from, to->mnt.mnt_root))
+   goto out;
+
+   /* Setting sharing groups is only allowed on private mounts */
+   if (IS_MNT_SHARED(to) || IS_MNT_SLAVE(to))
+   goto out;
+
+   /* From should not be private */
+   if (!IS_MNT_SHARED(from) && !IS_MNT_SLAVE(from))
+   goto out;
+
+   if (IS_MNT_SLAVE(from)) {
+   struct mount *m = from->mnt_master;
+
+   list_add(&to->mnt_slave, &m->mnt_slave_list);
+   to->mnt_master = m;
+   }
+
+   if (IS_MNT_SHARED(from)) {
+   to->mnt_group_id = from->mnt_group_id;
+   list_add(&to->mnt_share, &from->mnt_share);
+   lock_mount_hash();
+   set_mnt_shared(to);
+   unlock_mount_hash();
+   }
+
+   err = 0;
+out:
+   namespace_unlock();
+   return err;
+}
+
 static int do_move_mount(struct path *old_path, struct path *new_path)
 {
struct mnt_namespace *ns;
@@ -4004,7 +4076,10 @@ SYSCALL_DEFINE5(move_mount,
if (ret < 0)
goto ou

[Devel] [PATCH RH7 14/14] fs: drop peer group ids under namespace lock

2023-04-13 Thread Pavel Tikhomirov

From: Christian Brauner 

When cleaning up peer group ids in the failure path we need to make sure
to hold on to the namespace lock. Otherwise another thread might just
turn the mount from a shared into a non-shared mount concurrently.

Link: https://lore.kernel.org/lkml/88694505f8132...@google.com
Fixes: 2a1867219c7b ("fs: add mount_setattr()")
Reported-by: syzbot+8ac3859139c685c4f...@syzkaller.appspotmail.com
Cc: sta...@vger.kernel.org # 5.12+
Message-Id: 
<20230330-vfs-mount_setattr-propagation-fix-v1-1-37548d915...@kernel.org>
Signed-off-by: Christian Brauner 

(cherry picked from commit cb2239c198ad9fbd5aced22cf93e45562da781eb)
https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 fs/namespace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index f37cae055dbf..49d972024249 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -4280,9 +4280,9 @@ static int do_mount_setattr(struct path *path, struct 
mount_kattr *kattr)
unlock_mount_hash();
 
if (kattr->propagation) {
-   namespace_unlock();
if (err)
cleanup_group_ids(mnt, NULL);
+   namespace_unlock();
}
 
return err;
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 11/14] namespace: only take read lock in do_reconfigure_mnt()

2023-04-13 Thread Pavel Tikhomirov

From: Christian Brauner 

do_reconfigure_mnt() used to take the down_write(&sb->s_umount) lock
which seems unnecessary since we're not changing the superblock. We're
only checking whether it is already read-only. Setting other mount
attributes is protected by lock_mount_hash() afaict and not by s_umount.

The history of down_write(&sb->s_umount) lock being taken when setting
mount attributes dates back to the introduction of MNT_READONLY in [2].
This introduced the concept of having read-only mounts in contrast to
just having a read-only superblock. When it got introduced it was simply
plumbed into do_remount() which already took down_write(&sb->s_umount)
because it was only used to actually change the superblock before [2].
Afaict, it would've already been possible back then to only use
down_read(&sb->s_umount) for MS_BIND | MS_REMOUNT since actual mount
options were protected by the vfsmount lock already. But that would've
meant special casing the locking for MS_BIND | MS_REMOUNT in
do_remount() which people might not have considered worth it.
Then in [1] MS_BIND | MS_REMOUNT mount option changes were split out of
do_remount() into do_reconfigure_mnt() but the down_write(&sb->s_umount)
lock was simply copied over.
Now that we have this be a separate helper only take the
down_read(&sb->s_umount) lock since we're only interested in checking
whether the super block is currently read-only and blocking any writers
from changing it. Essentially, checking that the super block is
read-only has the advantage that we can avoid having to go into the
slowpath and through MNT_WRITE_HOLD and can simply set the read-only
flag on the mount in set_mount_attributes().

[1]: commit 43f5e655eff7 ("vfs: Separate changing mount flags full remount")
[2]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts 
at remount")

Link: 
https://lore.kernel.org/r/20210121131959.646623-32-christian.brau...@ubuntu.com
Cc: David Howells 
Cc: Al Viro 
Cc: linux-fsde...@vger.kernel.org
Reviewed-by: Christoph Hellwig 
Signed-off-by: Christian Brauner 

(cherry picked from commit e58ace1a0fa9d578f85f556b4b88c5fe9b871d08)
https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 fs/namespace.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 9e58a2ff16ea..c03e21575d08 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2701,10 +2701,6 @@ static int change_mount_ro_state(struct mount *mnt, 
unsigned int mnt_flags)
return 0;
 }
 
-/*
- * Update the user-settable attributes on a mount.  The caller must hold
- * sb->s_umount for writing.
- */
 static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags)
 {
mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
@@ -2732,13 +2728,17 @@ static int do_reconfigure_mnt(struct path *path, 
unsigned int mnt_flags)
if (!can_change_locked_flags(mnt, mnt_flags))
return -EPERM;
 
-   down_write(&sb->s_umount);
+   /*
+* We're only checking whether the superblock is read-only not
+* changing it, so only take down_read(&sb->s_umount).
+*/
+   down_read(&sb->s_umount);
lock_mount_hash();
ret = change_mount_ro_state(mnt, mnt_flags);
if (ret == 0)
set_mount_attributes(mnt, mnt_flags);
unlock_mount_hash();
-   up_write(&sb->s_umount);
+   up_read(&sb->s_umount);
return ret;
 }
 
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 01/14] mount: rename do_set_group to do_set_group_old

2023-04-13 Thread Pavel Tikhomirov

We have a VZ-only feature to copy mount sharing between mounts via mount
syscall, to be able to handle mount restore in CRIU u15-u19 efficiently.

In mainstream there is now similar feature through move_mount syscall.

To support both old criu and new criu, which uses ms API, at the same
time let's fix name collision and leave both variants for now, several
updates later we can drop old mount syscall based API.

https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 fs/namespace.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index fcfe15ed28f2..94f1e308b354 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2939,7 +2939,7 @@ static inline int tree_contains_unbindable(struct mount 
*mnt)
return 0;
 }
 
-static int do_set_group(struct path *path, const char *sibling_name)
+static int do_set_group_old(struct path *path, const char *sibling_name)
 {
struct ve_struct *ve = get_exec_env();
struct mount *sibling, *mnt;
@@ -3525,7 +3525,7 @@ long do_mount(const char *dev_name, const char __user 
*dir_name,
else if (cmd & MS_MOVE)
retval = do_move_mount_old(&path, dev_name);
else if (cmd & MS_SET_GROUP)
-   retval = do_set_group(&path, dev_name);
+   retval = do_set_group_old(&path, dev_name);
else
retval = do_new_mount(&path, type_page, flags, mnt_flags,
  dev_name, data_page);
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 12/14] fs: split out functions to hold writers

2023-04-13 Thread Pavel Tikhomirov

From: Christian Brauner 

When a mount is marked read-only we set MNT_WRITE_HOLD on it if there
aren't currently any active writers. Split this logic out into simple
helpers that we can use in follow-up patches.

Link: 
https://lore.kernel.org/r/20210121131959.646623-33-christian.brau...@ubuntu.com
Cc: David Howells 
Cc: Al Viro 
Cc: linux-fsde...@vger.kernel.org
Suggested-by: Christoph Hellwig 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Christian Brauner 

(cherry picked from commit fbdc2f6c40f6528fa0db79c73e844451234f3e26)
https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 fs/namespace.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c03e21575d08..a40a217f9871 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -570,10 +570,8 @@ void mnt_drop_write_file(struct file *file)
 }
 EXPORT_SYMBOL(mnt_drop_write_file);
 
-static int mnt_make_readonly(struct mount *mnt)
+static inline int mnt_hold_writers(struct mount *mnt)
 {
-   int ret = 0;
-
mnt->mnt.mnt_flags |= MNT_WRITE_HOLD;
/*
 * After storing MNT_WRITE_HOLD, we'll read the counters. This store
@@ -598,15 +596,29 @@ static int mnt_make_readonly(struct mount *mnt)
 * we're counting up here.
 */
if (mnt_get_writers(mnt) > 0)
-   ret = -EBUSY;
-   else
-   mnt->mnt.mnt_flags |= MNT_READONLY;
+   return -EBUSY;
+
+   return 0;
+}
+
+static inline void mnt_unhold_writers(struct mount *mnt)
+{
/*
 * MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers
 * that become unheld will see MNT_READONLY.
 */
smp_wmb();
mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
+}
+
+static int mnt_make_readonly(struct mount *mnt)
+{
+   int ret;
+
+   ret = mnt_hold_writers(mnt);
+   if (!ret)
+   mnt->mnt.mnt_flags |= MNT_READONLY;
+   mnt_unhold_writers(mnt);
return ret;
 }
 
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 06/14] VFS: Handle lazytime in do_mount()

2023-04-13 Thread Pavel Tikhomirov

From: Markus Trippelsdorf 

Since commit e462ec50cb5fa ("VFS: Differentiate mount flags (MS_*) from
internal superblock flags") the lazytime mount option doesn't get passed
on anymore.

Fix the issue by handling the option in do_mount().

Reviewed-by: Lukas Czerner 
Signed-off-by: Markus Trippelsdorf 
Signed-off-by: Al Viro 

(cherry picked from commit d7ee946942bdd12394809305e3df05aa4c8b7b8f)
https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 fs/namespace.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index b60e2eab0d85..33ccc11cf327 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3587,6 +3587,7 @@ long do_mount(const char *dev_name, const char __user 
*dir_name,
SB_DIRSYNC |
SB_SILENT |
SB_POSIXACL |
+   SB_LAZYTIME |
SB_I_VERSION);
 
if (flags & MS_REMOUNT)
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 02/14] move_mount: allow to add a mount into an existing group

2023-04-13 Thread Pavel Tikhomirov

Previously a sharing group (shared and master ids pair) can be only
inherited when mount is created via bindmount. This patch adds an
ability to add an existing private mount into an existing sharing group.

With this functionality one can first create the desired mount tree from
only private mounts (without the need to care about undesired mount
propagation or mount creation order implied by sharing group
dependencies), and next then setup any desired mount sharing between
those mounts in tree as needed.

This allows CRIU to restore any set of mount namespaces, mount trees and
sharing group trees for a container.

We have many issues with restoring mounts in CRIU related to sharing
groups and propagation:
- reverse sharing groups vs mount tree order requires complex mounts
  reordering which mostly implies also using some temporary mounts
(please see https://lkml.org/lkml/2021/3/23/569 for more info)

- mount() syscall creates tons of mounts due to propagation
- mount re-parenting due to propagation
- "Mount Trap" due to propagation
- "Non Uniform" propagation, meaning that with different tricks with
  mount order and temporary children-"lock" mounts one can create mount
  trees which can't be restored without those tricks
(see https://www.linuxplumbersconf.org/event/7/contributions/640/)

With this new functionality we can resolve all the problems with
propagation at once.

Link: 
https://lore.kernel.org/r/20210715100714.120228-1-ptikhomi...@virtuozzo.com
Cc: Eric W. Biederman 
Cc: Alexander Viro 
Cc: Christian Brauner 
Cc: Mattias Nissler 
Cc: Aleksa Sarai 
Cc: Andrei Vagin 
Cc: linux-fsde...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: lkml 
Co-developed-by: Andrei Vagin 
Acked-by: Christian Brauner 
Signed-off-by: Pavel Tikhomirov 
Signed-off-by: Andrei Vagin 
Signed-off-by: Christian Brauner 

https://jira.sw.ru/browse/PSBM-144416
(cherry picked from commit 9ffb14ef61bab83fa818736bf3e7e6b6e182e8e2)
Signed-off-by: Pavel Tikhomirov 
---
 fs/namespace.c  | 77 -
 include/uapi/linux/fs.h |  3 +-
 2 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 94f1e308b354..d10138869c91 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3017,6 +3017,78 @@ static bool check_for_nsfs_mounts(struct mount *subtree)
return ret;
 }
 
+static int do_set_group(struct path *from_path, struct path *to_path)
+{
+   struct mount *from, *to;
+   int err;
+
+   from = real_mount(from_path->mnt);
+   to = real_mount(to_path->mnt);
+
+   namespace_lock();
+
+   err = -EINVAL;
+   /* To and From must be mounted */
+   if (!is_mounted(&from->mnt))
+   goto out;
+   if (!is_mounted(&to->mnt))
+   goto out;
+
+   err = -EPERM;
+   /* We should be allowed to modify mount namespaces of both mounts */
+   if (!ns_capable(from->mnt_ns->user_ns, CAP_SYS_ADMIN))
+   goto out;
+   if (!ns_capable(to->mnt_ns->user_ns, CAP_SYS_ADMIN))
+   goto out;
+
+   err = -EINVAL;
+   /* To and From paths should be mount roots */
+   if (from_path->dentry != from_path->mnt->mnt_root)
+   goto out;
+   if (to_path->dentry != to_path->mnt->mnt_root)
+   goto out;
+
+   /* Setting sharing groups is only allowed across same superblock */
+   if (from->mnt.mnt_sb != to->mnt.mnt_sb)
+   goto out;
+
+   /* From mount root should be wider than To mount root */
+   if (!is_subdir(to->mnt.mnt_root, from->mnt.mnt_root))
+   goto out;
+
+   /* From mount should not have locked children in place of To's root */
+   if (has_locked_children(from, to->mnt.mnt_root))
+   goto out;
+
+   /* Setting sharing groups is only allowed on private mounts */
+   if (IS_MNT_SHARED(to) || IS_MNT_SLAVE(to))
+   goto out;
+
+   /* From should not be private */
+   if (!IS_MNT_SHARED(from) && !IS_MNT_SLAVE(from))
+   goto out;
+
+   if (IS_MNT_SLAVE(from)) {
+   struct mount *m = from->mnt_master;
+
+   list_add(&to->mnt_slave, &m->mnt_slave_list);
+   to->mnt_master = m;
+   }
+
+   if (IS_MNT_SHARED(from)) {
+   to->mnt_group_id = from->mnt_group_id;
+   list_add(&to->mnt_share, &from->mnt_share);
+   lock_mount_hash();
+   set_mnt_shared(to);
+   unlock_mount_hash();
+   }
+
+   err = 0;
+out:
+   namespace_unlock();
+   return err;
+}
+
 static int do_move_mount(struct path *old_path, struct path *new_path)
 {
struct path parent_path = {.mnt = NULL, .dentry = NULL};
@@ -3805,7 +3877,10 @@ SYSCALL_DEFINE5(move_mount,
if (ret < 0)

[Devel] [PATCH RH7 13/14] fs: add mount_setattr()

2023-04-13 Thread Pavel Tikhomirov

From: Christian Brauner 

This implements the missing mount_setattr() syscall. While the new mount
api allows to change the properties of a superblock there is currently
no way to change the properties of a mount or a mount tree using file
descriptors which the new mount api is based on. In addition the old
mount api has the restriction that mount options cannot be applied
recursively. This hasn't changed since changing mount options on a
per-mount basis was implemented in [1] and has been a frequent request
not just for convenience but also for security reasons. The legacy
mount syscall is unable to accommodate this behavior without introducing
a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND |
MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost
mount. Changing MS_REC to apply to the whole mount tree would mean
introducing a significant uapi change and would likely cause significant
regressions.

The new mount_setattr() syscall allows to recursively clear and set
mount options in one shot. Multiple calls to change mount options
requesting the same changes are idempotent:

int mount_setattr(int dfd, const char *path, unsigned flags,
  struct mount_attr *uattr, size_t usize);

Flags to modify path resolution behavior are specified in the @flags
argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
restrict path resolution as introduced with openat2() might be supported
in the future.

The mount_setattr() syscall can be expected to grow over time and is
designed with extensibility in mind. It follows the extensible syscall
pattern we have used with other syscalls such as openat2(), clone3(),
sched_{set,get}attr(), and others.
The set of mount options is passed in the uapi struct mount_attr which
currently has the following layout:

struct mount_attr {
__u64 attr_set;
__u64 attr_clr;
__u64 propagation;
__u64 userns_fd;
};

The @attr_set and @attr_clr members are used to clear and set mount
options. This way a user can e.g. request that a set of flags is to be
raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
@attr_set while at the same time requesting that another set of flags is
to be lowered such as removing noexec from a mount tree by specifying
MOUNT_ATTR_NOEXEC in @attr_clr.

Note, since the MOUNT_ATTR_ values are an enum starting from 0,
not a bitmap, users wanting to transition to a different atime setting
cannot simply specify the atime setting in @attr_set, but must also
specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that
MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set
can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in
@attr_clr.

The @propagation field lets callers specify the propagation type of a
mount tree. Propagation is a single property that has four different
settings and as such is not really a flag argument but an enum.
Specifically, it would be unclear what setting and clearing propagation
settings in combination would amount to. The legacy mount() syscall thus
forbids the combination of multiple propagation settings too. The goal
is to keep the semantics of mount propagation somewhat simple as they
are overly complex as it is.

The @userns_fd field lets user specify a user namespace whose idmapping
becomes the idmapping of the mount. This is implemented and explained in
detail in the next patch.

[1]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts 
at remount")

Link: 
https://lore.kernel.org/r/20210121131959.646623-35-christian.brau...@ubuntu.com
Cc: David Howells 
Cc: Aleksa Sarai 
Cc: Al Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Reviewed-by: Christoph Hellwig 
Signed-off-by: Christian Brauner 

Changes: port syscall for x86 only, drop uapi hunks and ignore
MNT_NOSYMFOLLOW as it is not yet supported
(cherry picked from commit 2a1867219c7b27f928e2545782b86daaf9ad50bd)
https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 fs/namespace.c   | 312 +++
 3 files changed, 314 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 978f07cb0ea1..7978137648ed 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -373,6 +373,7 @@
 
 428i386open_tree   sys_open_tree
 429i386move_mount  sys_move_mount
+442i386mount_setattr   sys_mount_setattr
 
 510i386getluid sys_getluid
 511i386setluid sys_setluid
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 3c86abac9a50..7175e94309fd 100644
--- a/arch/x86/syscalls

[Devel] [PATCH RH7 05/14] vfs: fix mounting a filesystem with i_version

2023-04-13 Thread Pavel Tikhomirov

From: Mimi Zohar 

The mount i_version flag is not enabled in the new sb_flags.  This patch
adds the missing SB_I_VERSION flag.

Fixes: e462ec5 "VFS: Differentiate mount flags (MS_*) from internal
   superblock flags"
Signed-off-by: Mimi Zohar 
Signed-off-by: Al Viro 

(cherry picked from commit 917086ff231f614e6705927d8fe3eb6aa74b21bf)
https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 fs/namespace.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 1a62ab03692e..b60e2eab0d85 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3586,7 +3586,8 @@ long do_mount(const char *dev_name, const char __user 
*dir_name,
SB_MANDLOCK |
SB_DIRSYNC |
SB_SILENT |
-   SB_POSIXACL);
+   SB_POSIXACL |
+   SB_I_VERSION);
 
if (flags & MS_REMOUNT)
retval = do_remount(&path, flags, sb_flags, mnt_flags,
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 08/14] vfs: Separate changing mount flags full remount

2023-04-13 Thread Pavel Tikhomirov

From: David Howells 

Separate just the changing of mount flags (MS_REMOUNT|MS_BIND) from full
remount because the mount data will get parsed with the new fs_context
stuff prior to doing a remount - and this causes the syscall to fail under
some circumstances.

To quote Eric's explanation:

  [...] mount(..., MS_REMOUNT|MS_BIND, ...) now validates the mount options
  string, which breaks systemd unit files with ProtectControlGroups=yes
  (e.g.  systemd-networkd.service) when systemd does the following to
  change a cgroup (v1) mount to read-only:

mount(NULL, "/run/systemd/unit-root/sys/fs/cgroup/systemd", NULL,
  MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_BIND, NULL)

  ... when the kernel has CONFIG_CGROUPS=y but no cgroup subsystems
  enabled, since in that case the error "cgroup1: Need name or subsystem
  set" is hit when the mount options string is empty.

  Probably it doesn't make sense to validate the mount options string at
  all in the MS_REMOUNT|MS_BIND case, though maybe you had something else
  in mind.

This is also worthwhile doing because we will need to add a mount_setattr()
syscall to take over the remount-bind function.

Reported-by: Eric Biggers 
Signed-off-by: David Howells 
Signed-off-by: Al Viro 
Reviewed-by: David Howells 

(cherry picked from commit 43f5e655eff7e124d4e484515689cba374ab698e)
https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 fs/namespace.c| 146 ++
 include/linux/mount.h |   2 +-
 2 files changed, 93 insertions(+), 55 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 10e294cb8cda..1b7b71c46a22 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -285,13 +285,9 @@ static struct mount *alloc_vfsmnt(const char *name)
  * mnt_want/drop_write() will _keep_ the filesystem
  * r/w.
  */
-int __mnt_is_readonly(struct vfsmount *mnt)
+bool __mnt_is_readonly(struct vfsmount *mnt)
 {
-   if (mnt->mnt_flags & MNT_READONLY)
-   return 1;
-   if (mnt->mnt_sb->s_flags & MS_RDONLY)
-   return 1;
-   return 0;
+   return (mnt->mnt_flags & MNT_READONLY) || (mnt->mnt_sb->s_flags & 
MS_RDONLY);
 }
 EXPORT_SYMBOL_GPL(__mnt_is_readonly);
 
@@ -606,11 +602,12 @@ static int mnt_make_readonly(struct mount *mnt)
return ret;
 }
 
-static void __mnt_unmake_readonly(struct mount *mnt)
+static int __mnt_unmake_readonly(struct mount *mnt)
 {
lock_mount_hash();
mnt->mnt.mnt_flags &= ~MNT_READONLY;
unlock_mount_hash();
+   return 0;
 }
 
 int sb_prepare_remount_readonly(struct super_block *sb)
@@ -2657,21 +2654,91 @@ SYSCALL_DEFINE3(open_tree, int, dfd, const char __user 
*, filename, unsigned, fl
return fd;
 }
 
-static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
+/*
+ * Don't allow locked mount flags to be cleared.
+ *
+ * No locks need to be held here while testing the various MNT_LOCK
+ * flags because those flags can never be cleared once they are set.
+ */
+static bool can_change_locked_flags(struct mount *mnt, unsigned int mnt_flags)
+{
+   unsigned int fl = mnt->mnt.mnt_flags;
+
+   if ((fl & MNT_LOCK_READONLY) &&
+   !(mnt_flags & MNT_READONLY))
+   return false;
+
+   if ((fl & MNT_LOCK_NODEV) &&
+   !(mnt_flags & MNT_NODEV))
+   return false;
+
+   if ((fl & MNT_LOCK_NOSUID) &&
+   !(mnt_flags & MNT_NOSUID))
+   return false;
+
+   if ((fl & MNT_LOCK_NOEXEC) &&
+   !(mnt_flags & MNT_NOEXEC))
+   return false;
+
+   if ((fl & MNT_LOCK_ATIME) &&
+   ((fl & MNT_ATIME_MASK) != (mnt_flags & MNT_ATIME_MASK)))
+   return false;
+
+   return true;
+}
+
+static int change_mount_ro_state(struct mount *mnt, unsigned int mnt_flags)
 {
-   int error = 0;
-   int readonly_request = 0;
+   bool readonly_request = (mnt_flags & MNT_READONLY);
 
-   if (ms_flags & MS_RDONLY)
-   readonly_request = 1;
-   if (readonly_request == __mnt_is_readonly(mnt))
+   if (readonly_request == __mnt_is_readonly(&mnt->mnt))
return 0;
 
if (readonly_request)
-   error = mnt_make_readonly(real_mount(mnt));
-   else
-   __mnt_unmake_readonly(real_mount(mnt));
-   return error;
+   return mnt_make_readonly(mnt);
+
+   return __mnt_unmake_readonly(mnt);
+}
+
+/*
+ * Update the user-settable attributes on a mount.  The caller must hold
+ * sb->s_umount for writing.
+ */
+static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags)
+{
+   lock_mount_hash();
+   mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
+   mnt->mnt.mnt_flags = mnt_flags;
+   touch_m

[Devel] [PATCH RH7 04/14] VFS: Differentiate mount flags (MS_*) from internal superblock flags

2023-04-13 Thread Pavel Tikhomirov

From: David Howells 

Differentiate the MS_* flags passed to mount(2) from the internal flags set
in the super_block's s_flags.  s_flags are now called SB_*, with the names
and the values for the moment mirroring the MS_* flags that they're
equivalent to.

In this patch, just the headers are altered and some kernel code where
blind automated conversion isn't necessarily correct.

Note that this shows up some interesting issues:

 (1) Some MS_* flags get translated to MNT_* flags (such as MS_NODEV ->
 MNT_NODEV) without passing this on to the filesystem, but some
 filesystems set such flags anyway.

 (2) The ->remount_fs() methods of some filesystems adjust the *flags
 argument by setting MS_* flags in it, such as MS_NOATIME - but these
 flags are then scrubbed by do_remount_sb() (only the occupants of
 MS_RMT_MASK are permitted: MS_RDONLY, MS_SYNCHRONOUS, MS_MANDLOCK,
 MS_I_VERSION and MS_LAZYTIME)

I'm not sure what's the best way to solve all these cases.

Suggested-by: Al Viro 
Signed-off-by: David Howells 

Change: Get rid of our cmd variable in do_mount
(cherry picked from commit e462ec50cb5fad19f6003a3d8087f4a0945dd2b1)
https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 Documentation/filesystems/porting |  2 +-
 fs/namespace.c| 71 ---
 fs/super.c| 68 ++---
 include/linux/fs.h| 45 
 init/do_mounts.c  |  2 +-
 5 files changed, 108 insertions(+), 80 deletions(-)

diff --git a/Documentation/filesystems/porting 
b/Documentation/filesystems/porting
index fbfc7338f2bd..680e3401f7d8 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -228,7 +228,7 @@ anything from oops to silent memory corruption.
 ---
 [mandatory]
 
-   FS_NOMOUNT is gone.  If you use it - just set MS_NOUSER in flags
+   FS_NOMOUNT is gone.  If you use it - just set SB_NOUSER in flags
 (see rootfs for one kind of solution and bdev/socket/pipe for another).
 
 ---
diff --git a/fs/namespace.c b/fs/namespace.c
index d10138869c91..1a62ab03692e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1065,7 +1065,7 @@ vfs_kern_mount(struct file_system_type *type, int flags, 
const char *name, void
if (!mnt)
return ERR_PTR(-ENOMEM);
 
-   if (flags & MS_KERNMOUNT)
+   if (flags & SB_KERNMOUNT)
mnt->mnt.mnt_flags = MNT_INTERNAL;
 
root = mount_fs(type, flags, name, data);
@@ -1121,7 +1121,7 @@ vfs_submount(const struct dentry *mountpoint, struct 
file_system_type *type,
return ERR_PTR(-EPERM);
 #endif
 
-   return vfs_kern_mount(type, MS_SUBMOUNT, name, data);
+   return vfs_kern_mount(type, SB_SUBMOUNT, name, data);
 }
 EXPORT_SYMBOL_GPL(vfs_submount);
 
@@ -1828,7 +1828,7 @@ static int do_umount(struct mount *mnt, int flags)
return -EPERM;
down_write(&sb->s_umount);
if (!(sb->s_flags & MS_RDONLY))
-   retval = do_remount_sb(sb, MS_RDONLY, NULL, 0);
+   retval = do_remount_sb(sb, SB_RDONLY, NULL, 0);
up_write(&sb->s_umount);
return retval;
}
@@ -2419,7 +2419,7 @@ static void unlock_mount(struct mountpoint *where)
 
 static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint 
*mp)
 {
-   if (mnt->mnt.mnt_sb->s_flags & MS_NOUSER)
+   if (mnt->mnt.mnt_sb->s_flags & SB_NOUSER)
return -EINVAL;
 
if (S_ISDIR(mp->m_dentry->d_inode->i_mode) !=
@@ -2433,9 +2433,9 @@ static int graft_tree(struct mount *mnt, struct mount *p, 
struct mountpoint *mp)
  * Sanity check the flags to change_mnt_propagation.
  */
 
-static int flags_to_propagation_type(int flags)
+static int flags_to_propagation_type(int ms_flags)
 {
-   int type = flags & ~(MS_REC | MS_SILENT);
+   int type = ms_flags & ~(MS_REC | MS_SILENT);
 
/* Fail if any non-propagation flags are set */
if (type & ~(MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
@@ -2449,18 +2449,18 @@ static int flags_to_propagation_type(int flags)
 /*
  * recursively change the type of the mountpoint.
  */
-static int do_change_type(struct path *path, int flag)
+static int do_change_type(struct path *path, int ms_flags)
 {
struct mount *m;
struct mount *mnt = real_mount(path->mnt);
-   int recurse = flag & MS_REC;
+   int recurse = ms_flags & MS_REC;
int type;
int err = 0;
 
if (path->dentry != path->mnt->mnt_root)
return -EINVAL;
 
-   type = flags_to_propagation_type(flag);
+   type = flags_to_propagation_type(ms_flags);
if (!type)
return -EINVAL;
 
@@ -2867,8 +2867,8 @@ static int

[Devel] [PATCH RH7 10/14] mount: make {lock, unlock}_mount_hash() static

2023-04-13 Thread Pavel Tikhomirov

From: Christian Brauner 

The lock_mount_hash() and unlock_mount_hash() helpers are never called
outside a single file. Remove them from the header and make them static
to reflect this fact. There's no need to have them callable from other
places right now, as Christoph observed.

Link: 
https://lore.kernel.org/r/20210121131959.646623-31-christian.brau...@ubuntu.com
Cc: David Howells 
Cc: Al Viro 
Cc: linux-fsde...@vger.kernel.org
Suggested-by: Christoph Hellwig 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Christian Brauner 

(cherry picked from commit d033cb6784c4f3a19a593cfe11f850e476197388)
https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 fs/mount.h | 10 --
 fs/namespace.c | 10 ++
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index b07b3934b746..f968bb548685 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -132,16 +132,6 @@ static inline void get_mnt_ns(struct mnt_namespace *ns)
 
 extern seqlock_t mount_lock;
 
-static inline void lock_mount_hash(void)
-{
-   write_seqlock(&mount_lock);
-}
-
-static inline void unlock_mount_hash(void)
-{
-   write_sequnlock(&mount_lock);
-}
-
 struct proc_mounts {
struct seq_file m;
struct mnt_namespace *ns;
diff --git a/fs/namespace.c b/fs/namespace.c
index 52a879628aba..9e58a2ff16ea 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -84,6 +84,16 @@ EXPORT_SYMBOL_GPL(fs_kobj);
  */
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(mount_lock);
 
+static inline void lock_mount_hash(void)
+{
+   write_seqlock(&mount_lock);
+}
+
+static inline void unlock_mount_hash(void)
+{
+   write_sequnlock(&mount_lock);
+}
+
 static inline struct hlist_head *m_hash(struct vfsmount *mnt, struct dentry 
*dentry)
 {
unsigned long tmp = ((unsigned long)mnt / L1_CACHE_BYTES);
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 09/14] namespace: take lock_mount_hash() directly when changing flags

2023-04-13 Thread Pavel Tikhomirov

From: Christian Brauner 

Changing mount options always ends up taking lock_mount_hash() but when
MNT_READONLY is requested and neither the mount nor the superblock are
MNT_READONLY we end up taking the lock, dropping it, and retaking it to
change the other mount attributes. Instead, let's acquire the lock once
when changing the mount attributes. This simplifies the locking in these
codepath, makes them easier to reason about and avoids having to
reacquire the lock right after dropping it.

Link: 
https://lore.kernel.org/r/20210121131959.646623-30-christian.brau...@ubuntu.com
Cc: David Howells 
Cc: Al Viro 
Cc: linux-fsde...@vger.kernel.org
Reviewed-by: Christoph Hellwig 
Signed-off-by: Christian Brauner 

(cherry picked from commit 68847c941700475575ced191108971d26e82ae29)
https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 fs/namespace.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 1b7b71c46a22..52a879628aba 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -564,7 +564,6 @@ static int mnt_make_readonly(struct mount *mnt)
 {
int ret = 0;
 
-   lock_mount_hash();
mnt->mnt.mnt_flags |= MNT_WRITE_HOLD;
/*
 * After storing MNT_WRITE_HOLD, we'll read the counters. This store
@@ -598,18 +597,9 @@ static int mnt_make_readonly(struct mount *mnt)
 */
smp_wmb();
mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
-   unlock_mount_hash();
return ret;
 }
 
-static int __mnt_unmake_readonly(struct mount *mnt)
-{
-   lock_mount_hash();
-   mnt->mnt.mnt_flags &= ~MNT_READONLY;
-   unlock_mount_hash();
-   return 0;
-}
-
 int sb_prepare_remount_readonly(struct super_block *sb)
 {
struct mount *mnt;
@@ -2697,7 +2687,8 @@ static int change_mount_ro_state(struct mount *mnt, 
unsigned int mnt_flags)
if (readonly_request)
return mnt_make_readonly(mnt);
 
-   return __mnt_unmake_readonly(mnt);
+   mnt->mnt.mnt_flags &= ~MNT_READONLY;
+   return 0;
 }
 
 /*
@@ -2706,11 +2697,9 @@ static int change_mount_ro_state(struct mount *mnt, 
unsigned int mnt_flags)
  */
 static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags)
 {
-   lock_mount_hash();
mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
mnt->mnt.mnt_flags = mnt_flags;
touch_mnt_namespace(mnt->mnt_ns);
-   unlock_mount_hash();
 }
 
 /*
@@ -2734,9 +2723,11 @@ static int do_reconfigure_mnt(struct path *path, 
unsigned int mnt_flags)
return -EPERM;
 
down_write(&sb->s_umount);
+   lock_mount_hash();
ret = change_mount_ro_state(mnt, mnt_flags);
if (ret == 0)
set_mount_attributes(mnt, mnt_flags);
+   unlock_mount_hash();
up_write(&sb->s_umount);
return ret;
 }
@@ -2958,8 +2949,11 @@ static int do_remount(struct path *path, int ms_flags, 
int sb_flags,
err = -EPERM;
if (ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) {
err = do_check_and_remount_sb(sb, sb_flags, data);
-   if (!err)
+   if (!err) {
+   lock_mount_hash();
set_mount_attributes(mnt, mnt_flags);
+   unlock_mount_hash();
+   }
}
up_write(&sb->s_umount);
return err;
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 07/14] vfs: Undo an overly zealous MS_RDONLY -> SB_RDONLY conversion

2023-04-13 Thread Pavel Tikhomirov

From: David Howells 

In do_mount() when the MS_* flags are being converted to MNT_* flags,
MS_RDONLY got accidentally convered to SB_RDONLY.

Undo this change.

Fixes: e462ec50cb5f ("VFS: Differentiate mount flags (MS_*) from internal 
superblock flags")
Signed-off-by: David Howells 
Signed-off-by: Linus Torvalds 

(cherry picked from commit a9e5b73288cf1595ac2e05cf1acd1924ceea05fa)
https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 
---
 fs/namespace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 33ccc11cf327..10e294cb8cda 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3570,7 +3570,7 @@ long do_mount(const char *dev_name, const char __user 
*dir_name,
mnt_flags |= MNT_NODIRATIME;
if (flags & MS_STRICTATIME)
mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
-   if (flags & SB_RDONLY)
+   if (flags & MS_RDONLY)
mnt_flags |= MNT_READONLY;
 
/* The default atime for remount is preservation */
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 00/14] port move_mount_set_group and mount_setattr

2023-04-13 Thread Pavel Tikhomirov

We need this as in Virtuozzo criu after rebase to mainstream criu in u20
we will switch to this new API for sharing group setting accross mounts.

https://jira.sw.ru/browse/PSBM-144416
Signed-off-by: Pavel Tikhomirov 

Aleksa Sarai (1):
  lib: introduce copy_struct_from_user() helper

Christian Brauner (6):
  namespace: take lock_mount_hash() directly when changing flags
  mount: make {lock,unlock}_mount_hash() static
  namespace: only take read lock in do_reconfigure_mnt()
  fs: split out functions to hold writers
  fs: add mount_setattr()
  fs: drop peer group ids under namespace lock

David Howells (3):
  VFS: Differentiate mount flags (MS_*) from internal superblock flags
  vfs: Undo an overly zealous MS_RDONLY -> SB_RDONLY conversion
  vfs: Separate changing mount flags full remount

Markus Trippelsdorf (1):
  VFS: Handle lazytime in do_mount()

Mimi Zohar (1):
  vfs: fix mounting a filesystem with i_version

Pavel Tikhomirov (2):
  mount: rename do_set_group to do_set_group_old
  move_mount: allow to add a mount into an existing group

 Documentation/filesystems/porting |   2 +-
 arch/x86/syscalls/syscall_32.tbl  |   1 +
 arch/x86/syscalls/syscall_64.tbl  |   1 +
 fs/mount.h|  10 -
 fs/namespace.c| 646 +-
 fs/super.c|  68 ++--
 include/linux/fs.h|  45 ++-
 include/linux/mount.h |   2 +-
 include/linux/uaccess.h   |  70 
 include/linux/vz_bitops.h |  11 +
 include/uapi/linux/fs.h   |   3 +-
 init/do_mounts.c  |   2 +-
 lib/strnlen_user.c|   8 +-
 lib/usercopy.c|  76 
 14 files changed, 780 insertions(+), 165 deletions(-)
 create mode 100644 include/linux/vz_bitops.h

-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 03/14] lib: introduce copy_struct_from_user() helper

2023-04-13 Thread Pavel Tikhomirov

From: Aleksa Sarai 

A common pattern for syscall extensions is increasing the size of a
struct passed from userspace, such that the zero-value of the new fields
result in the old kernel behaviour (allowing for a mix of userspace and
kernel vintages to operate on one another in most cases).

While this interface exists for communication in both directions, only
one interface is straightforward to have reasonable semantics for
(userspace passing a struct to the kernel). For kernel returns to
userspace, what the correct semantics are (whether there should be an
error if userspace is unaware of a new extension) is very
syscall-dependent and thus probably cannot be unified between syscalls
(a good example of this problem is [1]).

Previously there was no common lib/ function that implemented
the necessary extension-checking semantics (and different syscalls
implemented them slightly differently or incompletely[2]). Future
patches replace common uses of this pattern to make use of
copy_struct_from_user().

Some in-kernel selftests that insure that the handling of alignment and
various byte patterns are all handled identically to memchr_inv() usage.

[1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
 robustify sched_read_attr() ABI logic and code")

[2]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
 similar checks to copy_struct_from_user() while rt_sigprocmask(2)
 always rejects differently-sized struct arguments.

Suggested-by: Rasmus Villemoes 
Signed-off-by: Aleksa Sarai 
Reviewed-by: Kees Cook 
Reviewed-by: Christian Brauner 
Link: https://lore.kernel.org/r/20191001011055.19283-2-cyp...@cyphar.com
Signed-off-by: Christian Brauner 

- Need this for mount_setxattr syscall.
- Dropped lib/test_user_copy.c hunks when rebasing.
- Move aligned_byte_mask to vz_bitops.h else something breaks on boot
after moving it to bitops.h.
- Add unsafe_get_user, user_access_begin and user_access_end helpers.

https://jira.vzint.dev/browse/PSBM-144416
(cherry picked from commit f5a1a536fa14895ccff4e94e6a5af90901ce86aa)
Signed-off-by: Pavel Tikhomirov 
---
 include/linux/uaccess.h   | 70 
 include/linux/vz_bitops.h | 11 ++
 lib/strnlen_user.c|  8 +
 lib/usercopy.c| 76 +++
 4 files changed, 158 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/vz_bitops.h

diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index b28cc68c1896..9f8da4aa31a1 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -112,6 +112,76 @@ static inline unsigned long __copy_from_user_nocache(void 
*to,
ret;\
})
 
+extern __must_check int check_zeroed_user(const void __user *from, size_t 
size);
+
+/**
+ * copy_struct_from_user: copy a struct from userspace
+ * @dst:   Destination address, in kernel space. This buffer must be @ksize
+ * bytes long.
+ * @ksize: Size of @dst struct.
+ * @src:   Source address, in userspace.
+ * @usize: (Alleged) size of @src struct.
+ *
+ * Copies a struct from userspace to kernel space, in a way that guarantees
+ * backwards-compatibility for struct syscall arguments (as long as future
+ * struct extensions are made such that all new fields are *appended* to the
+ * old struct, and zeroed-out new fields have the same meaning as the old
+ * struct).
+ *
+ * @ksize is just sizeof(*dst), and @usize should've been passed by userspace.
+ * The recommended usage is something like the following:
+ *
+ *   SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, usize)
+ *   {
+ *  int err;
+ *  struct foo karg = {};
+ *
+ *  if (usize > PAGE_SIZE)
+ *return -E2BIG;
+ *  if (usize < FOO_SIZE_VER0)
+ *return -EINVAL;
+ *
+ *  err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
+ *  if (err)
+ *return err;
+ *
+ *  // ...
+ *   }
+ *
+ * There are three cases to consider:
+ *  * If @usize == @ksize, then it's copied verbatim.
+ *  * If @usize < @ksize, then the userspace has passed an old struct to a
+ *newer kernel. The rest of the trailing bytes in @dst (@ksize - @usize)
+ *are to be zero-filled.
+ *  * If @usize > @ksize, then the userspace has passed a new struct to an
+ *older kernel. The trailing bytes unknown to the kernel (@usize - @ksize)
+ *are checked to ensure they are zeroed, otherwise -E2BIG is returned.
+ *
+ * Returns (in all cases, some data may have been copied):
+ *  * -E2BIG:  (@usize > @ksize) and there are non-zero trailing bytes in @src.
+ *  * -EFAULT: access to userspace failed.
+ */
+static __always_inline __must_check int
+copy_struct_from_user(void *dst, size_t ksize, const void __user *src,
+ size_t usize)
+{
+   size_t size = min(ksize, usize);
+   size_t rest = max(ksize, usize) - size

Re: [Devel] [PATCH RH9 1/4] Revert "net: openvswitch: add capability to specify ifindex of new links"

2023-04-26 Thread Pavel Tikhomirov

Just a note: old criu will get EOPNOTSUPP on restore in create_one_vport 
on new kernel.


Probably it's OK and we can afford it, as we only released beta version 
yet. And probably nobody will just update kernel and reboot without 
updating criu.


On 25.04.2023 18:11, Andrey Zhadchenko wrote:

This reverts commit 757ebade1eec8c6a3d1a150c8bd6f564c939c058.
We should use the version upstream accepted

https://jira.vzint.dev/browse/PSBM-105844
Signed-off-by: Andrey Zhadchenko 
---
  net/openvswitch/datapath.c   | 16 ++--
  net/openvswitch/vport-internal_dev.c |  1 -
  net/openvswitch/vport.h  |  2 --
  3 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 8033c97a8d65..7e8a39a35627 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -1739,7 +1739,6 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct 
genl_info *info)
struct vport *vport;
struct ovs_net *ovs_net;
int err;
-   struct ovs_header *ovs_header = info->userhdr;
  
  	err = -EINVAL;

if (!a[OVS_DP_ATTR_NAME] || !a[OVS_DP_ATTR_UPCALL_PID])
@@ -1780,7 +1779,6 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct 
genl_info *info)
parms.dp = dp;
parms.port_no = OVSP_LOCAL;
parms.upcall_portids = a[OVS_DP_ATTR_UPCALL_PID];
-   parms.desired_ifindex = ovs_header->dp_ifindex;
  
  	/* So far only local changes have been made, now need the lock. */

ovs_lock();
@@ -2201,10 +2199,7 @@ static int ovs_vport_cmd_new(struct sk_buff *skb, struct 
genl_info *info)
if (!a[OVS_VPORT_ATTR_NAME] || !a[OVS_VPORT_ATTR_TYPE] ||
!a[OVS_VPORT_ATTR_UPCALL_PID])
return -EINVAL;
-
-   parms.type = nla_get_u32(a[OVS_VPORT_ATTR_TYPE]);
-
-   if (a[OVS_VPORT_ATTR_IFINDEX] && parms.type != OVS_VPORT_TYPE_INTERNAL)
+   if (a[OVS_VPORT_ATTR_IFINDEX])
return -EOPNOTSUPP;
  
  	port_no = a[OVS_VPORT_ATTR_PORT_NO]

@@ -2241,19 +2236,12 @@ static int ovs_vport_cmd_new(struct sk_buff *skb, 
struct genl_info *info)
}
  
  	parms.name = nla_data(a[OVS_VPORT_ATTR_NAME]);

+   parms.type = nla_get_u32(a[OVS_VPORT_ATTR_TYPE]);
parms.options = a[OVS_VPORT_ATTR_OPTIONS];
parms.dp = dp;
parms.port_no = port_no;
parms.upcall_portids = a[OVS_VPORT_ATTR_UPCALL_PID];
  
-	if (parms.type == OVS_VPORT_TYPE_INTERNAL) {

-   if (a[OVS_VPORT_ATTR_IFINDEX])
-   parms.desired_ifindex =
-   nla_get_u32(a[OVS_VPORT_ATTR_IFINDEX]);
-   else
-   parms.desired_ifindex = 0;
-   }
-
vport = new_vport(&parms);
err = PTR_ERR(vport);
if (IS_ERR(vport)) {
diff --git a/net/openvswitch/vport-internal_dev.c 
b/net/openvswitch/vport-internal_dev.c
index 1c25158fbdf2..1e5468137c88 100644
--- a/net/openvswitch/vport-internal_dev.c
+++ b/net/openvswitch/vport-internal_dev.c
@@ -157,7 +157,6 @@ static struct vport *internal_dev_create(const struct 
vport_parms *parms)
if (vport->port_no == OVSP_LOCAL)
vport->dev->features |= NETIF_F_NETNS_LOCAL;
  
-	dev->ifindex = parms->desired_ifindex;

rtnl_lock();
err = register_netdevice(vport->dev);
if (err)
diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
index 24e1cba2f1ac..9de5030d9801 100644
--- a/net/openvswitch/vport.h
+++ b/net/openvswitch/vport.h
@@ -98,8 +98,6 @@ struct vport_parms {
enum ovs_vport_type type;
struct nlattr *options;
  
-	int desired_ifindex;

-
/* For ovs_vport_alloc(). */
struct datapath *dp;
u16 port_no;


--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RH9 2/4] openvswitch: fix memory leak at failed datapath creation

2023-04-26 Thread Pavel Tikhomirov





On 25.04.2023 18:11, Andrey Zhadchenko wrote:

ovs_dp_cmd_new()->ovs_dp_change()->ovs_dp_set_upcall_portids()
allocates array via kmalloc.
If for some reason new_vport() fails during ovs_dp_cmd_new()
dp->upcall_portids must be freed.
Add missing kfree.

Kmemleak example:
unreferenced object 0x88800c382500 (size 64):
   comm "dump_state", pid 323, jiffies 4294955418 (age 104.347s)
   hex dump (first 32 bytes):
 5e c2 79 e4 1f 7a 38 c7 09 21 38 0c 80 88 ff ff  ^.y..z8..!8.
 03 00 00 00 0a 00 00 00 14 00 00 00 28 00 00 00  (...
   backtrace:
 [<71bebc9f>] ovs_dp_set_upcall_portids+0x38/0xa0
 [<0187d8bd>] ovs_dp_change+0x63/0xe0
 [<2397e446>] ovs_dp_cmd_new+0x1f0/0x380
 [] genl_family_rcv_msg_doit+0xea/0x150
 [<8f583bc4>] genl_rcv_msg+0xdc/0x1e0
 [] netlink_rcv_skb+0x50/0x100
 [<4959cece>] genl_rcv+0x24/0x40
 [<4699ac7f>] netlink_unicast+0x23e/0x360
 [] netlink_sendmsg+0x24e/0x4b0
 [<6f4aa380>] sock_sendmsg+0x62/0x70
 [] sys_sendmsg+0x230/0x270
 [<12dacf7d>] ___sys_sendmsg+0x88/0xd0
 [<11776020>] __sys_sendmsg+0x59/0xa0
 [<2e8f2dc1>] do_syscall_64+0x3b/0x90
 [<3243e7cb>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: b83d23a2a38b ("openvswitch: Introduce per-cpu upcall dispatch")
Acked-by: Aaron Conole 
Signed-off-by: Andrey Zhadchenko 
Link: 
https://lore.kernel.org/r/20220825020326.664073-1-andrey.zhadche...@virtuozzo.com
Signed-off-by: Jakub Kicinski 

(cherry picked from ms commit a87406f4adee9c53b311d8a1ba2849c69e29a6d0)


Do we have jira issue for it?


Signed-off-by: Andrey Zhadchenko 
---
  net/openvswitch/datapath.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 7e8a39a35627..6c9d153afbee 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -1802,7 +1802,7 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct 
genl_info *info)
ovs_dp_reset_user_features(skb, info);
}
  
-		goto err_unlock_and_destroy_meters;

+   goto err_destroy_portids;
}
  
  	err = ovs_dp_cmd_fill_info(dp, reply, info->snd_portid,

@@ -1817,6 +1817,8 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct 
genl_info *info)
ovs_notify(&dp_datapath_genl_family, reply, info);
return 0;
  
+err_destroy_portids:

+   kfree(rcu_dereference_raw(dp->upcall_portids));
  err_unlock_and_destroy_meters:
ovs_unlock();
ovs_meters_exit(dp);


--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RH9 0/4] net/openvswitch: pull ms API

2023-04-26 Thread Pavel Tikhomirov


Ported cleanly. Would be nice to have jira issue tracking this.

Reviewed-by: Pavel Tikhomirov 

On 25.04.2023 18:11, Andrey Zhadchenko wrote:

Revert our patch for openvswitch, apply the ones that got accepted
into mainstream:
https://lore.kernel.org/all/20220825020450.664147-1-andrey.zhadche...@virtuozzo.com/

Andrey Zhadchenko (4):
   Revert "net: openvswitch: add capability to specify ifindex of new
 links"
   openvswitch: fix memory leak at failed datapath creation
   openvswitch: allow specifying ifindex of new interfaces
   openvswitch: add OVS_DP_ATTR_PER_CPU_PIDS to get requests

  include/uapi/linux/openvswitch.h |  3 +++
  net/openvswitch/datapath.c   | 29 
  net/openvswitch/vport-internal_dev.c |  2 +-
  net/openvswitch/vport.h  |  4 ++--
  4 files changed, 23 insertions(+), 15 deletions(-)



--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7] ve/cgroups: fix subgroups_limit error path handling

2023-05-18 Thread Pavel Tikhomirov

We do ida_simple_get on root->cgroup_ida, just before checking
subgroups_limit, and in case subgroups_limit is reached we don't do
corresponding ida_simple_remove to free id. Let's fix it by jumping
to proper goto label err_free_id.

This may or may not be related to [1], found while investigating it.

https://jira.vzint.dev/browse/PSBM-147036 [1]
Fixes: 92faf0fad3e3 ("ve/cgroups: Introduce subgroups_limit control")
Signed-off-by: Pavel Tikhomirov 
---
 kernel/cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f2952d7c18dc..3f8c49b9ebe0 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4888,7 +4888,7 @@ static long cgroup_create(struct cgroup *parent, struct 
dentry *dentry,
if (ve_root && ve_root->subgroups_limit > 0 &&
subgroups_count(ve_root) >= ve_root->subgroups_limit) {
err = -EACCES;
-   goto err_free_name;
+   goto err_free_id;
}
 
/*
-- 
2.39.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH vz7] ploop: increase logging on errors when opening new deltas

2023-06-06 Thread Pavel Tikhomirov


Looks good, see minor nits in inline comments.

On 22.05.2023 23:39, Alexander Atanasov wrote:

Ocassionally we got EBUSY but it is a bit over used,
so it is not clear what it means.

Add more logging to catch the source of the error.

https://jira.vzint.dev/browse/PSBM-146836
Signed-off-by: Alexander Atanasov 
---
  drivers/block/ploop/dev.c |  9 -
  drivers/block/ploop/fmt_ploop1.c  |  4 +++-
  drivers/block/ploop/io_kaio.c | 13 ++---
  drivers/block/ploop/io_kaio_map.c |  6 --
  4 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 6eb22168b5fe..75e427927713 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -3572,8 +3572,11 @@ static int ploop_replace_delta(struct ploop_device * 
plo, unsigned long arg)
   sizeof(struct ploop_ctl_chunk)))
return -EFAULT;
  
-	if (plo->maintenance_type != PLOOP_MNTN_OFF)

+   if (plo->maintenance_type != PLOOP_MNTN_OFF) {
+   if (printk_ratelimit())
+   PL_WARN(plo, "Attempt to replace while in maintenance 
mode\n");
return -EBUSY;
+   }
  
  	old_delta = find_delta(plo, ctl.pctl_level);

if (old_delta == NULL)
@@ -3586,6 +3589,10 @@ static int ploop_replace_delta(struct ploop_device * 
plo, unsigned long arg)
if (IS_ERR(delta))
return PTR_ERR(delta);
  
+	WARN_ONCE(delta->ops != old_delta->ops,

+ "New delta uses different io %p vs %p\n",
+ delta->ops, old_delta->ops);
+
err = delta->ops->compose(delta, 1, &chunk);
if (err)
goto out_destroy;
diff --git a/drivers/block/ploop/fmt_ploop1.c b/drivers/block/ploop/fmt_ploop1.c
index e59a9eb50ac2..a89804561e57 100644
--- a/drivers/block/ploop/fmt_ploop1.c
+++ b/drivers/block/ploop/fmt_ploop1.c
@@ -314,8 +314,10 @@ ploop1_open(struct ploop_delta * delta)
if (!(delta->flags & PLOOP_FMT_RDONLY)) {
pvd_header_set_disk_in_use(vh);
err = delta->io.ops->sync_write(&delta->io, ph->dyn_page, 4096, 
0, 0);
-   if (err)
+   if (err) {
+   PL_ERR(delta->plo, "write failed updating in use\n");


Don't we need err printed here? (like in other places below where we 
have err printed)



goto out_err;
+   }
}
  
  	delta->io.alloc_head = ph->alloc_head;

diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
index ab93d2c70bc5..b7258252deb2 100644
--- a/drivers/block/ploop/io_kaio.c
+++ b/drivers/block/ploop/io_kaio.c
@@ -997,17 +997,23 @@ static int kaio_open(struct ploop_io * io)
io->files.bdev = io->files.inode->i_sb->s_bdev;
  
  	err = io->ops->sync(io);

-   if (err)
+   if (err) {
+   PL_WARN(delta->plo, "open failed to sync err=%d\n", err);
return err;
+   }
  
  	mutex_lock(&io->files.inode->i_mutex);

err = kaio_invalidate_cache(io);
-   if (!err)
+   if (err)
+   PL_WARN(delta->plo, "invaldiate_cache failed err=%d\n", err);
+   else
err = ploop_kaio_open(file, delta->flags & PLOOP_FMT_RDONLY);
mutex_unlock(&io->files.inode->i_mutex);
  
-	if (err)

+   if (err) {
+   PL_WARN(delta->plo, "open failed err=%d\n", err);
return err;
+   }
  
  	io->files.em_tree = &dummy_em_tree;
  
@@ -1019,6 +1025,7 @@ static int kaio_open(struct ploop_io * io)

err = PTR_ERR(io->fsync_thread);
io->fsync_thread = NULL;
ploop_kaio_close(io->files.mapping, 0);
+   PL_WARN(delta->plo, "fsync thread start failed=%d\n", 
err);
return err;
}
  
diff --git a/drivers/block/ploop/io_kaio_map.c b/drivers/block/ploop/io_kaio_map.c

index d4ff39d95e74..5a3d11e0f4a9 100644
--- a/drivers/block/ploop/io_kaio_map.c
+++ b/drivers/block/ploop/io_kaio_map.c
@@ -35,9 +35,11 @@ int ploop_kaio_open(struct file * file, int rdonly)
else
m->readers++;
} else {
-   if (m->readers)
+   if (m->readers) {
+   pr_warn("File is already active:%d\n",
+   m->readers);
err = -EBUSY;
-   else
+   } else
m->readers = -1;


I'd rather you follow the codding style here and add {} to both branches.

https://www.kernel.org/doc/html/v4.10/process/coding-style.html#placing-braces-and-spaces


}
goto kaio_open_done;


--
Best regards, Tikhomirov Pavel
Senior So

Re: [Devel] [PATCH vz7] ploop: properly restore old delta kobject on replace error

2023-06-06 Thread Pavel Tikhomirov





On 22.05.2023 23:19, Alexander Atanasov wrote:

Current code removes the old_delta kobject before the new delta
is ready to be used in case there is an error while reading BAT,
the "too short BAT" error we have seen.
The new delta and its objects are destroyed properly but the original
delta kobject is never added back, so pdelta sysfs dir stay empty and
userspace tools get confused since they consult this files before
performing operations on the ploop device. At later point when ploop
device is destroyed a second kobject_del will be called and it can lead
to a crash or corrupted memory.

To fix this instead of deleting the delta early rename it with prefix
and move the deletion at the end, while on error rename back to original
name. This way if something goes wrong it will be visible in sysfs as
old_N, where N is the delta level.

Extract code to rename kobject into a function and cleanup its users.

Fixes:
commit 5f3ee110e6f4 ("ploop: Repopulate holes_bitmap on changing delta")

https://jira.vzint.dev/browse/PSBM-146797
Signed-off-by: Alexander Atanasov 
---
  drivers/block/ploop/dev.c | 47 ++-
  1 file changed, 26 insertions(+), 21 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 6eb22168b5fe..0ed63f9da1cb 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -3557,6 +3557,16 @@ static int ploop_add_delta(struct ploop_device * plo, 
unsigned long arg)
return err;
  }
  
+/*

+ * Have to implement own version of kobject_rename since it is GPL only symbol
+ */
+static int ploop_rename_delta(struct ploop_delta *delta, int level, char *pref)
+{
+   kobject_del(&delta->kobj);
+   return KOBJECT_ADD(&delta->kobj, &delta->plo->kobj,
+ "%s%d", pref ? : "", level);
+}
+
  static int ploop_replace_delta(struct ploop_device * plo, unsigned long arg)
  {
int err;
@@ -3594,15 +3604,9 @@ static int ploop_replace_delta(struct ploop_device * 
plo, unsigned long arg)
if (err)
goto out_destroy;
  
-	kobject_del(&old_delta->kobj);

-
-   err = KOBJECT_ADD(&delta->kobj, kobject_get(&plo->kobj),
- "%d", delta->level);
-   /* _put below is a pair for _get for OLD delta */
-   kobject_put(&plo->kobj);
-
+   err = ploop_rename_delta(old_delta, old_delta->level, "old_");
if (err < 0) {
-   kobject_put(&plo->kobj);
+   PL_WARN(plo, "Failed to rename old delta kobj\n");
goto out_close;


I don't like this error path, if ploop_rename_delta failed it means that 
we did kobject_del(&old_delta->kobj); but didn't finish corresponding 
KOBJECT_ADD, and in out_close, we would do ploop_rename_delta again 
where we would do second kobject_del(&old_delta->jobj) and corresponding 
KOBJECT_ADD. This would do double 
kobject_del->kobj_kset_leave->kset_put, not sure if it would break 
something or not though.


I mean, maybe it's better to preserve original out_close, and add a new 
label out_restore_rename.



}
  
@@ -3611,10 +3615,17 @@ static int ploop_replace_delta(struct ploop_device * plo, unsigned long arg)

err = delta->ops->replace_delta(delta);
if (err) {
ploop_relax(plo);
-   goto out_kobj_del;
+   goto out_close;
}
}
  
+	err = KOBJECT_ADD(&delta->kobj, kobject_get(&plo->kobj),

+ "%d", delta->level);
+   if (err < 0) {
+   /* _put for failed _ADD */
+   kobject_put(&plo->kobj);
+   goto out_close;
+   }
ploop_map_destroy(&plo->map);
list_replace_init(&old_delta->list, &delta->list);
ploop_delta_list_changed(plo);
@@ -3631,13 +3642,15 @@ static int ploop_replace_delta(struct ploop_device * 
plo, unsigned long arg)
  
  	old_delta->ops->stop(old_delta);

old_delta->ops->destroy(old_delta);
+   kobject_del(&old_delta->kobj);
kobject_put(&old_delta->kobj);
+   kobject_put(&plo->kobj);
return 0;
  
-out_kobj_del:

-   kobject_del(&delta->kobj);
-   kobject_put(&plo->kobj);
  out_close:
+   err = ploop_rename_delta(old_delta, old_delta->level, NULL);
+   if (err < 0)
+   PL_ERR(plo, "Failed to restore old delta kobj name\n");
delta->ops->stop(delta);
  out_destroy:
delta->ops->destroy(delta);
@@ -3974,15 +3987,7 @@ static void rename_deltas(struct ploop_device * plo, int 
level)
  
  		if (delta->level < level)

continue;
-#if 0
-   /* Oops, kobject_rename() is not exported! */
-   sprintf(nname, "%d", delta->level);
-   err = kobject_rename(&delta->kobj, nname);
-#else
-   kobject_del(&delta->kobj);
-   err = KOBJECT_ADD(&delta->kobj, &plo->kobj,
- "%d", delta->level);
-#endi

Re: [Devel] [PATCH vz7 v2] ploop: increase logging on errors when opening new deltas

2023-06-08 Thread Pavel Tikhomirov


On 09.06.2023 00:16, Alexander Atanasov wrote:

Ocassionally we got EBUSY but it is a bit over used,
so it is not clear what it means.

Add more logging to catch the source of the error.

https://jira.vzint.dev/browse/PSBM-146836


Reviewed-by: Pavel Tikhomirov 


Signed-off-by: Alexander Atanasov 
---
  drivers/block/ploop/dev.c |  9 -
  drivers/block/ploop/fmt_ploop1.c  |  4 +++-
  drivers/block/ploop/io_kaio.c | 13 ++---
  drivers/block/ploop/io_kaio_map.c |  7 +--
  4 files changed, 26 insertions(+), 7 deletions(-)

v1->v2: addressing review comments - print the error value and fix coding style

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 2833865b087f..8e238eafd9f2 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -3582,8 +3582,11 @@ static int ploop_replace_delta(struct ploop_device * 
plo, unsigned long arg)
   sizeof(struct ploop_ctl_chunk)))
return -EFAULT;
  
-	if (plo->maintenance_type != PLOOP_MNTN_OFF)

+   if (plo->maintenance_type != PLOOP_MNTN_OFF) {
+   if (printk_ratelimit())
+   PL_WARN(plo, "Attempt to replace while in maintenance 
mode\n");
return -EBUSY;
+   }
  
  	old_delta = find_delta(plo, ctl.pctl_level);

if (old_delta == NULL)
@@ -3596,6 +3599,10 @@ static int ploop_replace_delta(struct ploop_device * 
plo, unsigned long arg)
if (IS_ERR(delta))
return PTR_ERR(delta);
  
+	WARN_ONCE(delta->ops != old_delta->ops,

+ "New delta uses different io %p vs %p\n",
+ delta->ops, old_delta->ops);
+
err = delta->ops->compose(delta, 1, &chunk);
if (err)
goto out_destroy;
diff --git a/drivers/block/ploop/fmt_ploop1.c b/drivers/block/ploop/fmt_ploop1.c
index e59a9eb50ac2..a0369db35c83 100644
--- a/drivers/block/ploop/fmt_ploop1.c
+++ b/drivers/block/ploop/fmt_ploop1.c
@@ -314,8 +314,10 @@ ploop1_open(struct ploop_delta * delta)
if (!(delta->flags & PLOOP_FMT_RDONLY)) {
pvd_header_set_disk_in_use(vh);
err = delta->io.ops->sync_write(&delta->io, ph->dyn_page, 4096, 
0, 0);
-   if (err)
+   if (err) {
+   PL_ERR(delta->plo, "failed to update in use err=%d\n", 
err);
goto out_err;
+   }
}
  
  	delta->io.alloc_head = ph->alloc_head;

diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
index 35c0fad43baf..098f6c7b5f2d 100644
--- a/drivers/block/ploop/io_kaio.c
+++ b/drivers/block/ploop/io_kaio.c
@@ -966,17 +966,23 @@ static int kaio_open(struct ploop_io * io)
io->files.bdev = io->files.inode->i_sb->s_bdev;
  
  	err = io->ops->sync(io);

-   if (err)
+   if (err) {
+   PL_WARN(delta->plo, "open failed to sync err=%d\n", err);
return err;
+   }
  
  	mutex_lock(&io->files.inode->i_mutex);

err = kaio_invalidate_cache(io);
-   if (!err)
+   if (err)
+   PL_WARN(delta->plo, "invaldiate_cache failed err=%d\n", err);
+   else
err = ploop_kaio_open(file, delta->flags & PLOOP_FMT_RDONLY);
mutex_unlock(&io->files.inode->i_mutex);
  
-	if (err)

+   if (err) {
+   PL_WARN(delta->plo, "open failed err=%d\n", err);
return err;
+   }
  
  	io->files.em_tree = &dummy_em_tree;
  
@@ -988,6 +994,7 @@ static int kaio_open(struct ploop_io * io)

err = PTR_ERR(io->fsync_thread);
io->fsync_thread = NULL;
ploop_kaio_close(io->files.mapping, 0);
+   PL_WARN(delta->plo, "fsync thread start failed=%d\n", 
err);
return err;
}
  
diff --git a/drivers/block/ploop/io_kaio_map.c b/drivers/block/ploop/io_kaio_map.c

index d4ff39d95e74..59e35c562ef9 100644
--- a/drivers/block/ploop/io_kaio_map.c
+++ b/drivers/block/ploop/io_kaio_map.c
@@ -35,10 +35,13 @@ int ploop_kaio_open(struct file * file, int rdonly)
else
m->readers++;
} else {
-   if (m->readers)
+   if (m->readers) {
+   pr_warn("File is already active:%d\n",
+   m->readers);
err = -EBUSY;
-   else
+   } else {
m->readers = -1;
+   }

Re: [Devel] [PATCH vz7 v2] ploop: properly restore old delta kobject on replace error

2023-06-08 Thread Pavel Tikhomirov





On 09.06.2023 00:19, Alexander Atanasov wrote:

Current code removes the old_delta kobject before the new delta
is ready to be used in case there is an error while reading BAT,
the "too short BAT" error we have seen.


The above sentence seems incomplete, as in an opposite case where there 
was no error, old_delta kobject would have been removed before new delta 
is ready anyway =)



The new delta and its objects are destroyed properly but the original
delta kobject is never added back, so pdelta sysfs dir stay empty and
userspace tools get confused since they consult this files before
performing operations on the ploop device. At later point when ploop
device is destroyed a second kobject_del will be called and it can lead
to a crash or corrupted memory.

To fix this instead of deleting the delta early move it at the end,
on error restore back original.

Extract code to rename kobject into a function and cleanup its users.
Since kobject_add can fail, get an extra reference on error so later
kobject_del/kobject_put would not destroy it unexpectedly. Object
should be intact except that it would not be visible in sysfs.

Fixes:
5f3ee110e6f4 (ploop: Repopulate holes_bitmap on changing delta, 2019-04-17)

https://jira.vzint.dev/browse/PSBM-146797
Signed-off-by: Alexander Atanasov 
---
  drivers/block/ploop/dev.c | 63 +++
  1 file changed, 37 insertions(+), 26 deletions(-)

v1->v2: Addressing review comments. Removed the rename to minimize
possibility of errors. added getting extra refs after failed add.

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 6eb22168b5fe..2833865b087f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -3557,6 +3557,16 @@ static int ploop_add_delta(struct ploop_device * plo, 
unsigned long arg)
return err;
  }
  
+/*

+ * Have to implement own version of kobject_rename since it is GPL only symbol
+ */
+static int ploop_rename_delta(struct ploop_delta *delta, int level, char *pref)
+{
+   kobject_del(&delta->kobj);
+   return KOBJECT_ADD(&delta->kobj, &delta->plo->kobj,
+ "%s%d", pref ? : "", level);
+}
+


It would probably be cleaner to put this function just before the 
function using it (rename_deltas).



  static int ploop_replace_delta(struct ploop_device * plo, unsigned long arg)
  {
int err;
@@ -3594,27 +3604,25 @@ static int ploop_replace_delta(struct ploop_device * 
plo, unsigned long arg)
if (err)
goto out_destroy;
  
-	kobject_del(&old_delta->kobj);

-
-   err = KOBJECT_ADD(&delta->kobj, kobject_get(&plo->kobj),
- "%d", delta->level);
-   /* _put below is a pair for _get for OLD delta */
-   kobject_put(&plo->kobj);
-
-   if (err < 0) {
-   kobject_put(&plo->kobj);
-   goto out_close;
-   }
  


^ double newline after code removal


ploop_quiesce(plo);
if (delta->ops->replace_delta) {
err = delta->ops->replace_delta(delta);
if (err) {
ploop_relax(plo);
-   goto out_kobj_del;
+   goto out_close;
}
}
  
+	/* Remove old delta kobj to avoid name collision with the new one */

+   kobject_del(&old_delta->kobj);
+   err = KOBJECT_ADD(&delta->kobj, kobject_get(&plo->kobj),
+ "%d", delta->level);
+   if (err < 0) {
+   /* _put for failed _ADD */
+   kobject_put(&plo->kobj);
+   goto out_kobj_restore;
+   }
ploop_map_destroy(&plo->map);
list_replace_init(&old_delta->list, &delta->list);
ploop_delta_list_changed(plo);
@@ -3632,11 +3640,19 @@ static int ploop_replace_delta(struct ploop_device * 
plo, unsigned long arg)
old_delta->ops->stop(old_delta);
old_delta->ops->destroy(old_delta);
kobject_put(&old_delta->kobj);
-   return 0;
-
-out_kobj_del:
-   kobject_del(&delta->kobj);
kobject_put(&plo->kobj);
+   return 0;
+out_kobj_restore:
+   /* we haven't dropped our plo->kobj ref just add back */
+   err = KOBJECT_ADD(&old_delta->kobj, &plo->kobj, "%d", old_delta->level);
+   if (err < 0) {
+   PL_ERR(plo, "Failed to restore old delta kobject\n");
+   /*
+* Get an extra ref to parent object as kobject_add does, so it
+* later kobject_del doesn't destory it
+*/
+   kobject_get(&plo->kobj);


Later kobject_del would not see parent object as first kobject_del did 
"kobj->parent = NULL;", so in this aspect you should not be getting 
reference here.


note: Problem I was talking in my previous message was about kobj->kset put.


+   }
  out_close:
delta->ops->stop(delta);
  out_destroy:
@@ -3974,17 +3990,12 @@ static void rename_deltas(struct ploop_device * plo, 
int level)
  
  		if (

Re: [Devel] [PATCH vz7 v3] ploop: properly restore old delta kobject on replace error

2023-06-14 Thread Pavel Tikhomirov


On 15.06.2023 04:23, Alexander Atanasov wrote:

Current code removes the old_delta kobject before the new delta
is completely ready to be used, in case there is an error while
reading BAT, the "too short BAT" error we have seen, the new delta
and its objects are destroyed properly but the original delta kobject
is not restored, so pdelta sysfs dir stay empty and userspace tools
get confused since they consult this files before performing operations
on the ploop device.

To fix this instead of deleting the delta's kobject early move it at
the end and on error restore back original kobject.

Extract code to rename kobject into a function and cleanup its users.
Since kobject_add can fail, get an extra reference on error so later
kobject_del/kobject_put would not destroy it unexpectedly. Object
should be intact except that it would not be visible in sysfs.

Fixes:
5f3ee110e6f4 (ploop: Repopulate holes_bitmap on changing delta, 2019-04-17)

https://jira.vzint.dev/browse/PSBM-146797


Reviewed-by: Pavel Tikhomirov 


Signed-off-by: Alexander Atanasov 
---
  drivers/block/ploop/dev.c | 54 ---
  1 file changed, 28 insertions(+), 26 deletions(-)

v1->v2: Addressing review comments. Removed the rename to minimize
possibility of errors. added getting extra refs after failed add.

v2->v3: dropped the idea of extra refs since they are not necessary.
double kobject_put is not an issue  since parent is cleared and
kset objects are not used.

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 6eb22168b5fe..0d6272b39863 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -3594,27 +3594,24 @@ static int ploop_replace_delta(struct ploop_device * 
plo, unsigned long arg)
if (err)
goto out_destroy;
  
-	kobject_del(&old_delta->kobj);

-
-   err = KOBJECT_ADD(&delta->kobj, kobject_get(&plo->kobj),
- "%d", delta->level);
-   /* _put below is a pair for _get for OLD delta */
-   kobject_put(&plo->kobj);
-
-   if (err < 0) {
-   kobject_put(&plo->kobj);
-   goto out_close;
-   }
-
ploop_quiesce(plo);
if (delta->ops->replace_delta) {
err = delta->ops->replace_delta(delta);
if (err) {
ploop_relax(plo);
-   goto out_kobj_del;
+   goto out_close;
}
}
  
+	/* Remove old delta kobj to avoid name collision with the new one */

+   kobject_del(&old_delta->kobj);
+   err = KOBJECT_ADD(&delta->kobj, kobject_get(&plo->kobj),
+ "%d", delta->level);
+   if (err < 0) {
+   /* _put for failed _ADD */
+   kobject_put(&plo->kobj);
+   goto out_kobj_restore;
+   }
ploop_map_destroy(&plo->map);
list_replace_init(&old_delta->list, &delta->list);
ploop_delta_list_changed(plo);
@@ -3632,11 +3629,14 @@ static int ploop_replace_delta(struct ploop_device * 
plo, unsigned long arg)
old_delta->ops->stop(old_delta);
old_delta->ops->destroy(old_delta);
kobject_put(&old_delta->kobj);
-   return 0;
-
-out_kobj_del:
-   kobject_del(&delta->kobj);
kobject_put(&plo->kobj);
+   return 0;
+out_kobj_restore:
+   /* we haven't dropped our plo->kobj ref just add back */
+   err = KOBJECT_ADD(&old_delta->kobj, &plo->kobj, "%d", old_delta->level);
+   if (err < 0)
+   /* Nothing we can do unfortunately */
+   PL_ERR(plo, "Failed to restore old delta kobject\n");
  out_close:
delta->ops->stop(delta);
  out_destroy:
@@ -3965,6 +3965,16 @@ static void renumber_deltas(struct ploop_device * plo)
}
  }
  
+/*

+ * Have to implement own version of kobject_rename since it is GPL only symbol
+ */
+static int ploop_rename_delta(struct ploop_delta *delta, int level, char *pref)
+{
+   kobject_del(&delta->kobj);
+   return KOBJECT_ADD(&delta->kobj, &delta->plo->kobj,
+ "%s%d", pref ? : "", level);
+}
+
  static void rename_deltas(struct ploop_device * plo, int level)
  {
struct ploop_delta * delta;
@@ -3974,15 +3984,7 @@ static void rename_deltas(struct ploop_device * plo, int 
level)
  
  		if (delta->level < level)

continue;
-#if 0
-   /* Oops, kobject_rename() is not exported! */
-   sprintf(nname, "%d", delta->level);
-   err = kobject_rename(&delta->kobj, nname);
-#else
-   kobject_del(&delta->kobj);
-   err = KOBJECT_ADD(&delta->kobj, &

[Devel] [PATCH RH7 5/9] mm/memcg: fix refcount error while moving and swapping

2023-07-04 Thread Pavel Tikhomirov

From: Hugh Dickins 

It was hard to keep a test running, moving tasks between memcgs with
move_charge_at_immigrate, while swapping: mem_cgroup_id_get_many()'s
refcount is discovered to be 0 (supposedly impossible), so it is then
forced to REFCOUNT_SATURATED, and after thousands of warnings in quick
succession, the test is at last put out of misery by being OOM killed.

This is because of the way moved_swap accounting was saved up until the
task move gets completed in __mem_cgroup_clear_mc(), deferred from when
mem_cgroup_move_swap_account() actually exchanged old and new ids.
Concurrent activity can free up swap quicker than the task is scanned,
bringing id refcount down 0 (which should only be possible when
offlining).

Just skip that optimization: do that part of the accounting immediately.

Fixes: 615d66c37c75 ("mm: memcontrol: fix memcg id ref counter on swap charge 
move")
Signed-off-by: Hugh Dickins 
Signed-off-by: Andrew Morton 
Reviewed-by: Alex Shi 
Cc: Johannes Weiner 
Cc: Alex Shi 
Cc: Shakeel Butt 
Cc: Michal Hocko 
Cc: 
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2007071431050.4726@eggly.anvils
Signed-off-by: Linus Torvalds 

https://jira.vzint.dev/browse/PSBM-147036

(cherry picked from commit 8d22a9351035ef2ff12ef163a1091b8b8cf1e49c)
Signed-off-by: Pavel Tikhomirov 
---
 mm/memcontrol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8b78c7c8d3e3..8c14585f5e86 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7465,7 +7465,6 @@ static void __mem_cgroup_clear_mc(void)
page_counter_uncharge(&mc.to->memory, mc.moved_swap);
}
 
-   mem_cgroup_id_get_many(mc.to, mc.moved_swap);
 
mc.moved_swap = 0;
}
@@ -7625,7 +7624,8 @@ put:  /* get_mctgt_type() gets the 
page */
ent = target.ent;
if (!mem_cgroup_move_swap_account(ent, mc.from, mc.to)) 
{
mc.precharge--;
-   /* we fixup refcnts and charges later. */
+   mem_cgroup_id_get_many(mc.to, 1);
+   /* we fixup other refcnts and charges later. */
mc.moved_swap++;
}
break;
-- 
2.40.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 0/9] memcg: release id when offlinging cgroup

2023-07-04 Thread Pavel Tikhomirov

We see that container user can deplete memory cgroup ids on the system
(64k) and prevent further memory cgroup creation. In crash collected by
our customer in such a situation we see that mem_cgroup_idr is full of
cgroups from one container with same exact path (cgroup of docker
service), cgroups are not released because they have kmem charges, this
kmem charge is for a tmpfs dentry allocated from this cgroup. (And on
vz7 kernel it seems that such a dentry is only released after umounting
tmpfs or removing the corresponding file from tmpfs.)

So there is a valid way to hold kmem cgroup for a long time. Similar
thing was mentioned in mainstream with page cache holding kmem cgroup
for a long time. And they proposed a way to deal with it - just release
cgroup id early so that one can allocate new cgroups immediately.

Reproduce:
https://git.vzint.dev/users/ptikhomirov/repos/helpers/browse/memcg-related/test-mycg-tmpfs.sh

After this fix the number of memory cgroups in /proc/cgroups can now
show > 64k as we allow to leave memory cgroups hanging while releasing
their ids.

Note: Maybe it's a bad idea to allow container to eat kernel
memory with such a hanging cgroups, but yet I don't have better ideas.

https://jira.vzint.dev/browse/PSBM-147473
https://jira.vzint.dev/browse/PSBM-147036
Signed-off-by: Pavel Tikhomirov 

Arnd Bergmann (1):
  mm: memcontrol: avoid unused function warning

Hugh Dickins (1):
  mm/memcg: fix refcount error while moving and swapping

Johannes Weiner (2):
  mm: memcontrol: uncharge pages on swapout
  mm: memcontrol: fix cgroup creation failure after many small jobs

Kirill Tkhai (1):
  memcg: remove memcg_cgroup::id from IDR on mem_cgroup_css_alloc()
failure

Qian Cai (1):
  mm/memcontrol.c: fix a -Wunused-function warning

Vladimir Davydov (3):
  mm: memcontrol: fix swap counter leak on swapout from offline cgroup
  mm: memcontrol: fix memcg id ref counter on swap charge move
  mm: memcontrol: add sanity checks for memcg->id.ref on get/put

 mm/memcontrol.c | 134 ++--
 1 file changed, 106 insertions(+), 28 deletions(-)

-- 
2.40.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

1 2 3 4 5 6 7 8 9 >

1 - 100 of 881 matches

Mail list logo