[Devel] [PATCH vz8] netfilter: ipset: Fix "INFO: rcu detected stall in hash_xxx" reports

2020-12-17 Thread Andrey Ryabinin
From: Jozsef Kadlecsik 

In the case of huge hash:* types of sets, due to the single spinlock of
a set the processing of the whole set under spinlock protection could take
too long.

There were four places where the whole hash table of the set was processed
from bucket to bucket under holding the spinlock:

- During resizing a set, the original set was locked to exclude kernel side
  add/del element operations (userspace add/del is excluded by the
  nfnetlink mutex). The original set is actually just read during the
  resize, so the spinlocking is replaced with rcu locking of regions.
  However, thus there can be parallel kernel side add/del of entries.
  In order not to loose those operations a backlog is added and replayed
  after the successful resize.
- Garbage collection of timed out entries was also protected by the spinlock.
  In order not to lock too long, region locking is introduced and a single
  region is processed in one gc go. Also, the simple timer based gc running
  is replaced with a workqueue based solution. The internal book-keeping
  (number of elements, size of extensions) is moved to region level due to
  the region locking.
- Adding elements: when the max number of the elements is reached, the gc
  was called to evict the timed out entries. The new approach is that the gc
  is called just for the matching region, assuming that if the region
  (proportionally) seems to be full, then the whole set does. We could scan
  the other regions to check every entry under rcu locking, but for huge
  sets it'd mean a slowdown at adding elements.
- Listing the set header data: when the set was defined with timeout
  support, the garbage collector was called to clean up timed out entries
  to get the correct element numbers and set size values. Now the set is
  scanned to check non-timed out entries, without actually calling the gc
  for the whole set.

Thanks to Florian Westphal for helping me to solve the SOFTIRQ-safe ->
SOFTIRQ-unsafe lock order issues during working on the patch.

Reported-by: syzbot+4b0e9d4ff3cf11783...@syzkaller.appspotmail.com
Reported-by: syzbot+c27b8d5010f45c666...@syzkaller.appspotmail.com
Reported-by: syzbot+68a806795ac89df3a...@syzkaller.appspotmail.com
Fixes: 23c42a403a9c ("netfilter: ipset: Introduction of new commands and 
protocol version 7")
Signed-off-by: Jozsef Kadlecsik 

https://jira.sw.ru/browse/PSBM-123524
(cherry picked from commit f66ee0410b1c3481ee75e5db9b34547b4d582465)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/netfilter/ipset/ip_set.h |  11 +-
 net/netfilter/ipset/ip_set_core.c  |  34 +-
 net/netfilter/ipset/ip_set_hash_gen.h  | 633 +
 3 files changed, 472 insertions(+), 206 deletions(-)

diff --git a/include/linux/netfilter/ipset/ip_set.h 
b/include/linux/netfilter/ipset/ip_set.h
index e499d170f12d..3c49b540c701 100644
--- a/include/linux/netfilter/ipset/ip_set.h
+++ b/include/linux/netfilter/ipset/ip_set.h
@@ -124,6 +124,7 @@ struct ip_set_ext {
u32 timeout;
u8 packets_op;
u8 bytes_op;
+   bool target;
 };
 
 struct ip_set;
@@ -190,6 +191,14 @@ struct ip_set_type_variant {
/* Return true if "b" set is the same as "a"
 * according to the create set parameters */
bool (*same_set)(const struct ip_set *a, const struct ip_set *b);
+   /* Region-locking is used */
+   bool region_lock;
+};
+
+struct ip_set_region {
+   spinlock_t lock;/* Region lock */
+   size_t ext_size;/* Size of the dynamic extensions */
+   u32 elements;   /* Number of elements vs timeout */
 };
 
 /* The core set type structure */
@@ -461,7 +470,7 @@ bitmap_bytes(u32 a, u32 b)
 #include 
 
 #define IP_SET_INIT_KEXT(skb, opt, set)\
-   { .bytes = (skb)->len, .packets = 1,\
+   { .bytes = (skb)->len, .packets = 1, .target = true,\
  .timeout = ip_set_adt_opt_timeout(opt, set) }
 
 #define IP_SET_INIT_UEXT(set)  \
diff --git a/net/netfilter/ipset/ip_set_core.c 
b/net/netfilter/ipset/ip_set_core.c
index 56b59904feca..615b5791edf2 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -558,6 +558,20 @@ ip_set_rcu_get(struct net *net, ip_set_id_t index)
return set;
 }
 
+static inline void
+ip_set_lock(struct ip_set *set)
+{
+   if (!set->variant->region_lock)
+   spin_lock_bh(>lock);
+}
+
+static inline void
+ip_set_unlock(struct ip_set *set)
+{
+   if (!set->variant->region_lock)
+   spin_unlock_bh(>lock);
+}
+
 int
 ip_set_test(ip_set_id_t index, const struct sk_buff *skb,
const struct xt_action_param *par, struct ip_set_adt_opt *opt)
@@ -579,9 +593,9 @@ ip_set_test(ip_set_id_t index, const struct sk_buff *skb,
if (ret == -EAGAIN) {
/* Type requests element to be completed */
pr_debug("element must be completed, ADD is triggered\n");
-   

[Devel] [PATCH RHEL8 COMMIT] dm-ploop: Actually zero tail of tail page

2020-12-17 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-240.1.1.vz8.5.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-240.1.1.vz8.5.2
-->
commit f9ebd8db47ba16a3d5f023172ab10fb87463bcab
Author: Kirill Tkhai 
Date:   Thu Dec 17 18:59:04 2020 +0300

dm-ploop: Actually zero tail of tail page

@from is temporary buffer. We have to zero tail of @to instead.

https://jira.sw.ru/browse/PSBM-123784
Fixes: 0497d745e201 "ploop: Zero tail of tail page"
Signed-off-by: Kirill Tkhai 
---
 drivers/md/dm-ploop-bat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/dm-ploop-bat.c b/drivers/md/dm-ploop-bat.c
index da18dd2e4638..3fcdf248c490 100644
--- a/drivers/md/dm-ploop-bat.c
+++ b/drivers/md/dm-ploop-bat.c
@@ -169,7 +169,7 @@ static int ploop_read_bat(struct ploop *ploop, struct bio 
*bio)
memcpy(to, from, nr_copy * sizeof(map_index_t));
kunmap(bio->bi_io_vec[page].bv_page);
if (unlikely(nr_copy < BAT_ENTRIES_PER_PAGE)) {
-   memset(from + nr_copy, 0, sizeof(map_index_t) *
+   memset(to + nr_copy, 0, sizeof(map_index_t) *
   (BAT_ENTRIES_PER_PAGE - nr_copy));
}
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8] ptrace: fix task_join_group_stop() for the case when current is traced

2020-12-17 Thread Andrey Ryabinin
From: Oleg Nesterov 

This testcase

#include 
#include 
#include 
#include 
#include 
#include 
#include 

void *tf(void *arg)
{
return NULL;
}

int main(void)
{
int pid = fork();
if (!pid) {
kill(getpid(), SIGSTOP);

pthread_t th;
pthread_create(, NULL, tf, NULL);

return 0;
}

waitpid(pid, NULL, WSTOPPED);

ptrace(PTRACE_SEIZE, pid, 0, PTRACE_O_TRACECLONE);
waitpid(pid, NULL, 0);

ptrace(PTRACE_CONT, pid, 0,0);
waitpid(pid, NULL, 0);

int status;
int thread = waitpid(-1, , 0);
assert(thread > 0 && thread != pid);
assert(status == 0x80137f);

return 0;
}

fails and triggers WARN_ON_ONCE(!signr) in do_jobctl_trap().

This is because task_join_group_stop() has 2 problems when current is traced:

1. We can't rely on the "JOBCTL_STOP_PENDING" check, a stopped tracee
   can be woken up by debugger and it can clone another thread which
   should join the group-stop.

   We need to check group_stop_count || SIGNAL_STOP_STOPPED.

2. If SIGNAL_STOP_STOPPED is already set, we should not increment
   sig->group_stop_count and add JOBCTL_STOP_CONSUME. The new thread
   should stop without another do_notify_parent_cldstop() report.

To clarify, the problem is very old and we should blame
ptrace_init_task().  But now that we have task_join_group_stop() it makes
more sense to fix this helper to avoid the code duplication.

Reported-by: syzbot+3485e3773f7da290e...@syzkaller.appspotmail.com
Signed-off-by: Oleg Nesterov 
Signed-off-by: Andrew Morton 
Cc: Jens Axboe 
Cc: Christian Brauner 
Cc: "Eric W . Biederman" 
Cc: Zhiqiang Liu 
Cc: Tejun Heo 
Cc: 
Link: https://lkml.kernel.org/r/20201019134237.ga18...@redhat.com
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-123525
(cherry picked from commit 7b3c36fc4c231ca532120bbc0df67a12f09c1d96)
Signed-off-by: Andrey Ryabinin 
---
 kernel/signal.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 177cd7f04acb..171f7496f811 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -388,16 +388,17 @@ static bool task_participate_group_stop(struct 
task_struct *task)
 
 void task_join_group_stop(struct task_struct *task)
 {
+   unsigned long mask = current->jobctl & JOBCTL_STOP_SIGMASK;
+   struct signal_struct *sig = current->signal;
+
+   if (sig->group_stop_count) {
+   sig->group_stop_count++;
+   mask |= JOBCTL_STOP_CONSUME;
+   } else if (!(sig->flags & SIGNAL_STOP_STOPPED))
+   return;
+
/* Have the new thread join an on-going signal group stop */
-   unsigned long jobctl = current->jobctl;
-   if (jobctl & JOBCTL_STOP_PENDING) {
-   struct signal_struct *sig = current->signal;
-   unsigned long signr = jobctl & JOBCTL_STOP_SIGMASK;
-   unsigned long gstop = JOBCTL_STOP_PENDING | JOBCTL_STOP_CONSUME;
-   if (task_set_jobctl_pending(task, signr | gstop)) {
-   sig->group_stop_count++;
-   }
-   }
+   task_set_jobctl_pending(task, mask | JOBCTL_STOP_PENDING);
 }
 
 /*
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel