Re: [Devel] [RFC PATCH 2/2] autofs: sent 32-bit sized packet for 32-bit process
On Thu, Aug 31, 2017 at 05:57:11PM +0400, Stanislav Kinsburskiy wrote: > The structure autofs_v5_packet (except name) is not aligned by 8 bytes, which > lead to different sizes in 32 and 64-bit architectures. > Let's form 32-bit compatible packet when daemon has 32-bit addressation. > > Signed-off-by: Stanislav Kinsburskiy > --- > fs/autofs4/waitq.c | 11 +-- > 1 file changed, 9 insertions(+), 2 deletions(-) > > diff --git a/fs/autofs4/waitq.c b/fs/autofs4/waitq.c > index 309ca6b..484cf2e 100644 > --- a/fs/autofs4/waitq.c > +++ b/fs/autofs4/waitq.c > @@ -153,12 +153,19 @@ static void autofs4_notify_daemon(struct autofs_sb_info > *sbi, > { > struct autofs_v5_packet *packet = &pkt.v5_pkt.v5_packet; > struct user_namespace *user_ns = sbi->pipe->f_cred->user_ns; > + size_t name_offset; > > - pktsz = sizeof(*packet); > + if (sbi->is32bit) > + name_offset = offsetof(struct autofs_v5_packet, len) + > + sizeof(packet->len); > + else > + name_offset = offsetof(struct autofs_v5_packet, name); This doesn't help at all because the offset of struct autofs_v5_packet.name does not change. > + pktsz = name_offset + sizeof(packet->name); What changes is pktsz: it's either sizeof(struct autofs_v5_packet) or 4 bytes less, depending on the architecture. For example, #ifdef CONFIG_COMPAT if (__alignof__(compat_u64) < __alignof__(u64) && sbi->is32bit) pktsz = offsetofend(struct autofs_v5_packet, name); else #endif pktsz = sizeof(*packet); -- ldv signature.asc Description: PGP signature ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] mm: Count list_lru_one::nr_items lockless
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 6fd774dbf6fd05eca8cfa192753bf35dac694368 Author: Kirill Tkhai Date: Thu Aug 31 18:25:20 2017 +0300 mm: Count list_lru_one::nr_items lockless During the reclaiming slab of a memcg, shrink_slab iterates over all registered shrinkers in the system, and tries to count and consume objects related to the cgroup. In case of memory pressure, this behaves bad: I observe high system time and time spent in list_lru_count_one() for many processes: 0,50% nixstatsagent [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock 0,26% nixstatsagent [kernel.vmlinux] [k] shrink_slab [k] shrink_slab 0,23% nixstatsagent [kernel.vmlinux] [k] super_cache_count [k] super_cache_count 0,15% nixstatsagent [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock 0,15% nixstatsagent [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2 0,94% mysqld [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock 0,57% mysqld [kernel.vmlinux] [k] shrink_slab [k] shrink_slab 0,51% mysqld [kernel.vmlinux] [k] super_cache_count [k] super_cache_count 0,32% mysqld [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock 0,32% mysqld [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2 0,73% sshd [kernel.vmlinux] [k] _raw_spin_lock [k] _raw_spin_lock 0,35% sshd [kernel.vmlinux] [k] shrink_slab [k] shrink_slab 0,32% sshd [kernel.vmlinux] [k] super_cache_count [k] super_cache_count 0,21% sshd [kernel.vmlinux] [k] __list_lru_count_one.isra.2 [k] _raw_spin_lock 0,21% sshd [kernel.vmlinux] [k] list_lru_count_one [k] __list_lru_count_one.isra.2 This patch aims to make super_cache_count() more effective. It makes __list_lru_count_one() count nr_items lockless to minimize overhead introducing by locking operation, and to make parallel reclaims more scalable. The lock won't be taken on shrinker::count_objects(), it would be taken only for the real shrink by the thread, who realizes it. https://jira.sw.ru/browse/PSBM-69296 Signed-off-by: Kirill Tkhai Acked-by: Andrey Ryabinin --- mm/list_lru.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/list_lru.c b/mm/list_lru.c index b166eff..5adc6621 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -160,10 +160,10 @@ static unsigned long __list_lru_count_one(struct list_lru *lru, struct list_lru_one *l; unsigned long count; - spin_lock(&nlru->lock); + rcu_read_lock(); l = list_lru_from_memcg_idx(nlru, memcg_idx); count = l->nr_items; - spin_unlock(&nlru->lock); + rcu_read_unlock(); return count; } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] mm: Make list_lru_node::memcg_lrus RCU protected
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 5db3da0bf7112c551ca9ce90b1c0e8a2bcad9ac1 Author: Kirill Tkhai Date: Thu Aug 31 18:25:20 2017 +0300 mm: Make list_lru_node::memcg_lrus RCU protected The array list_lru_node::memcg_lrus::list_lru_one[] only grows, and it never shrinks. The growths happens in memcg_update_list_lru_node(), and old array's members remain the same after it. So, the access to the array's members may become RCU protected, and it's possible to avoid using list_lru_node::lock to dereference it. This will be used to get list's nr_items in next patch lockless. Signed-off-by: Kirill Tkhai Acked-by: Andrey Ryabinin --- include/linux/list_lru.h | 2 +- mm/list_lru.c| 59 2 files changed, 40 insertions(+), 21 deletions(-) diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index 7bf4251..00a339b 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -43,7 +43,7 @@ struct list_lru_node { struct list_lru_one lru; #ifdef CONFIG_MEMCG_KMEM /* for cgroup aware lrus points to per cgroup lists, otherwise NULL */ - struct list_lru_memcg *memcg_lrus; + struct list_lru_memcg __rcu *memcg_lrus; #endif } cacheline_aligned_in_smp; diff --git a/mm/list_lru.c b/mm/list_lru.c index cb53462..b166eff 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -42,19 +42,24 @@ static void list_lru_unregister(struct list_lru *lru) #ifdef CONFIG_MEMCG_KMEM static inline bool list_lru_memcg_aware(struct list_lru *lru) { - return !!lru->node[0].memcg_lrus; + struct list_lru_memcg *memcg_lrus; + /* Here we only check the pointer is not NULL, so RCU lock isn't need */ + memcg_lrus = rcu_dereference_check(lru->node[0].memcg_lrus, true); + return !!memcg_lrus; } static inline struct list_lru_one * list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx) { + struct list_lru_memcg *memcg_lrus; /* -* The lock protects the array of per cgroup lists from relocation -* (see memcg_update_list_lru_node). +* Either lock and RCU protects the array of per cgroup lists +* from relocation (see memcg_update_list_lru_node). */ - lockdep_assert_held(&nlru->lock); - if (nlru->memcg_lrus && idx >= 0) - return nlru->memcg_lrus->lru[idx]; + memcg_lrus = rcu_dereference_check(nlru->memcg_lrus, + lockdep_is_held(&nlru->lock)); + if (memcg_lrus && idx >= 0) + return memcg_lrus->lru[idx]; return &nlru->lru; } @@ -62,9 +67,12 @@ list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx) static inline struct list_lru_one * list_lru_from_kmem(struct list_lru_node *nlru, void *ptr) { + struct list_lru_memcg *memcg_lrus; struct mem_cgroup *memcg; - if (!nlru->memcg_lrus) + memcg_lrus = rcu_dereference_check(nlru->memcg_lrus, + lockdep_is_held(&nlru->lock)); + if (!memcg_lrus) return &nlru->lru; memcg = mem_cgroup_from_kmem(ptr); @@ -311,25 +319,34 @@ static int __memcg_init_list_lru_node(struct list_lru_memcg *memcg_lrus, static int memcg_init_list_lru_node(struct list_lru_node *nlru) { + struct list_lru_memcg *memcg_lrus; int size = memcg_nr_cache_ids; - nlru->memcg_lrus = kmalloc(sizeof(struct list_lru_memcg) + - size * sizeof(void *), GFP_KERNEL); - if (!nlru->memcg_lrus) + memcg_lrus = kmalloc(sizeof(*memcg_lrus) + +size * sizeof(void *), GFP_KERNEL); + if (!memcg_lrus) return -ENOMEM; - if (__memcg_init_list_lru_node(nlru->memcg_lrus, 0, size)) { - kfree(nlru->memcg_lrus); + if (__memcg_init_list_lru_node(memcg_lrus, 0, size)) { + kfree(memcg_lrus); return -ENOMEM; } + rcu_assign_pointer(nlru->memcg_lrus, memcg_lrus); return 0; } static void memcg_destroy_list_lru_node(struct list_lru_node *nlru) { - __memcg_destroy_list_lru_node(nlru->memcg_lrus, 0, memcg_nr_cache_ids); - kfree(nlru->memcg_lrus); + struct list_lru_memcg *memcg_lrus; + + /* +* This is called when shrinker has already been unregistered, +* so nobody can use it. +*/ + memcg_lrus = rcu_dereference_check(nlru->memcg_lrus, true); + __memcg_destroy_list_lru_node(memcg_lrus, 0, memcg_nr_cache_ids); + kfree(memcg_lrus); } static int memcg_update_list_lru_node(struct list_lru_node *nlru, @@ -338,8 +355,10 @@ static int memcg_update_list_lru_node(struct list_lru_node *nlru, struct list_lr
[Devel] [PATCH RHEL7 COMMIT] mm: Add rcu field to struct list_lru_memcg
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit b3b3ea1125f07f57ea0f95b29ad368934cc7bb53 Author: Kirill Tkhai Date: Thu Aug 31 18:25:19 2017 +0300 mm: Add rcu field to struct list_lru_memcg Patchset description: Make count list_lru_one::nr_items lockless This series aims to improve scalability of list_lru shrinking and to make list_lru_count_one() working more effective. Kirill Tkhai (3): mm: Add rcu field to struct list_lru_memcg mm: Make list_lru_node::memcg_lrus RCU protected mm: Count list_lru_one::nr_items lockless https://jira.sw.ru/browse/PSBM-69296 = This patch description: This patch adds the new field and teaches kmalloc() to allocate memory for it. Signed-off-by: Kirill Tkhai Acked-by: Andrey Ryabinin --- include/linux/list_lru.h | 1 + mm/list_lru.c| 7 --- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index 2a6b994..7bf4251 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -31,6 +31,7 @@ struct list_lru_one { }; struct list_lru_memcg { + struct rcu_head rcu; /* array of per cgroup lists, indexed by memcg_cache_id */ struct list_lru_one *lru[0]; }; diff --git a/mm/list_lru.c b/mm/list_lru.c index 84b4c21..cb53462 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -313,7 +313,8 @@ static int memcg_init_list_lru_node(struct list_lru_node *nlru) { int size = memcg_nr_cache_ids; - nlru->memcg_lrus = kmalloc(size * sizeof(void *), GFP_KERNEL); + nlru->memcg_lrus = kmalloc(sizeof(struct list_lru_memcg) + + size * sizeof(void *), GFP_KERNEL); if (!nlru->memcg_lrus) return -ENOMEM; @@ -339,7 +340,7 @@ static int memcg_update_list_lru_node(struct list_lru_node *nlru, BUG_ON(old_size > new_size); old = nlru->memcg_lrus; - new = kmalloc(new_size * sizeof(void *), GFP_KERNEL); + new = kmalloc(sizeof(*new) + new_size * sizeof(void *), GFP_KERNEL); if (!new) return -ENOMEM; @@ -348,7 +349,7 @@ static int memcg_update_list_lru_node(struct list_lru_node *nlru, return -ENOMEM; } - memcpy(new, old, old_size * sizeof(void *)); + memcpy(&new->lru, &old->lru, old_size * sizeof(void *)); /* * The lock guarantees that we won't race with a reader ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] tcache: Cleanup unused expression from tcache_lru_isolate()
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 389f1b056f987726601af0399791b18f107436c5 Author: Kirill Tkhai Date: Thu Aug 31 18:18:18 2017 +0300 tcache: Cleanup unused expression from tcache_lru_isolate() Nobody use nr_to_isolate in further. It seems, it's historical leftover. Signed-off-by: Kirill Tkhai Acked-by: Andrey Ryabinin --- mm/tcache.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/tcache.c b/mm/tcache.c index ab70af2..0e57ae6 100644 --- a/mm/tcache.c +++ b/mm/tcache.c @@ -1049,7 +1049,6 @@ tcache_lru_isolate(int nid, struct page **pages, int nr_to_isolate) nr = __tcache_lru_isolate(pni, pages, nr_to_isolate); ni->nr_pages -= nr; nr_isolated += nr; - nr_to_isolate -= nr; if (!list_empty(&pni->lru)) __tcache_insert_reclaim_node(ni, pni); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] tcache: Make tcache_lru_isolate() keep ni->lock less
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 05e159ab7e5981fc76950e8e999d5f855d9313f7 Author: Kirill Tkhai Date: Thu Aug 31 18:18:20 2017 +0300 tcache: Make tcache_lru_isolate() keep ni->lock less Grab pool using RCU technics, and do not use ni->lock. This refactors the function and will be used in further. v2: Use tcache_nodeinfo::rb_first Signed-off-by: Kirill Tkhai Acked-by: Andrey Ryabinin --- mm/tcache.c | 40 1 file changed, 28 insertions(+), 12 deletions(-) diff --git a/mm/tcache.c b/mm/tcache.c index 40608ec..3d9c5ac 100644 --- a/mm/tcache.c +++ b/mm/tcache.c @@ -1044,33 +1044,49 @@ tcache_lru_isolate(int nid, struct page **pages, int nr_to_isolate) int nr_isolated = 0; struct rb_node *rbn; - spin_lock_irq(&ni->lock); + rcu_read_lock(); again: - rbn = rb_first(&ni->reclaim_tree); - if (!rbn) + rbn = rcu_dereference(ni->rb_first); + if (!rbn) { + rcu_read_unlock(); goto out; - - rb_erase(rbn, &ni->reclaim_tree); - RB_CLEAR_NODE(rbn); - update_ni_rb_first(ni); + } pni = rb_entry(rbn, struct tcache_pool_nodeinfo, reclaim_node); - if (!tcache_grab_pool(pni->pool)) + if (!tcache_grab_pool(pni->pool)) { + spin_lock_irq(&ni->lock); + if (!RB_EMPTY_NODE(rbn) && list_empty(&pni->lru)) { + rb_erase(rbn, &ni->reclaim_tree); + RB_CLEAR_NODE(rbn); + update_ni_rb_first(ni); + } + spin_unlock_irq(&ni->lock); goto again; + } + rcu_read_unlock(); + spin_lock_irq(&ni->lock); spin_lock(&pni->lock); nr_isolated = __tcache_lru_isolate(pni, pages, nr_to_isolate); + + if (!nr_isolated) + goto unlock; + ni->nr_pages -= nr_isolated; - if (!list_empty(&pni->lru)) { - __tcache_insert_reclaim_node(ni, pni); - update_ni_rb_first(ni); + if (!RB_EMPTY_NODE(rbn)) { + rb_erase(rbn, &ni->reclaim_tree); + RB_CLEAR_NODE(rbn); } + if (!list_empty(&pni->lru)) + __tcache_insert_reclaim_node(ni, pni); + update_ni_rb_first(ni); +unlock: spin_unlock(&pni->lock); + spin_unlock_irq(&ni->lock); tcache_put_pool(pni->pool); out: - spin_unlock_irq(&ni->lock); return nr_isolated; } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] tcache: Use ni->lock only for inserting and erasing from rbtree.
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 342e800a8b114e74c372374268812ce2612a26aa Author: Kirill Tkhai Date: Thu Aug 31 18:18:22 2017 +0300 tcache: Use ni->lock only for inserting and erasing from rbtree. This patch completes splitting of ni->lock into ni->lock and pni->lock. Now, global ni->lock is used for inserting in tcache_nodeinfo::reclaim_tree, which happen just on every 1024 inserting or erasing of pages. For other LRU operations is used pni->lock, which is per-filesystem (i.e., per-container), and does not affect other containers. Also, lock order is changed: spin_lock(&pni->lock); spin_lock(&ni->lock); v3: Disable irqs in tcache_lru_isolate(). Signed-off-by: Kirill Tkhai Acked-by: Andrey Ryabinin --- mm/tcache.c | 31 --- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/mm/tcache.c b/mm/tcache.c index 202834c..5faa390 100644 --- a/mm/tcache.c +++ b/mm/tcache.c @@ -261,7 +261,6 @@ static void tcache_lru_add(struct tcache_pool *pool, struct page *page) struct tcache_nodeinfo *ni = &tcache_nodeinfo[nid]; struct tcache_pool_nodeinfo *pni = &pool->nodeinfo[nid]; - spin_lock(&ni->lock); spin_lock(&pni->lock); atomic_long_inc(&ni->nr_pages); pni->nr_pages++; @@ -274,13 +273,14 @@ static void tcache_lru_add(struct tcache_pool *pool, struct page *page) } if (tcache_check_events(pni) || RB_EMPTY_NODE(&pni->reclaim_node)) { + spin_lock(&ni->lock); if (!RB_EMPTY_NODE(&pni->reclaim_node)) rb_erase(&pni->reclaim_node, &ni->reclaim_tree); __tcache_insert_reclaim_node(ni, pni); update_ni_rb_first(ni); + spin_unlock(&ni->lock); } spin_unlock(&pni->lock); - spin_unlock(&ni->lock); } static void __tcache_lru_del(struct tcache_pool_nodeinfo *pni, @@ -301,7 +301,6 @@ static void tcache_lru_del(struct tcache_pool *pool, struct page *page, struct tcache_nodeinfo *ni = &tcache_nodeinfo[nid]; struct tcache_pool_nodeinfo *pni = &pool->nodeinfo[nid]; - spin_lock(&ni->lock); spin_lock(&pni->lock); /* Raced with reclaimer? */ @@ -315,14 +314,15 @@ static void tcache_lru_del(struct tcache_pool *pool, struct page *page, pni->recent_gets++; if (tcache_check_events(pni)) { + spin_lock(&ni->lock); if (!RB_EMPTY_NODE(&pni->reclaim_node)) rb_erase(&pni->reclaim_node, &ni->reclaim_tree); __tcache_insert_reclaim_node(ni, pni); update_ni_rb_first(ni); + spin_unlock(&ni->lock); } out: spin_unlock(&pni->lock); - spin_unlock(&ni->lock); } static int tcache_create_pool(void) @@ -1065,8 +1065,7 @@ tcache_lru_isolate(int nid, struct page **pages, int nr_to_isolate) } rcu_read_unlock(); - spin_lock_irq(&ni->lock); - spin_lock(&pni->lock); + spin_lock_irq(&pni->lock); nr_isolated = __tcache_lru_isolate(pni, pages, nr_to_isolate); if (!nr_isolated) @@ -1074,17 +1073,19 @@ tcache_lru_isolate(int nid, struct page **pages, int nr_to_isolate) atomic_long_sub(nr_isolated, &ni->nr_pages); - if (!RB_EMPTY_NODE(rbn)) { - rb_erase(rbn, &ni->reclaim_tree); - RB_CLEAR_NODE(rbn); + if (!RB_EMPTY_NODE(rbn) || !list_empty(&pni->lru)) { + spin_lock(&ni->lock); + if (!RB_EMPTY_NODE(rbn)) + rb_erase(rbn, &ni->reclaim_tree); + if (!list_empty(&pni->lru)) + __tcache_insert_reclaim_node(ni, pni); + else + RB_CLEAR_NODE(rbn); + update_ni_rb_first(ni); + spin_unlock(&ni->lock); } - if (!list_empty(&pni->lru)) - __tcache_insert_reclaim_node(ni, pni); - update_ni_rb_first(ni); - unlock: - spin_unlock(&pni->lock); - spin_unlock_irq(&ni->lock); + spin_unlock_irq(&pni->lock); tcache_put_pool(pni->pool); out: return nr_isolated; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] tcache: Remove excess variable from tcache_lru_isolate()
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit e6c8082f25609c977202364e85e72f6c2442d4b5 Author: Kirill Tkhai Date: Thu Aug 31 18:18:19 2017 +0300 tcache: Remove excess variable from tcache_lru_isolate() We have two variables (nr and nr_isolated), which show the same. Kill one of them. v2: new Signed-off-by: Kirill Tkhai Acked-by: Andrey Ryabinin --- mm/tcache.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/tcache.c b/mm/tcache.c index 0e57ae6..0f15e8e 100644 --- a/mm/tcache.c +++ b/mm/tcache.c @@ -1029,7 +1029,7 @@ tcache_lru_isolate(int nid, struct page **pages, int nr_to_isolate) { struct tcache_nodeinfo *ni = &tcache_nodeinfo[nid]; struct tcache_pool_nodeinfo *pni; - int nr, nr_isolated = 0; + int nr_isolated = 0; struct rb_node *rbn; spin_lock_irq(&ni->lock); @@ -1046,9 +1046,8 @@ tcache_lru_isolate(int nid, struct page **pages, int nr_to_isolate) goto again; spin_lock(&pni->lock); - nr = __tcache_lru_isolate(pni, pages, nr_to_isolate); - ni->nr_pages -= nr; - nr_isolated += nr; + nr_isolated = __tcache_lru_isolate(pni, pages, nr_to_isolate); + ni->nr_pages -= nr_isolated; if (!list_empty(&pni->lru)) __tcache_insert_reclaim_node(ni, pni); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] tcache: Cache rb_first() of reclaim tree in tcache_nodeinfo::rb_first
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 5a95787003bdb2cbd00fa9111a3ef67aec05468c Author: Kirill Tkhai Date: Thu Aug 31 18:18:20 2017 +0300 tcache: Cache rb_first() of reclaim tree in tcache_nodeinfo::rb_first Set rb_first via RCU and, thus, allow lockless access to it. v3: Move update_ni_rb_first() from patch "tcache: Move erase-insert logic out of tcache_check_events()". v2: New Signed-off-by: Kirill Tkhai Acked-by: Andrey Ryabinin --- mm/tcache.c | 20 ++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/mm/tcache.c b/mm/tcache.c index 0f15e8e..40608ec 100644 --- a/mm/tcache.c +++ b/mm/tcache.c @@ -157,6 +157,7 @@ struct tcache_nodeinfo { /* tree of pools, sorted by reclaim prio */ struct rb_root reclaim_tree; + struct rb_node __rcu *rb_first; /* total number of pages on all LRU lists corresponding to this node */ unsigned long nr_pages; @@ -205,6 +206,13 @@ node_tree_from_key(struct tcache_pool *pool, return &pool->node_tree[key_hash(key) & (num_node_trees - 1)]; } +static struct rb_node *update_ni_rb_first(struct tcache_nodeinfo *ni) +{ + struct rb_node *first = rb_first(&ni->reclaim_tree); + rcu_assign_pointer(ni->rb_first, first); + return first; +} + static void __tcache_insert_reclaim_node(struct tcache_nodeinfo *ni, struct tcache_pool_nodeinfo *pni); @@ -242,6 +250,7 @@ static inline void __tcache_check_events(struct tcache_nodeinfo *ni, rb_erase(&pni->reclaim_node, &ni->reclaim_tree); __tcache_insert_reclaim_node(ni, pni); + update_ni_rb_first(ni); } /* @@ -270,8 +279,10 @@ static void tcache_lru_add(struct tcache_pool *pool, struct page *page) __tcache_check_events(ni, pni); - if (unlikely(RB_EMPTY_NODE(&pni->reclaim_node))) + if (unlikely(RB_EMPTY_NODE(&pni->reclaim_node))) { __tcache_insert_reclaim_node(ni, pni); + update_ni_rb_first(ni); + } spin_unlock(&pni->lock); spin_unlock(&ni->lock); @@ -934,6 +945,7 @@ tcache_remove_from_reclaim_trees(struct tcache_pool *pool) spin_lock_irq(&ni->lock); if (!RB_EMPTY_NODE(&pni->reclaim_node)) { rb_erase(&pni->reclaim_node, &ni->reclaim_tree); + update_ni_rb_first(ni); /* * Clear the node for __tcache_check_events() not to * reinsert the pool back into the tree. @@ -1040,6 +1052,7 @@ tcache_lru_isolate(int nid, struct page **pages, int nr_to_isolate) rb_erase(rbn, &ni->reclaim_tree); RB_CLEAR_NODE(rbn); + update_ni_rb_first(ni); pni = rb_entry(rbn, struct tcache_pool_nodeinfo, reclaim_node); if (!tcache_grab_pool(pni->pool)) @@ -1049,8 +1062,10 @@ tcache_lru_isolate(int nid, struct page **pages, int nr_to_isolate) nr_isolated = __tcache_lru_isolate(pni, pages, nr_to_isolate); ni->nr_pages -= nr_isolated; - if (!list_empty(&pni->lru)) + if (!list_empty(&pni->lru)) { __tcache_insert_reclaim_node(ni, pni); + update_ni_rb_first(ni); + } spin_unlock(&pni->lock); tcache_put_pool(pni->pool); @@ -1349,6 +1364,7 @@ static int __init tcache_nodeinfo_init(void) ni = &tcache_nodeinfo[i]; spin_lock_init(&ni->lock); ni->reclaim_tree = RB_ROOT; + update_ni_rb_first(ni); } return 0; } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] tcache: Add tcache_pool_nodeinfo::lock
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 13afaf53ede5cb733a5dba3319bcffea95fe9f48 Author: Kirill Tkhai Date: Thu Aug 31 18:18:18 2017 +0300 tcache: Add tcache_pool_nodeinfo::lock Currently, for protection of all LRU lists is used tcache_nodeinfo::lock, which is the only for the NUMA node, and it is used for all containers. It's used when every container adds a page to LRU list. This makes it "big tcache lock", which does not scale good. The patch introduces a new lock for protection of struct tcache_pool_nodeinfo fields, in particular, LRU list. LRU lists of filesystems (of containers) are independent of each other, so different locks allows to scale better. This patch only introduces the lock, and the lock order is: tcache_nodeinfo::lock -> tcache_pool_nodeinfo::lock at the moment. Next patches gradually will allow to change it vice versa. Note, that now update of tcache_pool_nodeinfo::nr_pages and tcache_nodeinfo::nr_pages happens under different locks. v3: Add spin_lock_init() for lockdep Signed-off-by: Kirill Tkhai Acked-by: Andrey Ryabinin --- mm/tcache.c | 17 ++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/mm/tcache.c b/mm/tcache.c index 9f296dc..ab70af2 100644 --- a/mm/tcache.c +++ b/mm/tcache.c @@ -66,6 +66,7 @@ struct tcache_pool_nodeinfo { /* increased on every LRU add/del, reset once it gets big enough; * used for rate limiting rebalancing of reclaim_tree */ unsigned long events; + spinlock_t lock; } cacheline_aligned_in_smp; /* @@ -255,6 +256,7 @@ static void tcache_lru_add(struct tcache_pool *pool, struct page *page) struct tcache_pool_nodeinfo *pni = &pool->nodeinfo[nid]; spin_lock(&ni->lock); + spin_lock(&pni->lock); ni->nr_pages++; pni->nr_pages++; @@ -271,6 +273,7 @@ static void tcache_lru_add(struct tcache_pool *pool, struct page *page) if (unlikely(RB_EMPTY_NODE(&pni->reclaim_node))) __tcache_insert_reclaim_node(ni, pni); + spin_unlock(&pni->lock); spin_unlock(&ni->lock); } @@ -293,6 +296,7 @@ static void tcache_lru_del(struct tcache_pool *pool, struct page *page, struct tcache_pool_nodeinfo *pni = &pool->nodeinfo[nid]; spin_lock(&ni->lock); + spin_lock(&pni->lock); /* Raced with reclaimer? */ if (unlikely(list_empty(&page->lru))) @@ -306,6 +310,7 @@ static void tcache_lru_del(struct tcache_pool *pool, struct page *page, __tcache_check_events(ni, pni); out: + spin_unlock(&pni->lock); spin_unlock(&ni->lock); } @@ -342,6 +347,7 @@ static int tcache_create_pool(void) pni->pool = pool; RB_CLEAR_NODE(&pni->reclaim_node); INIT_LIST_HEAD(&pni->lru); + spin_lock_init(&pni->lock); } idr_preload(GFP_KERNEL); @@ -1039,6 +1045,7 @@ tcache_lru_isolate(int nid, struct page **pages, int nr_to_isolate) if (!tcache_grab_pool(pni->pool)) goto again; + spin_lock(&pni->lock); nr = __tcache_lru_isolate(pni, pages, nr_to_isolate); ni->nr_pages -= nr; nr_isolated += nr; @@ -1047,6 +1054,7 @@ tcache_lru_isolate(int nid, struct page **pages, int nr_to_isolate) if (!list_empty(&pni->lru)) __tcache_insert_reclaim_node(ni, pni); + spin_unlock(&pni->lock); tcache_put_pool(pni->pool); out: spin_unlock_irq(&ni->lock); @@ -1091,14 +1099,17 @@ tcache_try_to_reclaim_page(struct tcache_pool *pool, int nid) local_irq_save(flags); - spin_lock(&ni->lock); + spin_lock(&pni->lock); ret = __tcache_lru_isolate(pni, &page, 1); - ni->nr_pages -= ret; - spin_unlock(&ni->lock); + spin_unlock(&pni->lock); if (!ret) goto out; + spin_lock(&ni->lock); + ni->nr_pages -= ret; + spin_unlock(&ni->lock); + if (!__tcache_reclaim_page(page)) page = NULL; else ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] tcache: Move add/sub out of pni->lock
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 673358de1fce85596dcd17e1bde8b7a9639fcc1c Author: Kirill Tkhai Date: Thu Aug 31 18:18:22 2017 +0300 tcache: Move add/sub out of pni->lock This minimizes number of operations happening under pni->lock. Note, that we do add before linking to the list, so parallel shrink does not make nr_pages negative. Signed-off-by: Kirill Tkhai Acked-by: Andrey Ryabinin --- mm/tcache.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/mm/tcache.c b/mm/tcache.c index 5faa390..d1a2c53 100644 --- a/mm/tcache.c +++ b/mm/tcache.c @@ -261,8 +261,9 @@ static void tcache_lru_add(struct tcache_pool *pool, struct page *page) struct tcache_nodeinfo *ni = &tcache_nodeinfo[nid]; struct tcache_pool_nodeinfo *pni = &pool->nodeinfo[nid]; - spin_lock(&pni->lock); atomic_long_inc(&ni->nr_pages); + + spin_lock(&pni->lock); pni->nr_pages++; list_add_tail(&page->lru, &pni->lru); @@ -300,6 +301,7 @@ static void tcache_lru_del(struct tcache_pool *pool, struct page *page, int nid = page_to_nid(page); struct tcache_nodeinfo *ni = &tcache_nodeinfo[nid]; struct tcache_pool_nodeinfo *pni = &pool->nodeinfo[nid]; + bool deleted = false; spin_lock(&pni->lock); @@ -308,7 +310,7 @@ static void tcache_lru_del(struct tcache_pool *pool, struct page *page, goto out; __tcache_lru_del(pni, page); - atomic_long_dec(&ni->nr_pages); + deleted = true; if (reused) pni->recent_gets++; @@ -323,6 +325,8 @@ static void tcache_lru_del(struct tcache_pool *pool, struct page *page, } out: spin_unlock(&pni->lock); + if (deleted) + atomic_long_dec(&ni->nr_pages); } static int tcache_create_pool(void) @@ -1071,8 +1075,6 @@ tcache_lru_isolate(int nid, struct page **pages, int nr_to_isolate) if (!nr_isolated) goto unlock; - atomic_long_sub(nr_isolated, &ni->nr_pages); - if (!RB_EMPTY_NODE(rbn) || !list_empty(&pni->lru)) { spin_lock(&ni->lock); if (!RB_EMPTY_NODE(rbn)) @@ -1088,6 +1090,8 @@ tcache_lru_isolate(int nid, struct page **pages, int nr_to_isolate) spin_unlock_irq(&pni->lock); tcache_put_pool(pni->pool); out: + if (nr_isolated) + atomic_long_sub(nr_isolated, &ni->nr_pages); return nr_isolated; } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] tcache: Decrement removed from LRU pages out of __tcache_lru_del()
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit eb34be224b7ca575751cc6f9752a7f8171c5c4f7 Author: Kirill Tkhai Date: Thu Aug 31 18:18:17 2017 +0300 tcache: Decrement removed from LRU pages out of __tcache_lru_del() Patchset description: tcache: Manage LRU lists under per-filesystem lock Changes to v2: Disable irqs in tcache_lru_isolate() [9/10] Move update_ni_rb_first() to "tcache: Cache rb_first() of reclaim tree in tcache_nodeinfo::rb_first". Add spin_lock_init() for lockdep [2/10] Kirill Tkhai (10): tcache: Decrement removed from LRU pages out of __tcache_lru_del() tcache: Add tcache_pool_nodeinfo::lock tcache: Cleanup unused expression from tcache_lru_isolate() tcache: Remove excess variable from tcache_lru_isolate() tcache: Cache rb_first() of reclaim tree in tcache_nodeinfo::rb_first tcache: Make tcache_lru_isolate() keep ni->lock less tcache: Move erase-insert logic out of tcache_check_events() tcache: Make tcache_nodeinfo::nr_pages atomic_long_t tcache: Use ni->lock only for inserting and erasing from rbtree. tcache: Move add/sub out of pni->lock https://jira.sw.ru/browse/PSBM-69296 This patchset decreases the cpu usage on writing big files in Containers. == This patch description: Move the subtraction out of __tcache_lru_del, and this will be used in next patches. Also, delete ni argument of the function. Signed-off-by: Kirill Tkhai Acked-by: Andrey Ryabinin --- mm/tcache.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/mm/tcache.c b/mm/tcache.c index 0bfbb69..9f296dc 100644 --- a/mm/tcache.c +++ b/mm/tcache.c @@ -274,11 +274,9 @@ static void tcache_lru_add(struct tcache_pool *pool, struct page *page) spin_unlock(&ni->lock); } -static void __tcache_lru_del(struct tcache_nodeinfo *ni, -struct tcache_pool_nodeinfo *pni, +static void __tcache_lru_del(struct tcache_pool_nodeinfo *pni, struct page *page) { - ni->nr_pages--; pni->nr_pages--; list_del_init(&page->lru); } @@ -300,7 +298,8 @@ static void tcache_lru_del(struct tcache_pool *pool, struct page *page, if (unlikely(list_empty(&page->lru))) goto out; - __tcache_lru_del(ni, pni, page); + __tcache_lru_del(pni, page); + ni->nr_pages--; if (reused) pni->recent_gets++; @@ -988,8 +987,7 @@ __tcache_insert_reclaim_node(struct tcache_nodeinfo *ni, } static noinline_for_stack int -__tcache_lru_isolate(struct tcache_nodeinfo *ni, -struct tcache_pool_nodeinfo *pni, +__tcache_lru_isolate(struct tcache_pool_nodeinfo *pni, struct page **pages, int nr_to_scan) { struct tcache_node *node; @@ -1002,7 +1000,7 @@ __tcache_lru_isolate(struct tcache_nodeinfo *ni, if (unlikely(!page_cache_get_speculative(page))) continue; - __tcache_lru_del(ni, pni, page); + __tcache_lru_del(pni, page); /* * A node can be destroyed only if all its pages have been @@ -1041,7 +1039,8 @@ tcache_lru_isolate(int nid, struct page **pages, int nr_to_isolate) if (!tcache_grab_pool(pni->pool)) goto again; - nr = __tcache_lru_isolate(ni, pni, pages, nr_to_isolate); + nr = __tcache_lru_isolate(pni, pages, nr_to_isolate); + ni->nr_pages -= nr; nr_isolated += nr; nr_to_isolate -= nr; @@ -1093,7 +1092,8 @@ tcache_try_to_reclaim_page(struct tcache_pool *pool, int nid) local_irq_save(flags); spin_lock(&ni->lock); - ret = __tcache_lru_isolate(ni, pni, &page, 1); + ret = __tcache_lru_isolate(pni, &page, 1); + ni->nr_pages -= ret; spin_unlock(&ni->lock); if (!ret) ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] tcache: Make tcache_nodeinfo::nr_pages atomic_long_t
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 89f8a885e1deeff230554cf1c4dcd323fcbaa9ea Author: Kirill Tkhai Date: Thu Aug 31 18:18:21 2017 +0300 tcache: Make tcache_nodeinfo::nr_pages atomic_long_t This allows to do not avoid tcache_nodeinfo::lock to change nr_pages. Signed-off-by: Kirill Tkhai Acked-by: Andrey Ryabinin --- mm/tcache.c | 21 - 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/mm/tcache.c b/mm/tcache.c index 6962097..202834c 100644 --- a/mm/tcache.c +++ b/mm/tcache.c @@ -160,7 +160,7 @@ struct tcache_nodeinfo { struct rb_node __rcu *rb_first; /* total number of pages on all LRU lists corresponding to this node */ - unsigned long nr_pages; + atomic_long_t nr_pages; } cacheline_aligned_in_smp; /* @@ -263,8 +263,7 @@ static void tcache_lru_add(struct tcache_pool *pool, struct page *page) spin_lock(&ni->lock); spin_lock(&pni->lock); - - ni->nr_pages++; + atomic_long_inc(&ni->nr_pages); pni->nr_pages++; list_add_tail(&page->lru, &pni->lru); @@ -310,7 +309,7 @@ static void tcache_lru_del(struct tcache_pool *pool, struct page *page, goto out; __tcache_lru_del(pni, page); - ni->nr_pages--; + atomic_long_dec(&ni->nr_pages); if (reused) pni->recent_gets++; @@ -1073,7 +1072,7 @@ tcache_lru_isolate(int nid, struct page **pages, int nr_to_isolate) if (!nr_isolated) goto unlock; - ni->nr_pages -= nr_isolated; + atomic_long_sub(nr_isolated, &ni->nr_pages); if (!RB_EMPTY_NODE(rbn)) { rb_erase(rbn, &ni->reclaim_tree); @@ -1136,9 +1135,7 @@ tcache_try_to_reclaim_page(struct tcache_pool *pool, int nid) if (!ret) goto out; - spin_lock(&ni->lock); - ni->nr_pages -= ret; - spin_unlock(&ni->lock); + atomic_long_dec(&ni->nr_pages); if (!__tcache_reclaim_page(page)) page = NULL; @@ -1163,7 +1160,12 @@ static struct page *tcache_alloc_page(struct tcache_pool *pool) static unsigned long tcache_shrink_count(struct shrinker *shrink, struct shrink_control *sc) { - return tcache_nodeinfo[sc->nid].nr_pages; + atomic_long_t *nr_pages = &tcache_nodeinfo[sc->nid].nr_pages; + long ret; + + ret = atomic_long_read(nr_pages); + WARN_ON(ret < 0); + return ret >= 0 ? ret : 0; } #define TCACHE_SCAN_BATCH 128UL @@ -1380,6 +1382,7 @@ static int __init tcache_nodeinfo_init(void) for (i = 0; i < nr_node_ids; i++) { ni = &tcache_nodeinfo[i]; spin_lock_init(&ni->lock); + atomic_long_set(&ni->nr_pages, 0); ni->reclaim_tree = RB_ROOT; update_ni_rb_first(ni); } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] tcache: Move erase-insert logic out of tcache_check_events()
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit e6e93d14b403bd4176358427d6f7f0e1c252ea5e Author: Kirill Tkhai Date: Thu Aug 31 18:18:21 2017 +0300 tcache: Move erase-insert logic out of tcache_check_events() Make the function return true, when erase-insert (requeue) should be executed. Move erase-insert out of the function. v3: Move update_ni_rb_first() to "tcache: Cache rb_first() of reclaim tree in tcache_nodeinfo::rb_first". Signed-off-by: Kirill Tkhai Acked-by: Andrey Ryabinin --- mm/tcache.c | 29 +++-- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/mm/tcache.c b/mm/tcache.c index 3d9c5ac..6962097 100644 --- a/mm/tcache.c +++ b/mm/tcache.c @@ -216,8 +216,7 @@ static struct rb_node *update_ni_rb_first(struct tcache_nodeinfo *ni) static void __tcache_insert_reclaim_node(struct tcache_nodeinfo *ni, struct tcache_pool_nodeinfo *pni); -static inline void __tcache_check_events(struct tcache_nodeinfo *ni, -struct tcache_pool_nodeinfo *pni) +static inline bool tcache_check_events(struct tcache_pool_nodeinfo *pni) { /* * We don't want to rebalance reclaim_tree on each get/put, because it @@ -228,7 +227,7 @@ static inline void __tcache_check_events(struct tcache_nodeinfo *ni, */ pni->events++; if (likely(pni->events < 1024)) - return; + return false; pni->events = 0; @@ -238,7 +237,7 @@ static inline void __tcache_check_events(struct tcache_nodeinfo *ni, * it will be done by the shrinker once it tries to scan it. */ if (unlikely(list_empty(&pni->lru))) - return; + return false; /* * This can only happen if the node was removed from the tree on pool @@ -246,11 +245,9 @@ static inline void __tcache_check_events(struct tcache_nodeinfo *ni, * then. */ if (unlikely(RB_EMPTY_NODE(&pni->reclaim_node))) - return; + return false; - rb_erase(&pni->reclaim_node, &ni->reclaim_tree); - __tcache_insert_reclaim_node(ni, pni); - update_ni_rb_first(ni); + return true; } /* @@ -277,13 +274,12 @@ static void tcache_lru_add(struct tcache_pool *pool, struct page *page) pni->recent_puts /= 2; } - __tcache_check_events(ni, pni); - - if (unlikely(RB_EMPTY_NODE(&pni->reclaim_node))) { + if (tcache_check_events(pni) || RB_EMPTY_NODE(&pni->reclaim_node)) { + if (!RB_EMPTY_NODE(&pni->reclaim_node)) + rb_erase(&pni->reclaim_node, &ni->reclaim_tree); __tcache_insert_reclaim_node(ni, pni); update_ni_rb_first(ni); } - spin_unlock(&pni->lock); spin_unlock(&ni->lock); } @@ -319,7 +315,12 @@ static void tcache_lru_del(struct tcache_pool *pool, struct page *page, if (reused) pni->recent_gets++; - __tcache_check_events(ni, pni); + if (tcache_check_events(pni)) { + if (!RB_EMPTY_NODE(&pni->reclaim_node)) + rb_erase(&pni->reclaim_node, &ni->reclaim_tree); + __tcache_insert_reclaim_node(ni, pni); + update_ni_rb_first(ni); + } out: spin_unlock(&pni->lock); spin_unlock(&ni->lock); @@ -947,7 +948,7 @@ tcache_remove_from_reclaim_trees(struct tcache_pool *pool) rb_erase(&pni->reclaim_node, &ni->reclaim_tree); update_ni_rb_first(ni); /* -* Clear the node for __tcache_check_events() not to +* Clear the node for tcache_check_events() not to * reinsert the pool back into the tree. */ RB_CLEAR_NODE(&pni->reclaim_node); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Revert "autofs: fix autofs_v5_packet structure for compat mode"
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit a24e586ec36bf182a3261a9608e8515d424b242e Author: Konstantin Khorenko Date: Thu Aug 31 17:56:44 2017 +0300 Revert "autofs: fix autofs_v5_packet structure for compat mode" This reverts commit e484b0abe8af8793f58e6434060a3779261d3151. The patch is question increases the offsetof(struct autofs_v5_packet, name) by 4 which is not good, the patch is to be reworked. Thanks to Dmitry V. Levin for noticing it. https://jira.sw.ru/browse/PSBM-71078 Signed-off-by: Konstantin Khorenko --- include/uapi/linux/auto_fs4.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/include/uapi/linux/auto_fs4.h b/include/uapi/linux/auto_fs4.h index 8729a47..e02982f 100644 --- a/include/uapi/linux/auto_fs4.h +++ b/include/uapi/linux/auto_fs4.h @@ -137,8 +137,6 @@ struct autofs_v5_packet { __u32 pid; __u32 tgid; __u32 len; - __u32 blob; /* This is needed to align structure up to 8 - bytes for ALL archs including 32-bit */ char name[NAME_MAX+1]; }; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RHEL7 COMMIT] ms/mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for thp
Please consider to prepare a ReadyKernel patch for it. https://readykernel.com/ -- Best regards, Konstantin Khorenko, Virtuozzo Linux Kernel Team On 08/31/2017 05:51 PM, Konstantin Khorenko wrote: The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit f5b413ea4e53d819c8b4e4a4927fb563bd3ec24f Author: Keno Fischer Date: Thu Aug 31 17:51:25 2017 +0300 ms/mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for thp commit 8310d48b125d19fcd9521d83b8293e63eb1646aa upstream. In commit 19be0eaffa3a ("mm: remove gup_flags FOLL_WRITE games from __get_user_pages()"), the mm code was changed from unsetting FOLL_WRITE after a COW was resolved to setting the (newly introduced) FOLL_COW instead. Simultaneously, the check in gup.c was updated to still allow writes with FOLL_FORCE set if FOLL_COW had also been set. However, a similar check in huge_memory.c was forgotten. As a result, remote memory writes to ro regions of memory backed by transparent huge pages cause an infinite loop in the kernel (handle_mm_fault sets FOLL_COW and returns 0 causing a retry, but follow_trans_huge_pmd bails out immidiately because `(flags & FOLL_WRITE) && !pmd_write(*pmd)` is true. While in this state the process is stil SIGKILLable, but little else works (e.g. no ptrace attach, no other signals). This is easily reproduced with the following code (assuming thp are set to always): #include #include #include #include #include #include #include #include #include #include #define TEST_SIZE 5 * 1024 * 1024 int main(void) { int status; pid_t child; int fd = open("/proc/self/mem", O_RDWR); void *addr = mmap(NULL, TEST_SIZE, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); assert(addr != MAP_FAILED); pid_t parent_pid = getpid(); if ((child = fork()) == 0) { void *addr2 = mmap(NULL, TEST_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); assert(addr2 != MAP_FAILED); memset(addr2, 'a', TEST_SIZE); pwrite(fd, addr2, TEST_SIZE, (uintptr_t)addr); return 0; } assert(child == waitpid(child, &status, 0)); assert(WIFEXITED(status) && WEXITSTATUS(status) == 0); return 0; } Fix this by updating follow_trans_huge_pmd in huge_memory.c analogously to the update in gup.c in the original commit. The same pattern exists in follow_devmap_pmd. However, we should not be able to reach that check with FOLL_COW set, so add WARN_ONCE to make sure we notice if we ever do. [a...@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170106015025.ga38...@juliacomputing.com Signed-off-by: Keno Fischer Acked-by: Kirill A. Shutemov Cc: Greg Thelen Cc: Nicholas Piggin Cc: Willy Tarreau Cc: Oleg Nesterov Cc: Kees Cook Cc: Andy Lutomirski Cc: Michal Hocko Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds [bwh: Backported to 3.2: - Drop change to follow_devmap_pmd() - pmd_dirty() is not available; check the page flags as in can_follow_write_pte() - Adjust context] Signed-off-by: Ben Hutchings [mhocko: This has been forward ported from the 3.2 stable tree. And fixed to return NULL.] Reviewed-by: Michal Hocko Signed-off-by: Jiri Slaby Signed-off-by: Willy Tarreau https://jira.sw.ru/browse/PSBM-70151 Signed-off-by: Andrey Ryabinin --- mm/huge_memory.c | 19 --- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 477610d..5a07e76 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1317,6 +1317,18 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, return ret; } +/* + * foll_force can write to even unwritable pmd's, but only + * after we've gone through a cow cycle and they are dirty. + */ +static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page, + unsigned int flags) +{ + return pmd_write(pmd) || + ((flags & FOLL_FORCE) && (flags & FOLL_COW) && +page && PageAnon(page)); +} + struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, @@ -1327,9 +1339,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, assert_spin_locked(pmd_lockptr(mm, pmd)); - if (flags & FOLL_WRITE && !pmd_write(*pmd)) -
[Devel] [PATCH RHEL7 COMMIT] ms/mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for thp
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit f5b413ea4e53d819c8b4e4a4927fb563bd3ec24f Author: Keno Fischer Date: Thu Aug 31 17:51:25 2017 +0300 ms/mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for thp commit 8310d48b125d19fcd9521d83b8293e63eb1646aa upstream. In commit 19be0eaffa3a ("mm: remove gup_flags FOLL_WRITE games from __get_user_pages()"), the mm code was changed from unsetting FOLL_WRITE after a COW was resolved to setting the (newly introduced) FOLL_COW instead. Simultaneously, the check in gup.c was updated to still allow writes with FOLL_FORCE set if FOLL_COW had also been set. However, a similar check in huge_memory.c was forgotten. As a result, remote memory writes to ro regions of memory backed by transparent huge pages cause an infinite loop in the kernel (handle_mm_fault sets FOLL_COW and returns 0 causing a retry, but follow_trans_huge_pmd bails out immidiately because `(flags & FOLL_WRITE) && !pmd_write(*pmd)` is true. While in this state the process is stil SIGKILLable, but little else works (e.g. no ptrace attach, no other signals). This is easily reproduced with the following code (assuming thp are set to always): #include #include #include #include #include #include #include #include #include #include #define TEST_SIZE 5 * 1024 * 1024 int main(void) { int status; pid_t child; int fd = open("/proc/self/mem", O_RDWR); void *addr = mmap(NULL, TEST_SIZE, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); assert(addr != MAP_FAILED); pid_t parent_pid = getpid(); if ((child = fork()) == 0) { void *addr2 = mmap(NULL, TEST_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); assert(addr2 != MAP_FAILED); memset(addr2, 'a', TEST_SIZE); pwrite(fd, addr2, TEST_SIZE, (uintptr_t)addr); return 0; } assert(child == waitpid(child, &status, 0)); assert(WIFEXITED(status) && WEXITSTATUS(status) == 0); return 0; } Fix this by updating follow_trans_huge_pmd in huge_memory.c analogously to the update in gup.c in the original commit. The same pattern exists in follow_devmap_pmd. However, we should not be able to reach that check with FOLL_COW set, so add WARN_ONCE to make sure we notice if we ever do. [a...@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170106015025.ga38...@juliacomputing.com Signed-off-by: Keno Fischer Acked-by: Kirill A. Shutemov Cc: Greg Thelen Cc: Nicholas Piggin Cc: Willy Tarreau Cc: Oleg Nesterov Cc: Kees Cook Cc: Andy Lutomirski Cc: Michal Hocko Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds [bwh: Backported to 3.2: - Drop change to follow_devmap_pmd() - pmd_dirty() is not available; check the page flags as in can_follow_write_pte() - Adjust context] Signed-off-by: Ben Hutchings [mhocko: This has been forward ported from the 3.2 stable tree. And fixed to return NULL.] Reviewed-by: Michal Hocko Signed-off-by: Jiri Slaby Signed-off-by: Willy Tarreau https://jira.sw.ru/browse/PSBM-70151 Signed-off-by: Andrey Ryabinin --- mm/huge_memory.c | 19 --- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 477610d..5a07e76 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1317,6 +1317,18 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, return ret; } +/* + * foll_force can write to even unwritable pmd's, but only + * after we've gone through a cow cycle and they are dirty. + */ +static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page, + unsigned int flags) +{ + return pmd_write(pmd) || + ((flags & FOLL_FORCE) && (flags & FOLL_COW) && +page && PageAnon(page)); +} + struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, @@ -1327,9 +1339,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, assert_spin_locked(pmd_lockptr(mm, pmd)); - if (flags & FOLL_WRITE && !pmd_write(*pmd)) - goto out; - /* Avoid dumping huge zero page */ if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd)) return ERR_PTR(-EFAULT)
[Devel] [PATCH RHEL7 COMMIT] proc connector: use generic event helper for coredump event
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit b6d449038da008a26835e1ae16292869b1fe80aa Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:36 2017 +0300 proc connector: use generic event helper for coredump event Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/cn_proc.c | 28 +++- 1 file changed, 7 insertions(+), 21 deletions(-) diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index 2d5ff7c..312f30f 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -222,31 +222,17 @@ void proc_comm_connector(struct task_struct *task) proc_event_connector(task, PROC_EVENT_COMM, 0, fill_comm_event); } -void proc_coredump_connector(struct task_struct *task) +static bool fill_coredump_event(struct proc_event *ev, struct task_struct *task, + int unused) { - struct cn_msg *msg; - struct proc_event *ev; - __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8); - struct timespec ts; - - if (atomic_read(&proc_event_num_listeners) < 1) - return; - - msg = buffer_to_cn_msg(buffer); - ev = (struct proc_event *)msg->data; - memset(&ev->event_data, 0, sizeof(ev->event_data)); - get_seq(&msg->seq, &ev->cpu); - ktime_get_ts(&ts); /* get high res monotonic timestamp */ - ev->timestamp_ns = timespec_to_ns(&ts); - ev->what = PROC_EVENT_COREDUMP; ev->event_data.coredump.process_pid = task->pid; ev->event_data.coredump.process_tgid = task->tgid; + return true; +} - memcpy(&msg->id, &cn_proc_event_id, sizeof(msg->id)); - msg->ack = 0; /* not used */ - msg->len = sizeof(*ev); - msg->flags = 0; /* not used */ - cn_netlink_send(msg, CN_IDX_PROC, GFP_KERNEL); +void proc_coredump_connector(struct task_struct *task) +{ + proc_event_connector(task, PROC_EVENT_COREDUMP, 0, fill_coredump_event); } void proc_exit_connector(struct task_struct *task) ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] proc connector: take number of listeners and per-cpu conters from VE
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 06bed1f6e4442906e11d86763920b00f107a2112 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:40 2017 +0300 proc connector: take number of listeners and per-cpu conters from VE Instead of static variables. Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/cn_proc.c | 50 + 1 file changed, 32 insertions(+), 18 deletions(-) diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index 7a1124a..ff99f06 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -50,21 +50,17 @@ static inline struct cn_msg *buffer_to_cn_msg(__u8 *buffer) return (struct cn_msg *)(buffer + 4); } -static atomic_t proc_event_num_listeners = ATOMIC_INIT(0); static struct cb_id cn_proc_event_id = { CN_IDX_PROC, CN_VAL_PROC }; -/* proc_event_counts is used as the sequence number of the netlink message */ -static DEFINE_PER_CPU(__u32, proc_event_counts) = { 0 }; - -static inline void get_seq(__u32 *ts, int *cpu) +static inline void get_seq(struct ve_struct *ve, __u32 *ts, int *cpu) { preempt_disable(); - *ts = __this_cpu_inc_return(proc_event_counts) - 1; + *ts = __this_cpu_inc_return(*ve->cn->proc_event_counts) - 1; *cpu = smp_processor_id(); preempt_enable(); } -static struct cn_msg *cn_msg_fill(__u8 *buffer, +static struct cn_msg *cn_msg_fill(__u8 *buffer, struct ve_struct *ve, struct task_struct *task, int what, int cookie, bool (*fill_event)(struct proc_event *ev, @@ -78,7 +74,7 @@ static struct cn_msg *cn_msg_fill(__u8 *buffer, msg = buffer_to_cn_msg(buffer); ev = (struct proc_event *)msg->data; - get_seq(&msg->seq, &ev->cpu); + get_seq(ve, &msg->seq, &ev->cpu); memcpy(&msg->id, &cn_proc_event_id, sizeof(msg->id)); msg->ack = 0; /* not used */ msg->len = sizeof(*ev); @@ -92,6 +88,13 @@ static struct cn_msg *cn_msg_fill(__u8 *buffer, return fill_event(ev, task, cookie) ? msg : NULL; } +static int proc_event_num_listeners(struct ve_struct *ve) +{ + if (ve->cn) + return atomic_read(&ve->cn->proc_event_num_listeners); + return 0; +} + static void proc_event_connector(struct task_struct *task, int what, int cookie, bool (*fill_event)(struct proc_event *ev, @@ -100,11 +103,12 @@ static void proc_event_connector(struct task_struct *task, { struct cn_msg *msg; __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8); + struct ve_struct *ve = task->task_ve; - if (atomic_read(&proc_event_num_listeners) < 1) + if (proc_event_num_listeners(ve) < 1) return; - msg = cn_msg_fill(buffer, task, what, cookie, fill_event); + msg = cn_msg_fill(buffer, ve, task, what, cookie, fill_event); if (!msg) return; @@ -258,14 +262,14 @@ void proc_exit_connector(struct task_struct *task) * values because it's not being returned via syscall return * mechanisms. */ -static void cn_proc_ack(int err, int rcvd_seq, int rcvd_ack) +static void cn_proc_ack(struct ve_struct *ve, int err, int rcvd_seq, int rcvd_ack) { struct cn_msg *msg; struct proc_event *ev; __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8); struct timespec ts; - if (atomic_read(&proc_event_num_listeners) < 1) + if (proc_event_num_listeners(ve) < 1) return; msg = buffer_to_cn_msg(buffer); @@ -292,6 +296,7 @@ static void cn_proc_mcast_ctl(struct cn_msg *msg, struct netlink_skb_parms *nsp) { enum proc_cn_mcast_op *mc_op = NULL; + struct ve_struct *ve = get_exec_env(); int err = 0; if (msg->len != sizeof(*mc_op)) @@ -315,10 +320,10 @@ static void cn_proc_mcast_ctl(struct cn_msg *msg, mc_op = (enum proc_cn_mcast_op *)msg->data; switch (*mc_op) { case PROC_CN_MCAST_LISTEN: - atomic_inc(&proc_event_num_listeners); + atomic_inc(&ve->cn->proc_event_num_listeners); break; case PROC_CN_MCAST_IGNORE: - atomic_dec(&proc_event_num_listeners); + atomic_dec(&ve->cn->proc_event_num_listeners); break; default: err = EINVAL; @@ -326,22 +331,31 @@ static void cn_proc_mcast_ctl(struct cn_msg *msg, } out: - cn_proc_ack(err, msg->seq, msg->ack); + cn_proc_ack(ve, err, msg->seq, msg->ack); } int cn_proc_init_ve(struct ve_struct *ve) { - int err = cn_add_callback_ve(ve, &cn_proc_event_id, -
[Devel] [PATCH RHEL7 COMMIT] connector: store all private data on VE structure
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit a74b5b56cac3c2212351dbc1e9ca957789221347 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:28 2017 +0300 connector: store all private data on VE structure This is needed to containerize connector and its proc part. Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- include/linux/connector.h | 9 + include/linux/ve.h| 4 2 files changed, 13 insertions(+) diff --git a/include/linux/connector.h b/include/linux/connector.h index 4c4d2b9..9e05e28 100644 --- a/include/linux/connector.h +++ b/include/linux/connector.h @@ -67,6 +67,15 @@ struct cn_dev { struct cn_queue_dev *cbdev; }; +struct cn_private { + struct cn_dev cdev; + int cn_already_initialized; + + atomic_tproc_event_num_listeners; + u32 __percpu*proc_event_counts; + +}; + int cn_add_callback(struct cb_id *id, const char *name, void (*callback)(struct cn_msg *, struct netlink_skb_parms *)); void cn_del_callback(struct cb_id *); diff --git a/include/linux/ve.h b/include/linux/ve.h index c9b0af4..d63edee 100644 --- a/include/linux/ve.h +++ b/include/linux/ve.h @@ -30,6 +30,7 @@ struct file_system_type; struct veip_struct; struct nsproxy; struct user_namespace; +struct cn_private; extern struct user_namespace init_user_ns; struct ve_struct { @@ -123,6 +124,9 @@ struct ve_struct { #ifdef CONFIG_COREDUMP charcore_pattern[CORENAME_MAX_SIZE]; #endif +#ifdef CONFIG_CONNECTOR + struct cn_private *cn; +#endif }; struct ve_devmnt { ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] proc connector: use generic event helper for comm event
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 5e8a090a6347dc8364c23612aaf6a225254a0c53 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:36 2017 +0300 proc connector: use generic event helper for comm event Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/cn_proc.c | 28 +++- 1 file changed, 7 insertions(+), 21 deletions(-) diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index 36a53fd..2d5ff7c 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -208,32 +208,18 @@ void proc_ptrace_connector(struct task_struct *task, int ptrace_id) fill_ptrace_event); } -void proc_comm_connector(struct task_struct *task) +static bool fill_comm_event(struct proc_event *ev, struct task_struct *task, + int unused) { - struct cn_msg *msg; - struct proc_event *ev; - struct timespec ts; - __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8); - - if (atomic_read(&proc_event_num_listeners) < 1) - return; - - msg = buffer_to_cn_msg(buffer); - ev = (struct proc_event *)msg->data; - memset(&ev->event_data, 0, sizeof(ev->event_data)); - get_seq(&msg->seq, &ev->cpu); - ktime_get_ts(&ts); /* get high res monotonic timestamp */ - ev->timestamp_ns = timespec_to_ns(&ts); - ev->what = PROC_EVENT_COMM; ev->event_data.comm.process_pid = task->pid; ev->event_data.comm.process_tgid = task->tgid; get_task_comm(ev->event_data.comm.comm, task); + return true; +} - memcpy(&msg->id, &cn_proc_event_id, sizeof(msg->id)); - msg->ack = 0; /* not used */ - msg->len = sizeof(*ev); - msg->flags = 0; /* not used */ - cn_netlink_send(msg, CN_IDX_PROC, GFP_KERNEL); +void proc_comm_connector(struct task_struct *task) +{ + proc_event_connector(task, PROC_EVENT_COMM, 0, fill_comm_event); } void proc_coredump_connector(struct task_struct *task) ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] connector: add VE SS hook
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 07e77673691685713f04bd6b84fc0e07eae57158 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:45 2017 +0300 connector: add VE SS hook And thus containerize connector finally. https://jira.sw.ru/browse/PSBM-60227 Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/connector.c | 23 --- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c index 81854bf..752c692 100644 --- a/drivers/connector/connector.c +++ b/drivers/connector/connector.c @@ -273,8 +273,9 @@ static const struct file_operations cn_file_ops = { .release = single_release }; -static int cn_init_ve(struct ve_struct *ve) +static int cn_init_ve(void *data) { + struct ve_struct *ve = data; struct cn_dev *dev; struct netlink_kernel_cfg cfg = { .groups = CN_NETLINK_USERS + 0xf, @@ -326,8 +327,9 @@ static int cn_init_ve(struct ve_struct *ve) return err; } -static void cn_fini_ve(struct ve_struct *ve) +static void cn_fini_ve(void *data) { + struct ve_struct *ve = data; struct cn_dev *dev = get_cdev(ve); struct net *net = ve->ve_netns; @@ -344,13 +346,28 @@ static void cn_fini_ve(struct ve_struct *ve) ve->cn = NULL; } +static struct ve_hook cn_ss_hook = { + .init = cn_init_ve, + .fini = cn_fini_ve, + .priority = HOOK_PRIO_DEFAULT, + .owner = THIS_MODULE, +}; + static int cn_init(void) { - return cn_init_ve(get_ve0()); + int err; + + err = cn_init_ve(get_ve0()); + if (err) + return err; + + ve_hook_register(VE_SS_CHAIN, &cn_ss_hook); + return 0; } static void cn_fini(void) { + ve_hook_unregister(&cn_ss_hook); return cn_fini_ve(get_ve0()); } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] proc connector: send events to both VEs if not in VE#0
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 4ba0afbe02d33bf2e906209521bb59e7fa0def73 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:43 2017 +0300 proc connector: send events to both VEs if not in VE#0 This is needed to preserve current behaviour, when process in initial pid and user namespaces (i.e. in VE#0) can receive events from all the processes in the system. Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/cn_proc.c | 29 ++--- 1 file changed, 22 insertions(+), 7 deletions(-) diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index 17e0247..81f2e56 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -96,16 +96,16 @@ static int proc_event_num_listeners(struct ve_struct *ve) return 0; } -static void proc_event_connector(struct task_struct *task, -int what, int cookie, -bool (*fill_event)(struct proc_event *ev, - struct ve_struct *ve, - struct task_struct *task, - int cookie)) +static void proc_event_connector_ve(struct task_struct *task, + struct ve_struct *ve, + int what, int cookie, + bool (*fill_event)(struct proc_event *ev, + struct ve_struct *ve, + struct task_struct *task, + int cookie)) { struct cn_msg *msg; __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8); - struct ve_struct *ve = task->task_ve; if (proc_event_num_listeners(ve) < 1) return; @@ -118,6 +118,21 @@ static void proc_event_connector(struct task_struct *task, cn_netlink_send_ve(ve, msg, CN_IDX_PROC, GFP_KERNEL); } +static void proc_event_connector(struct task_struct *task, +int what, int cookie, +bool (*fill_event)(struct proc_event *ev, + struct ve_struct *ve, + struct task_struct *task, + int cookie)) +{ + struct ve_struct *ve = task->task_ve; + + if (!ve_is_super(ve)) + proc_event_connector_ve(task, ve, what, cookie, fill_event); + + proc_event_connector_ve(task, get_ve0(), what, cookie, fill_event); +} + static bool fill_fork_event(struct proc_event *ev, struct ve_struct *ve, struct task_struct *task, int unused) { ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] proc connector: add pid namespace awareness
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit d53ad1ca8439459567dbb732ea568ae75cb9a6b3 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:38 2017 +0300 proc connector: add pid namespace awareness This is precursor patch. Later VE pid ns will be used. Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/cn_proc.c | 40 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index 4ee1640..17a8c8c 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -119,11 +119,11 @@ static bool fill_fork_event(struct proc_event *ev, struct task_struct *task, rcu_read_lock(); parent = rcu_dereference(task->real_parent); - ev->event_data.fork.parent_pid = parent->pid; - ev->event_data.fork.parent_tgid = parent->tgid; + ev->event_data.fork.parent_pid = task_pid_nr_ns(parent, &init_pid_ns); + ev->event_data.fork.parent_tgid = task_tgid_nr_ns(parent, &init_pid_ns); rcu_read_unlock(); - ev->event_data.fork.child_pid = task->pid; - ev->event_data.fork.child_tgid = task->tgid; + ev->event_data.fork.child_pid = task_pid_nr_ns(task, &init_pid_ns); + ev->event_data.fork.child_tgid = task_tgid_nr_ns(task, &init_pid_ns); return true; } @@ -135,8 +135,8 @@ void proc_fork_connector(struct task_struct *task) static bool fill_exec_event(struct proc_event *ev, struct task_struct *task, int unused) { - ev->event_data.exec.process_pid = task->pid; - ev->event_data.exec.process_tgid = task->tgid; + ev->event_data.exec.process_pid = task_pid_nr_ns(task, &init_pid_ns); + ev->event_data.exec.process_tgid = task_tgid_nr_ns(task, &init_pid_ns); return true; } @@ -150,8 +150,8 @@ static bool fill_id_event(struct proc_event *ev, struct task_struct *task, { const struct cred *cred; - ev->event_data.id.process_pid = task->pid; - ev->event_data.id.process_tgid = task->tgid; + ev->event_data.id.process_pid = task_pid_nr_ns(task, &init_pid_ns); + ev->event_data.id.process_tgid = task_tgid_nr_ns(task, &init_pid_ns); rcu_read_lock(); cred = __task_cred(task); if (which_id == PROC_EVENT_UID) { @@ -176,8 +176,8 @@ void proc_id_connector(struct task_struct *task, int which_id) static bool fill_sid_event(struct proc_event *ev, struct task_struct *task, int unused) { - ev->event_data.sid.process_pid = task->pid; - ev->event_data.sid.process_tgid = task->tgid; + ev->event_data.sid.process_pid = task_pid_nr_ns(task, &init_pid_ns); + ev->event_data.sid.process_tgid = task_tgid_nr_ns(task, &init_pid_ns); return true; } @@ -189,11 +189,11 @@ void proc_sid_connector(struct task_struct *task) static bool fill_ptrace_event(struct proc_event *ev, struct task_struct *task, int ptrace_id) { - ev->event_data.ptrace.process_pid = task->pid; - ev->event_data.ptrace.process_tgid = task->tgid; + ev->event_data.ptrace.process_pid = task_pid_nr_ns(task, &init_pid_ns); + ev->event_data.ptrace.process_tgid = task_tgid_nr_ns(task, &init_pid_ns); if (ptrace_id == PTRACE_ATTACH) { - ev->event_data.ptrace.tracer_pid = current->pid; - ev->event_data.ptrace.tracer_tgid = current->tgid; + ev->event_data.ptrace.tracer_pid = task_pid_nr_ns(current, &init_pid_ns); + ev->event_data.ptrace.tracer_tgid = task_tgid_nr_ns(current, &init_pid_ns); } else if (ptrace_id == PTRACE_DETACH) { ev->event_data.ptrace.tracer_pid = 0; ev->event_data.ptrace.tracer_tgid = 0; @@ -211,8 +211,8 @@ void proc_ptrace_connector(struct task_struct *task, int ptrace_id) static bool fill_comm_event(struct proc_event *ev, struct task_struct *task, int unused) { - ev->event_data.comm.process_pid = task->pid; - ev->event_data.comm.process_tgid = task->tgid; + ev->event_data.comm.process_pid = task_pid_nr_ns(task, &init_pid_ns); + ev->event_data.comm.process_tgid = task_tgid_nr_ns(task, &init_pid_ns); get_task_comm(ev->event_data.comm.comm, task); return true; } @@ -225,8 +225,8 @@ void proc_comm_connector(struct task_struct *task) static bool fill_coredump_event(struct proc_event *ev, struct task_struct *task, int unused) { - ev->event_data.coredump.process_pid = task->pid; - ev->event_data.coredump.process_tgid = task->tgid; + ev->event_data.coredump.process_pid = task_pid_nr_ns(task, &init_pid_ns); + ev->event_data.coredump.pro
[Devel] [PATCH RHEL7 COMMIT] connector: take VE from socket upon callback
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit d7f362627da257bcb656a806fa0ece3743371fd4 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:44 2017 +0300 connector: take VE from socket upon callback This is needed to attach listener to the right device. I.e. attach to the right source of events (in terms of CT). Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/connector.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c index 771dadf..81854bf 100644 --- a/drivers/connector/connector.c +++ b/drivers/connector/connector.c @@ -130,7 +130,7 @@ EXPORT_SYMBOL_GPL(cn_netlink_send); static int cn_call_callback(struct sk_buff *skb) { struct cn_callback_entry *i, *cbq = NULL; - struct cn_dev *dev = get_cdev(get_ve0()); + struct cn_dev *dev = get_cdev(skb->sk->sk_net->owner_ve); struct cn_msg *msg = nlmsg_data(nlmsg_hdr(skb)); struct netlink_skb_parms *nsp = &NETLINK_CB(skb); int err = -ENODEV; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] connector: use device stored in VE
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 0773323bf46b0b99e6095a74cc1e1cd46dd18752 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:30 2017 +0300 connector: use device stored in VE Instead of global static device. Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/connector.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c index f5484b2..bc2308a 100644 --- a/drivers/connector/connector.c +++ b/drivers/connector/connector.c @@ -38,8 +38,6 @@ MODULE_AUTHOR("Evgeniy Polyakov "); MODULE_DESCRIPTION("Generic userspace <-> kernelspace connector."); MODULE_ALIAS_NET_PF_PROTO(PF_NETLINK, NETLINK_CONNECTOR); -static struct cn_dev cdev; - static int cn_already_initialized; /* @@ -66,7 +64,7 @@ static int cn_already_initialized; static struct cn_dev *get_cdev(struct ve_struct *ve) { - return &cdev; + return &ve->cn->cdev; } int cn_netlink_send(struct cn_msg *msg, u32 __group, gfp_t gfp_mask) @@ -261,7 +259,7 @@ static const struct file_operations cn_file_ops = { static int cn_init_ve(struct ve_struct *ve) { - struct cn_dev *dev = get_cdev(get_ve0()); + struct cn_dev *dev; struct netlink_kernel_cfg cfg = { .groups = CN_NETLINK_USERS + 0xf, .input = cn_rx_skb, @@ -272,6 +270,8 @@ static int cn_init_ve(struct ve_struct *ve) if (!ve->cn) return -ENOMEM; + dev = &ve->cn->cdev; + dev->nls = netlink_kernel_create(net, NETLINK_CONNECTOR, &cfg); if (!dev->nls) return -EIO; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] proc connector: pass VE to event fillers
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 472f0bf7498a2c07fb5e3764cda8036314497bf9 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:40 2017 +0300 proc connector: pass VE to event fillers Precursor patch. VE will be used later to get proper pid and user namespaces for correct event generation. Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/cn_proc.c | 36 +++- 1 file changed, 19 insertions(+), 17 deletions(-) diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index ff99f06..b66fde8 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -64,6 +64,7 @@ static struct cn_msg *cn_msg_fill(__u8 *buffer, struct ve_struct *ve, struct task_struct *task, int what, int cookie, bool (*fill_event)(struct proc_event *ev, +struct ve_struct *ve, struct task_struct *task, int cookie)) { @@ -85,7 +86,7 @@ static struct cn_msg *cn_msg_fill(__u8 *buffer, struct ve_struct *ve, ev->timestamp_ns = timespec_to_ns(&ts); ev->what = what; - return fill_event(ev, task, cookie) ? msg : NULL; + return fill_event(ev, ve, task, cookie) ? msg : NULL; } static int proc_event_num_listeners(struct ve_struct *ve) @@ -98,6 +99,7 @@ static int proc_event_num_listeners(struct ve_struct *ve) static void proc_event_connector(struct task_struct *task, int what, int cookie, bool (*fill_event)(struct proc_event *ev, + struct ve_struct *ve, struct task_struct *task, int cookie)) { @@ -116,8 +118,8 @@ static void proc_event_connector(struct task_struct *task, cn_netlink_send(msg, CN_IDX_PROC, GFP_KERNEL); } -static bool fill_fork_event(struct proc_event *ev, struct task_struct *task, - int unused) +static bool fill_fork_event(struct proc_event *ev, struct ve_struct *ve, + struct task_struct *task, int unused) { struct task_struct *parent; @@ -136,8 +138,8 @@ void proc_fork_connector(struct task_struct *task) proc_event_connector(task, PROC_EVENT_FORK, 0, fill_fork_event); } -static bool fill_exec_event(struct proc_event *ev, struct task_struct *task, - int unused) +static bool fill_exec_event(struct proc_event *ev, struct ve_struct *ve, + struct task_struct *task, int unused) { ev->event_data.exec.process_pid = task_pid_nr_ns(task, &init_pid_ns); ev->event_data.exec.process_tgid = task_tgid_nr_ns(task, &init_pid_ns); @@ -149,8 +151,8 @@ void proc_exec_connector(struct task_struct *task) proc_event_connector(task, PROC_EVENT_EXEC, 0, fill_exec_event); } -static bool fill_id_event(struct proc_event *ev, struct task_struct *task, - int which_id) +static bool fill_id_event(struct proc_event *ev, struct ve_struct *ve, + struct task_struct *task, int which_id) { const struct cred *cred; @@ -177,8 +179,8 @@ void proc_id_connector(struct task_struct *task, int which_id) proc_event_connector(task, which_id, which_id, fill_id_event); } -static bool fill_sid_event(struct proc_event *ev, struct task_struct *task, - int unused) +static bool fill_sid_event(struct proc_event *ev, struct ve_struct *ve, + struct task_struct *task, int unused) { ev->event_data.sid.process_pid = task_pid_nr_ns(task, &init_pid_ns); ev->event_data.sid.process_tgid = task_tgid_nr_ns(task, &init_pid_ns); @@ -190,8 +192,8 @@ void proc_sid_connector(struct task_struct *task) proc_event_connector(task, PROC_EVENT_SID, 0, fill_sid_event); } -static bool fill_ptrace_event(struct proc_event *ev, struct task_struct *task, - int ptrace_id) +static bool fill_ptrace_event(struct proc_event *ev, struct ve_struct *ve, + struct task_struct *task, int ptrace_id) { ev->event_data.ptrace.process_pid = task_pid_nr_ns(task, &init_pid_ns); ev->event_data.ptrace.process_tgid = task_tgid_nr_ns(task, &init_pid_ns); @@ -212,8 +214,8 @@ void proc_ptrace_connector(struct task_struct *task, int ptrace_id) fill_ptrace_event); } -static bool fill_comm_event(struct proc_event *ev, struct task_struct *task, -
[Devel] [PATCH RHEL7 COMMIT] proc connector: call proc-related init and fini routines explicitly
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 33a6978beb7622e8e97837904db45d7432776bb5 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:39 2017 +0300 proc connector: call proc-related init and fini routines explicitly This allows to support per-container connector creation and destruction. Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/cn_proc.c | 19 --- drivers/connector/connector.c | 33 - 2 files changed, 28 insertions(+), 24 deletions(-) diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index 8998335..7a1124a 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -345,22 +345,3 @@ void cn_proc_fini_ve(struct ve_struct *ve) { cn_del_callback_ve(ve, &cn_proc_event_id); } - -/* - * cn_proc_init - initialization entry point - * - * Adds the connector callback to the connector driver. - */ -static int __init cn_proc_init(void) -{ - int err = cn_add_callback(&cn_proc_event_id, - "cn_proc", - &cn_proc_mcast_ctl); - if (err) { - pr_warn("cn_proc failed to register\n"); - return err; - } - return 0; -} - -module_init(cn_proc_init); diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c index 110637b..59d81a3 100644 --- a/drivers/connector/connector.c +++ b/drivers/connector/connector.c @@ -281,6 +281,7 @@ static int cn_init_ve(struct ve_struct *ve) .input = cn_rx_skb, }; struct net *net = ve->ve_netns; + int err; ve->cn = kzalloc(sizeof(*ve->cn), GFP_KERNEL); if (!ve->cn) @@ -289,20 +290,40 @@ static int cn_init_ve(struct ve_struct *ve) dev = &ve->cn->cdev; dev->nls = netlink_kernel_create(net, NETLINK_CONNECTOR, &cfg); - if (!dev->nls) - return -EIO; + if (!dev->nls) { + err = -EIO; + goto free_cn; + } dev->cbdev = cn_queue_alloc_dev("cqueue", dev->nls); if (!dev->cbdev) { - netlink_kernel_release(dev->nls); - return -EINVAL; + err = -EINVAL; + goto netlink_release; } ve->cn->cn_already_initialized = 1; - proc_create("connector", S_IRUGO, net->proc_net, &cn_file_ops); + if (!proc_create("connector", S_IRUGO, net->proc_net, &cn_file_ops)) { + err = -ENOMEM; + goto free_cdev; + } + + err = cn_proc_init_ve(ve); + if (err) + goto remove_proc; return 0; + +remove_proc: + remove_proc_entry("connector", net->proc_net); +free_cdev: + cn_queue_free_dev(dev->cbdev); +netlink_release: + netlink_kernel_release(dev->nls); +free_cn: + kfree(ve->cn); + ve->cn = NULL; + return err; } static void cn_fini_ve(struct ve_struct *ve) @@ -312,6 +333,8 @@ static void cn_fini_ve(struct ve_struct *ve) ve->cn->cn_already_initialized = 0; + cn_proc_fini_ve(ve); + remove_proc_entry("connector", net->proc_net); cn_queue_free_dev(dev->cbdev); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] proc connector: take namespaces from VE
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit ea9dfef19a855fe11f8caab1aaee1ca8263176fe Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:41 2017 +0300 proc connector: take namespaces from VE Intead of hardcoded "init" namespaces. Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/cn_proc.c | 69 +++-- 1 file changed, 42 insertions(+), 27 deletions(-) diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index b66fde8..df6553d 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -122,14 +122,15 @@ static bool fill_fork_event(struct proc_event *ev, struct ve_struct *ve, struct task_struct *task, int unused) { struct task_struct *parent; + struct pid_namespace *pid_ns = ve->ve_ns->pid_ns; rcu_read_lock(); parent = rcu_dereference(task->real_parent); - ev->event_data.fork.parent_pid = task_pid_nr_ns(parent, &init_pid_ns); - ev->event_data.fork.parent_tgid = task_tgid_nr_ns(parent, &init_pid_ns); + ev->event_data.fork.parent_pid = task_pid_nr_ns(parent, pid_ns); + ev->event_data.fork.parent_tgid = task_tgid_nr_ns(parent, pid_ns); rcu_read_unlock(); - ev->event_data.fork.child_pid = task_pid_nr_ns(task, &init_pid_ns); - ev->event_data.fork.child_tgid = task_tgid_nr_ns(task, &init_pid_ns); + ev->event_data.fork.child_pid = task_pid_nr_ns(task, pid_ns); + ev->event_data.fork.child_tgid = task_tgid_nr_ns(task, pid_ns); return true; } @@ -141,8 +142,10 @@ void proc_fork_connector(struct task_struct *task) static bool fill_exec_event(struct proc_event *ev, struct ve_struct *ve, struct task_struct *task, int unused) { - ev->event_data.exec.process_pid = task_pid_nr_ns(task, &init_pid_ns); - ev->event_data.exec.process_tgid = task_tgid_nr_ns(task, &init_pid_ns); + struct pid_namespace *pid_ns = ve->ve_ns->pid_ns; + + ev->event_data.exec.process_pid = task_pid_nr_ns(task, pid_ns); + ev->event_data.exec.process_tgid = task_tgid_nr_ns(task, pid_ns); return true; } @@ -155,17 +158,19 @@ static bool fill_id_event(struct proc_event *ev, struct ve_struct *ve, struct task_struct *task, int which_id) { const struct cred *cred; + struct pid_namespace *pid_ns = ve->ve_ns->pid_ns; + struct user_namespace *user_ns = ve->init_cred->user_ns; - ev->event_data.id.process_pid = task_pid_nr_ns(task, &init_pid_ns); - ev->event_data.id.process_tgid = task_tgid_nr_ns(task, &init_pid_ns); + ev->event_data.id.process_pid = task_pid_nr_ns(task, pid_ns); + ev->event_data.id.process_tgid = task_tgid_nr_ns(task, pid_ns); rcu_read_lock(); cred = __task_cred(task); if (which_id == PROC_EVENT_UID) { - ev->event_data.id.r.ruid = from_kuid_munged(&init_user_ns, cred->uid); - ev->event_data.id.e.euid = from_kuid_munged(&init_user_ns, cred->euid); + ev->event_data.id.r.ruid = from_kuid_munged(user_ns, cred->uid); + ev->event_data.id.e.euid = from_kuid_munged(user_ns, cred->euid); } else if (which_id == PROC_EVENT_GID) { - ev->event_data.id.r.rgid = from_kgid_munged(&init_user_ns, cred->gid); - ev->event_data.id.e.egid = from_kgid_munged(&init_user_ns, cred->egid); + ev->event_data.id.r.rgid = from_kgid_munged(user_ns, cred->gid); + ev->event_data.id.e.egid = from_kgid_munged(user_ns, cred->egid); } else { rcu_read_unlock(); return false; @@ -182,8 +187,10 @@ void proc_id_connector(struct task_struct *task, int which_id) static bool fill_sid_event(struct proc_event *ev, struct ve_struct *ve, struct task_struct *task, int unused) { - ev->event_data.sid.process_pid = task_pid_nr_ns(task, &init_pid_ns); - ev->event_data.sid.process_tgid = task_tgid_nr_ns(task, &init_pid_ns); + struct pid_namespace *pid_ns = ve->ve_ns->pid_ns; + + ev->event_data.sid.process_pid = task_pid_nr_ns(task, pid_ns); + ev->event_data.sid.process_tgid = task_tgid_nr_ns(task, pid_ns); return true; } @@ -195,11 +202,13 @@ void proc_sid_connector(struct task_struct *task) static bool fill_ptrace_event(struct proc_event *ev, struct ve_struct *ve, struct task_struct *task, int ptrace_id) { - ev->event_data.ptrace.process_pid = task_pid_nr_ns(task, &init_pid_ns); - ev->event_data.ptrace.process_tgid = task_tgid_nr_ns(task, &init_pid_ns); + struct pid_namespace *pid_ns = ve->ve_ns->pid_ns; + + ev->event_data.ptrace.process_p
[Devel] [PATCH RHEL7 COMMIT] connector: introduce VE-aware get_cdev() helper
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 1dd02e8904050497fc1eb9c74485c526184679b0 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:28 2017 +0300 connector: introduce VE-aware get_cdev() helper Once containerized, device won't be one and for all. Thus make a helper template and use it instead of direct device object access. Use ve0 for now. Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/connector.c | 20 +--- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c index da26064..407fe52 100644 --- a/drivers/connector/connector.c +++ b/drivers/connector/connector.c @@ -63,6 +63,12 @@ static int cn_already_initialized; * a new message. * */ + +static struct cn_dev *get_cdev(struct ve_struct *ve) +{ + return &cdev; +} + int cn_netlink_send(struct cn_msg *msg, u32 __group, gfp_t gfp_mask) { struct cn_callback_entry *__cbq; @@ -70,7 +76,7 @@ int cn_netlink_send(struct cn_msg *msg, u32 __group, gfp_t gfp_mask) struct sk_buff *skb; struct nlmsghdr *nlh; struct cn_msg *data; - struct cn_dev *dev = &cdev; + struct cn_dev *dev = get_cdev(get_ve0()); u32 group = 0; int found = 0; @@ -123,7 +129,7 @@ EXPORT_SYMBOL_GPL(cn_netlink_send); static int cn_call_callback(struct sk_buff *skb) { struct cn_callback_entry *i, *cbq = NULL; - struct cn_dev *dev = &cdev; + struct cn_dev *dev = get_cdev(get_ve0()); struct cn_msg *msg = nlmsg_data(nlmsg_hdr(skb)); struct netlink_skb_parms *nsp = &NETLINK_CB(skb); int err = -ENODEV; @@ -190,7 +196,7 @@ int cn_add_callback(struct cb_id *id, const char *name, struct netlink_skb_parms *)) { int err; - struct cn_dev *dev = &cdev; + struct cn_dev *dev = get_cdev(get_ve0()); if (!cn_already_initialized) return -EAGAIN; @@ -213,7 +219,7 @@ EXPORT_SYMBOL_GPL(cn_add_callback); */ void cn_del_callback(struct cb_id *id) { - struct cn_dev *dev = &cdev; + struct cn_dev *dev = get_cdev(get_ve0()); cn_queue_del_callback(dev->cbdev, id); } @@ -221,7 +227,7 @@ EXPORT_SYMBOL_GPL(cn_del_callback); static int cn_proc_show(struct seq_file *m, void *v) { - struct cn_queue_dev *dev = cdev.cbdev; + struct cn_queue_dev *dev = get_cdev(get_ve0())->cbdev; struct cn_callback_entry *cbq; seq_printf(m, "NameID\n"); @@ -255,7 +261,7 @@ static const struct file_operations cn_file_ops = { static int cn_init(void) { - struct cn_dev *dev = &cdev; + struct cn_dev *dev = get_cdev(get_ve0()); struct netlink_kernel_cfg cfg = { .groups = CN_NETLINK_USERS + 0xf, .input = cn_rx_skb, @@ -280,7 +286,7 @@ static int cn_init(void) static void cn_fini(void) { - struct cn_dev *dev = &cdev; + struct cn_dev *dev = get_cdev(get_ve0()); cn_already_initialized = 0; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] proc connector: use per-ve netlink sender helper
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit df6a3526acfae69476e008569e659ac52374950c Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:42 2017 +0300 proc connector: use per-ve netlink sender helper Required to send event in the network to the right listener. Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/cn_proc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index df6553d..17e0247 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -115,7 +115,7 @@ static void proc_event_connector(struct task_struct *task, return; /* If cn_netlink_send() failed, the data is not sent */ - cn_netlink_send(msg, CN_IDX_PROC, GFP_KERNEL); + cn_netlink_send_ve(ve, msg, CN_IDX_PROC, GFP_KERNEL); } static bool fill_fork_event(struct proc_event *ev, struct ve_struct *ve, @@ -302,7 +302,7 @@ static void cn_proc_ack(struct ve_struct *ve, int err, int rcvd_seq, int rcvd_ac msg->ack = rcvd_ack + 1; msg->len = sizeof(*ev); msg->flags = 0; /* not used */ - cn_netlink_send(msg, CN_IDX_PROC, GFP_KERNEL); + cn_netlink_send_ve(ve, msg, CN_IDX_PROC, GFP_KERNEL); } /** ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] proc connector: add per-ve init and fini foutines
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit ed6801f36adefd236c8d87418518763e876fb1ad Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:38 2017 +0300 proc connector: add per-ve init and fini foutines These routines will be called from main connecter per-ve init and fini routines. Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/cn_proc.c | 17 + include/linux/connector.h | 3 +++ 2 files changed, 20 insertions(+) diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index 17a8c8c..8998335 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -329,6 +329,23 @@ static void cn_proc_mcast_ctl(struct cn_msg *msg, cn_proc_ack(err, msg->seq, msg->ack); } +int cn_proc_init_ve(struct ve_struct *ve) +{ + int err = cn_add_callback_ve(ve, &cn_proc_event_id, +"cn_proc", +&cn_proc_mcast_ctl); + if (err) { + pr_warn("VE#%d: cn_proc failed to register\n", ve->veid); + return err; + } + return 0; +} + +void cn_proc_fini_ve(struct ve_struct *ve) +{ + cn_del_callback_ve(ve, &cn_proc_event_id); +} + /* * cn_proc_init - initialization entry point * diff --git a/include/linux/connector.h b/include/linux/connector.h index 8b44bf0..60eb089 100644 --- a/include/linux/connector.h +++ b/include/linux/connector.h @@ -76,6 +76,9 @@ struct cn_private { }; +int cn_proc_init_ve(struct ve_struct *ve); +void cn_proc_fini_ve(struct ve_struct *ve); + int cn_add_callback_ve(struct ve_struct *ve, struct cb_id *id, const char *name, void (*callback)(struct cn_msg *, ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] connector: remove redundant input callback from cn_dev
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit b36f8c16abd69f33268c7b57613f529252a28075 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:27 2017 +0300 connector: remove redundant input callback from cn_dev Patchset description: proc connector: containerize on per-VE basis This feature is requested by customer and needed by cgred service. https://jira.sw.ru/browse/PSBM-60227 What's ne in v2: 1) Containerization is done on per-VE basis 2) Event in container is also sent to VE#0 Stanislav Kinsburskiy (27): connector: remove redundant input callback from cn_dev connector: store all private data on VE structure connector: introduce VE-aware get_cdev() helper connector: per-ve init and fini helpers introduced connector: use device stored in VE connector: per-ve helpers intoruduced connector: take cn_already_initialized from VE proc connector: generic proc_event_connector() helper introduced proc connector: use generic event helper for fork event proc connector: use generic event helper for exec event proc connector: use generic event helper for id event proc connector: use generic event helper for sid event proc connector: use generic event helper for ptrace event proc connector: use generic event helper for comm event proc connector: use generic event helper for coredump event proc connector: use generic event helper for exit event proc connector: add pid namespace awareness proc connector: add per-ve init and fini foutines proc connector: call proc-related init and fini routines explicitly proc connector: take number of listeners and per-cpu conters from VE proc connector: pass VE to event fillers proc connector: take namespaces from VE proc connector: use per-ve netlink sender helper proc connector: send events to both VEs if not in VE#0 connector: containerize "connector" proc entry connector: take VE from socket upon callback connector: add VE SS hook = This patch description: A small cleanup: this callback is never used. Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/connector.c | 6 +- include/linux/connector.h | 1 - 2 files changed, 1 insertion(+), 6 deletions(-) diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c index 0daa11e..da26064 100644 --- a/drivers/connector/connector.c +++ b/drivers/connector/connector.c @@ -253,16 +253,12 @@ static const struct file_operations cn_file_ops = { .release = single_release }; -static struct cn_dev cdev = { - .input = cn_rx_skb, -}; - static int cn_init(void) { struct cn_dev *dev = &cdev; struct netlink_kernel_cfg cfg = { .groups = CN_NETLINK_USERS + 0xf, - .input = dev->input, + .input = cn_rx_skb, }; dev->nls = netlink_kernel_create(&init_net, NETLINK_CONNECTOR, &cfg); diff --git a/include/linux/connector.h b/include/linux/connector.h index b2b5a41..4c4d2b9 100644 --- a/include/linux/connector.h +++ b/include/linux/connector.h @@ -63,7 +63,6 @@ struct cn_dev { u32 seq, groups; struct sock *nls; - void (*input) (struct sk_buff *skb); struct cn_queue_dev *cbdev; }; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] proc connector: generic proc_event_connector() helper introduced
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit b4a281062d0770311132bc2a19b6797f63abe161 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:32 2017 +0300 proc connector: generic proc_event_connector() helper introduced A lot of code is duplicated in proc connector events handling. This patch introduces generic even handler, which will be used by different events. Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/cn_proc.c | 50 + 1 file changed, 50 insertions(+) diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index 3165811..808b22a 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -64,6 +64,54 @@ static inline void get_seq(__u32 *ts, int *cpu) preempt_enable(); } +static struct cn_msg *cn_msg_fill(__u8 *buffer, + struct task_struct *task, + int what, int cookie, + bool (*fill_event)(struct proc_event *ev, +struct task_struct *task, +int cookie)) +{ + struct cn_msg *msg; + struct proc_event *ev; + struct timespec ts; + + msg = buffer_to_cn_msg(buffer); + ev = (struct proc_event *)msg->data; + + get_seq(&msg->seq, &ev->cpu); + memcpy(&msg->id, &cn_proc_event_id, sizeof(msg->id)); + msg->ack = 0; /* not used */ + msg->len = sizeof(*ev); + msg->flags = 0; /* not used */ + + memset(&ev->event_data, 0, sizeof(ev->event_data)); + ktime_get_ts(&ts); /* get high res monotonic timestamp */ + ev->timestamp_ns = timespec_to_ns(&ts); + ev->what = what; + + return fill_event(ev, task, cookie) ? msg : NULL; +} + +static void proc_event_connector(struct task_struct *task, +int what, int cookie, +bool (*fill_event)(struct proc_event *ev, + struct task_struct *task, + int cookie)) +{ + struct cn_msg *msg; + __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8); + + if (atomic_read(&proc_event_num_listeners) < 1) + return; + + msg = cn_msg_fill(buffer, task, what, cookie, fill_event); + if (!msg) + return; + + /* If cn_netlink_send() failed, the data is not sent */ + cn_netlink_send(msg, CN_IDX_PROC, GFP_KERNEL); +} + void proc_fork_connector(struct task_struct *task) { struct cn_msg *msg; @@ -72,6 +120,8 @@ void proc_fork_connector(struct task_struct *task) struct timespec ts; struct task_struct *parent; + (void) proc_event_connector; + if (atomic_read(&proc_event_num_listeners) < 1) return; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] connector: containerize "connector" proc entry
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 2100a680437f0c26d65ab4e304cc274399ccbcf3 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:43 2017 +0300 connector: containerize "connector" proc entry Needed to expose "/proc/net/connector" in CT and show right content. Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/connector.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c index 59d81a3..771dadf 100644 --- a/drivers/connector/connector.c +++ b/drivers/connector/connector.c @@ -241,7 +241,7 @@ EXPORT_SYMBOL_GPL(cn_del_callback); static int cn_proc_show(struct seq_file *m, void *v) { - struct cn_queue_dev *dev = get_cdev(get_ve0())->cbdev; + struct cn_queue_dev *dev = get_cdev(get_exec_env())->cbdev; struct cn_callback_entry *cbq; seq_printf(m, "NameID\n"); @@ -303,7 +303,7 @@ static int cn_init_ve(struct ve_struct *ve) ve->cn->cn_already_initialized = 1; - if (!proc_create("connector", S_IRUGO, net->proc_net, &cn_file_ops)) { + if (!proc_net_create("connector", S_IRUGO, net->proc_net, &cn_file_ops)) { err = -ENOMEM; goto free_cdev; } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] proc connector: use generic event helper for id event
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 34e9dc939d3adba763404a3e97b38a4255dd1e02 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:34 2017 +0300 proc connector: use generic event helper for id event Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/cn_proc.c | 29 - 1 file changed, 8 insertions(+), 21 deletions(-) diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index 06fd6b3..0647fcf 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -145,21 +145,11 @@ void proc_exec_connector(struct task_struct *task) proc_event_connector(task, PROC_EVENT_EXEC, 0, fill_exec_event); } -void proc_id_connector(struct task_struct *task, int which_id) +static bool fill_id_event(struct proc_event *ev, struct task_struct *task, + int which_id) { - struct cn_msg *msg; - struct proc_event *ev; - __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8); - struct timespec ts; const struct cred *cred; - if (atomic_read(&proc_event_num_listeners) < 1) - return; - - msg = buffer_to_cn_msg(buffer); - ev = (struct proc_event *)msg->data; - memset(&ev->event_data, 0, sizeof(ev->event_data)); - ev->what = which_id; ev->event_data.id.process_pid = task->pid; ev->event_data.id.process_tgid = task->tgid; rcu_read_lock(); @@ -172,18 +162,15 @@ void proc_id_connector(struct task_struct *task, int which_id) ev->event_data.id.e.egid = from_kgid_munged(&init_user_ns, cred->egid); } else { rcu_read_unlock(); - return; + return false; } rcu_read_unlock(); - get_seq(&msg->seq, &ev->cpu); - ktime_get_ts(&ts); /* get high res monotonic timestamp */ - ev->timestamp_ns = timespec_to_ns(&ts); + return true; +} - memcpy(&msg->id, &cn_proc_event_id, sizeof(msg->id)); - msg->ack = 0; /* not used */ - msg->len = sizeof(*ev); - msg->flags = 0; /* not used */ - cn_netlink_send(msg, CN_IDX_PROC, GFP_KERNEL); +void proc_id_connector(struct task_struct *task, int which_id) +{ + proc_event_connector(task, which_id, which_id, fill_id_event); } void proc_sid_connector(struct task_struct *task) ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] connector: per-ve helpers intoruduced
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 596e20e4cfc9660a390027c3d5b5d2d9fc61b203 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:30 2017 +0300 connector: per-ve helpers intoruduced This is precursor patch. Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/connector.c | 48 +-- include/linux/connector.h | 7 +++ 2 files changed, 40 insertions(+), 15 deletions(-) diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c index bc2308a..bba667d 100644 --- a/drivers/connector/connector.c +++ b/drivers/connector/connector.c @@ -67,14 +67,14 @@ static struct cn_dev *get_cdev(struct ve_struct *ve) return &ve->cn->cdev; } -int cn_netlink_send(struct cn_msg *msg, u32 __group, gfp_t gfp_mask) +int cn_netlink_send_ve(struct ve_struct *ve, struct cn_msg *msg, u32 __group, gfp_t gfp_mask) { struct cn_callback_entry *__cbq; unsigned int size; struct sk_buff *skb; struct nlmsghdr *nlh; struct cn_msg *data; - struct cn_dev *dev = get_cdev(get_ve0()); + struct cn_dev *dev = get_cdev(ve); u32 group = 0; int found = 0; @@ -119,6 +119,11 @@ int cn_netlink_send(struct cn_msg *msg, u32 __group, gfp_t gfp_mask) return netlink_broadcast(dev->nls, skb, 0, group, gfp_mask); } + +int cn_netlink_send(struct cn_msg *msg, u32 __group, gfp_t gfp_mask) +{ + return cn_netlink_send_ve(get_ve0(), msg, __group, gfp_mask); +} EXPORT_SYMBOL_GPL(cn_netlink_send); /* @@ -183,18 +188,13 @@ static void cn_rx_skb(struct sk_buff *__skb) } } -/* - * Callback add routing - adds callback with given ID and name. - * If there is registered callback with the same ID it will not be added. - * - * May sleep. - */ -int cn_add_callback(struct cb_id *id, const char *name, - void (*callback)(struct cn_msg *, -struct netlink_skb_parms *)) +int cn_add_callback_ve(struct ve_struct *ve, + struct cb_id *id, const char *name, + void (*callback)(struct cn_msg *, + struct netlink_skb_parms *)) { int err; - struct cn_dev *dev = get_cdev(get_ve0()); + struct cn_dev *dev = get_cdev(ve); if (!cn_already_initialized) return -EAGAIN; @@ -205,8 +205,28 @@ int cn_add_callback(struct cb_id *id, const char *name, return 0; } + +/* + * Callback add routing - adds callback with given ID and name. + * If there is registered callback with the same ID it will not be added. + * + * May sleep. + */ +int cn_add_callback(struct cb_id *id, const char *name, + void (*callback)(struct cn_msg *, +struct netlink_skb_parms *)) +{ + return cn_add_callback_ve(get_ve0(), id, name, callback); +} EXPORT_SYMBOL_GPL(cn_add_callback); +void cn_del_callback_ve(struct ve_struct *ve, struct cb_id *id) +{ + struct cn_dev *dev = get_cdev(ve); + + cn_queue_del_callback(dev->cbdev, id); +} + /* * Callback remove routing - removes callback * with given ID. @@ -217,9 +237,7 @@ EXPORT_SYMBOL_GPL(cn_add_callback); */ void cn_del_callback(struct cb_id *id) { - struct cn_dev *dev = get_cdev(get_ve0()); - - cn_queue_del_callback(dev->cbdev, id); + cn_del_callback_ve(get_ve0(), id); } EXPORT_SYMBOL_GPL(cn_del_callback); diff --git a/include/linux/connector.h b/include/linux/connector.h index 9e05e28..8b44bf0 100644 --- a/include/linux/connector.h +++ b/include/linux/connector.h @@ -76,6 +76,13 @@ struct cn_private { }; +int cn_add_callback_ve(struct ve_struct *ve, + struct cb_id *id, const char *name, + void (*callback)(struct cn_msg *, + struct netlink_skb_parms *)); +void cn_del_callback_ve(struct ve_struct *ve, struct cb_id *id); +int cn_netlink_send_ve(struct ve_struct *ve, struct cn_msg *, u32, gfp_t); + int cn_add_callback(struct cb_id *id, const char *name, void (*callback)(struct cn_msg *, struct netlink_skb_parms *)); void cn_del_callback(struct cb_id *); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] proc connector: use generic event helper for exit event
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 72677a7d7de095a6c32f7f1c41e32fc3173337fd Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:37 2017 +0300 proc connector: use generic event helper for exit event Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/cn_proc.c | 28 +++- 1 file changed, 7 insertions(+), 21 deletions(-) diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index 312f30f..4ee1640 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -235,33 +235,19 @@ void proc_coredump_connector(struct task_struct *task) proc_event_connector(task, PROC_EVENT_COREDUMP, 0, fill_coredump_event); } -void proc_exit_connector(struct task_struct *task) +static bool fill_exit_event(struct proc_event *ev, struct task_struct *task, + int unused) { - struct cn_msg *msg; - struct proc_event *ev; - __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8); - struct timespec ts; - - if (atomic_read(&proc_event_num_listeners) < 1) - return; - - msg = buffer_to_cn_msg(buffer); - ev = (struct proc_event *)msg->data; - memset(&ev->event_data, 0, sizeof(ev->event_data)); - get_seq(&msg->seq, &ev->cpu); - ktime_get_ts(&ts); /* get high res monotonic timestamp */ - ev->timestamp_ns = timespec_to_ns(&ts); - ev->what = PROC_EVENT_EXIT; ev->event_data.exit.process_pid = task->pid; ev->event_data.exit.process_tgid = task->tgid; ev->event_data.exit.exit_code = task->exit_code; ev->event_data.exit.exit_signal = task->exit_signal; + return true; +} - memcpy(&msg->id, &cn_proc_event_id, sizeof(msg->id)); - msg->ack = 0; /* not used */ - msg->len = sizeof(*ev); - msg->flags = 0; /* not used */ - cn_netlink_send(msg, CN_IDX_PROC, GFP_KERNEL); +void proc_exit_connector(struct task_struct *task) +{ + proc_event_connector(task, PROC_EVENT_EXIT, 0, fill_exit_event); } /* ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] proc connector: use generic event helper for sid event
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 9de4dc2591367ad8b7276ba1b8c723cb9960e9e9 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:34 2017 +0300 proc connector: use generic event helper for sid event Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/cn_proc.c | 28 +++- 1 file changed, 7 insertions(+), 21 deletions(-) diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index 0647fcf..2ad2587 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -173,31 +173,17 @@ void proc_id_connector(struct task_struct *task, int which_id) proc_event_connector(task, which_id, which_id, fill_id_event); } -void proc_sid_connector(struct task_struct *task) +static bool fill_sid_event(struct proc_event *ev, struct task_struct *task, + int unused) { - struct cn_msg *msg; - struct proc_event *ev; - struct timespec ts; - __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8); - - if (atomic_read(&proc_event_num_listeners) < 1) - return; - - msg = buffer_to_cn_msg(buffer); - ev = (struct proc_event *)msg->data; - memset(&ev->event_data, 0, sizeof(ev->event_data)); - get_seq(&msg->seq, &ev->cpu); - ktime_get_ts(&ts); /* get high res monotonic timestamp */ - ev->timestamp_ns = timespec_to_ns(&ts); - ev->what = PROC_EVENT_SID; ev->event_data.sid.process_pid = task->pid; ev->event_data.sid.process_tgid = task->tgid; + return true; +} - memcpy(&msg->id, &cn_proc_event_id, sizeof(msg->id)); - msg->ack = 0; /* not used */ - msg->len = sizeof(*ev); - msg->flags = 0; /* not used */ - cn_netlink_send(msg, CN_IDX_PROC, GFP_KERNEL); +void proc_sid_connector(struct task_struct *task) +{ + proc_event_connector(task, PROC_EVENT_SID, 0, fill_sid_event); } void proc_ptrace_connector(struct task_struct *task, int ptrace_id) ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] proc connector: use generic event helper for ptrace event
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit c4c0ba8521053013532de1e7db5ec3b5d27276c2 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:35 2017 +0300 proc connector: use generic event helper for ptrace event Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/cn_proc.c | 31 +-- 1 file changed, 9 insertions(+), 22 deletions(-) diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index 2ad2587..36a53fd 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -186,23 +186,9 @@ void proc_sid_connector(struct task_struct *task) proc_event_connector(task, PROC_EVENT_SID, 0, fill_sid_event); } -void proc_ptrace_connector(struct task_struct *task, int ptrace_id) +static bool fill_ptrace_event(struct proc_event *ev, struct task_struct *task, + int ptrace_id) { - struct cn_msg *msg; - struct proc_event *ev; - struct timespec ts; - __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8); - - if (atomic_read(&proc_event_num_listeners) < 1) - return; - - msg = buffer_to_cn_msg(buffer); - ev = (struct proc_event *)msg->data; - memset(&ev->event_data, 0, sizeof(ev->event_data)); - get_seq(&msg->seq, &ev->cpu); - ktime_get_ts(&ts); /* get high res monotonic timestamp */ - ev->timestamp_ns = timespec_to_ns(&ts); - ev->what = PROC_EVENT_PTRACE; ev->event_data.ptrace.process_pid = task->pid; ev->event_data.ptrace.process_tgid = task->tgid; if (ptrace_id == PTRACE_ATTACH) { @@ -212,13 +198,14 @@ void proc_ptrace_connector(struct task_struct *task, int ptrace_id) ev->event_data.ptrace.tracer_pid = 0; ev->event_data.ptrace.tracer_tgid = 0; } else - return; + return false; + return true; +} - memcpy(&msg->id, &cn_proc_event_id, sizeof(msg->id)); - msg->ack = 0; /* not used */ - msg->len = sizeof(*ev); - msg->flags = 0; /* not used */ - cn_netlink_send(msg, CN_IDX_PROC, GFP_KERNEL); +void proc_ptrace_connector(struct task_struct *task, int ptrace_id) +{ + proc_event_connector(task, PROC_EVENT_PTRACE, ptrace_id, +fill_ptrace_event); } void proc_comm_connector(struct task_struct *task) ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] proc connector: use generic event helper for exec event
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit ea2114f455580db5ab66460c31c19efbb7f716b2 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:33 2017 +0300 proc connector: use generic event helper for exec event Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/cn_proc.c | 28 +++- 1 file changed, 7 insertions(+), 21 deletions(-) diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index ffda79b..06fd6b3 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -132,31 +132,17 @@ void proc_fork_connector(struct task_struct *task) proc_event_connector(task, PROC_EVENT_FORK, 0, fill_fork_event); } -void proc_exec_connector(struct task_struct *task) +static bool fill_exec_event(struct proc_event *ev, struct task_struct *task, + int unused) { - struct cn_msg *msg; - struct proc_event *ev; - struct timespec ts; - __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8); - - if (atomic_read(&proc_event_num_listeners) < 1) - return; - - msg = buffer_to_cn_msg(buffer); - ev = (struct proc_event *)msg->data; - memset(&ev->event_data, 0, sizeof(ev->event_data)); - get_seq(&msg->seq, &ev->cpu); - ktime_get_ts(&ts); /* get high res monotonic timestamp */ - ev->timestamp_ns = timespec_to_ns(&ts); - ev->what = PROC_EVENT_EXEC; ev->event_data.exec.process_pid = task->pid; ev->event_data.exec.process_tgid = task->tgid; + return true; +} - memcpy(&msg->id, &cn_proc_event_id, sizeof(msg->id)); - msg->ack = 0; /* not used */ - msg->len = sizeof(*ev); - msg->flags = 0; /* not used */ - cn_netlink_send(msg, CN_IDX_PROC, GFP_KERNEL); +void proc_exec_connector(struct task_struct *task) +{ + proc_event_connector(task, PROC_EVENT_EXEC, 0, fill_exec_event); } void proc_id_connector(struct task_struct *task, int which_id) ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] connector: take cn_already_initialized from VE
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit a66190eb61ac389d0060e3cff22f76cff0bf4c3d Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:31 2017 +0300 connector: take cn_already_initialized from VE Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/connector.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c index bba667d..110637b 100644 --- a/drivers/connector/connector.c +++ b/drivers/connector/connector.c @@ -38,8 +38,6 @@ MODULE_AUTHOR("Evgeniy Polyakov "); MODULE_DESCRIPTION("Generic userspace <-> kernelspace connector."); MODULE_ALIAS_NET_PF_PROTO(PF_NETLINK, NETLINK_CONNECTOR); -static int cn_already_initialized; - /* * msg->seq and msg->ack are used to determine message genealogy. * When someone sends message it puts there locally unique sequence @@ -196,7 +194,7 @@ int cn_add_callback_ve(struct ve_struct *ve, int err; struct cn_dev *dev = get_cdev(ve); - if (!cn_already_initialized) + if (!ve->cn->cn_already_initialized) return -EAGAIN; err = cn_queue_add_callback(dev->cbdev, name, id, callback); @@ -300,7 +298,7 @@ static int cn_init_ve(struct ve_struct *ve) return -EINVAL; } - cn_already_initialized = 1; + ve->cn->cn_already_initialized = 1; proc_create("connector", S_IRUGO, net->proc_net, &cn_file_ops); @@ -312,7 +310,7 @@ static void cn_fini_ve(struct ve_struct *ve) struct cn_dev *dev = get_cdev(ve); struct net *net = ve->ve_netns; - cn_already_initialized = 0; + ve->cn->cn_already_initialized = 0; remove_proc_entry("connector", net->proc_net); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] proc connector: use generic event helper for fork event
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit b9b0ba3dfa697a80078cbef06b13caf3c14ec249 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:32 2017 +0300 proc connector: use generic event helper for fork event Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/cn_proc.c | 30 +++--- 1 file changed, 7 insertions(+), 23 deletions(-) diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c index 808b22a..ffda79b 100644 --- a/drivers/connector/cn_proc.c +++ b/drivers/connector/cn_proc.c @@ -112,26 +112,11 @@ static void proc_event_connector(struct task_struct *task, cn_netlink_send(msg, CN_IDX_PROC, GFP_KERNEL); } -void proc_fork_connector(struct task_struct *task) +static bool fill_fork_event(struct proc_event *ev, struct task_struct *task, + int unused) { - struct cn_msg *msg; - struct proc_event *ev; - __u8 buffer[CN_PROC_MSG_SIZE] __aligned(8); - struct timespec ts; struct task_struct *parent; - (void) proc_event_connector; - - if (atomic_read(&proc_event_num_listeners) < 1) - return; - - msg = buffer_to_cn_msg(buffer); - ev = (struct proc_event *)msg->data; - memset(&ev->event_data, 0, sizeof(ev->event_data)); - get_seq(&msg->seq, &ev->cpu); - ktime_get_ts(&ts); /* get high res monotonic timestamp */ - ev->timestamp_ns = timespec_to_ns(&ts); - ev->what = PROC_EVENT_FORK; rcu_read_lock(); parent = rcu_dereference(task->real_parent); ev->event_data.fork.parent_pid = parent->pid; @@ -139,13 +124,12 @@ void proc_fork_connector(struct task_struct *task) rcu_read_unlock(); ev->event_data.fork.child_pid = task->pid; ev->event_data.fork.child_tgid = task->tgid; + return true; +} - memcpy(&msg->id, &cn_proc_event_id, sizeof(msg->id)); - msg->ack = 0; /* not used */ - msg->len = sizeof(*ev); - msg->flags = 0; /* not used */ - /* If cn_netlink_send() failed, the data is not sent */ - cn_netlink_send(msg, CN_IDX_PROC, GFP_KERNEL); +void proc_fork_connector(struct task_struct *task) +{ + proc_event_connector(task, PROC_EVENT_FORK, 0, fill_fork_event); } void proc_exec_connector(struct task_struct *task) ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] connector: per-ve init and fini helpers introduced
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 37c6a11416ce88290d381482e0b8bf568dc59e97 Author: Stanislav Kinsburskiy Date: Thu Aug 31 17:40:29 2017 +0300 connector: per-ve init and fini helpers introduced This helpers will be used later to initialize per-container connector. Signed-off-by: Stanislav Kinsburskiy Reviewed-by: Andrey Ryabinin --- drivers/connector/connector.c | 31 +-- 1 file changed, 25 insertions(+), 6 deletions(-) diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c index 407fe52..f5484b2 100644 --- a/drivers/connector/connector.c +++ b/drivers/connector/connector.c @@ -259,15 +259,20 @@ static const struct file_operations cn_file_ops = { .release = single_release }; -static int cn_init(void) +static int cn_init_ve(struct ve_struct *ve) { struct cn_dev *dev = get_cdev(get_ve0()); struct netlink_kernel_cfg cfg = { .groups = CN_NETLINK_USERS + 0xf, .input = cn_rx_skb, }; + struct net *net = ve->ve_netns; + + ve->cn = kzalloc(sizeof(*ve->cn), GFP_KERNEL); + if (!ve->cn) + return -ENOMEM; - dev->nls = netlink_kernel_create(&init_net, NETLINK_CONNECTOR, &cfg); + dev->nls = netlink_kernel_create(net, NETLINK_CONNECTOR, &cfg); if (!dev->nls) return -EIO; @@ -279,21 +284,35 @@ static int cn_init(void) cn_already_initialized = 1; - proc_create("connector", S_IRUGO, init_net.proc_net, &cn_file_ops); + proc_create("connector", S_IRUGO, net->proc_net, &cn_file_ops); return 0; } -static void cn_fini(void) +static void cn_fini_ve(struct ve_struct *ve) { - struct cn_dev *dev = get_cdev(get_ve0()); + struct cn_dev *dev = get_cdev(ve); + struct net *net = ve->ve_netns; cn_already_initialized = 0; - remove_proc_entry("connector", init_net.proc_net); + remove_proc_entry("connector", net->proc_net); cn_queue_free_dev(dev->cbdev); netlink_kernel_release(dev->nls); + + kfree(ve->cn); + ve->cn = NULL; +} + +static int cn_init(void) +{ + return cn_init_ve(get_ve0()); +} + +static void cn_fini(void) +{ + return cn_fini_ve(get_ve0()); } subsys_initcall(cn_init); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [RFC PATCH 2/2] autofs: sent 32-bit sized packet for 32-bit process
The structure autofs_v5_packet (except name) is not aligned by 8 bytes, which lead to different sizes in 32 and 64-bit architectures. Let's form 32-bit compatible packet when daemon has 32-bit addressation. Signed-off-by: Stanislav Kinsburskiy --- fs/autofs4/waitq.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/fs/autofs4/waitq.c b/fs/autofs4/waitq.c index 309ca6b..484cf2e 100644 --- a/fs/autofs4/waitq.c +++ b/fs/autofs4/waitq.c @@ -153,12 +153,19 @@ static void autofs4_notify_daemon(struct autofs_sb_info *sbi, { struct autofs_v5_packet *packet = &pkt.v5_pkt.v5_packet; struct user_namespace *user_ns = sbi->pipe->f_cred->user_ns; + size_t name_offset; - pktsz = sizeof(*packet); + if (sbi->is32bit) + name_offset = offsetof(struct autofs_v5_packet, len) + + sizeof(packet->len); + else + name_offset = offsetof(struct autofs_v5_packet, name); + + pktsz = name_offset + sizeof(packet->name); packet->wait_queue_token = wq->wait_queue_token; packet->len = wq->name.len; - memcpy(packet->name, wq->name.name, wq->name.len); + memcpy(packet + name_offset, wq->name.name, wq->name.len); packet->name[wq->name.len] = '\0'; packet->dev = wq->dev; packet->ino = wq->ino; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [RFC PATCH 1/2] autofs: set compat flag on sbi when daemon uses 32bit addressation
Signed-off-by: Stanislav Kinsburskiy --- fs/autofs4/inode.c | 16 1 file changed, 16 insertions(+) diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c index b23cf2a..989ac38 100644 --- a/fs/autofs4/inode.c +++ b/fs/autofs4/inode.c @@ -217,6 +217,7 @@ int autofs4_fill_super(struct super_block *s, void *data, int silent) int pgrp; bool pgrp_set = false; int ret = -EINVAL; + struct task_struct *tsk; sbi = kzalloc(sizeof(*sbi), GFP_KERNEL); if (!sbi) @@ -281,10 +282,25 @@ int autofs4_fill_super(struct super_block *s, void *data, int silent) pgrp); goto fail_dput; } + tsk = get_pid_task(sbi->oz_pgrp, PIDTYPE_PGID); + if (!tsk) { + pr_warn("autofs: could not find process group leader %d\n", + pgrp); + goto fail_put_pid; + } } else { sbi->oz_pgrp = get_task_pid(current, PIDTYPE_PGID); + get_task_struct(current); + tsk = current; } + if (test_tsk_thread_flag(tsk, TIF_ADDR32)) + sbi->is32bit = 1; + else + sbi->is32bit = 0; + + put_task_struct(tsk); + if (autofs_type_trigger(sbi->type)) __managed_dentry_set_managed(root); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [RFC PATCH 0/2] autofs: add "compat" support
The idea is simple: reduce autofs_v5_packet for 32bit damon on 64bit architectures. --- Stanislav Kinsburskiy (2): autofs: set compat flag on sbi when daemon uses 32bit addressation autofs: sent 32-bit sized packet for 32-bit process fs/autofs4/inode.c | 16 fs/autofs4/waitq.c | 11 +-- 2 files changed, 25 insertions(+), 2 deletions(-) -- ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] tswap: Add support for zero-filled pages
On 08/03/2017 12:54 PM, Kirill Tkhai wrote: > static int tswap_frontswap_store(unsigned type, pgoff_t offset, >struct page *page) > { > swp_entry_t entry = swp_entry(type, offset); > + int zero_filled = -1, err = 0; > struct page *cache_page; > - int err = 0; > > if (!tswap_active) > return -1; > > cache_page = tswap_lookup_page(entry); > - if (cache_page) > - goto copy; > + if (cache_page) { > + zero_filled = is_zero_filled_page(page); > + /* If type of page has not changed, just reuse it */ > + if (zero_filled == (cache_page == ZERO_PAGE(0))) > + goto copy; > + tswap_delete_page(entry, NULL); > + put_page(cache_page); I think if we race with tswap_frontswap_load() this will lead to double put_page(). > + } > > if (!(current->flags & PF_MEMCG_RECLAIM)) > return -1; > ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH] autofs: fix autofs_v5_packet structure for compat mode
31.08.2017 15:05, Dmitry V. Levin пишет: > On Thu, Aug 31, 2017 at 02:40:23PM +0300, Dmitry V. Levin wrote: >> On Thu, Aug 31, 2017 at 01:48:27PM +0300, Stanislav Kinsburskiy wrote: >>> >>> >>> 31.08.2017 13:38, Dmitry V. Levin пишет: On Thu, Aug 31, 2017 at 02:11:34PM +0400, Stanislav Kinsburskiy wrote: > Due to integer variables alignment size of struct autofs_v5_packet in 300 > bytes in 32-bit architectures (instead of 304 bytes in 64-bits > architectures). > > This may lead to memory corruption (64 bits kernel always send 304 bytes, > while 32-bit userspace application expects for 300). > > https://jira.sw.ru/browse/PSBM-71078 > > Signed-off-by: Stanislav Kinsburskiy > --- > include/uapi/linux/auto_fs4.h |2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/include/uapi/linux/auto_fs4.h b/include/uapi/linux/auto_fs4.h > index e02982f..8729a47 100644 > --- a/include/uapi/linux/auto_fs4.h > +++ b/include/uapi/linux/auto_fs4.h > @@ -137,6 +137,8 @@ struct autofs_v5_packet { > __u32 pid; > __u32 tgid; > __u32 len; > + __u32 blob; /* This is needed to align structure up to 8 > +bytes for ALL archs including 32-bit */ > char name[NAME_MAX+1]; > }; This change breaks ABI because it changes offsetof(struct autofs_v5_packet, name). If you need to fix the alignment, use __attribute__((aligned(8))). >>> >>> Nice to know you're watching. >>> Yes, attribute is better. >>> But how ABI is broken? On x86_64 this alignment is implied, so nothing is >>> changed. >> >> Your change increases offsetof(struct autofs_v5_packet, name) by 4 on all >> architectures. On architectures where the structure is 32-bit aligned >> this also leads to increase of its size by 4. >> An alignment change would also be an ABI breakage on 32-bit architectures, though. >>> >>> True. >>> But from my POW better have it working on 64bit archs for 32bit apps. >>> But anyway, upstream guys will device, whether they want 32-bit autofs >>> applications properly work on 64 or 32 bits. >> >> Let's fix old bugs without introducing new bugs. >> The right fix here seems to be a compat structure, that is, both 64-bit >> and 32-bit kernels should send the same 32-bit aligned structure, and >> it has to be the same structure sent by traditional 32-bit kernels. > > Alternatively, a much more simple fix would be to change 64-bit kernels > not to send the trailing 4 padding bytes of 64-bit aligned > struct autofs_v5_packet. That is, just send > offsetofend(struct autofs_v5_packet, name) bytes instead of > sizeof(struct autofs_v5_packet) regardless of architecture. > Fair enough, thanks! But this approach won't work, because autofs pipe has O_DIRECT flag. Compat structure looks more promising, but not yet clear to me, how to define one properly. Probably by replacing "len" with char array like this: /* autofs v5 common packet struct */ struct autofs_v5_packet { struct autofs_packet_hdr hdr; autofs_wqt_t wait_queue_token; __u32 dev; __u64 ino; __u32 uid; __u32 gid; __u32 pid; __u32 tgid; __u8 len[4]; char name[NAME_MAX+1]; }; What do you think? ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7 2/2] mm/memcg: reclaim only kmem if kmem limit reached.
On 08/31/2017 12:58 PM, Konstantin Khorenko wrote: > Do we want to push it to mainstream as well? > I don't think so. Distributions are slowly moving towards v2 cgroup, where kmem limit simply doesn't exists. And for legacy cgroup v1 lack of reclaim on kmem limit hit wasn't a mistake but a deliberate choice. There is no clear use case for this, but it's adds a lot complexity to the reclaim code and just looks a bit ugly. > -- > Best regards, > > Konstantin Khorenko, > Virtuozzo Linux Kernel Team > ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH] autofs: fix autofs_v5_packet structure for compat mode
On Thu, Aug 31, 2017 at 02:40:23PM +0300, Dmitry V. Levin wrote: > On Thu, Aug 31, 2017 at 01:48:27PM +0300, Stanislav Kinsburskiy wrote: > > > > > > 31.08.2017 13:38, Dmitry V. Levin пишет: > > > On Thu, Aug 31, 2017 at 02:11:34PM +0400, Stanislav Kinsburskiy wrote: > > >> Due to integer variables alignment size of struct autofs_v5_packet in 300 > > >> bytes in 32-bit architectures (instead of 304 bytes in 64-bits > > >> architectures). > > >> > > >> This may lead to memory corruption (64 bits kernel always send 304 bytes, > > >> while 32-bit userspace application expects for 300). > > >> > > >> https://jira.sw.ru/browse/PSBM-71078 > > >> > > >> Signed-off-by: Stanislav Kinsburskiy > > >> --- > > >> include/uapi/linux/auto_fs4.h |2 ++ > > >> 1 file changed, 2 insertions(+) > > >> > > >> diff --git a/include/uapi/linux/auto_fs4.h > > >> b/include/uapi/linux/auto_fs4.h > > >> index e02982f..8729a47 100644 > > >> --- a/include/uapi/linux/auto_fs4.h > > >> +++ b/include/uapi/linux/auto_fs4.h > > >> @@ -137,6 +137,8 @@ struct autofs_v5_packet { > > >> __u32 pid; > > >> __u32 tgid; > > >> __u32 len; > > >> +__u32 blob; /* This is needed to align structure up > > >> to 8 > > >> + bytes for ALL archs including 32-bit > > >> */ > > >> char name[NAME_MAX+1]; > > >> }; > > > > > > This change breaks ABI because it changes offsetof(struct > > > autofs_v5_packet, name). > > > If you need to fix the alignment, use __attribute__((aligned(8))). > > > > > > > Nice to know you're watching. > > Yes, attribute is better. > > But how ABI is broken? On x86_64 this alignment is implied, so nothing is > > changed. > > Your change increases offsetof(struct autofs_v5_packet, name) by 4 on all > architectures. On architectures where the structure is 32-bit aligned > this also leads to increase of its size by 4. > > > > An alignment change would also be an ABI breakage on 32-bit architectures, > > > though. > > > > > > > True. > > But from my POW better have it working on 64bit archs for 32bit apps. > > But anyway, upstream guys will device, whether they want 32-bit autofs > > applications properly work on 64 or 32 bits. > > Let's fix old bugs without introducing new bugs. > The right fix here seems to be a compat structure, that is, both 64-bit > and 32-bit kernels should send the same 32-bit aligned structure, and > it has to be the same structure sent by traditional 32-bit kernels. Alternatively, a much more simple fix would be to change 64-bit kernels not to send the trailing 4 padding bytes of 64-bit aligned struct autofs_v5_packet. That is, just send offsetofend(struct autofs_v5_packet, name) bytes instead of sizeof(struct autofs_v5_packet) regardless of architecture. -- ldv signature.asc Description: PGP signature ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] tswap: Add support for zero-filled pages
Andrey, please review the patch. -- Best regards, Konstantin Khorenko, Virtuozzo Linux Kernel Team On 08/03/2017 12:54 PM, Kirill Tkhai wrote: This patch makes tswap to do not allocate a new page, if swapped page is zero-filled, and to use ZERO_PAGE() pointer to decode it instead. The same optimization is made in zram, and it may help VMs to reduce memory usage in some way. Signed-off-by: Kirill Tkhai --- mm/tswap.c | 65 +++- 1 file changed, 51 insertions(+), 14 deletions(-) diff --git a/mm/tswap.c b/mm/tswap.c index 15f5adc2dc9..6a3cb917059 100644 --- a/mm/tswap.c +++ b/mm/tswap.c @@ -54,16 +54,20 @@ static void tswap_lru_add(struct page *page) { struct tswap_lru *lru = &tswap_lru_node[page_to_nid(page)]; - list_add_tail(&page->lru, &lru->list); - lru->nr_items++; + if (page != ZERO_PAGE(0)) { + list_add_tail(&page->lru, &lru->list); + lru->nr_items++; + } } static void tswap_lru_del(struct page *page) { struct tswap_lru *lru = &tswap_lru_node[page_to_nid(page)]; - list_del(&page->lru); - lru->nr_items--; + if (page != ZERO_PAGE(0)) { + list_del(&page->lru); + lru->nr_items--; + } } static struct page *tswap_lookup_page(swp_entry_t entry) @@ -73,7 +77,7 @@ static struct page *tswap_lookup_page(swp_entry_t entry) spin_lock(&tswap_lock); page = radix_tree_lookup(&tswap_page_tree, entry.val); spin_unlock(&tswap_lock); - BUG_ON(page && page_private(page) != entry.val); + BUG_ON(page && page != ZERO_PAGE(0) && page_private(page) != entry.val); return page; } @@ -85,7 +89,8 @@ static int tswap_insert_page(swp_entry_t entry, struct page *page) if (err) return err; - set_page_private(page, entry.val); + if (page != ZERO_PAGE(0)) + set_page_private(page, entry.val); spin_lock(&tswap_lock); err = radix_tree_insert(&tswap_page_tree, entry.val, page); if (!err) { @@ -111,7 +116,7 @@ static struct page *tswap_delete_page(swp_entry_t entry, struct page *expected) spin_unlock(&tswap_lock); if (page) { BUG_ON(expected && page != expected); - BUG_ON(page_private(page) != entry.val); + BUG_ON(page_private(page) != entry.val && page != ZERO_PAGE(0)); } return page; } @@ -274,26 +279,57 @@ static void tswap_frontswap_init(unsigned type) */ } +static bool is_zero_filled_page(struct page *page) +{ + bool zero_filled = true; + unsigned long *v; + int i; + + v = kmap_atomic(page); + for (i = 0; i < PAGE_SIZE / sizeof(*v); i++) { + if (v[i] != 0) { + zero_filled = false; + break; + } + } + kunmap_atomic(v); + return zero_filled; +} + static int tswap_frontswap_store(unsigned type, pgoff_t offset, struct page *page) { swp_entry_t entry = swp_entry(type, offset); + int zero_filled = -1, err = 0; struct page *cache_page; - int err = 0; if (!tswap_active) return -1; cache_page = tswap_lookup_page(entry); - if (cache_page) - goto copy; + if (cache_page) { + zero_filled = is_zero_filled_page(page); + /* If type of page has not changed, just reuse it */ + if (zero_filled == (cache_page == ZERO_PAGE(0))) + goto copy; + tswap_delete_page(entry, NULL); + put_page(cache_page); + } if (!(current->flags & PF_MEMCG_RECLAIM)) return -1; - cache_page = alloc_page(TSWAP_GFP_MASK | __GFP_HIGHMEM); - if (!cache_page) - return -1; + if (zero_filled == -1) + zero_filled = is_zero_filled_page(page); + + if (!zero_filled) { + cache_page = alloc_page(TSWAP_GFP_MASK | __GFP_HIGHMEM); + if (!cache_page) + return -1; + } else { + cache_page = ZERO_PAGE(0); + get_page(cache_page); + } err = tswap_insert_page(entry, cache_page); if (err) { @@ -306,7 +342,8 @@ static int tswap_frontswap_store(unsigned type, pgoff_t offset, return -1; } copy: - copy_highpage(cache_page, page); + if (cache_page != ZERO_PAGE(0)) + copy_highpage(cache_page, page); return 0; } . ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH] autofs: fix autofs_v5_packet structure for compat mode
On Thu, Aug 31, 2017 at 01:48:27PM +0300, Stanislav Kinsburskiy wrote: > > > 31.08.2017 13:38, Dmitry V. Levin пишет: > > On Thu, Aug 31, 2017 at 02:11:34PM +0400, Stanislav Kinsburskiy wrote: > >> Due to integer variables alignment size of struct autofs_v5_packet in 300 > >> bytes in 32-bit architectures (instead of 304 bytes in 64-bits > >> architectures). > >> > >> This may lead to memory corruption (64 bits kernel always send 304 bytes, > >> while 32-bit userspace application expects for 300). > >> > >> https://jira.sw.ru/browse/PSBM-71078 > >> > >> Signed-off-by: Stanislav Kinsburskiy > >> --- > >> include/uapi/linux/auto_fs4.h |2 ++ > >> 1 file changed, 2 insertions(+) > >> > >> diff --git a/include/uapi/linux/auto_fs4.h b/include/uapi/linux/auto_fs4.h > >> index e02982f..8729a47 100644 > >> --- a/include/uapi/linux/auto_fs4.h > >> +++ b/include/uapi/linux/auto_fs4.h > >> @@ -137,6 +137,8 @@ struct autofs_v5_packet { > >>__u32 pid; > >>__u32 tgid; > >>__u32 len; > >> + __u32 blob; /* This is needed to align structure up to 8 > >> + bytes for ALL archs including 32-bit */ > >>char name[NAME_MAX+1]; > >> }; > > > > This change breaks ABI because it changes offsetof(struct autofs_v5_packet, > > name). > > If you need to fix the alignment, use __attribute__((aligned(8))). > > > > Nice to know you're watching. > Yes, attribute is better. > But how ABI is broken? On x86_64 this alignment is implied, so nothing is > changed. Your change increases offsetof(struct autofs_v5_packet, name) by 4 on all architectures. On architectures where the structure is 32-bit aligned this also leads to increase of its size by 4. > > An alignment change would also be an ABI breakage on 32-bit architectures, > > though. > > > > True. > But from my POW better have it working on 64bit archs for 32bit apps. > But anyway, upstream guys will device, whether they want 32-bit autofs > applications properly work on 64 or 32 bits. Let's fix old bugs without introducing new bugs. The right fix here seems to be a compat structure, that is, both 64-bit and 32-bit kernels should send the same 32-bit aligned structure, and it has to be the same structure sent by traditional 32-bit kernels. -- ldv signature.asc Description: PGP signature ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH] autofs: fix autofs_v5_packet structure for compat mode
31.08.2017 13:38, Dmitry V. Levin пишет: > On Thu, Aug 31, 2017 at 02:11:34PM +0400, Stanislav Kinsburskiy wrote: >> Due to integer variables alignment size of struct autofs_v5_packet in 300 >> bytes in 32-bit architectures (instead of 304 bytes in 64-bits >> architectures). >> >> This may lead to memory corruption (64 bits kernel always send 304 bytes, >> while 32-bit userspace application expects for 300). >> >> https://jira.sw.ru/browse/PSBM-71078 >> >> Signed-off-by: Stanislav Kinsburskiy >> --- >> include/uapi/linux/auto_fs4.h |2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/include/uapi/linux/auto_fs4.h b/include/uapi/linux/auto_fs4.h >> index e02982f..8729a47 100644 >> --- a/include/uapi/linux/auto_fs4.h >> +++ b/include/uapi/linux/auto_fs4.h >> @@ -137,6 +137,8 @@ struct autofs_v5_packet { >> __u32 pid; >> __u32 tgid; >> __u32 len; >> +__u32 blob; /* This is needed to align structure up to 8 >> + bytes for ALL archs including 32-bit */ >> char name[NAME_MAX+1]; >> }; > > This change breaks ABI because it changes offsetof(struct autofs_v5_packet, > name). > If you need to fix the alignment, use __attribute__((aligned(8))). > Nice to know you're watching. Yes, attribute is better. But how ABI is broken? On x86_64 this alignment is implied, so nothing is changed. > An alignment change would also be an ABI breakage on 32-bit architectures, > though. > True. But from my POW better have it working on 64bit archs for 32bit apps. But anyway, upstream guys will device, whether they want 32-bit autofs applications properly work on 64 or 32 bits. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH] autofs: fix autofs_v5_packet structure for compat mode
On Thu, Aug 31, 2017 at 02:11:34PM +0400, Stanislav Kinsburskiy wrote: > Due to integer variables alignment size of struct autofs_v5_packet in 300 > bytes in 32-bit architectures (instead of 304 bytes in 64-bits architectures). > > This may lead to memory corruption (64 bits kernel always send 304 bytes, > while 32-bit userspace application expects for 300). > > https://jira.sw.ru/browse/PSBM-71078 > > Signed-off-by: Stanislav Kinsburskiy > --- > include/uapi/linux/auto_fs4.h |2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/include/uapi/linux/auto_fs4.h b/include/uapi/linux/auto_fs4.h > index e02982f..8729a47 100644 > --- a/include/uapi/linux/auto_fs4.h > +++ b/include/uapi/linux/auto_fs4.h > @@ -137,6 +137,8 @@ struct autofs_v5_packet { > __u32 pid; > __u32 tgid; > __u32 len; > + __u32 blob; /* This is needed to align structure up to 8 > +bytes for ALL archs including 32-bit */ > char name[NAME_MAX+1]; > }; This change breaks ABI because it changes offsetof(struct autofs_v5_packet, name). If you need to fix the alignment, use __attribute__((aligned(8))). An alignment change would also be an ABI breakage on 32-bit architectures, though. -- ldv signature.asc Description: PGP signature ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH] autofs: fix autofs_v5_packet structure for compat mode
Yes 31.08.2017 13:37, Konstantin Khorenko пишет: > Will you send it to mainstream as well? > > -- > Best regards, > > Konstantin Khorenko, > Virtuozzo Linux Kernel Team > > On 08/31/2017 01:11 PM, Stanislav Kinsburskiy wrote: >> Due to integer variables alignment size of struct autofs_v5_packet in 300 >> bytes in 32-bit architectures (instead of 304 bytes in 64-bits >> architectures). >> >> This may lead to memory corruption (64 bits kernel always send 304 bytes, >> while 32-bit userspace application expects for 300). >> >> https://jira.sw.ru/browse/PSBM-71078 >> >> Signed-off-by: Stanislav Kinsburskiy >> --- >> include/uapi/linux/auto_fs4.h |2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/include/uapi/linux/auto_fs4.h b/include/uapi/linux/auto_fs4.h >> index e02982f..8729a47 100644 >> --- a/include/uapi/linux/auto_fs4.h >> +++ b/include/uapi/linux/auto_fs4.h >> @@ -137,6 +137,8 @@ struct autofs_v5_packet { >> __u32 pid; >> __u32 tgid; >> __u32 len; >> +__u32 blob;/* This is needed to align structure up to 8 >> + bytes for ALL archs including 32-bit */ >> char name[NAME_MAX+1]; >> }; >> >> >> . >> ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH] autofs: fix autofs_v5_packet structure for compat mode
Will you send it to mainstream as well? -- Best regards, Konstantin Khorenko, Virtuozzo Linux Kernel Team On 08/31/2017 01:11 PM, Stanislav Kinsburskiy wrote: Due to integer variables alignment size of struct autofs_v5_packet in 300 bytes in 32-bit architectures (instead of 304 bytes in 64-bits architectures). This may lead to memory corruption (64 bits kernel always send 304 bytes, while 32-bit userspace application expects for 300). https://jira.sw.ru/browse/PSBM-71078 Signed-off-by: Stanislav Kinsburskiy --- include/uapi/linux/auto_fs4.h |2 ++ 1 file changed, 2 insertions(+) diff --git a/include/uapi/linux/auto_fs4.h b/include/uapi/linux/auto_fs4.h index e02982f..8729a47 100644 --- a/include/uapi/linux/auto_fs4.h +++ b/include/uapi/linux/auto_fs4.h @@ -137,6 +137,8 @@ struct autofs_v5_packet { __u32 pid; __u32 tgid; __u32 len; + __u32 blob; /* This is needed to align structure up to 8 + bytes for ALL archs including 32-bit */ char name[NAME_MAX+1]; }; . ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] autofs: fix autofs_v5_packet structure for compat mode
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit e484b0abe8af8793f58e6434060a3779261d3151 Author: Stanislav Kinsburskiy Date: Thu Aug 31 13:36:59 2017 +0300 autofs: fix autofs_v5_packet structure for compat mode Due to integer variables alignment size of struct autofs_v5_packet in 300 bytes in 32-bit architectures (instead of 304 bytes in 64-bits architectures). This may lead to memory corruption (64 bits kernel always send 304 bytes, while 32-bit userspace application expects for 300). https://jira.sw.ru/browse/PSBM-71078 Signed-off-by: Stanislav Kinsburskiy --- include/uapi/linux/auto_fs4.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/uapi/linux/auto_fs4.h b/include/uapi/linux/auto_fs4.h index e02982f..8729a47 100644 --- a/include/uapi/linux/auto_fs4.h +++ b/include/uapi/linux/auto_fs4.h @@ -137,6 +137,8 @@ struct autofs_v5_packet { __u32 pid; __u32 tgid; __u32 len; + __u32 blob; /* This is needed to align structure up to 8 + bytes for ALL archs including 32-bit */ char name[NAME_MAX+1]; }; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RHEL7 COMMIT] ms/workqueue: fix ghost PENDING flag while doing MQ IO
Please consider to release it as a ReadyKernel patch. https://readykernel.com/ -- Best regards, Konstantin Khorenko, Virtuozzo Linux Kernel Team On 08/31/2017 01:28 PM, Konstantin Khorenko wrote: The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit f24bbb53d5035c7b13b5ecb61728d5f12240f139 Author: Roman Pen Date: Thu Aug 31 13:28:46 2017 +0300 ms/workqueue: fix ghost PENDING flag while doing MQ IO We have the hole node hang, many processes hang on similar stack as here: crash> ps -m 8802b7f0 [0 00:20:36.663] [UN] PID: 22713 TASK: 8802b7f0 CPU: 1 COMMAND: "worker" crash> bt 8802b7f0 PID: 22713 TASK: 8802b7f0 CPU: 1 COMMAND: "worker" #0 [88031b04f980] __schedule at 8256cdd1 #1 [88031b04f9f8] schedule at 8256e239 #2 [88031b04fa18] schedule_timeout at 82561cea #3 [88031b04fb88] io_schedule_timeout at 8256c0d9 #4 [88031b04fbb8] wait_for_completion_io at 8256f3e0 #5 [88031b04fc90] blkdev_issue_flush at 8193a207 #6 [88031b04fe08] ext4_sync_file at a0af6d34 [ext4] #7 [88031b04fe68] vfs_fsync_range at 8173212c #8 [88031b04fec8] do_fsync at 817330dc #9 [88031b04ff68] sys_fdatasync at 8173437e RIP: 7f474a581ddd RSP: 7f46ba3fe8a0 RFLAGS: 0282 RAX: 004b RBX: 8258f609 RCX: RDX: 7f4754ffd458 RSI: RDI: 0011 RBP: R8: R9: 58b9 R10: 7f46ba3fe8b0 R11: 0293 R12: 7f475be25d80 R13: 8173437e R14: 88031b04ff78 R15: 7f4755141452 ORIG_RAX: 004b CS: 0033 SS: 002b crash> ps -m 8802b7f0 [0 00:20:36.663] [UN] PID: 22713 TASK: 8802b7f0 CPU: 1 COMMAND: "worker" Sleeps for 20 minutes on bio completion: blkdev_issue_flush: submit_bio(WRITE_FLUSH, bio); here>wait_for_completion_io(&wait); As bio->bi_rw = (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_FLUSH), we had: submit_bio->generic_make_request->dm_make_request->queue_io->queue_work So in wait_for_completion_io we wait for dm_wq_work to complete these bio. But work is not in the workqueue already, as work->entry is empty list, so the work seem completed. That could happen only if md->flags had DMF_BLOCK_IO_FOR_SUSPEND bit set. But it is already unset, when we clear the bit we queue another dm_wq_work on these wq in dm_queue_flush. So what could've happened here is that operation reordering loads DMF_BLOCK_IO_FOR_SUSPEND bit in dm_wq_work before it was cleared in dm_queue_flush. Adding smp_mb in set_work_pool_and_clear_pending should order operations properly. https://jira.sw.ru/browse/PSBM-69788 original commit message: The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list with the following backtrace: [ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds. [ 601.347574] Tainted: G O4.4.5-1-storage+ #6 [ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 601.348142] kworker/u129:5 D 880803077988 0 1636 2 0x [ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server] [ 601.348999] 880803077988 88080466b900 8808033f9c80 880803078000 [ 601.349662] 880807c95000 7fff 815b0920 880803077ad0 [ 601.350333] 8808030779a0 815b01d5 880803077a38 [ 601.350965] Call Trace: [ 601.351203] [] ? bit_wait+0x60/0x60 [ 601.351444] [] schedule+0x35/0x80 [ 601.351709] [] schedule_timeout+0x192/0x230 [ 601.351958] [] ? blk_flush_plug_list+0xc7/0x220 [ 601.352208] [] ? ktime_get+0x37/0xa0 [ 601.352446] [] ? bit_wait+0x60/0x60 [ 601.352688] [] io_schedule_timeout+0xa4/0x110 [ 601.352951] [] ? _raw_spin_unlock_irqrestore+0xe/0x10 [ 601.353196] [] bit_wait_io+0x1b/0x70 [ 601.353440] [] __wait_on_bit+0x5d/0x90 [ 601.353689] [] wait_on_page_bit+0xc0/0xd0 [ 601.353958] [] ? autoremove_wake_function+0x40/0x40 [ 601.354200] [] __filemap_fdatawait_range+0xe4/0x140 [ 601.354441] [] filemap_fdatawait_range+0x14/0x30 [ 601.354688] [] filemap_write_and_wait_range+0x3f/0x70 [ 601.354932] [] blkdev_fsync+0x1b/0x50 [ 601.355193] [] vfs_fsync_range+0x49/0xa0 [ 601.355432] [] blkdev_write_iter+0xca/0x100 [ 601.355679] [] __vfs_write+0xaa/0xe0 [ 601.355925] [] vfs_write+0xa9/0x1a0 [ 601.356164] []
[Devel] [PATCH RHEL7 COMMIT] ms/workqueue: fix ghost PENDING flag while doing MQ IO
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit f24bbb53d5035c7b13b5ecb61728d5f12240f139 Author: Roman Pen Date: Thu Aug 31 13:28:46 2017 +0300 ms/workqueue: fix ghost PENDING flag while doing MQ IO We have the hole node hang, many processes hang on similar stack as here: crash> ps -m 8802b7f0 [0 00:20:36.663] [UN] PID: 22713 TASK: 8802b7f0 CPU: 1 COMMAND: "worker" crash> bt 8802b7f0 PID: 22713 TASK: 8802b7f0 CPU: 1 COMMAND: "worker" #0 [88031b04f980] __schedule at 8256cdd1 #1 [88031b04f9f8] schedule at 8256e239 #2 [88031b04fa18] schedule_timeout at 82561cea #3 [88031b04fb88] io_schedule_timeout at 8256c0d9 #4 [88031b04fbb8] wait_for_completion_io at 8256f3e0 #5 [88031b04fc90] blkdev_issue_flush at 8193a207 #6 [88031b04fe08] ext4_sync_file at a0af6d34 [ext4] #7 [88031b04fe68] vfs_fsync_range at 8173212c #8 [88031b04fec8] do_fsync at 817330dc #9 [88031b04ff68] sys_fdatasync at 8173437e RIP: 7f474a581ddd RSP: 7f46ba3fe8a0 RFLAGS: 0282 RAX: 004b RBX: 8258f609 RCX: RDX: 7f4754ffd458 RSI: RDI: 0011 RBP: R8: R9: 58b9 R10: 7f46ba3fe8b0 R11: 0293 R12: 7f475be25d80 R13: 8173437e R14: 88031b04ff78 R15: 7f4755141452 ORIG_RAX: 004b CS: 0033 SS: 002b crash> ps -m 8802b7f0 [0 00:20:36.663] [UN] PID: 22713 TASK: 8802b7f0 CPU: 1 COMMAND: "worker" Sleeps for 20 minutes on bio completion: blkdev_issue_flush: submit_bio(WRITE_FLUSH, bio); here> wait_for_completion_io(&wait); As bio->bi_rw = (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_FLUSH), we had: submit_bio->generic_make_request->dm_make_request->queue_io->queue_work So in wait_for_completion_io we wait for dm_wq_work to complete these bio. But work is not in the workqueue already, as work->entry is empty list, so the work seem completed. That could happen only if md->flags had DMF_BLOCK_IO_FOR_SUSPEND bit set. But it is already unset, when we clear the bit we queue another dm_wq_work on these wq in dm_queue_flush. So what could've happened here is that operation reordering loads DMF_BLOCK_IO_FOR_SUSPEND bit in dm_wq_work before it was cleared in dm_queue_flush. Adding smp_mb in set_work_pool_and_clear_pending should order operations properly. https://jira.sw.ru/browse/PSBM-69788 original commit message: The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list with the following backtrace: [ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds. [ 601.347574] Tainted: G O4.4.5-1-storage+ #6 [ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 601.348142] kworker/u129:5 D 880803077988 0 1636 2 0x [ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server] [ 601.348999] 880803077988 88080466b900 8808033f9c80 880803078000 [ 601.349662] 880807c95000 7fff 815b0920 880803077ad0 [ 601.350333] 8808030779a0 815b01d5 880803077a38 [ 601.350965] Call Trace: [ 601.351203] [] ? bit_wait+0x60/0x60 [ 601.351444] [] schedule+0x35/0x80 [ 601.351709] [] schedule_timeout+0x192/0x230 [ 601.351958] [] ? blk_flush_plug_list+0xc7/0x220 [ 601.352208] [] ? ktime_get+0x37/0xa0 [ 601.352446] [] ? bit_wait+0x60/0x60 [ 601.352688] [] io_schedule_timeout+0xa4/0x110 [ 601.352951] [] ? _raw_spin_unlock_irqrestore+0xe/0x10 [ 601.353196] [] bit_wait_io+0x1b/0x70 [ 601.353440] [] __wait_on_bit+0x5d/0x90 [ 601.353689] [] wait_on_page_bit+0xc0/0xd0 [ 601.353958] [] ? autoremove_wake_function+0x40/0x40 [ 601.354200] [] __filemap_fdatawait_range+0xe4/0x140 [ 601.354441] [] filemap_fdatawait_range+0x14/0x30 [ 601.354688] [] filemap_write_and_wait_range+0x3f/0x70 [ 601.354932] [] blkdev_fsync+0x1b/0x50 [ 601.355193] [] vfs_fsync_range+0x49/0xa0 [ 601.355432] [] blkdev_write_iter+0xca/0x100 [ 601.355679] [] __vfs_write+0xaa/0xe0 [ 601.355925] [] vfs_write+0xa9/0x1a0 [ 601.356164] [] kernel_write+0x38/0x50 The underlying device is a null_blk, with default parameters: queue_mode= MQ submit_queues
[Devel] [PATCH RHEL7 COMMIT] fs-writeback: add endless writeback debug
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 1069e544ff85161d41fd3679c3d3b47dc3af5139 Author: Dmitry Monakhov Date: Fri Aug 25 13:16:52 2017 +0400 fs-writeback: add endless writeback debug This is temporary debug patch, it will be rolled back before the release. https://jira.sw.ru/browse/PSBM-69587 Signed-off-by: Dmitry Monakhov --- fs/fs-writeback.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index f34ae6c..a54c0bd 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -787,11 +787,15 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb, { unsigned long start_time = jiffies; long wrote = 0; + int trace = 0; while (!list_empty(&wb->b_io)) { struct inode *inode = wb_inode(wb->b_io.prev); struct super_block *sb = inode->i_sb; + if (time_is_before_jiffies(start_time + 15* HZ)) + trace = 1; + if (!grab_super_passive(sb)) { /* * grab_super_passive() may fail consistently due to @@ -799,6 +803,9 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb, * requeue_io() to avoid busy retrying the inode/sb. */ redirty_tail(inode, wb); + if (trace) + printk("%s:%d writeback is taking too long ino:%ld sb(%p):%s\n", + __FUNCTION__, __LINE__, inode->i_ino, sb, sb->s_id); continue; } wrote += writeback_sb_inodes(sb, wb, work); @@ -890,6 +897,7 @@ static long wb_writeback(struct bdi_writeback *wb, unsigned long oldest_jif; struct inode *inode; long progress; + int trace = 0; oldest_jif = jiffies; work->older_than_this = &oldest_jif; @@ -902,6 +910,9 @@ static long wb_writeback(struct bdi_writeback *wb, if (work->nr_pages <= 0) break; + if (time_is_before_jiffies(wb_start + 15* HZ)) + trace = 1; + /* * Background writeout and kupdate-style writeback may * run forever. Stop them if there is other work to do @@ -973,6 +984,10 @@ static long wb_writeback(struct bdi_writeback *wb, inode = wb_inode(wb->b_more_io.prev); spin_lock(&inode->i_lock); spin_unlock(&wb->list_lock); + if (trace) + printk("%s:%d writeback is taking too long ino:%ld st:%ld sb(%p):%s\n", + __FUNCTION__, __LINE__, inode->i_ino, + inode->i_state, inode->i_sb, inode->i_sb->s_id); /* This function drops i_lock... */ inode_sleep_on_writeback(inode); spin_lock(&wb->list_lock); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH] autofs: fix autofs_v5_packet structure for compat mode
Due to integer variables alignment size of struct autofs_v5_packet in 300 bytes in 32-bit architectures (instead of 304 bytes in 64-bits architectures). This may lead to memory corruption (64 bits kernel always send 304 bytes, while 32-bit userspace application expects for 300). https://jira.sw.ru/browse/PSBM-71078 Signed-off-by: Stanislav Kinsburskiy --- include/uapi/linux/auto_fs4.h |2 ++ 1 file changed, 2 insertions(+) diff --git a/include/uapi/linux/auto_fs4.h b/include/uapi/linux/auto_fs4.h index e02982f..8729a47 100644 --- a/include/uapi/linux/auto_fs4.h +++ b/include/uapi/linux/auto_fs4.h @@ -137,6 +137,8 @@ struct autofs_v5_packet { __u32 pid; __u32 tgid; __u32 len; + __u32 blob; /* This is needed to align structure up to 8 + bytes for ALL archs including 32-bit */ char name[NAME_MAX+1]; }; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] mm/memcg: reclaim only kmem if kmem limit reached
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit aa84e9472d88646f993f8bf1f2eb03a6abad93cd Author: Andrey Ryabinin Date: Thu Aug 31 13:03:24 2017 +0300 mm/memcg: reclaim only kmem if kmem limit reached If kmem limit on memcg reached, we go into memory reclaim, and reclaim everything we can, including page cache and anon. Reclaiming page cache or anon won't help since we need to lower only kmem usage. This patch fixes the problem by avoiding non-kmem reclaim on hitting the kmem limit. https://jira.sw.ru/browse/PSBM-69226 Signed-off-by: Andrey Ryabinin --- include/linux/memcontrol.h | 10 ++ include/linux/swap.h | 2 +- mm/memcontrol.c| 30 -- mm/vmscan.c| 31 --- 4 files changed, 51 insertions(+), 22 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1a52e58..1d6bc80 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -45,6 +45,16 @@ struct mem_cgroup_reclaim_cookie { unsigned int generation; }; +/* + * Reclaim flags for mem_cgroup_hierarchical_reclaim + */ +#define MEM_CGROUP_RECLAIM_NOSWAP_BIT 0x0 +#define MEM_CGROUP_RECLAIM_NOSWAP (1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT) +#define MEM_CGROUP_RECLAIM_SHRINK_BIT 0x1 +#define MEM_CGROUP_RECLAIM_SHRINK (1 << MEM_CGROUP_RECLAIM_SHRINK_BIT) +#define MEM_CGROUP_RECLAIM_KMEM_BIT0x2 +#define MEM_CGROUP_RECLAIM_KMEM(1 << MEM_CGROUP_RECLAIM_KMEM_BIT) + #ifdef CONFIG_MEMCG int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask, struct mem_cgroup **memcgp); diff --git a/include/linux/swap.h b/include/linux/swap.h index bd162f9..bd47451 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -324,7 +324,7 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, extern int __isolate_lru_page(struct page *page, isolate_mode_t mode); extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, unsigned long nr_pages, - gfp_t gfp_mask, bool noswap); + gfp_t gfp_mask, int flags); extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, struct zone *zone, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 09ce016..5372151 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -511,16 +511,6 @@ enum res_type { #define OOM_CONTROL(0) /* - * Reclaim flags for mem_cgroup_hierarchical_reclaim - */ -#define MEM_CGROUP_RECLAIM_NOSWAP_BIT 0x0 -#define MEM_CGROUP_RECLAIM_NOSWAP (1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT) -#define MEM_CGROUP_RECLAIM_SHRINK_BIT 0x1 -#define MEM_CGROUP_RECLAIM_SHRINK (1 << MEM_CGROUP_RECLAIM_SHRINK_BIT) -#define MEM_CGROUP_RECLAIM_KMEM_BIT0x2 -#define MEM_CGROUP_RECLAIM_KMEM(1 << MEM_CGROUP_RECLAIM_KMEM_BIT) - -/* * The memcg_create_mutex will be held whenever a new cgroup is created. * As a consequence, any change that needs to protect against new child cgroups * appearing has to hold it as well. @@ -2137,7 +2127,7 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg, if (loop) drain_all_stock_async(memcg); total += try_to_free_mem_cgroup_pages(memcg, SWAP_CLUSTER_MAX, - gfp_mask, noswap); + gfp_mask, flags); if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) return 1; @@ -2150,6 +2140,16 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg, break; if (mem_cgroup_margin(memcg, flags & MEM_CGROUP_RECLAIM_KMEM)) break; + + /* +* Try harder to reclaim dcache. dcache reclaim may +* temporarly fail due to dcache->dlock being held +* by someone else. We must try harder to avoid premature +* slab allocation failures. +*/ + if (flags & MEM_CGROUP_RECLAIM_KMEM && + page_counter_read(&memcg->dcache)) + continue; /* * If nothing was reclaimed after two attempts, there * may be no reclaimable pages in this hierarchy. @@ -2778,11 +2778,13 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, bool kmem_charge struct mem_cg
[Devel] [PATCH RHEL7 COMMIT] ms/mm: use sc->priority for slab shrink targets
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 5a99b388025b5981f44c29588a22dc37607f990c Author: Josef Bacik Date: Thu Aug 31 13:03:23 2017 +0300 ms/mm: use sc->priority for slab shrink targets Previously we were using the ratio of the number of lru pages scanned to the number of eligible lru pages to determine the number of slab objects to scan. The problem with this is that these two things have nothing to do with each other, so in slab heavy work loads where there is little to no page cache we can end up with the pages scanned being a very low number. This means that we reclaim next to no slab pages and waste a lot of time reclaiming small amounts of space. Instead use sc->priority in the same way we use it to determine scan amounts for the lru's. This generally equates to pages. Consider the following slab_pages = (nr_objects * object_size) / PAGE_SIZE What we would like to do is scan = slab_pages >> sc->priority but we don't know the number of slab pages each shrinker controls, only the objects. However say that theoretically we knew how many pages a shrinker controlled, we'd still have to convert this to objects, which would look like the following scan = shrinker_pages >> sc->priority scan_objects = (PAGE_SIZE / object_size) * scan or written another way scan_objects = (shrinker_pages >> sc->priority) * (PAGE_SIZE / object_size) which can thus be written scan_objects = ((shrinker_pages * PAGE_SIZE) / object_size) >> sc->priority which is just scan_objects = nr_objects >> sc->priority We don't need to know exactly how many pages each shrinker represents, it's objects are all the information we need. Making this change allows us to place an appropriate amount of pressure on the shrinker pools for their relative size. Signed-off-by: Josef Bacik https://jira.sw.ru/browse/PSBM-69226 Signed-off-by: Andrey Ryabinin --- include/trace/events/vmscan.h | 23 ++ mm/vmscan.c | 44 --- 2 files changed, 22 insertions(+), 45 deletions(-) diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index 132a985..d98fb0a 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -181,23 +181,22 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_re TRACE_EVENT(mm_shrink_slab_start, TP_PROTO(struct shrinker *shr, struct shrink_control *sc, - long nr_objects_to_shrink, unsigned long pgs_scanned, - unsigned long lru_pgs, unsigned long cache_items, - unsigned long long delta, unsigned long total_scan), + long nr_objects_to_shrink, unsigned long cache_items, + unsigned long long delta, unsigned long total_scan, + int priority), - TP_ARGS(shr, sc, nr_objects_to_shrink, pgs_scanned, lru_pgs, - cache_items, delta, total_scan), + TP_ARGS(shr, sc, nr_objects_to_shrink, cache_items, delta, total_scan, + priority), TP_STRUCT__entry( __field(struct shrinker *, shr) __field(void *, shrink) __field(long, nr_objects_to_shrink) __field(gfp_t, gfp_flags) - __field(unsigned long, pgs_scanned) - __field(unsigned long, lru_pgs) __field(unsigned long, cache_items) __field(unsigned long long, delta) __field(unsigned long, total_scan) + __field(int, priority) ), TP_fast_assign( @@ -205,23 +204,21 @@ TRACE_EVENT(mm_shrink_slab_start, __entry->shrink = shr->scan_objects; __entry->nr_objects_to_shrink = nr_objects_to_shrink; __entry->gfp_flags = sc->gfp_mask; - __entry->pgs_scanned = pgs_scanned; - __entry->lru_pgs = lru_pgs; __entry->cache_items = cache_items; __entry->delta = delta; __entry->total_scan = total_scan; + __entry->priority = priority; ), - TP_printk("%pF %p: objects to shrink %ld gfp_flags %s pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld", + TP_printk("%pF %p: objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d", __entry->shrink, __entry->shr, __entry->nr_objects_to_shrink, show_gfp_flags(__entry->gfp_flags), - __entry->pgs_scanned, - __entry->lru_pgs, __entry->cach
Re: [Devel] [PATCH rh7 2/2] mm/memcg: reclaim only kmem if kmem limit reached.
Do we want to push it to mainstream as well? -- Best regards, Konstantin Khorenko, Virtuozzo Linux Kernel Team On 08/25/2017 06:38 PM, Andrey Ryabinin wrote: If kmem limit on memcg reached, we go into memory reclaim, and reclaim everything we can, including page cache and anon. Reclaiming page cache or anon won't help since we need to lower only kmem usage. This patch fixes the problem by avoiding non-kmem reclaim on hitting the kmem limit. https://jira.sw.ru/browse/PSBM-69226 Signed-off-by: Andrey Ryabinin --- include/linux/memcontrol.h | 10 ++ include/linux/swap.h | 2 +- mm/memcontrol.c| 30 -- mm/vmscan.c| 31 --- 4 files changed, 51 insertions(+), 22 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1a52e58ab7de..1d6bc80c4c90 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -45,6 +45,16 @@ struct mem_cgroup_reclaim_cookie { unsigned int generation; }; +/* + * Reclaim flags for mem_cgroup_hierarchical_reclaim + */ +#define MEM_CGROUP_RECLAIM_NOSWAP_BIT 0x0 +#define MEM_CGROUP_RECLAIM_NOSWAP (1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT) +#define MEM_CGROUP_RECLAIM_SHRINK_BIT 0x1 +#define MEM_CGROUP_RECLAIM_SHRINK (1 << MEM_CGROUP_RECLAIM_SHRINK_BIT) +#define MEM_CGROUP_RECLAIM_KMEM_BIT0x2 +#define MEM_CGROUP_RECLAIM_KMEM(1 << MEM_CGROUP_RECLAIM_KMEM_BIT) + #ifdef CONFIG_MEMCG int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask, struct mem_cgroup **memcgp); diff --git a/include/linux/swap.h b/include/linux/swap.h index bd162f9bef0d..bd47451ec95a 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -324,7 +324,7 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, extern int __isolate_lru_page(struct page *page, isolate_mode_t mode); extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, unsigned long nr_pages, - gfp_t gfp_mask, bool noswap); + gfp_t gfp_mask, int flags); extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, struct zone *zone, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 97824e281d7a..f9a5f3819a31 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -511,16 +511,6 @@ enum res_type { #define OOM_CONTROL(0) /* - * Reclaim flags for mem_cgroup_hierarchical_reclaim - */ -#define MEM_CGROUP_RECLAIM_NOSWAP_BIT 0x0 -#define MEM_CGROUP_RECLAIM_NOSWAP (1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT) -#define MEM_CGROUP_RECLAIM_SHRINK_BIT 0x1 -#define MEM_CGROUP_RECLAIM_SHRINK (1 << MEM_CGROUP_RECLAIM_SHRINK_BIT) -#define MEM_CGROUP_RECLAIM_KMEM_BIT0x2 -#define MEM_CGROUP_RECLAIM_KMEM(1 << MEM_CGROUP_RECLAIM_KMEM_BIT) - -/* * The memcg_create_mutex will be held whenever a new cgroup is created. * As a consequence, any change that needs to protect against new child cgroups * appearing has to hold it as well. @@ -2137,7 +2127,7 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg, if (loop) drain_all_stock_async(memcg); total += try_to_free_mem_cgroup_pages(memcg, SWAP_CLUSTER_MAX, - gfp_mask, noswap); + gfp_mask, flags); if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) return 1; @@ -2150,6 +2140,16 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg, break; if (mem_cgroup_margin(memcg, flags & MEM_CGROUP_RECLAIM_KMEM)) break; + + /* +* Try harder to reclaim dcache. dcache reclaim may +* temporarly fail due to dcache->dlock being held +* by someone else. We must try harder to avoid premature +* slab allocation failures. +*/ + if (flags & MEM_CGROUP_RECLAIM_KMEM && + page_counter_read(&memcg->dcache)) + continue; /* * If nothing was reclaimed after two attempts, there * may be no reclaimable pages in this hierarchy. @@ -2778,11 +2778,13 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, bool kmem_charge struct mem_cgroup *mem_over_limit; struct page_counter *counter; unsigned long nr_reclaimed; - unsigned long flags = 0; + unsigned long flags; if (mem_cgroup_is_r
[Devel] [PATCH] zdtm: fix package memory allocation in autofs.c
Plus some cleanup. https://jira.sw.ru/browse/PSBM-71078 Signed-off-by: Stanislav Kinsburskiy --- test/zdtm/static/autofs.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/test/zdtm/static/autofs.c b/test/zdtm/static/autofs.c index 8d917ee..882289f 100644 --- a/test/zdtm/static/autofs.c +++ b/test/zdtm/static/autofs.c @@ -460,10 +460,10 @@ static int automountd_loop(int pipe, const char *mountpoint, struct autofs_param { union autofs_v5_packet_union *packet; ssize_t bytes; - size_t psize = sizeof(*packet) * 2; + size_t psize = sizeof(*packet); int err = 0; - packet = malloc(psize); + packet = malloc(psize * 2); if (!packet) { pr_err("failed to allocate autofs packet\n"); return -ENOMEM; @@ -473,7 +473,7 @@ static int automountd_loop(int pipe, const char *mountpoint, struct autofs_param siginterrupt(SIGUSR2, 1); while (!stop && !err) { - memset(packet, 0, sizeof(*packet)); + memset(packet, 0, psize * 2); bytes = read(pipe, packet, psize); if (bytes < 0) { @@ -483,12 +483,12 @@ static int automountd_loop(int pipe, const char *mountpoint, struct autofs_param } continue; } - if (bytes > psize) { - pr_err("read more that expected: %zd > %zd\n", bytes, psize); - return -EINVAL; - } - if (bytes != sizeof(*packet)) { - pr_err("read less than expected: %zd\n", bytes); + if (bytes != psize) { + pr_err("read %s that expected: %zd %s %zd\n", + (bytes > psize) ? "more" : "less", + bytes, + (bytes > psize) ? ">" : "<", + psize); return -EINVAL; } err = automountd_serve(mountpoint, param, packet); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RHEL7 COMMIT] ms/block: Check for gaps on front and back merges
Please consider to release a RK patch for it. https://readykernel.com/ -- Best regards, Konstantin Khorenko, Virtuozzo Linux Kernel Team On 08/31/2017 11:59 AM, Konstantin Khorenko wrote: The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 36178f3003689a99f0b58b6c12e235186952d9a9 Author: Maxim Patlasov Date: Thu Aug 31 11:59:13 2017 +0300 ms/block: Check for gaps on front and back merges Backport 5e7c4274a70aa2d6f485996d0ca1dad52d0039ca from ml. Before the patch, front merge incorrectly used the same req_gap_to_prev() as back merge. Original patch description: block: Check for gaps on front and back merges We are checking for gaps to previous bio_vec, which can only detect back merges gaps. Moreover, at the point where we check for a gap, we don't know if we will attempt a back or a front merge. Thus, check for gap to prev in a back merge attempt and check for a gap to next in a front merge attempt. Signed-off-by: Jens Axboe [sagig: Minor rename change] Signed-off-by: Sagi Grimberg https://jira.sw.ru/browse/PSBM-70321 Signed-off-by: Maxim Patlasov --- block/blk-merge.c | 18 +- include/linux/blkdev.h | 20 2 files changed, 25 insertions(+), 13 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index b0ce46d..0e8b7f2 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -294,6 +294,8 @@ static inline int ll_new_hw_segment(struct request_queue *q, int ll_back_merge_fn(struct request_queue *q, struct request *req, struct bio *bio) { + if (req_gap_back_merge(req, bio)) + return 0; if (blk_rq_sectors(req) + bio_sectors(bio) > blk_rq_get_max_sectors(req)) { req->cmd_flags |= REQ_NOMERGE; @@ -312,6 +314,8 @@ int ll_back_merge_fn(struct request_queue *q, struct request *req, int ll_front_merge_fn(struct request_queue *q, struct request *req, struct bio *bio) { + if (req_gap_front_merge(req, bio)) + return 0; if (blk_rq_sectors(req) + bio_sectors(bio) > blk_rq_get_max_sectors(req)) { req->cmd_flags |= REQ_NOMERGE; @@ -338,14 +342,6 @@ static bool req_no_special_merge(struct request *req) return !q->mq_ops && req->special; } -static int req_gap_to_prev(struct request *req, struct bio *next) -{ - struct bio *prev = req->biotail; - - return bvec_gap_to_prev(req->q, &prev->bi_io_vec[prev->bi_vcnt - 1], - next->bi_io_vec[0].bv_offset); -} - static int ll_merge_requests_fn(struct request_queue *q, struct request *req, struct request *next) { @@ -360,7 +356,7 @@ static int ll_merge_requests_fn(struct request_queue *q, struct request *req, if (req_no_special_merge(req) || req_no_special_merge(next)) return 0; - if (req_gap_to_prev(req, next->bio)) + if (req_gap_back_merge(req, next->bio)) return 0; /* @@ -568,10 +564,6 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio) !blk_write_same_mergeable(rq->bio, bio)) return false; - /* Only check gaps if the bio carries data */ - if (bio_has_data(bio) && req_gap_to_prev(rq, bio)) - return false; - return true; } diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index e1662f9..2b9bc88 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1493,6 +1493,26 @@ static inline bool bvec_gap_to_prev(struct request_queue *q, ((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q)); } +static inline bool bio_will_gap(struct request_queue *q, struct bio *prev, +struct bio *next) +{ + if (!bio_has_data(prev)) + return false; + + return bvec_gap_to_prev(q, &prev->bi_io_vec[prev->bi_vcnt - 1], + next->bi_io_vec[0].bv_offset); +} + +static inline bool req_gap_back_merge(struct request *req, struct bio *bio) +{ + return bio_will_gap(req->q, req->biotail, bio); +} + +static inline bool req_gap_front_merge(struct request *req, struct bio *bio) +{ + return bio_will_gap(req->q, bio, req->bio); +} + struct work_struct; int kblockd_schedule_work(struct work_struct *work); int kblockd_schedule_delayed_work(struct delayed_work *dwork, unsigned long delay); . ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/block: Check for gaps on front and back merges
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 36178f3003689a99f0b58b6c12e235186952d9a9 Author: Maxim Patlasov Date: Thu Aug 31 11:59:13 2017 +0300 ms/block: Check for gaps on front and back merges Backport 5e7c4274a70aa2d6f485996d0ca1dad52d0039ca from ml. Before the patch, front merge incorrectly used the same req_gap_to_prev() as back merge. Original patch description: block: Check for gaps on front and back merges We are checking for gaps to previous bio_vec, which can only detect back merges gaps. Moreover, at the point where we check for a gap, we don't know if we will attempt a back or a front merge. Thus, check for gap to prev in a back merge attempt and check for a gap to next in a front merge attempt. Signed-off-by: Jens Axboe [sagig: Minor rename change] Signed-off-by: Sagi Grimberg https://jira.sw.ru/browse/PSBM-70321 Signed-off-by: Maxim Patlasov --- block/blk-merge.c | 18 +- include/linux/blkdev.h | 20 2 files changed, 25 insertions(+), 13 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index b0ce46d..0e8b7f2 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -294,6 +294,8 @@ static inline int ll_new_hw_segment(struct request_queue *q, int ll_back_merge_fn(struct request_queue *q, struct request *req, struct bio *bio) { + if (req_gap_back_merge(req, bio)) + return 0; if (blk_rq_sectors(req) + bio_sectors(bio) > blk_rq_get_max_sectors(req)) { req->cmd_flags |= REQ_NOMERGE; @@ -312,6 +314,8 @@ int ll_back_merge_fn(struct request_queue *q, struct request *req, int ll_front_merge_fn(struct request_queue *q, struct request *req, struct bio *bio) { + if (req_gap_front_merge(req, bio)) + return 0; if (blk_rq_sectors(req) + bio_sectors(bio) > blk_rq_get_max_sectors(req)) { req->cmd_flags |= REQ_NOMERGE; @@ -338,14 +342,6 @@ static bool req_no_special_merge(struct request *req) return !q->mq_ops && req->special; } -static int req_gap_to_prev(struct request *req, struct bio *next) -{ - struct bio *prev = req->biotail; - - return bvec_gap_to_prev(req->q, &prev->bi_io_vec[prev->bi_vcnt - 1], - next->bi_io_vec[0].bv_offset); -} - static int ll_merge_requests_fn(struct request_queue *q, struct request *req, struct request *next) { @@ -360,7 +356,7 @@ static int ll_merge_requests_fn(struct request_queue *q, struct request *req, if (req_no_special_merge(req) || req_no_special_merge(next)) return 0; - if (req_gap_to_prev(req, next->bio)) + if (req_gap_back_merge(req, next->bio)) return 0; /* @@ -568,10 +564,6 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio) !blk_write_same_mergeable(rq->bio, bio)) return false; - /* Only check gaps if the bio carries data */ - if (bio_has_data(bio) && req_gap_to_prev(rq, bio)) - return false; - return true; } diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index e1662f9..2b9bc88 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1493,6 +1493,26 @@ static inline bool bvec_gap_to_prev(struct request_queue *q, ((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q)); } +static inline bool bio_will_gap(struct request_queue *q, struct bio *prev, +struct bio *next) +{ + if (!bio_has_data(prev)) + return false; + + return bvec_gap_to_prev(q, &prev->bi_io_vec[prev->bi_vcnt - 1], + next->bi_io_vec[0].bv_offset); +} + +static inline bool req_gap_back_merge(struct request *req, struct bio *bio) +{ + return bio_will_gap(req->q, req->biotail, bio); +} + +static inline bool req_gap_front_merge(struct request *req, struct bio *bio) +{ + return bio_will_gap(req->q, bio, req->bio); +} + struct work_struct; int kblockd_schedule_work(struct work_struct *work); int kblockd_schedule_delayed_work(struct delayed_work *dwork, unsigned long delay); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RHEL7 COMMIT] mm/memcg: add missing kmem charge
Please consider to release it as a ReadyKernel patch. https://readykernel.com/ (required only for vz7.33.22 kernel) -- Best regards, Konstantin Khorenko, Virtuozzo Linux Kernel Team On 08/31/2017 11:52 AM, Konstantin Khorenko wrote: The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 822bec288dcaf5f69c1ed3e64734230320c798ba Author: Andrey Ryabinin Date: Thu Aug 31 11:52:13 2017 +0300 mm/memcg: add missing kmem charge Since de3a106e28d5 ("mm/memcg: reclaim memory on reaching kmem limit.") if try_charge() decide to bypass memory limit, memcg_charge_kmem() will charge only ->memory/->memsw but not ->kmem. This may lead to deadlocks during cgroup destruction as condition: (page_counter_read(&memcg->memory) - page_counter_read(&memcg->kmem) > 0) in mem_cgroup_reparent_charges() won't come true ever. https://jira.sw.ru/browse/PSBM-70556 Fixes: de3a106e28d5 ("mm/memcg: reclaim memory on reaching kmem limit.") Signed-off-by: Andrey Ryabinin Reviewed-by:Vasily Averin --- mm/memcontrol.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 97824e2..09ce016 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3087,6 +3087,8 @@ int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, page_counter_charge(&memcg->memory, nr_pages); if (do_swap_account) page_counter_charge(&memcg->memsw, nr_pages); + page_counter_charge(&memcg->kmem, nr_pages); + ret = 0; } . ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] mm/memcg: add missing kmem charge
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-514.26.1.vz7.35.5 --> commit 822bec288dcaf5f69c1ed3e64734230320c798ba Author: Andrey Ryabinin Date: Thu Aug 31 11:52:13 2017 +0300 mm/memcg: add missing kmem charge Since de3a106e28d5 ("mm/memcg: reclaim memory on reaching kmem limit.") if try_charge() decide to bypass memory limit, memcg_charge_kmem() will charge only ->memory/->memsw but not ->kmem. This may lead to deadlocks during cgroup destruction as condition: (page_counter_read(&memcg->memory) - page_counter_read(&memcg->kmem) > 0) in mem_cgroup_reparent_charges() won't come true ever. https://jira.sw.ru/browse/PSBM-70556 Fixes: de3a106e28d5 ("mm/memcg: reclaim memory on reaching kmem limit.") Signed-off-by: Andrey Ryabinin Reviewed-by:Vasily Averin --- mm/memcontrol.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 97824e2..09ce016 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3087,6 +3087,8 @@ int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, page_counter_charge(&memcg->memory, nr_pages); if (do_swap_account) page_counter_charge(&memcg->memsw, nr_pages); + page_counter_charge(&memcg->kmem, nr_pages); + ret = 0; } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel