[Devel] [PATCH RHEL7 COMMIT] docker: Revert "vfs: take stat's dev from mnt->sb"
The commit is pushed to "branch-rh7-3.10.0-327.28.2.vz7.17.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.28.2.vz7.17.3 --> commit 11a02aff96d8a6140b5ca71eacaa6e9b995db7d5 Author: Pavel Tikhomirov Date: Wed Aug 24 20:19:01 2016 +0400 docker: Revert "vfs: take stat's dev from mnt->sb" This reverts commit ecfa1b0f7ba985e200e807941b5838943b266cb3. All non-directory objects on overlayfs should report an st_dev from lower or upper filesystem that is providing the object then call stat(2) on it(Documentation/filesystems/overlayfs.txt). But in our case file on overlayfs reports st_dev from overlayfs itself: e.g.: on VZ7 host device is 57 but should be 64768: mkdir /lower /upper /merged /work mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,\ workdir=/work /merged touch /merged/file stat /merged/file | grep Device | awk '{print $1$2}' Device:39h/57d cat /proc/self/mountinfo 63 1 253:0 / / rw,relatime shared:1 - ext4 /dev/mapper/virtuozzo_pcs7-root rw,data=ordered 149 63 0:57 / /merged rw,relatime shared:81 - overlay overlay rw,lowerdir=/lower,upperdir=/upper,workdir=/work Reason is in sys_stat->vfs_stat->vfs_fstatat->vfs_getattr: 1) Call ovl_getattr()->ovl_path_real() - find real object's path 2) Call ovl_getattr()->vfs_getattr() - find real object's s_dev 3) Replace it with overlay's s_dev, which is wrong. We do not have simfs - remove step (3) and stat will be fine again. *note: we need it for docker - when stat and fstat give different s_dev ldconfig in glibc breaks, and thus docker-ui tests break. https://jira.sw.ru/browse/PSBM-51255 Signed-off-by: Pavel Tikhomirov Reviewed-by: Kirill Tkhai --- fs/stat.c | 11 +++ 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/fs/stat.c b/fs/stat.c index a423f6351e27..d0ea7ef75e26 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -14,7 +14,6 @@ #include #include #include -#include #include #include @@ -48,14 +47,10 @@ int vfs_getattr(struct path *path, struct kstat *stat) return retval; if (inode->i_op->getattr) - retval = inode->i_op->getattr(path->mnt, path->dentry, stat); - else - generic_fillattr(inode, stat); + return inode->i_op->getattr(path->mnt, path->dentry, stat); - if (!retval) - stat->dev = path->mnt->mnt_sb->s_dev; - - return retval; + generic_fillattr(inode, stat); + return 0; } EXPORT_SYMBOL(vfs_getattr); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] tty: Fix task hang if one of peers is sitting in read
The commit is pushed to "branch-rh7-3.10.0-327.28.2.vz7.17.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.28.2.vz7.17.3 --> commit a7fb20c4cd83a1add67efe3edaa8500cc6edc6d1 Author: Cyrill Gorcunov Date: Wed Aug 24 16:01:42 2016 +0400 tty: Fix task hang if one of peers is sitting in read We reverted the former fix (ae93b8e96941c9ad) in commit 9539e4b2c5eee61f but the changes ported by rh team eventually are still not enough. So bring ae93b8e96941c9ad back. https://jira.sw.ru/browse/PSBM-51273 Signed-off-by: Cyrill Gorcunov CC: Igor Sukhih CC: Vladimir Davydov CC: Konstantin Khorenko --- drivers/tty/tty_ldisc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/tty/tty_ldisc.c b/drivers/tty/tty_ldisc.c index fd2b20d6af80..4c82aaad8566 100644 --- a/drivers/tty/tty_ldisc.c +++ b/drivers/tty/tty_ldisc.c @@ -685,7 +685,7 @@ void tty_ldisc_hangup(struct tty_struct *tty) * * Avoid racing set_ldisc or tty_ldisc_release */ - tty_ldisc_lock_pair(tty, tty->link); + tty_ldisc_lock(tty, MAX_SCHEDULE_TIMEOUT); if (tty->ldisc) { @@ -707,7 +707,7 @@ void tty_ldisc_hangup(struct tty_struct *tty) WARN_ON(tty_ldisc_open(tty, tty->ldisc)); } } - tty_ldisc_enable_pair(tty, tty->link); + tty_ldisc_unlock(tty); if (reset) tty_reset_termios(tty); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] mm: memcontrol: add memory.numa_migrate file
The commit is pushed to "branch-rh7-3.10.0-327.28.2.vz7.17.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.28.2.vz7.17.2 --> commit 9c08b7e913cf502ea61ca00814247488bdb1f65f Author: Vladimir Davydov Date: Tue Aug 23 17:08:57 2016 +0400 mm: memcontrol: add memory.numa_migrate file The new file is supposed to be used for migrating pages accounted to a memory cgroup to a particular set of numa nodes. The reason to add it is that currently there's no API for migrating unmapped file pages used for storing page cache (neither migrate_pages syscall nor cpuset subsys doesn't provide this functionality). The file is added to the memory cgroup and has the following format: NODELIST[ MAX_SCAN] where NODELIST is a comma-separated list of ranges N1-N2 specifying the set of nodes to migrate pages of this cgroup to, and the optional MAX_SCAN imposes a limit on the number of pages that can be migrated in one go. The call may be interrupted by a signal, in which case -EINTR is returned. https://jira.sw.ru/browse/PSBM-50875 Signed-off-by: Vladimir Davydov Reviewed-by: Andrey Ryabinin Cc: Igor Redko Cc: Konstantin Neumoin --- mm/memcontrol.c | 226 1 file changed, 226 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0d0e31e0917e..69189490ed68 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -54,6 +54,7 @@ #include #include #include +#include #include "internal.h" #include #include @@ -5697,6 +5698,226 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft, seq_putc(m, '\n'); return 0; } + +/* + * memcg_numa_migrate_new_page() private argument. @target_nodes specifies the + * set of nodes to allocate pages from. @current_node is the current preferable + * node, it gets rotated after each allocation. + */ +struct memcg_numa_migrate_struct { + nodemask_t *target_nodes; + int current_node; +}; + +/* + * Used as an argument for migrate_pages(). Allocated pages are spread evenly + * among destination nodes. + */ +static struct page *memcg_numa_migrate_new_page(struct page *page, + unsigned long private, int **result) +{ + struct memcg_numa_migrate_struct *ms = (void *)private; + gfp_t gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_NORETRY | __GFP_NOWARN; + + ms->current_node = next_node(ms->current_node, *ms->target_nodes); + if (ms->current_node >= MAX_NUMNODES) { + ms->current_node = first_node(*ms->target_nodes); + VM_BUG_ON(ms->current_node >= MAX_NUMNODES); + } + + return __alloc_pages_nodemask(gfp_mask, 0, + node_zonelist(ms->current_node, gfp_mask), + ms->target_nodes); +} + +/* + * Isolate at most @nr_to_scan pages from @lruvec for further migration and + * store them in @dst. Returns the number of pages scanned. Return value of 0 + * means that @lruved is empty. + */ +static long memcg_numa_isolate_pages(struct lruvec *lruvec, enum lru_list lru, +long nr_to_scan, struct list_head *dst) +{ + struct list_head *src = &lruvec->lists[lru]; + struct zone *zone = lruvec_zone(lruvec); + long scanned = 0, taken = 0; + + spin_lock_irq(&zone->lru_lock); + while (!list_empty(src) && scanned < nr_to_scan && taken < nr_to_scan) { + struct page *page = list_last_entry(src, struct page, lru); + int nr_pages; + + scanned++; + + switch (__isolate_lru_page(page, ISOLATE_ASYNC_MIGRATE)) { + case 0: + nr_pages = hpage_nr_pages(page); + mem_cgroup_update_lru_size(lruvec, lru, -nr_pages); + list_move(&page->lru, dst); + taken += nr_pages; + break; + + case -EBUSY: + list_move(&page->lru, src); + continue; + + default: + BUG(); + } + } + __mod_zone_page_state(zone, NR_LRU_BASE + lru, -taken); + __mod_zone_page_state(zone, NR_ISOLATED_ANON + is_file_lru(lru), taken); + spin_unlock_irq(&zone->lru_lock); + + return scanned; +} + +static long __memcg_numa_migrate_pages(struct lruvec *lruvec, enum lru_list lru, + nodemask_t *target_nodes, long nr_to_scan) +{ + struct memcg_numa_migrate_struct ms = { + .target_nodes = target_nodes, + .current_node = -1, + }; + LIST_HEAD(pages); + long total_scanned = 0; + + /
[Devel] [PATCH rh7 v2] mm: memcontrol: add memory.numa_migrate file
The new file is supposed to be used for migrating pages accounted to a memory cgroup to a particular set of numa nodes. The reason to add it is that currently there's no API for migrating unmapped file pages used for storing page cache (neither migrate_pages syscall nor cpuset subsys doesn't provide this functionality). The file is added to the memory cgroup and has the following format: NODELIST[ MAX_SCAN] where NODELIST is a comma-separated list of ranges N1-N2 specifying the set of nodes to migrate pages of this cgroup to, and the optional MAX_SCAN imposes a limit on the number of pages that can be migrated in one go. The call may be interrupted by a signal, in which case -EINTR is returned. https://jira.sw.ru/browse/PSBM-50875 Signed-off-by: Vladimir Davydov Cc: Andrey Ryabinin Cc: Igor Redko Cc: Konstantin Neumoin --- Changes in v2: - break loop if not making any progress (fixes softlockup) - drop useless VM_BUG_ON_PAGE in memcg_numa_isolate_pages and replace BUG_ON with VM_BUG_ON in memcg_numa_migrate_new_page mm/memcontrol.c | 226 1 file changed, 226 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e3a16b99ccc6..bfb56a649225 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -54,6 +54,7 @@ #include #include #include +#include #include "internal.h" #include #include @@ -5697,6 +5698,226 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft, seq_putc(m, '\n'); return 0; } + +/* + * memcg_numa_migrate_new_page() private argument. @target_nodes specifies the + * set of nodes to allocate pages from. @current_node is the current preferable + * node, it gets rotated after each allocation. + */ +struct memcg_numa_migrate_struct { + nodemask_t *target_nodes; + int current_node; +}; + +/* + * Used as an argument for migrate_pages(). Allocated pages are spread evenly + * among destination nodes. + */ +static struct page *memcg_numa_migrate_new_page(struct page *page, + unsigned long private, int **result) +{ + struct memcg_numa_migrate_struct *ms = (void *)private; + gfp_t gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_NORETRY | __GFP_NOWARN; + + ms->current_node = next_node(ms->current_node, *ms->target_nodes); + if (ms->current_node >= MAX_NUMNODES) { + ms->current_node = first_node(*ms->target_nodes); + VM_BUG_ON(ms->current_node >= MAX_NUMNODES); + } + + return __alloc_pages_nodemask(gfp_mask, 0, + node_zonelist(ms->current_node, gfp_mask), + ms->target_nodes); +} + +/* + * Isolate at most @nr_to_scan pages from @lruvec for further migration and + * store them in @dst. Returns the number of pages scanned. Return value of 0 + * means that @lruved is empty. + */ +static long memcg_numa_isolate_pages(struct lruvec *lruvec, enum lru_list lru, +long nr_to_scan, struct list_head *dst) +{ + struct list_head *src = &lruvec->lists[lru]; + struct zone *zone = lruvec_zone(lruvec); + long scanned = 0, taken = 0; + + spin_lock_irq(&zone->lru_lock); + while (!list_empty(src) && scanned < nr_to_scan && taken < nr_to_scan) { + struct page *page = list_last_entry(src, struct page, lru); + int nr_pages; + + scanned++; + + switch (__isolate_lru_page(page, ISOLATE_ASYNC_MIGRATE)) { + case 0: + nr_pages = hpage_nr_pages(page); + mem_cgroup_update_lru_size(lruvec, lru, -nr_pages); + list_move(&page->lru, dst); + taken += nr_pages; + break; + + case -EBUSY: + list_move(&page->lru, src); + continue; + + default: + BUG(); + } + } + __mod_zone_page_state(zone, NR_LRU_BASE + lru, -taken); + __mod_zone_page_state(zone, NR_ISOLATED_ANON + is_file_lru(lru), taken); + spin_unlock_irq(&zone->lru_lock); + + return scanned; +} + +static long __memcg_numa_migrate_pages(struct lruvec *lruvec, enum lru_list lru, + nodemask_t *target_nodes, long nr_to_scan) +{ + struct memcg_numa_migrate_struct ms = { + .target_nodes = target_nodes, + .current_node = -1, + }; + LIST_HEAD(pages); + long total_scanned = 0; + + /* +* If no limit on the maximal number of migrated pages is specified, +* assume the caller wants to migrate them all. +*/ + if (nr_to_scan < 0) + nr_to_scan = mem_cgroup_get_lru_size(lruvec, lru); + +
Re: [Devel] [PATCH rh7] mm: memcontrol: add memory.numa_migrate file
On Tue, Aug 23, 2016 at 12:57:53PM +0300, Andrey Ryabinin wrote: ... > echo "0 100" > /sys/fs/cgroup/memory/machine.slice/100/memory.numa_migrate > > [ 296.073002] BUG: soft lockup - CPU#1 stuck for 22s! [bash:4028] Thanks for catching, will fix in v2. > > +static struct page *memcg_numa_migrate_new_page(struct page *page, > > + unsigned long private, int **result) > > +{ > > + struct memcg_numa_migrate_struct *ms = (void *)private; > > + gfp_t gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_NORETRY | __GFP_NOWARN; > > + > > + ms->current_node = next_node(ms->current_node, *ms->target_nodes); > > + if (ms->current_node >= MAX_NUMNODES) { > > + ms->current_node = first_node(*ms->target_nodes); > > + BUG_ON(ms->current_node >= MAX_NUMNODES); > > Maybe WARN_ON() or VM_BUG_ON() ? Will replace with VM_BUG_ON. > > + } > > + > > + return __alloc_pages_nodemask(gfp_mask, 0, > > + node_zonelist(ms->current_node, gfp_mask), > > + ms->target_nodes); > > +} > > + > > +/* > > + * Isolate at most @nr_to_scan pages from @lruvec for further migration and > > + * store them in @dst. Returns the number of pages scanned. Return value > > of 0 > > + * means that @lruved is empty. > > + */ > > +static long memcg_numa_isolate_pages(struct lruvec *lruvec, enum lru_list > > lru, > > +long nr_to_scan, struct list_head *dst) > > +{ > > + struct list_head *src = &lruvec->lists[lru]; > > + struct zone *zone = lruvec_zone(lruvec); > > + long scanned = 0, taken = 0; > > + > > + spin_lock_irq(&zone->lru_lock); > > + while (!list_empty(src) && scanned < nr_to_scan && taken < nr_to_scan) { > > + struct page *page = list_last_entry(src, struct page, lru); > > + int nr_pages; > > + > > + VM_BUG_ON_PAGE(!PageLRU(page), page); > > + > > __isolate_lru_page() will return -EINVAL for !PageLRU, so either this or the > BUG() bellow is unnecessary. OK, will remove the VM_BUG_ON_PAGE. ... > > +static int memcg_numa_migrate_pages(struct mem_cgroup *memcg, > > + nodemask_t *target_nodes, long nr_to_scan) > > +{ > > + struct mem_cgroup *mi; > > + long total_scanned = 0; > > + > > +again: > > + for_each_mem_cgroup_tree(mi, memcg) { > > + struct zone *zone; > > + > > + for_each_populated_zone(zone) { > > + struct lruvec *lruvec; > > + enum lru_list lru; > > + long scanned; > > + > > + if (node_isset(zone_to_nid(zone), *target_nodes)) > > + continue; > > + > > + lruvec = mem_cgroup_zone_lruvec(zone, mi); > > + /* > > +* For the sake of simplicity, do not attempt to migrate > > +* unevictable pages. It should be fine as long as there > > +* aren't too many of them, which is usually true. > > +*/ > > + for_each_evictable_lru(lru) { > > + scanned = __memcg_numa_migrate_pages(lruvec, > > + lru, target_nodes, > > + nr_to_scan > 0 ? > > + SWAP_CLUSTER_MAX : -1); > > Shouldn't we just pass nr_to_scan here? No, I want to migrate memory evenly from all nodes. I.e. if you have 2 source nodes and nr_to_scan=100, there should be ~50 pages migrated from one node and ~50 from another, not 100-vs-0. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] vzprivnet: vzprivnet_hook: fix crash if skb->dev == NULL
The commit is pushed to "branch-rh7-3.10.0-327.28.2.vz7.17.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.28.2.vz7.17.2 --> commit 96e55a820e144b35e9254e97cdc0afbb4f879b3f Author: Vladimir Davydov Date: Tue Aug 23 13:13:19 2016 +0400 vzprivnet: vzprivnet_hook: fix crash if skb->dev == NULL For the sake of Docker, we only call vzprivnet rules if skb comes from the host [1]. To check that, we look at skb->dev->nd_net->owner_ve. This works fine when skb is retransmitted by a device (as it is the case in case of a bridged network), but this results in KP when skb is sent directly to a veth or venet device (via sendto) provided sysctl net.vzpriv_filter_host is enabled: BUG: unable to handle kernel NULL pointer dereference at 03e8 IP: [] vzprivnet_hook+0x9/0xb0 [ip_vzprivnet] Oops: [#1] SMP CPU: 0 PID: 3669 Comm: sendmail ve: 2ee1e66b-d1d4-4cb9-b65e-56af4cdd60b7 Not tainted 3.10.0-327.28.2.vz7.17.1 #1 17.1 task: 880039e94c20 ti: 880036428000 task.ti: 880036428000 RIP: 0010:[] [] vzprivnet_hook+0x9/0xb0 [ip_vzprivnet] RSP: 0018:88003642ba78 EFLAGS: 00010202 RAX: RBX: 88003642bb08 RCX: 88003a451000 RDX: 0001 RSI: 0001 RDI: 88000509d000 RBP: 88003642ba80 R08: 88003642bb08 R09: 0024 R10: 880035f78360 R11: 0006 R12: 88003642bad0 R13: 88000509d000 R14: 81a6c1f0 R15: FS: 7fcc702ca840() GS:88003de0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 03e8 CR3: 39c1d000 CR4: 06f0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Stack: a03059f2 88003642bac0 81562950 a0308680 43838e8f 88000509d000 88003642bb08 88000509d000 0024 88003642baf8 81562a38 a0308680 Call Trace: [] ? vzprivnet_out_hook+0x32/0x40 [ip_vzprivnet] [] nf_iterate+0x70/0xb0 [] nf_hook_slow+0xa8/0x110 [] __ip_local_out_sk+0xee/0x100 [] ? ip_make_skb+0x22/0x120 [] ? ip_forward_options+0x1c0/0x1c0 [] ip_local_out_sk+0x1b/0x40 [] ip_send_skb+0x16/0x50 [] udp_send_skb+0x170/0x380 [] ? ip_copy_metadata+0x170/0x170 [] udp_sendmsg+0x2f7/0x9d0 [] ? link_path_walk+0x81/0x860 [] inet_sendmsg+0x64/0xb0 [] ? radix_tree_lookup_slot+0x22/0x50 [] sock_sendmsg+0x87/0xc0 [] ? unlock_page+0x2b/0x30 [] SYSC_sendto+0x121/0x1c0 [] ? __do_page_fault+0x164/0x450 [] ? do_page_fault+0x23/0x80 [] SyS_sendto+0xe/0x10 [] system_call_fastpath+0x16/0x1b This happens, because in this case skb->dev is NULL (there's no device this skb is arrived to). To avoid crash there, let's take owner_ve from the net namespace which the socket is assigned to. https://jira.sw.ru/browse/PSBM-51041 Fixes: 32efdd408fad ("vzprivnet: Do not execute vzprivnet_hook inside CT") [1] Signed-off-by: Vladimir Davydov Acked-by: Pavel Tikhomirov --- net/ipv4/netfilter/ip_vzprivnet.c | 7 ++- net/ipv6/netfilter/ip6_vzprivnet.c | 7 ++- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/net/ipv4/netfilter/ip_vzprivnet.c b/net/ipv4/netfilter/ip_vzprivnet.c index 293b2802b4c2..6e2bbe2d42ef 100644 --- a/net/ipv4/netfilter/ip_vzprivnet.c +++ b/net/ipv4/netfilter/ip_vzprivnet.c @@ -250,8 +250,13 @@ static unsigned int vzprivnet_hook(struct sk_buff *skb, int can_be_bridge) { struct dst_entry *dst; unsigned int pmark = VZPRIV_MARK_UNKNOWN; + struct net *src_net; - if (!ve_is_super(skb->dev->nd_net->owner_ve)) + if (WARN_ON_ONCE(!skb->dev && !skb->sk)) + return NF_ACCEPT; + + src_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk); + if (!ve_is_super(src_net->owner_ve)) return NF_ACCEPT; dst = skb_dst(skb); diff --git a/net/ipv6/netfilter/ip6_vzprivnet.c b/net/ipv6/netfilter/ip6_vzprivnet.c index ee9c3c972637..0a491afd81b6 100644 --- a/net/ipv6/netfilter/ip6_vzprivnet.c +++ b/net/ipv6/netfilter/ip6_vzprivnet.c @@ -484,8 +484,13 @@ static unsigned int vzprivnet6_hook(struct sk_buff *skb, int can_be_bridge) int verdict = NF_DROP; struct vzprivnet *dst, *src; struct ipv6hdr *hdr; + struct net *src_net; - if (!ve_is_super(skb->dev->nd_net->owner_ve)) + if (WARN_ON_ONCE(!skb->dev && !skb->sk)) + return NF_ACCEPT; + + src_net = skb->dev
[Devel] [PATCH rh7 v2] vzprivnet: vzprivnet_hook: fix crash if skb->dev == NULL
For the sake of Docker, we only call vzprivnet rules if skb comes from the host [1]. To check that, we look at skb->dev->nd_net->owner_ve. This works fine when skb is retransmitted by a device (as it is the case in case of a bridged network), but this results in KP when skb is sent directly to a veth or venet device (via sendto) provided sysctl net.vzpriv_filter_host is enabled: BUG: unable to handle kernel NULL pointer dereference at 03e8 IP: [] vzprivnet_hook+0x9/0xb0 [ip_vzprivnet] Oops: [#1] SMP CPU: 0 PID: 3669 Comm: sendmail ve: 2ee1e66b-d1d4-4cb9-b65e-56af4cdd60b7 Not tainted 3.10.0-327.28.2.vz7.17.1 #1 17.1 task: 880039e94c20 ti: 880036428000 task.ti: 880036428000 RIP: 0010:[] [] vzprivnet_hook+0x9/0xb0 [ip_vzprivnet] RSP: 0018:88003642ba78 EFLAGS: 00010202 RAX: RBX: 88003642bb08 RCX: 88003a451000 RDX: 0001 RSI: 0001 RDI: 88000509d000 RBP: 88003642ba80 R08: 88003642bb08 R09: 0024 R10: 880035f78360 R11: 0006 R12: 88003642bad0 R13: 88000509d000 R14: 81a6c1f0 R15: FS: 7fcc702ca840() GS:88003de0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 03e8 CR3: 39c1d000 CR4: 06f0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Stack: a03059f2 88003642bac0 81562950 a0308680 43838e8f 88000509d000 88003642bb08 88000509d000 0024 88003642baf8 81562a38 a0308680 Call Trace: [] ? vzprivnet_out_hook+0x32/0x40 [ip_vzprivnet] [] nf_iterate+0x70/0xb0 [] nf_hook_slow+0xa8/0x110 [] __ip_local_out_sk+0xee/0x100 [] ? ip_make_skb+0x22/0x120 [] ? ip_forward_options+0x1c0/0x1c0 [] ip_local_out_sk+0x1b/0x40 [] ip_send_skb+0x16/0x50 [] udp_send_skb+0x170/0x380 [] ? ip_copy_metadata+0x170/0x170 [] udp_sendmsg+0x2f7/0x9d0 [] ? link_path_walk+0x81/0x860 [] inet_sendmsg+0x64/0xb0 [] ? radix_tree_lookup_slot+0x22/0x50 [] sock_sendmsg+0x87/0xc0 [] ? unlock_page+0x2b/0x30 [] SYSC_sendto+0x121/0x1c0 [] ? __do_page_fault+0x164/0x450 [] ? do_page_fault+0x23/0x80 [] SyS_sendto+0xe/0x10 [] system_call_fastpath+0x16/0x1b This happens, because in this case skb->dev is NULL (there's no device this skb is arrived to). To avoid crash there, let's take owner_ve from the net namespace which the socket is assigned to. https://jira.sw.ru/browse/PSBM-51041 Fixes: 32efdd408fad ("vzprivnet: Do not execute vzprivnet_hook inside CT") [1] Signed-off-by: Vladimir Davydov Cc: Pavel Tikhomirov --- Changes in v2: - do not crash if both skb->dev and skb->sk turn out to be NULL for some reason - just print a warning and accept the packet net/ipv4/netfilter/ip_vzprivnet.c | 7 ++- net/ipv6/netfilter/ip6_vzprivnet.c | 7 ++- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/net/ipv4/netfilter/ip_vzprivnet.c b/net/ipv4/netfilter/ip_vzprivnet.c index 293b2802b4c2..6e2bbe2d42ef 100644 --- a/net/ipv4/netfilter/ip_vzprivnet.c +++ b/net/ipv4/netfilter/ip_vzprivnet.c @@ -250,8 +250,13 @@ static unsigned int vzprivnet_hook(struct sk_buff *skb, int can_be_bridge) { struct dst_entry *dst; unsigned int pmark = VZPRIV_MARK_UNKNOWN; + struct net *src_net; - if (!ve_is_super(skb->dev->nd_net->owner_ve)) + if (WARN_ON_ONCE(!skb->dev && !skb->sk)) + return NF_ACCEPT; + + src_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk); + if (!ve_is_super(src_net->owner_ve)) return NF_ACCEPT; dst = skb_dst(skb); diff --git a/net/ipv6/netfilter/ip6_vzprivnet.c b/net/ipv6/netfilter/ip6_vzprivnet.c index ee9c3c972637..0a491afd81b6 100644 --- a/net/ipv6/netfilter/ip6_vzprivnet.c +++ b/net/ipv6/netfilter/ip6_vzprivnet.c @@ -484,8 +484,13 @@ static unsigned int vzprivnet6_hook(struct sk_buff *skb, int can_be_bridge) int verdict = NF_DROP; struct vzprivnet *dst, *src; struct ipv6hdr *hdr; + struct net *src_net; - if (!ve_is_super(skb->dev->nd_net->owner_ve)) + if (WARN_ON_ONCE(!skb->dev && !skb->sk)) + return NF_ACCEPT; + + src_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk); + if (!ve_is_super(src_net->owner_ve)) return NF_ACCEPT; hdr = ipv6_hdr(skb); -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] vzprivnet: vzprivnet_hook: fix crash if skb->dev == NULL
For the sake of Docker, we only call vzprivnet rules if skb comes from the host [1]. To check that, we look at skb->dev->nd_net->owner_ve. This works fine when skb is retransmitted by a device (as it is the case in case of a bridged network), but this results in KP when skb is sent directly to a veth or venet device (via sendto) provided sysctl net.vzpriv_filter_host is enabled: BUG: unable to handle kernel NULL pointer dereference at 03e8 IP: [] vzprivnet_hook+0x9/0xb0 [ip_vzprivnet] Oops: [#1] SMP CPU: 0 PID: 3669 Comm: sendmail ve: 2ee1e66b-d1d4-4cb9-b65e-56af4cdd60b7 Not tainted 3.10.0-327.28.2.vz7.17.1 #1 17.1 task: 880039e94c20 ti: 880036428000 task.ti: 880036428000 RIP: 0010:[] [] vzprivnet_hook+0x9/0xb0 [ip_vzprivnet] RSP: 0018:88003642ba78 EFLAGS: 00010202 RAX: RBX: 88003642bb08 RCX: 88003a451000 RDX: 0001 RSI: 0001 RDI: 88000509d000 RBP: 88003642ba80 R08: 88003642bb08 R09: 0024 R10: 880035f78360 R11: 0006 R12: 88003642bad0 R13: 88000509d000 R14: 81a6c1f0 R15: FS: 7fcc702ca840() GS:88003de0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 03e8 CR3: 39c1d000 CR4: 06f0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Stack: a03059f2 88003642bac0 81562950 a0308680 43838e8f 88000509d000 88003642bb08 88000509d000 0024 88003642baf8 81562a38 a0308680 Call Trace: [] ? vzprivnet_out_hook+0x32/0x40 [ip_vzprivnet] [] nf_iterate+0x70/0xb0 [] nf_hook_slow+0xa8/0x110 [] __ip_local_out_sk+0xee/0x100 [] ? ip_make_skb+0x22/0x120 [] ? ip_forward_options+0x1c0/0x1c0 [] ip_local_out_sk+0x1b/0x40 [] ip_send_skb+0x16/0x50 [] udp_send_skb+0x170/0x380 [] ? ip_copy_metadata+0x170/0x170 [] udp_sendmsg+0x2f7/0x9d0 [] ? link_path_walk+0x81/0x860 [] inet_sendmsg+0x64/0xb0 [] ? radix_tree_lookup_slot+0x22/0x50 [] sock_sendmsg+0x87/0xc0 [] ? unlock_page+0x2b/0x30 [] SYSC_sendto+0x121/0x1c0 [] ? __do_page_fault+0x164/0x450 [] ? do_page_fault+0x23/0x80 [] SyS_sendto+0xe/0x10 [] system_call_fastpath+0x16/0x1b This happens, because in this case skb->dev is NULL (there's no device this skb is arrived to). To avoid crash there, let's take owner_ve from the net namespace which the socket is assigned to. https://jira.sw.ru/browse/PSBM-51041 Fixes: 32efdd408fad ("vzprivnet: Do not execute vzprivnet_hook inside CT") [1] Signed-off-by: Vladimir Davydov Cc: Pavel Tikhomirov --- net/ipv4/netfilter/ip_vzprivnet.c | 3 ++- net/ipv6/netfilter/ip6_vzprivnet.c | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/net/ipv4/netfilter/ip_vzprivnet.c b/net/ipv4/netfilter/ip_vzprivnet.c index 293b2802b4c2..798a3ced3ef3 100644 --- a/net/ipv4/netfilter/ip_vzprivnet.c +++ b/net/ipv4/netfilter/ip_vzprivnet.c @@ -250,8 +250,9 @@ static unsigned int vzprivnet_hook(struct sk_buff *skb, int can_be_bridge) { struct dst_entry *dst; unsigned int pmark = VZPRIV_MARK_UNKNOWN; + struct net *src_net = skb->dev ? skb->dev->nd_net : sock_net(skb->sk); - if (!ve_is_super(skb->dev->nd_net->owner_ve)) + if (!ve_is_super(src_net->owner_ve)) return NF_ACCEPT; dst = skb_dst(skb); diff --git a/net/ipv6/netfilter/ip6_vzprivnet.c b/net/ipv6/netfilter/ip6_vzprivnet.c index ee9c3c972637..36cc1d4c5aa0 100644 --- a/net/ipv6/netfilter/ip6_vzprivnet.c +++ b/net/ipv6/netfilter/ip6_vzprivnet.c @@ -484,8 +484,9 @@ static unsigned int vzprivnet6_hook(struct sk_buff *skb, int can_be_bridge) int verdict = NF_DROP; struct vzprivnet *dst, *src; struct ipv6hdr *hdr; + struct net *src_net = skb->dev ? skb->dev->nd_net : sock_net(skb->sk); - if (!ve_is_super(skb->dev->nd_net->owner_ve)) + if (!ve_is_super(src_net->owner_ve)) return NF_ACCEPT; hdr = ipv6_hdr(skb); -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] mm: memcontrol: add memory.numa_migrate file
The new file is supposed to be used for migrating pages accounted to a memory cgroup to a particular set of numa nodes. The reason to add it is that currently there's no API for migrating unmapped file pages used for storing page cache (neither migrate_pages syscall nor cpuset subsys doesn't provide this functionality). The file is added to the memory cgroup and has the following format: NODELIST[ MAX_SCAN] where NODELIST is a comma-separated list of ranges N1-N2 specifying the set of nodes to migrate pages of this cgroup to, and the optional MAX_SCAN imposes a limit on the number of pages that can be migrated in one go. The call may be interrupted by a signal, in which case -EINTR is returned. https://jira.sw.ru/browse/PSBM-50875 Signed-off-by: Vladimir Davydov Cc: Andrey Ryabinin Cc: Igor Redko Cc: Konstantin Neumoin --- mm/memcontrol.c | 223 1 file changed, 223 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e3a16b99ccc6..8c6c4fb9c153 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -54,6 +54,7 @@ #include #include #include +#include #include "internal.h" #include #include @@ -5697,6 +5698,223 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft, seq_putc(m, '\n'); return 0; } + +/* + * memcg_numa_migrate_new_page() private argument. @target_nodes specifies the + * set of nodes to allocate pages from. @current_node is the current preferable + * node, it gets rotated after each allocation. + */ +struct memcg_numa_migrate_struct { + nodemask_t *target_nodes; + int current_node; +}; + +/* + * Used as an argument for migrate_pages(). Allocated pages are spread evenly + * among destination nodes. + */ +static struct page *memcg_numa_migrate_new_page(struct page *page, + unsigned long private, int **result) +{ + struct memcg_numa_migrate_struct *ms = (void *)private; + gfp_t gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_NORETRY | __GFP_NOWARN; + + ms->current_node = next_node(ms->current_node, *ms->target_nodes); + if (ms->current_node >= MAX_NUMNODES) { + ms->current_node = first_node(*ms->target_nodes); + BUG_ON(ms->current_node >= MAX_NUMNODES); + } + + return __alloc_pages_nodemask(gfp_mask, 0, + node_zonelist(ms->current_node, gfp_mask), + ms->target_nodes); +} + +/* + * Isolate at most @nr_to_scan pages from @lruvec for further migration and + * store them in @dst. Returns the number of pages scanned. Return value of 0 + * means that @lruved is empty. + */ +static long memcg_numa_isolate_pages(struct lruvec *lruvec, enum lru_list lru, +long nr_to_scan, struct list_head *dst) +{ + struct list_head *src = &lruvec->lists[lru]; + struct zone *zone = lruvec_zone(lruvec); + long scanned = 0, taken = 0; + + spin_lock_irq(&zone->lru_lock); + while (!list_empty(src) && scanned < nr_to_scan && taken < nr_to_scan) { + struct page *page = list_last_entry(src, struct page, lru); + int nr_pages; + + VM_BUG_ON_PAGE(!PageLRU(page), page); + + scanned++; + + switch (__isolate_lru_page(page, ISOLATE_ASYNC_MIGRATE)) { + case 0: + nr_pages = hpage_nr_pages(page); + mem_cgroup_update_lru_size(lruvec, lru, -nr_pages); + list_move(&page->lru, dst); + taken += nr_pages; + break; + + case -EBUSY: + list_move(&page->lru, src); + continue; + + default: + BUG(); + } + } + __mod_zone_page_state(zone, NR_LRU_BASE + lru, -taken); + __mod_zone_page_state(zone, NR_ISOLATED_ANON + is_file_lru(lru), taken); + spin_unlock_irq(&zone->lru_lock); + + return scanned; +} + +static long __memcg_numa_migrate_pages(struct lruvec *lruvec, enum lru_list lru, + nodemask_t *target_nodes, long nr_to_scan) +{ + struct memcg_numa_migrate_struct ms = { + .target_nodes = target_nodes, + .current_node = -1, + }; + LIST_HEAD(pages); + long total_scanned = 0; + + /* +* If no limit on the maximal number of migrated pages is specified, +* assume the caller wants to migrate them all. +*/ + if (nr_to_scan < 0) + nr_to_scan = mem_cgroup_get_lru_size(lruvec, lru); + + while (total_scanned < nr_to_scan) { + int ret; + long scanned; + + scanned = memcg_numa_isolate
Re: [Devel] [PATCH 4/4] x86/arch_prctl/vdso: add ARCH_MAP_VDSO_*
On Tue, Jul 26, 2016 at 05:25:02PM +0300, Dmitry Safonov wrote: > Add API to change vdso blob type with arch_prctl. > As this is usefull only by needs of CRIU, expose > this interface under CONFIG_CHECKPOINT_RESTORE. > > Cc: Andy Lutomirski > Cc: Ingo Molnar > Cc: Thomas Gleixner > Cc: "H. Peter Anvin" > > [Differences to vanilla patches: > o API only for 32-bit vDSO mapping > o unmap previous vdso just by mm->context.vdso pointer] > Signed-off-by: Dmitry Safonov Reviewed-by: Vladimir Davydov ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] sched: make load balancing more agressive
Currently, we only pull tasks if the destination cpu group load is below the average over the domain being rebalanced. This sounds reasonable, but only as long as there's no pinned tasks, otherwise we can get an unfair task distribution. For instance, suppose the host has 16 cores and there's a container pinned to two of the cores (either strictly by using cpumask or indirectly by setting cpulimit). If we start 16 tasks in the container, then the average load will be 1, so that even if 15 tasks turn out to run on the same cpu (out of 2), no tasks will be pulled, which is wrong. To overcome this issue, let's port the following patches from PCS6: diff-sched-balance-even-if-load-is-greater-than-average diff-sched-always-try-to-equalize-load-between-this-and-busiest-cpus-when-balancing They make the balance procedure pull tasks even if the destination is above average, by setting the imbalance value to be (source_load - destination_load) / 2 instead of (average_load - destination_load) / 2 This implies decreasing the convergence speed of the balancing procedure, but PCS6 has worked like that for quite a while, so it should be fine. Signed-off-by: Vladimir Davydov --- kernel/sched/fair.c | 9 + 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cedd178f963c..685517597a30 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6618,7 +6618,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s /* How much load to actually move to equalise the imbalance */ env->imbalance = min( max_pull * busiest->group_power, - (sds->avg_load - this->avg_load) * this->group_power + (busiest->avg_load - this->avg_load) * this->group_power ) / SCHED_POWER_SCALE; /* @@ -6695,13 +6695,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env) if (this->avg_load >= busiest->avg_load) goto out_balanced; - /* -* Don't pull any tasks if this group is already above the domain -* average load. -*/ - if (this->avg_load >= sds.avg_load) - goto out_balanced; - if (env->idle == CPU_IDLE) { /* * This cpu is idle. If the busiest group load doesn't -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] sched: fair: fix dst pinned not set when failing migration due to cpulimit restriction
When failing task migration due to cpulimit restriction, we should set dst pinned flag so that the load balancing procedure will proceed to the next cpu, just like in case of failing task migration due to affinity mask (see can_migrate_task). We don't do that since rebase to 3.10.0-327.18.2.el7. Fix that. Signed-off-by: Vladimir Davydov --- kernel/sched/fair.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e39ed4c17464..cedd178f963c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5446,15 +5446,17 @@ static inline int can_migrate_task_cpulimit(struct task_struct *p, struct lb_env schedstat_inc(p, se.statistics.nr_failed_migrations_cpulimit); + env->flags |= LBF_SOME_PINNED; + if (check_cpulimit_spread(tg, env->src_cpu) != 0) return 0; - if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED)) + if (!env->dst_grpmask || (env->flags & LBF_DST_PINNED)) return 0; for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) { if (cfs_rq_active(tg->cfs_rq[cpu])) { - env->flags |= LBF_SOME_PINNED; + env->flags |= LBF_DST_PINNED; env->new_dst_cpu = cpu; break; } -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] sched: debug: show nr_failed_migrations_cpulimit
This is our (non-mainstream) counter counting how many times task migration failed due to cpulimit restriction. For some reason, we don't show it in proc although it might be helpful for debugging. Let's fix that. Signed-off-by: Vladimir Davydov --- kernel/sched/debug.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 55ac5fb78e29..6cf0c2ceedfe 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -594,6 +594,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m) P(se.statistics.nr_migrations_cold); P(se.statistics.nr_failed_migrations_affine); P(se.statistics.nr_failed_migrations_running); + P(se.statistics.nr_failed_migrations_cpulimit); P(se.statistics.nr_failed_migrations_hot); P(se.statistics.nr_forced_migrations); P(se.statistics.nr_wakeups); -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] sched: add WARN_ON's to debug task boosting
This patch ports * diff-sched-add-WARN_ONs-to-debug-task-boosting Added to 042stab114_2 Assert that we never have a boosted entity under a throttled hierarchy. Also, do not panic if on set_next_entity we find a boosted entity being not on the list - just warn and carry on as if nothing happened. https://jira.sw.ru/browse/PSBM-44475 https://jira.sw.ru/browse/PSBM-50077 Signed-off-by: Vladimir Davydov Reviewed-by: Kirill Tkhai --- kernel/sched/fair.c | 22 +- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 515685f77217..e39ed4c17464 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -909,9 +909,10 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se) #ifdef CONFIG_CFS_BANDWIDTH static inline void update_entity_boost(struct sched_entity *se) { - if (!entity_is_task(se)) + if (!entity_is_task(se)) { se->boosted = cfs_rq_has_boosted_entities(group_cfs_rq(se)); - else { + WARN_ON(se->boosted && cfs_rq_throttled(group_cfs_rq(se))); + } else { struct task_struct *p = task_of(se); if (unlikely(p != current)) @@ -943,6 +944,8 @@ static inline void __enqueue_boosted_entity(struct cfs_rq *cfs_rq, static inline void __dequeue_boosted_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) { + if (WARN_ON(se->boost_node.next == LIST_POISON1)) + return; list_del(&se->boost_node); } @@ -953,8 +956,11 @@ static int enqueue_boosted_entity(struct cfs_rq *cfs_rq, if (se != cfs_rq->curr) __enqueue_boosted_entity(cfs_rq, se); se->boosted = 1; + WARN_ON(!entity_is_task(se) && + cfs_rq_throttled(group_cfs_rq(se))); return 1; - } + } else + WARN_ON(cfs_rq_throttled(group_cfs_rq(se))); return 0; } @@ -3847,6 +3853,8 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) */ static void check_enqueue_throttle(struct cfs_rq *cfs_rq) { + WARN_ON(cfs_rq_has_boosted_entities(cfs_rq)); + if (!cfs_bandwidth_used()) return; @@ -4150,8 +4158,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) } else if (boost) { for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); - if (!enqueue_boosted_entity(cfs_rq, se)) + if (!enqueue_boosted_entity(cfs_rq, se)) { + WARN_ON(throttled_hierarchy(cfs_rq)); break; + } if (cfs_rq_throttled(cfs_rq)) unthrottle_cfs_rq(cfs_rq); } @@ -4213,8 +4223,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) cfs_rq = cfs_rq_of(se); cfs_rq->h_nr_running--; - if (cfs_rq_throttled(cfs_rq)) + if (cfs_rq_throttled(cfs_rq)) { + WARN_ON(boosted); break; + } if (boosted) boosted = dequeue_boosted_entity(cfs_rq, se); -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ploop: release plo->ctl_mutex for thaw_bdev in PLOOP_IOC_THAW handler
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.25 --> commit b81bf870a6c314b664977ee9b6747cf93e78bcc3 Author: Vladimir Davydov Date: Fri Jul 15 13:34:36 2016 +0400 ploop: release plo->ctl_mutex for thaw_bdev in PLOOP_IOC_THAW handler Recent patch to ploop 91a74e3b91a ("ploop: add PLOOP_IOC_FREEZE and PLOOP_IOC_THAW ioctls") introduced the follwing deadlock: Thread 1: [] sleep_on_buffer+0xe/0x20 [] __sync_dirty_buffer+0xb8/0xe0 [] sync_dirty_buffer+0x13/0x20 [] ext4_commit_super+0x1b0/0x240 [ext4] [] ext4_unfreeze+0x2d/0x40 [ext4] [] thaw_super+0x3f/0xb0 [] thaw_bdev+0x65/0x80 [] ploop_ioctl+0x6d0/0x29f0 [ploop] [] blkdev_ioctl+0x2df/0x770 [] block_ioctl+0x41/0x50 [] do_vfs_ioctl+0x255/0x4f0 [] SyS_ioctl+0x54/0xa0 [] system_call_fastpath+0x16/0x1b [] 0x Thread 2: [] ploop_pb_get_pending+0x163/0x290 [ploop] [] ploop_push_backup_io_get.isra.26+0x81/0x1b0 [ploop] [] ploop_push_backup_io+0x15b/0x260 [ploop] [] ploop_ioctl+0xe96/0x29f0 [ploop] [] blkdev_ioctl+0x2df/0x770 [] block_ioctl+0x41/0x50 [] do_vfs_ioctl+0x255/0x4f0 [] SyS_ioctl+0x54/0xa0 [] system_call_fastpath+0x16/0x1b [] 0x E.g. thread 1 is thawing ploop with PLOOP_IOC_THAW ioctl which holds plo->ctl_mutex during its work. To thaw itself, ext4 has to commit some data. This commit triggers push backup out-of-order request which must be processed and acked by userspace to be completed. But userspace can't process it, because ploop_pb_get_pending() wants the same mutext. Thus, deadlock. Fix the deadlock by releasing the mutex before calling thaw_bdev and reaquiring it after thaw_bdev is done. https://jira.sw.ru/browse/PSBM-49699 Reported-by: Pavel Borzenkov Signed-off-by: Vladimir Davydov Cc: Maxim Patlasov --- drivers/block/ploop/dev.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index d52975eaaa36..3dc94ca5c393 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -4839,11 +4839,12 @@ static int ploop_thaw(struct ploop_device *plo, struct block_device *bdev) if (!test_bit(PLOOP_S_FROZEN, &plo->state)) return 0; + plo->sb = NULL; + clear_bit(PLOOP_S_FROZEN, &plo->state); + + mutex_unlock(&plo->ctl_mutex); err = thaw_bdev(bdev, sb); - if (!err) { - plo->sb = NULL; - clear_bit(PLOOP_S_FROZEN, &plo->state); - } + mutex_lock(&plo->ctl_mutex); return err; } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] ploop: release plo->ctl_mutex for thaw_bdev in PLOOP_IOC_THAW handler
Recent patch to ploop 91a74e3b91a ("ploop: add PLOOP_IOC_FREEZE and PLOOP_IOC_THAW ioctls") introduced the follwing deadlock: Thread 1: [] sleep_on_buffer+0xe/0x20 [] __sync_dirty_buffer+0xb8/0xe0 [] sync_dirty_buffer+0x13/0x20 [] ext4_commit_super+0x1b0/0x240 [ext4] [] ext4_unfreeze+0x2d/0x40 [ext4] [] thaw_super+0x3f/0xb0 [] thaw_bdev+0x65/0x80 [] ploop_ioctl+0x6d0/0x29f0 [ploop] [] blkdev_ioctl+0x2df/0x770 [] block_ioctl+0x41/0x50 [] do_vfs_ioctl+0x255/0x4f0 [] SyS_ioctl+0x54/0xa0 [] system_call_fastpath+0x16/0x1b [] 0x Thread 2: [] ploop_pb_get_pending+0x163/0x290 [ploop] [] ploop_push_backup_io_get.isra.26+0x81/0x1b0 [ploop] [] ploop_push_backup_io+0x15b/0x260 [ploop] [] ploop_ioctl+0xe96/0x29f0 [ploop] [] blkdev_ioctl+0x2df/0x770 [] block_ioctl+0x41/0x50 [] do_vfs_ioctl+0x255/0x4f0 [] SyS_ioctl+0x54/0xa0 [] system_call_fastpath+0x16/0x1b [] 0x E.g. thread 1 is thawing ploop with PLOOP_IOC_THAW ioctl which holds plo->ctl_mutex during its work. To thaw itself, ext4 has to commit some data. This commit triggers push backup out-of-order request which must be processed and acked by userspace to be completed. But userspace can't process it, because ploop_pb_get_pending() wants the same mutext. Thus, deadlock. Fix the deadlock by releasing the mutex before calling thaw_bdev and reaquiring it after thaw_bdev is done. https://jira.sw.ru/browse/PSBM-49699 Reported-by: Pavel Borzenkov Signed-off-by: Vladimir Davydov Cc: Maxim Patlasov --- drivers/block/ploop/dev.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index d52975eaaa36..3dc94ca5c393 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -4839,11 +4839,12 @@ static int ploop_thaw(struct ploop_device *plo, struct block_device *bdev) if (!test_bit(PLOOP_S_FROZEN, &plo->state)) return 0; + plo->sb = NULL; + clear_bit(PLOOP_S_FROZEN, &plo->state); + + mutex_unlock(&plo->ctl_mutex); err = thaw_bdev(bdev, sb); - if (!err) { - plo->sb = NULL; - clear_bit(PLOOP_S_FROZEN, &plo->state); - } + mutex_lock(&plo->ctl_mutex); return err; } -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] sched: use topmost limited ancestor for cpulimit balancing
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.24 --> commit 0754d59aeddb72bc69ec10ebbe70dc41f03c16ab Author: Vladimir Davydov Date: Thu Jul 14 20:46:46 2016 +0400 sched: use topmost limited ancestor for cpulimit balancing We want to keep all proceses of a container's cgroup packed on the minimal allowed number of cpus, which is set by the cpulimit. Doing this properly when deep hierarchies are used is tricky if not impossible w/o introducing tremendous overhead, so initially we implemented this feature exclusively for top-level cgroups. Now this isn't enough, as containers can be created in machine.slice. So in this patch we make cpulimit balancing work for topmost cgroups that has a cpu limit set. This way, no matter if containers are created under the root or in machine.slice, cpulimit balancing will always be applied to container's cgroup as machine.slice isn't supposed to have cpu limit set. https://jira.sw.ru/browse/PSBM-49203 Signed-off-by: Vladimir Davydov --- kernel/sched/core.c | 62 kernel/sched/fair.c | 36 +- kernel/sched/sched.h | 2 ++ 3 files changed, 69 insertions(+), 31 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 94deef41f05a..657b8e4ba8d8 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7557,6 +7557,10 @@ void __init sched_init(void) #endif /* CONFIG_CGROUP_SCHED */ +#ifdef CONFIG_CFS_CPULIMIT + root_task_group.topmost_limited_ancestor = &root_task_group; +#endif + for_each_possible_cpu(i) { struct rq *rq; @@ -7882,6 +7886,8 @@ err: return ERR_PTR(-ENOMEM); } +static void tg_update_topmost_limited_ancestor(struct task_group *tg); + void sched_online_group(struct task_group *tg, struct task_group *parent) { unsigned long flags; @@ -7894,6 +7900,9 @@ void sched_online_group(struct task_group *tg, struct task_group *parent) tg->parent = parent; INIT_LIST_HEAD(&tg->children); list_add_rcu(&tg->siblings, &parent->children); + + tg_update_topmost_limited_ancestor(tg); + spin_unlock_irqrestore(&task_group_lock, flags); } @@ -8428,6 +8437,8 @@ const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */ static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime); +static void tg_limit_toggled(struct task_group *tg); + /* call with cfs_constraints_mutex held */ static int __tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) { @@ -8485,6 +8496,8 @@ static int __tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) unthrottle_cfs_rq(cfs_rq); raw_spin_unlock_irq(&rq->lock); } + if (runtime_enabled != runtime_was_enabled) + tg_limit_toggled(tg); return ret; } @@ -8662,6 +8675,49 @@ static int cpu_stats_show(struct cgroup *cgrp, struct cftype *cft, } #ifdef CONFIG_CFS_CPULIMIT +static int __tg_update_topmost_limited_ancestor(struct task_group *tg, void *unused) +{ + struct task_group *parent = tg->parent; + + /* +* Parent and none of its uncestors is limited? The task group should +* become a topmost limited uncestor then, provided it has a limit set. +* Otherwise inherit topmost limited ancestor from the parent. +*/ + if (parent->topmost_limited_ancestor == parent && + parent->cfs_bandwidth.quota == RUNTIME_INF) + tg->topmost_limited_ancestor = tg; + else + tg->topmost_limited_ancestor = parent->topmost_limited_ancestor; + return 0; +} + +static void tg_update_topmost_limited_ancestor(struct task_group *tg) +{ + __tg_update_topmost_limited_ancestor(tg, NULL); +} + +static void tg_limit_toggled(struct task_group *tg) +{ + if (tg->topmost_limited_ancestor != tg) { + /* +* This task group is not a topmost limited ancestor, so both +* it and all its children must already point to their topmost +* limited ancestor, and we have nothing to do. +*/ + return; + } + + /* +* This task group is a topmost limited ancestor. Walk over all its +* children and update their pointers to the topmost limited ancestor. +*/ + + spin_lock_irq(&task_group_lock); + walk_tg_tree_from(tg, __tg_update_topmost_limited_ancestor, tg_nop, NULL); + spin_unlock_irq(&task_group_lock); +} + static void tg_update_cpu_limit(struct task_group *tg) { long quota, period; @@ -8736,6 +8792,12 @@ static int nr_cpus_wri
[Devel] [PATCH RHEL7 COMMIT] sched: account task_group->nr_cpus_active for all cgroups
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.24 --> commit 2d2b5ae36a8af86450d7cdae8dcc981e3430b06d Author: Vladimir Davydov Date: Thu Jul 14 20:46:41 2016 +0400 sched: account task_group->nr_cpus_active for all cgroups Currently nr_cpus_active is only accounted for top-level cgroups, because container cgroups, which are the only users of this counter, could only be created under the root cgroup. Now things have changed, and containers can reside either under the root or under machine.slice or in any other cgroup depending on the host's config. So we can't preserve this little optimization anymore. Remove it. Signed-off-by: Vladimir Davydov --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fd25e1e8ae5b..70a5861d4166 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3039,7 +3039,7 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq); static void enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { - if (is_top_cfs_rq(cfs_rq) && !cfs_rq->load.weight) + if (!cfs_rq->load.weight) inc_nr_active_cfs_rqs(cfs_rq); /* @@ -3163,7 +3163,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) update_min_vruntime(cfs_rq); update_cfs_shares(cfs_rq); - if (is_top_cfs_rq(cfs_rq) && !cfs_rq->load.weight) + if (!cfs_rq->load.weight) dec_nr_active_cfs_rqs(cfs_rq, flags & DEQUEUE_TASK_SLEEP); } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] sched: make check_cpulimit_spread accept tg instead of cfs_rq
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.24 --> commit 7ae55375b703b6f386447e223df63dc1f217cc6f Author: Vladimir Davydov Date: Thu Jul 14 20:46:43 2016 +0400 sched: make check_cpulimit_spread accept tg instead of cfs_rq It only needs cfs_rq->tg, so let's pass it directly. This eases further modifications. Signed-off-by: Vladimir Davydov --- kernel/sched/fair.c | 57 - 1 file changed, 26 insertions(+), 31 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 70a5861d4166..52365f6a4e36 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -542,9 +542,8 @@ static enum hrtimer_restart sched_cfs_active_timer(struct hrtimer *timer) return HRTIMER_NORESTART; } -static inline int check_cpulimit_spread(struct cfs_rq *cfs_rq, int target_cpu) +static inline int check_cpulimit_spread(struct task_group *tg, int target_cpu) { - struct task_group *tg = cfs_rq->tg; int nr_cpus_active = atomic_read(&tg->nr_cpus_active); int nr_cpus_limit = DIV_ROUND_UP(tg->cpu_rate, MAX_CPU_RATE); @@ -579,7 +578,7 @@ static inline enum hrtimer_restart sched_cfs_active_timer(struct hrtimer *timer) return 0; } -static inline int check_cpulimit_spread(struct cfs_rq *cfs_rq, int target_cpu) +static inline int check_cpulimit_spread(struct task_group *tg, int target_cpu) { return 1; } @@ -4717,18 +4716,15 @@ done: static inline bool select_runnable_cpu(struct task_struct *p, int *new_cpu) { - struct cfs_rq *cfs_rq; struct task_group *tg; struct sched_domain *sd; int prev_cpu = task_cpu(p); int cpu; - cfs_rq = top_cfs_rq_of(&p->se); - if (check_cpulimit_spread(cfs_rq, *new_cpu) > 0) + tg = top_cfs_rq_of(&p->se)->tg; + if (check_cpulimit_spread(tg, *new_cpu) > 0) return false; - tg = cfs_rq->tg; - if (cfs_rq_active(tg->cfs_rq[*new_cpu])) return true; @@ -5084,7 +5080,7 @@ static int cpulimit_balance_cpu_stop(void *data); static inline void trigger_cpulimit_balance(struct task_struct *p) { struct rq *this_rq; - struct cfs_rq *cfs_rq; + struct task_group *tg; int this_cpu, cpu, target_cpu = -1; struct sched_domain *sd; @@ -5094,8 +5090,8 @@ static inline void trigger_cpulimit_balance(struct task_struct *p) if (!p->se.on_rq || this_rq->active_balance) return; - cfs_rq = top_cfs_rq_of(&p->se); - if (check_cpulimit_spread(cfs_rq, this_cpu) >= 0) + tg = top_cfs_rq_of(&p->se)->tg; + if (check_cpulimit_spread(tg, this_cpu) >= 0) return; rcu_read_lock(); @@ -5105,7 +5101,7 @@ static inline void trigger_cpulimit_balance(struct task_struct *p) for_each_cpu_and(cpu, sched_domain_span(sd), tsk_cpus_allowed(p)) { if (cpu != this_cpu && - cfs_rq_active(cfs_rq->tg->cfs_rq[cpu])) { + cfs_rq_active(tg->cfs_rq[cpu])) { target_cpu = cpu; goto unlock; } @@ -5471,22 +5467,22 @@ static inline bool migrate_degrades_locality(struct task_struct *p, static int can_migrate_task(struct task_struct *p, struct lb_env *env) { - struct cfs_rq *cfs_rq = top_cfs_rq_of(&p->se); + struct task_group *tg = top_cfs_rq_of(&p->se)->tg; int tsk_cache_hot = 0; - if (check_cpulimit_spread(cfs_rq, env->dst_cpu) < 0) { + if (check_cpulimit_spread(tg, env->dst_cpu) < 0) { int cpu; schedstat_inc(p, se.statistics.nr_failed_migrations_cpulimit); - if (check_cpulimit_spread(cfs_rq, env->src_cpu) != 0) + if (check_cpulimit_spread(tg, env->src_cpu) != 0) return 0; if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED)) return 0; for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) { - if (cfs_rq_active(cfs_rq->tg->cfs_rq[cpu])) { + if (cfs_rq_active(tg->cfs_rq[cpu])) { env->flags |= LBF_SOME_PINNED; env->new_dst_cpu = cpu; break; @@ -5719,7 +5715,7 @@ static int move_task_group(struct cfs_rq *cfs_rq, struct lb_env *env) static int move_task_groups(struct lb_env *env) { - struct cfs_rq *cfs_rq, *top_cfs_rq; + struct cfs_rq *cfs_rq; struct
[Devel] [PATCH RHEL7 COMMIT] sched: cleanup !CFS_CPULIMIT code
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.24 --> commit 5e6211be102fd95bb3bd1c807cf91a235cc10a77 Author: Vladimir Davydov Date: Thu Jul 14 20:46:44 2016 +0400 sched: cleanup !CFS_CPULIMIT code Let's move all CFS_CPULIMIT related functions under CFS_CPULIMIT ifdef. This will ease further patching. Signed-off-by: Vladimir Davydov --- kernel/sched/fair.c | 39 ++- 1 file changed, 18 insertions(+), 21 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 52365f6a4e36..2ff38fc1d600 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -560,11 +560,6 @@ static inline int check_cpulimit_spread(struct task_group *tg, int target_cpu) return cfs_rq_active(tg->cfs_rq[target_cpu]) ? 0 : -1; } #else /* !CONFIG_CFS_CPULIMIT */ -static inline int cfs_rq_active(struct cfs_rq *cfs_rq) -{ - return 1; -} - static inline void inc_nr_active_cfs_rqs(struct cfs_rq *cfs_rq) { } @@ -572,16 +567,6 @@ static inline void inc_nr_active_cfs_rqs(struct cfs_rq *cfs_rq) static inline void dec_nr_active_cfs_rqs(struct cfs_rq *cfs_rq, int postpone) { } - -static inline enum hrtimer_restart sched_cfs_active_timer(struct hrtimer *timer) -{ - return 0; -} - -static inline int check_cpulimit_spread(struct task_group *tg, int target_cpu) -{ - return 1; -} #endif /* CONFIG_CFS_CPULIMIT */ static __always_inline @@ -4716,6 +4701,7 @@ done: static inline bool select_runnable_cpu(struct task_struct *p, int *new_cpu) { +#ifdef CONFIG_CFS_CPULIMIT struct task_group *tg; struct sched_domain *sd; int prev_cpu = task_cpu(p); @@ -4741,6 +4727,7 @@ static inline bool select_runnable_cpu(struct task_struct *p, int *new_cpu) } } } +#endif return false; } @@ -5461,14 +5448,10 @@ static inline bool migrate_degrades_locality(struct task_struct *p, } #endif -/* - * can_migrate_task - may task p from runqueue rq be migrated to this_cpu? - */ -static -int can_migrate_task(struct task_struct *p, struct lb_env *env) +static inline int can_migrate_task_cpulimit(struct task_struct *p, struct lb_env *env) { +#ifdef CONFIG_CFS_CPULIMIT struct task_group *tg = top_cfs_rq_of(&p->se)->tg; - int tsk_cache_hot = 0; if (check_cpulimit_spread(tg, env->dst_cpu) < 0) { int cpu; @@ -5490,6 +5473,20 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) } return 0; } +#endif + return 1; +} + +/* + * can_migrate_task - may task p from runqueue rq be migrated to this_cpu? + */ +static +int can_migrate_task(struct task_struct *p, struct lb_env *env) +{ + int tsk_cache_hot = 0; + + if (!can_migrate_task_cpulimit(p, env)) + return 0; /* * We do not migrate tasks that are: ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] arch: x86: perf_event_intel: do not taint kernel when irq loop is stuck
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.24 --> commit 098b10ef2ff6088c2e3df8949045a094ab37bf52 Author: Vladimir Davydov Date: Thu Jul 14 20:42:36 2016 +0400 arch: x86: perf_event_intel: do not taint kernel when irq loop is stuck Presumably, this happens in case a perf counter overflows. This might be a hardware bug, which needs a workaround. We don't have enough knowledge to fix it or investigate further. Since the issue is rare and can't lead to system crash, we can close our eye on it. Nevertheless, it taints the kernel, which results in test failure. To avoid that, let's replace WARN with pr_warn. https://jira.sw.ru/browse/PSBM-49258 Signed-off-by: Vladimir Davydov --- arch/x86/kernel/cpu/perf_event_intel.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index 5106e8378d96..9f2a12e5e553 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -1579,8 +1579,13 @@ static int intel_pmu_handle_irq(struct pt_regs *regs) again: intel_pmu_ack_status(status); if (++loops > 100) { - WARN_ONCE(1, "perfevents: irq loop stuck!\n"); - perf_event_print_debug(); + static bool warned = false; + if (!warned) { + pr_warn("perfevents: irq loop stuck!\n"); + dump_stack(); + perf_event_print_debug(); + warned = true; + } intel_pmu_reset(); goto done; } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] arch: x86: perf_event_intel: do not taint kernel when irq loop is stuck
Presumably, this happens in case a perf counter overflows. This might be a hardware bug, which needs a workaround. We don't have enough knowledge to fix it or investigate further. Since the issue is rare and can't lead to system crash, we can close our eye on it. Nevertheless, it taints the kernel, which results in test failure. To avoid that, let's replace WARN with pr_warn. https://jira.sw.ru/browse/PSBM-49258 Signed-off-by: Vladimir Davydov --- arch/x86/kernel/cpu/perf_event_intel.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index 5106e8378d96..9f2a12e5e553 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -1579,8 +1579,13 @@ static int intel_pmu_handle_irq(struct pt_regs *regs) again: intel_pmu_ack_status(status); if (++loops > 100) { - WARN_ONCE(1, "perfevents: irq loop stuck!\n"); - perf_event_print_debug(); + static bool warned = false; + if (!warned) { + pr_warn("perfevents: irq loop stuck!\n"); + dump_stack(); + perf_event_print_debug(); + warned = true; + } intel_pmu_reset(); goto done; } -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] mm: default collapse huge pages if there's at least 1/4th ptes mapped
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.24 --> commit ee04e199550d1a1b517306e729aaa56d9c045399 Author: Vladimir Davydov Date: Wed Jul 13 20:52:40 2016 +0400 mm: default collapse huge pages if there's at least 1/4th ptes mapped A huge page may be collapsed by khugepaged if there's not more than khugepaged_max_ptes_none unmapped ptes (configured via sysfs). The latter equals 511 (HPAGE_PMD_NR - 1) by default, which results in noticeable growth in memory footprint if a process has a sparse address space. Experiments have shown (see bug-id below) that decreasing the threshold down to 384 (3/4*HPAGE_PMD_NR) results in no performance degradation for VMs and CTs and at the same time improves test results for VMs (because qemu has a sparse heap). So let's set it by default. https://jira.sw.ru/browse/PSBM-48885 Signed-off-by: Vladimir Davydov --- mm/huge_memory.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 7543156e8d39..3c23df1d3392 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -58,11 +58,10 @@ static DEFINE_MUTEX(khugepaged_mutex); static DEFINE_SPINLOCK(khugepaged_mm_lock); static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait); /* - * default collapse hugepages if there is at least one pte mapped like - * it would have happened if the vma was large enough during page - * fault. + * default collapse hugepages if there is at least 1/4th ptes mapped + * to avoid memory footprint growth due to fragmentation */ -static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1; +static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR*3/4; static int khugepaged(void *none); static int khugepaged_slab_init(void); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] fs: make overlayfs disabled in CT by default
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.24 --> commit af1bf9e1a067c1186501ef5415acdb62a33e8c22 Author: Maxim Patlasov Date: Wed Jul 13 20:52:37 2016 +0400 fs: make overlayfs disabled in CT by default Overlayfs is in "TECH PREVIEW" state right now. Letting CT users to freely mount and exercise overlayfs, we risk to have the whole node crashed. Let's disable it for CT users by default. Customers who need it (e.g. to run Docker in CT) may enable it like this: # echo 1 > /proc/sys/fs/experimental_fs_enable The patch is a temporary (awkward) workaround until we make overlayfs production-ready. Then we'll roll back the patch. https://jira.sw.ru/browse/PSBM-49629 Signed-off-by: Maxim Patlasov Reviewed-by: Vladimir Davydov --- fs/filesystems.c | 8 +++- fs/overlayfs/super.c | 2 +- include/linux/fs.h | 4 kernel/sysctl.c | 7 +++ 4 files changed, 19 insertions(+), 2 deletions(-) diff --git a/fs/filesystems.c b/fs/filesystems.c index beaba560979f..670d228e9c56 100644 --- a/fs/filesystems.c +++ b/fs/filesystems.c @@ -16,6 +16,9 @@ #include #include +/* Affects ability of CT users to mount fs marked as FS_EXPERIMENTAL */ +int sysctl_experimental_fs_enable; + /* * Handling of filesystem drivers list. * Rules: @@ -219,7 +222,10 @@ int __init get_filesystem_list(char *buf) static inline bool filesystem_permitted(const struct file_system_type *fs) { - return ve_is_super(get_exec_env()) || (fs->fs_flags & FS_VIRTUALIZED); + return ve_is_super(get_exec_env()) || + (fs->fs_flags & FS_VIRTUALIZED) || + ((fs->fs_flags & FS_EXPERIMENTAL) && +sysctl_experimental_fs_enable); } #ifdef CONFIG_PROC_FS diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c index c20cfe977cdf..d5c57b4b5983 100644 --- a/fs/overlayfs/super.c +++ b/fs/overlayfs/super.c @@ -1129,7 +1129,7 @@ static struct file_system_type ovl_fs_type = { .name = "overlay", .mount = ovl_mount, .kill_sb= kill_anon_super, - .fs_flags = FS_VIRTUALIZED, + .fs_flags = FS_EXPERIMENTAL, }; MODULE_ALIAS_FS("overlay"); diff --git a/include/linux/fs.h b/include/linux/fs.h index 7203dbadbbf9..f1c3d5be60d8 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -59,6 +59,8 @@ extern struct inodes_stat_t inodes_stat; extern int leases_enable, lease_break_time; extern int sysctl_protected_symlinks; extern int sysctl_protected_hardlinks; +extern int sysctl_experimental_fs_enable; + struct buffer_head; typedef int (get_block_t)(struct inode *inode, sector_t iblock, @@ -2108,6 +2110,8 @@ struct file_system_type { #define FS_USERNS_MOUNT8 /* Can be mounted by userns root */ #define FS_USERNS_DEV_MOUNT16 /* A userns mount does not imply MNT_NODEV */ #define FS_VIRTUALIZED 64 /* Can mount this fstype inside ve */ +#define FS_EXPERIMENTAL128 /* Ability to mount this fstype inside ve +* is governed by experimental_fs_enable */ #define FS_HAS_RM_XQUOTA 256 /* KABI: fs has the rm_xquota quota op */ #define FS_HAS_INVALIDATE_RANGE512 /* FS has new ->invalidatepage with length arg */ #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */ diff --git a/kernel/sysctl.c b/kernel/sysctl.c index c8f7bc34c590..e59dd3be92dd 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1781,6 +1781,13 @@ static struct ctl_table fs_table[] = { .proc_handler = &pipe_proc_fn, .extra1 = &pipe_min_size, }, + { + .procname = "experimental_fs_enable", + .data = &sysctl_experimental_fs_enable, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, { } }; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ploop: add PLOOP_IOC_FREEZE and PLOOP_IOC_THAW ioctls
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.24 --> commit 91a74e3b91ab9aa0d4be64b01f86a14c36b3279e Author: Maxim Patlasov Date: Wed Jul 13 20:52:28 2016 +0400 ploop: add PLOOP_IOC_FREEZE and PLOOP_IOC_THAW ioctls The ioctls simply freeze and thaw ploop bdev. If no fs is mounted over ploop bdev being frozen, then the freeze ioctl just increments bd_fsfreeze_count, which prevents the ploop from being mounted until it is thawed. https://jira.sw.ru/browse/PSBM-49091 Caveats: 1) No nested freeze: many PLOOP_IOC_FREEZE ioctls have the same effect as one. 2) The same for thaw. [vdavydov@: allow to freeze unmounted ploop] Signed-off-by: Maxim Patlasov Signed-off-by: Vladimir Davydov Cc: Pavel Borzenkov --- drivers/block/ploop/dev.c | 39 +++ include/linux/ploop/ploop.h| 2 ++ include/linux/ploop/ploop_if.h | 6 ++ 3 files changed, 47 insertions(+) diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index e5f010b9aeba..d52975eaaa36 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -4815,6 +4815,39 @@ static int ploop_push_backup_stop(struct ploop_device *plo, unsigned long arg) return copy_to_user((void*)arg, &ctl, sizeof(ctl)); } +static int ploop_freeze(struct ploop_device *plo, struct block_device *bdev) +{ + struct super_block *sb = plo->sb; + + if (test_bit(PLOOP_S_FROZEN, &plo->state)) + return 0; + + sb = freeze_bdev(bdev); + if (sb && IS_ERR(sb)) + return PTR_ERR(sb); + + plo->sb = sb; + set_bit(PLOOP_S_FROZEN, &plo->state); + return 0; +} + +static int ploop_thaw(struct ploop_device *plo, struct block_device *bdev) +{ + struct super_block *sb = plo->sb; + int err; + + if (!test_bit(PLOOP_S_FROZEN, &plo->state)) + return 0; + + err = thaw_bdev(bdev, sb); + if (!err) { + plo->sb = NULL; + clear_bit(PLOOP_S_FROZEN, &plo->state); + } + + return err; +} + static int ploop_ioctl(struct block_device *bdev, fmode_t fmode, unsigned int cmd, unsigned long arg) { @@ -4928,6 +4961,12 @@ static int ploop_ioctl(struct block_device *bdev, fmode_t fmode, unsigned int cm case PLOOP_IOC_PUSH_BACKUP_STOP: err = ploop_push_backup_stop(plo, arg); break; + case PLOOP_IOC_FREEZE: + err = ploop_freeze(plo, bdev); + break; + case PLOOP_IOC_THAW: + err = ploop_thaw(plo, bdev); + break; default: err = -EINVAL; } diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h index deee8a78cc96..7864edf17f19 100644 --- a/include/linux/ploop/ploop.h +++ b/include/linux/ploop/ploop.h @@ -61,6 +61,7 @@ enum { (for minor mgmt only) */ PLOOP_S_ONCE, /* An event (e.g. printk once) happened */ PLOOP_S_PUSH_BACKUP,/* Push_backup is in progress */ + PLOOP_S_FROZEN /* Frozen PLOOP_IOC_FREEZE */ }; struct ploop_snapdata @@ -409,6 +410,7 @@ struct ploop_device struct block_device *bdev; struct request_queue*queue; struct task_struct *thread; + struct super_block *sb; struct rb_node link; /* someone who wants to quiesce state-machine waits diff --git a/include/linux/ploop/ploop_if.h b/include/linux/ploop/ploop_if.h index a098ca9d0ef0..302ace984a5a 100644 --- a/include/linux/ploop/ploop_if.h +++ b/include/linux/ploop/ploop_if.h @@ -352,6 +352,12 @@ struct ploop_track_extent /* Stop push backup */ #define PLOOP_IOC_PUSH_BACKUP_STOP _IOR(PLOOPCTLTYPE, 31, struct ploop_push_backup_stop_ctl) +/* Freeze FS mounted over ploop */ +#define PLOOP_IOC_FREEZE _IO(PLOOPCTLTYPE, 32) + +/* Unfreeze FS mounted over ploop */ +#define PLOOP_IOC_THAW _IO(PLOOPCTLTYPE, 33) + /* Events exposed via /sys/block/ploopN/pstate/event */ #define PLOOP_EVENT_ABORTED1 #define PLOOP_EVENT_STOPPED2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 3/4] sched: cleanup !CFS_CPULIMIT code
Let's move all CFS_CPULIMIT related functions under CFS_CPULIMIT ifdef. This will ease further patching. Signed-off-by: Vladimir Davydov --- kernel/sched/fair.c | 39 ++- 1 file changed, 18 insertions(+), 21 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 52365f6a4e36..2ff38fc1d600 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -560,11 +560,6 @@ static inline int check_cpulimit_spread(struct task_group *tg, int target_cpu) return cfs_rq_active(tg->cfs_rq[target_cpu]) ? 0 : -1; } #else /* !CONFIG_CFS_CPULIMIT */ -static inline int cfs_rq_active(struct cfs_rq *cfs_rq) -{ - return 1; -} - static inline void inc_nr_active_cfs_rqs(struct cfs_rq *cfs_rq) { } @@ -572,16 +567,6 @@ static inline void inc_nr_active_cfs_rqs(struct cfs_rq *cfs_rq) static inline void dec_nr_active_cfs_rqs(struct cfs_rq *cfs_rq, int postpone) { } - -static inline enum hrtimer_restart sched_cfs_active_timer(struct hrtimer *timer) -{ - return 0; -} - -static inline int check_cpulimit_spread(struct task_group *tg, int target_cpu) -{ - return 1; -} #endif /* CONFIG_CFS_CPULIMIT */ static __always_inline @@ -4716,6 +4701,7 @@ done: static inline bool select_runnable_cpu(struct task_struct *p, int *new_cpu) { +#ifdef CONFIG_CFS_CPULIMIT struct task_group *tg; struct sched_domain *sd; int prev_cpu = task_cpu(p); @@ -4741,6 +4727,7 @@ static inline bool select_runnable_cpu(struct task_struct *p, int *new_cpu) } } } +#endif return false; } @@ -5461,14 +5448,10 @@ static inline bool migrate_degrades_locality(struct task_struct *p, } #endif -/* - * can_migrate_task - may task p from runqueue rq be migrated to this_cpu? - */ -static -int can_migrate_task(struct task_struct *p, struct lb_env *env) +static inline int can_migrate_task_cpulimit(struct task_struct *p, struct lb_env *env) { +#ifdef CONFIG_CFS_CPULIMIT struct task_group *tg = top_cfs_rq_of(&p->se)->tg; - int tsk_cache_hot = 0; if (check_cpulimit_spread(tg, env->dst_cpu) < 0) { int cpu; @@ -5490,6 +5473,20 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) } return 0; } +#endif + return 1; +} + +/* + * can_migrate_task - may task p from runqueue rq be migrated to this_cpu? + */ +static +int can_migrate_task(struct task_struct *p, struct lb_env *env) +{ + int tsk_cache_hot = 0; + + if (!can_migrate_task_cpulimit(p, env)) + return 0; /* * We do not migrate tasks that are: -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 4/4] sched: use topmost limited ancestor for cpulimit balancing
We want to keep all proceses of a container's cgroup packed on the minimal allowed number of cpus, which is set by the cpulimit. Doing this properly when deep hierarchies are used is tricky if not impossible w/o introducing tremendous overhead, so initially we implemented this feature exclusively for top-level cgroups. Now this isn't enough, as containers can be created in machine.slice. So in this patch we make cpulimit balancing work for topmost cgroups that has a cpu limit set. This way, no matter if containers are created under the root or in machine.slice, cpulimit balancing will always be applied to container's cgroup as machine.slice isn't supposed to have cpu limit set. https://jira.sw.ru/browse/PSBM-49203 Signed-off-by: Vladimir Davydov --- kernel/sched/core.c | 62 kernel/sched/fair.c | 36 +- kernel/sched/sched.h | 2 ++ 3 files changed, 69 insertions(+), 31 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 94deef41f05a..657b8e4ba8d8 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7557,6 +7557,10 @@ void __init sched_init(void) #endif /* CONFIG_CGROUP_SCHED */ +#ifdef CONFIG_CFS_CPULIMIT + root_task_group.topmost_limited_ancestor = &root_task_group; +#endif + for_each_possible_cpu(i) { struct rq *rq; @@ -7882,6 +7886,8 @@ err: return ERR_PTR(-ENOMEM); } +static void tg_update_topmost_limited_ancestor(struct task_group *tg); + void sched_online_group(struct task_group *tg, struct task_group *parent) { unsigned long flags; @@ -7894,6 +7900,9 @@ void sched_online_group(struct task_group *tg, struct task_group *parent) tg->parent = parent; INIT_LIST_HEAD(&tg->children); list_add_rcu(&tg->siblings, &parent->children); + + tg_update_topmost_limited_ancestor(tg); + spin_unlock_irqrestore(&task_group_lock, flags); } @@ -8428,6 +8437,8 @@ const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */ static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime); +static void tg_limit_toggled(struct task_group *tg); + /* call with cfs_constraints_mutex held */ static int __tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) { @@ -8485,6 +8496,8 @@ static int __tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) unthrottle_cfs_rq(cfs_rq); raw_spin_unlock_irq(&rq->lock); } + if (runtime_enabled != runtime_was_enabled) + tg_limit_toggled(tg); return ret; } @@ -8662,6 +8675,49 @@ static int cpu_stats_show(struct cgroup *cgrp, struct cftype *cft, } #ifdef CONFIG_CFS_CPULIMIT +static int __tg_update_topmost_limited_ancestor(struct task_group *tg, void *unused) +{ + struct task_group *parent = tg->parent; + + /* +* Parent and none of its uncestors is limited? The task group should +* become a topmost limited uncestor then, provided it has a limit set. +* Otherwise inherit topmost limited ancestor from the parent. +*/ + if (parent->topmost_limited_ancestor == parent && + parent->cfs_bandwidth.quota == RUNTIME_INF) + tg->topmost_limited_ancestor = tg; + else + tg->topmost_limited_ancestor = parent->topmost_limited_ancestor; + return 0; +} + +static void tg_update_topmost_limited_ancestor(struct task_group *tg) +{ + __tg_update_topmost_limited_ancestor(tg, NULL); +} + +static void tg_limit_toggled(struct task_group *tg) +{ + if (tg->topmost_limited_ancestor != tg) { + /* +* This task group is not a topmost limited ancestor, so both +* it and all its children must already point to their topmost +* limited ancestor, and we have nothing to do. +*/ + return; + } + + /* +* This task group is a topmost limited ancestor. Walk over all its +* children and update their pointers to the topmost limited ancestor. +*/ + + spin_lock_irq(&task_group_lock); + walk_tg_tree_from(tg, __tg_update_topmost_limited_ancestor, tg_nop, NULL); + spin_unlock_irq(&task_group_lock); +} + static void tg_update_cpu_limit(struct task_group *tg) { long quota, period; @@ -8736,6 +8792,12 @@ static int nr_cpus_write_u64(struct cgroup *cgrp, struct cftype *cftype, return tg_set_cpu_limit(tg, tg->cpu_rate, nr_cpus); } #else +static void tg_update_topmost_limited_ancestor(struct task_group *tg) +{ +} +static void tg_limit_toggled(struct task_group *tg) +{ +} static void tg_update_cpu_limit(struct task_group *tg) { } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2ff38fc1d600..515685f77217 100644 -
[Devel] [PATCH rh7 2/4] sched: make check_cpulimit_spread accept tg instead of cfs_rq
It only needs cfs_rq->tg, so let's pass it directly. This eases further modifications. Signed-off-by: Vladimir Davydov --- kernel/sched/fair.c | 57 - 1 file changed, 26 insertions(+), 31 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 70a5861d4166..52365f6a4e36 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -542,9 +542,8 @@ static enum hrtimer_restart sched_cfs_active_timer(struct hrtimer *timer) return HRTIMER_NORESTART; } -static inline int check_cpulimit_spread(struct cfs_rq *cfs_rq, int target_cpu) +static inline int check_cpulimit_spread(struct task_group *tg, int target_cpu) { - struct task_group *tg = cfs_rq->tg; int nr_cpus_active = atomic_read(&tg->nr_cpus_active); int nr_cpus_limit = DIV_ROUND_UP(tg->cpu_rate, MAX_CPU_RATE); @@ -579,7 +578,7 @@ static inline enum hrtimer_restart sched_cfs_active_timer(struct hrtimer *timer) return 0; } -static inline int check_cpulimit_spread(struct cfs_rq *cfs_rq, int target_cpu) +static inline int check_cpulimit_spread(struct task_group *tg, int target_cpu) { return 1; } @@ -4717,18 +4716,15 @@ done: static inline bool select_runnable_cpu(struct task_struct *p, int *new_cpu) { - struct cfs_rq *cfs_rq; struct task_group *tg; struct sched_domain *sd; int prev_cpu = task_cpu(p); int cpu; - cfs_rq = top_cfs_rq_of(&p->se); - if (check_cpulimit_spread(cfs_rq, *new_cpu) > 0) + tg = top_cfs_rq_of(&p->se)->tg; + if (check_cpulimit_spread(tg, *new_cpu) > 0) return false; - tg = cfs_rq->tg; - if (cfs_rq_active(tg->cfs_rq[*new_cpu])) return true; @@ -5084,7 +5080,7 @@ static int cpulimit_balance_cpu_stop(void *data); static inline void trigger_cpulimit_balance(struct task_struct *p) { struct rq *this_rq; - struct cfs_rq *cfs_rq; + struct task_group *tg; int this_cpu, cpu, target_cpu = -1; struct sched_domain *sd; @@ -5094,8 +5090,8 @@ static inline void trigger_cpulimit_balance(struct task_struct *p) if (!p->se.on_rq || this_rq->active_balance) return; - cfs_rq = top_cfs_rq_of(&p->se); - if (check_cpulimit_spread(cfs_rq, this_cpu) >= 0) + tg = top_cfs_rq_of(&p->se)->tg; + if (check_cpulimit_spread(tg, this_cpu) >= 0) return; rcu_read_lock(); @@ -5105,7 +5101,7 @@ static inline void trigger_cpulimit_balance(struct task_struct *p) for_each_cpu_and(cpu, sched_domain_span(sd), tsk_cpus_allowed(p)) { if (cpu != this_cpu && - cfs_rq_active(cfs_rq->tg->cfs_rq[cpu])) { + cfs_rq_active(tg->cfs_rq[cpu])) { target_cpu = cpu; goto unlock; } @@ -5471,22 +5467,22 @@ static inline bool migrate_degrades_locality(struct task_struct *p, static int can_migrate_task(struct task_struct *p, struct lb_env *env) { - struct cfs_rq *cfs_rq = top_cfs_rq_of(&p->se); + struct task_group *tg = top_cfs_rq_of(&p->se)->tg; int tsk_cache_hot = 0; - if (check_cpulimit_spread(cfs_rq, env->dst_cpu) < 0) { + if (check_cpulimit_spread(tg, env->dst_cpu) < 0) { int cpu; schedstat_inc(p, se.statistics.nr_failed_migrations_cpulimit); - if (check_cpulimit_spread(cfs_rq, env->src_cpu) != 0) + if (check_cpulimit_spread(tg, env->src_cpu) != 0) return 0; if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED)) return 0; for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) { - if (cfs_rq_active(cfs_rq->tg->cfs_rq[cpu])) { + if (cfs_rq_active(tg->cfs_rq[cpu])) { env->flags |= LBF_SOME_PINNED; env->new_dst_cpu = cpu; break; @@ -5719,7 +5715,7 @@ static int move_task_group(struct cfs_rq *cfs_rq, struct lb_env *env) static int move_task_groups(struct lb_env *env) { - struct cfs_rq *cfs_rq, *top_cfs_rq; + struct cfs_rq *cfs_rq; struct task_group *tg; unsigned long load; int cur_pulled, pulled = 0; @@ -5728,8 +5724,7 @@ static int move_task_groups(struct lb_env *env) return 0; for_each_leaf_cfs_rq(env->src_rq, cfs_rq) { - tg = cfs_rq->tg; - if (tg == &root_task_group) + if (cfs_rq->tg == &root_task_group)
[Devel] [PATCH rh7 1/4] sched: account task_group->nr_cpus_active for all cgroups
Currently nr_cpus_active is only accounted for top-level cgroups, because container cgroups, which are the only users of this counter, could only be created under the root cgroup. Now things have changed, and containers can reside either under the root or under machine.slice or in any other cgroup depending on the host's config. So we can't preserve this little optimization anymore. Remove it. Signed-off-by: Vladimir Davydov --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fd25e1e8ae5b..70a5861d4166 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3039,7 +3039,7 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq); static void enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { - if (is_top_cfs_rq(cfs_rq) && !cfs_rq->load.weight) + if (!cfs_rq->load.weight) inc_nr_active_cfs_rqs(cfs_rq); /* @@ -3163,7 +3163,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) update_min_vruntime(cfs_rq); update_cfs_shares(cfs_rq); - if (is_top_cfs_rq(cfs_rq) && !cfs_rq->load.weight) + if (!cfs_rq->load.weight) dec_nr_active_cfs_rqs(cfs_rq, flags & DEQUEUE_TASK_SLEEP); } -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 0/4] sched: fix degradation caused by moving containers to machine.slice
We have a hack in the scheduler that makes containers' processes run on the minimal allowed number of cpus, which dramatically improves performance of some tests if a container has 1 or 2 cpus. This hack depends on the fact that containers are located under the root cgroup, so moving them to machine.slice broke it. This patch set fixes it. https://jira.sw.ru/browse/PSBM-49203 Vladimir Davydov (4): sched: account task_group->nr_cpus_active for all cgroups sched: make check_cpulimit_spread accept tg instead of cfs_rq sched: cleanup !CFS_CPULIMIT code sched: use topmost limited ancestor for cpulimit balancing kernel/sched/core.c | 62 ++ kernel/sched/fair.c | 120 ++- kernel/sched/sched.h | 2 + 3 files changed, 107 insertions(+), 77 deletions(-) -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 v2] ploop: add PLOOP_IOC_FREEZE and PLOOP_IOC_THAW ioctls
From: Maxim Patlasov The ioctls simply freeze and thaw ploop bdev. If no fs is mounted over ploop bdev being frozen, then the freeze ioctl just increments bd_fsfreeze_count, which prevents the ploop from being mounted until it is thawed. https://jira.sw.ru/browse/PSBM-49091 Caveats: 1) No nested freeze: many PLOOP_IOC_FREEZE ioctls have the same effect as one. 2) The same for thaw. [vdavydov@: allow to freeze unmounted ploop] Signed-off-by: Maxim Patlasov Signed-off-by: Vladimir Davydov Cc: Pavel Borzenkov --- Changes in v2: - avoid patching generic code drivers/block/ploop/dev.c | 39 +++ include/linux/ploop/ploop.h| 2 ++ include/linux/ploop/ploop_if.h | 6 ++ 3 files changed, 47 insertions(+) diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index e5f010b9aeba..d52975eaaa36 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -4815,6 +4815,39 @@ static int ploop_push_backup_stop(struct ploop_device *plo, unsigned long arg) return copy_to_user((void*)arg, &ctl, sizeof(ctl)); } +static int ploop_freeze(struct ploop_device *plo, struct block_device *bdev) +{ + struct super_block *sb = plo->sb; + + if (test_bit(PLOOP_S_FROZEN, &plo->state)) + return 0; + + sb = freeze_bdev(bdev); + if (sb && IS_ERR(sb)) + return PTR_ERR(sb); + + plo->sb = sb; + set_bit(PLOOP_S_FROZEN, &plo->state); + return 0; +} + +static int ploop_thaw(struct ploop_device *plo, struct block_device *bdev) +{ + struct super_block *sb = plo->sb; + int err; + + if (!test_bit(PLOOP_S_FROZEN, &plo->state)) + return 0; + + err = thaw_bdev(bdev, sb); + if (!err) { + plo->sb = NULL; + clear_bit(PLOOP_S_FROZEN, &plo->state); + } + + return err; +} + static int ploop_ioctl(struct block_device *bdev, fmode_t fmode, unsigned int cmd, unsigned long arg) { @@ -4928,6 +4961,12 @@ static int ploop_ioctl(struct block_device *bdev, fmode_t fmode, unsigned int cm case PLOOP_IOC_PUSH_BACKUP_STOP: err = ploop_push_backup_stop(plo, arg); break; + case PLOOP_IOC_FREEZE: + err = ploop_freeze(plo, bdev); + break; + case PLOOP_IOC_THAW: + err = ploop_thaw(plo, bdev); + break; default: err = -EINVAL; } diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h index deee8a78cc96..7864edf17f19 100644 --- a/include/linux/ploop/ploop.h +++ b/include/linux/ploop/ploop.h @@ -61,6 +61,7 @@ enum { (for minor mgmt only) */ PLOOP_S_ONCE, /* An event (e.g. printk once) happened */ PLOOP_S_PUSH_BACKUP,/* Push_backup is in progress */ + PLOOP_S_FROZEN /* Frozen PLOOP_IOC_FREEZE */ }; struct ploop_snapdata @@ -409,6 +410,7 @@ struct ploop_device struct block_device *bdev; struct request_queue*queue; struct task_struct *thread; + struct super_block *sb; struct rb_node link; /* someone who wants to quiesce state-machine waits diff --git a/include/linux/ploop/ploop_if.h b/include/linux/ploop/ploop_if.h index a098ca9d0ef0..302ace984a5a 100644 --- a/include/linux/ploop/ploop_if.h +++ b/include/linux/ploop/ploop_if.h @@ -352,6 +352,12 @@ struct ploop_track_extent /* Stop push backup */ #define PLOOP_IOC_PUSH_BACKUP_STOP _IOR(PLOOPCTLTYPE, 31, struct ploop_push_backup_stop_ctl) +/* Freeze FS mounted over ploop */ +#define PLOOP_IOC_FREEZE _IO(PLOOPCTLTYPE, 32) + +/* Unfreeze FS mounted over ploop */ +#define PLOOP_IOC_THAW _IO(PLOOPCTLTYPE, 33) + /* Events exposed via /sys/block/ploopN/pstate/event */ #define PLOOP_EVENT_ABORTED1 #define PLOOP_EVENT_STOPPED2 -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7 1/4] fs: do not fail on double freeze bdev w/o sb
On Tue, Jul 12, 2016 at 03:02:11PM -0700, Maxim Patlasov wrote: > Let's keep flies and cutlets separately. It seems we can easily satisfy > push-backup needs by implementing freeze/thaw ploop ioctls without tackling > generic code at all, see a patch in attachment (unless I missed something > obvious). And apart from these ploop/push-backup stuff, if you think your > changes for freeze_bdev() and thaw_bdev() are useful, send them upstream, so > we'll back-port them later, when they are accepted upstream (unless I missed > some scenario for which those changes matter for us). In the other words, I > think we have to keep our vz7 generic code base closer to ms, unless we have > good reason to deviate. Agree. Generally, I like your patch more than mine, but I've a concern about it - see below. > > On 07/12/2016 03:04 AM, Vladimir Davydov wrote: > >It's possible to freeze a bdev which is not mounted. In this case > >freeze_bdev() only increments bd_fsfrozen_count in order to prevent the > >bdev from being mounted and does nothing else. A second freeze attempt > >on the same device is supposed to increment bd_fsfrozen_count again, but > >it results in NULL ptr dereference, because freeze_bdev() doesn't check > >the return value of get_super(). Fix that. > > > >Signed-off-by: Vladimir Davydov > >--- > > fs/block_dev.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > >diff --git a/fs/block_dev.c b/fs/block_dev.c > >index 4575c62d8b0b..325ee7161fbf 100644 > >--- a/fs/block_dev.c > >+++ b/fs/block_dev.c > >@@ -227,7 +227,8 @@ struct super_block *freeze_bdev(struct block_device > >*bdev) > > * thaw_bdev drops it. > > */ > > sb = get_super(bdev); > >-drop_super(sb); > >+if (sb) > >+drop_super(sb); > > mutex_unlock(&bdev->bd_fsfreeze_mutex); > > return sb; > > } > > The ioctls simply freeze and thaw ploop bdev. > > Caveats: > > 1) If no fs mounted, the ioctls have no effect. > 2) No nested freeze: many PLOOP_IOC_FREEZE ioctls have the same effect as one. > 3) The same for thaw. I think #2 and #3 are OK. But regarding #1 - what if we want to make a backup of a secondary ploop which is not mounted? So we try to freeze it and succeed, but it isn't actually frozen, so it can be mounted and modified while we're backing it up, which is incorrect AFAIU. What about something like this on top of your patch? diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index 9a9cc8b0b934..d52975eaaa36 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -4819,16 +4819,15 @@ static int ploop_freeze(struct ploop_device *plo, struct block_device *bdev) { struct super_block *sb = plo->sb; - if (sb) + if (test_bit(PLOOP_S_FROZEN, &plo->state)) return 0; sb = freeze_bdev(bdev); if (sb && IS_ERR(sb)) return PTR_ERR(sb); - if (!sb) - thaw_bdev(bdev, sb); plo->sb = sb; + set_bit(PLOOP_S_FROZEN, &plo->state); return 0; } @@ -4837,12 +4836,14 @@ static int ploop_thaw(struct ploop_device *plo, struct block_device *bdev) struct super_block *sb = plo->sb; int err; - if (!sb) + if (!test_bit(PLOOP_S_FROZEN, &plo->state)) return 0; err = thaw_bdev(bdev, sb); - if (!err) + if (!err) { plo->sb = NULL; + clear_bit(PLOOP_S_FROZEN, &plo->state); + } return err; } diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h index 6ae96c4486fe..7864edf17f19 100644 --- a/include/linux/ploop/ploop.h +++ b/include/linux/ploop/ploop.h @@ -61,6 +61,7 @@ enum { (for minor mgmt only) */ PLOOP_S_ONCE, /* An event (e.g. printk once) happened */ PLOOP_S_PUSH_BACKUP,/* Push_backup is in progress */ + PLOOP_S_FROZEN /* Frozen PLOOP_IOC_FREEZE */ }; struct ploop_snapdata ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] mm: default collapse huge pages if there's at least 1/4th ptes mapped
A huge page may be collapsed by khugepaged if there's not more than khugepaged_max_ptes_none unmapped ptes (configured via sysfs). The latter equals 511 (HPAGE_PMD_NR - 1) by default, which results in noticeable growth in memory footprint if a process has a sparse address space. Experiments have shown (see bug-id below) that decreasing the threshold down to 384 (3/4*HPAGE_PMD_NR) results in no performance degradation for VMs and CTs and at the same time improves test results for VMs (because qemu has a sparse heap). So let's set it by default. https://jira.sw.ru/browse/PSBM-48885 Signed-off-by: Vladimir Davydov --- mm/huge_memory.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 7543156e8d39..3c23df1d3392 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -58,11 +58,10 @@ static DEFINE_MUTEX(khugepaged_mutex); static DEFINE_SPINLOCK(khugepaged_mm_lock); static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait); /* - * default collapse hugepages if there is at least one pte mapped like - * it would have happened if the vma was large enough during page - * fault. + * default collapse hugepages if there is at least 1/4th ptes mapped + * to avoid memory footprint growth due to fragmentation */ -static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1; +static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR*3/4; static int khugepaged(void *none); static int khugepaged_slab_init(void); -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 3/4] fs: export get_active_super
It is required by the next patch. Signed-off-by: Vladimir Davydov --- fs/super.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/super.c b/fs/super.c index 50ac29391b96..e52b7db23c8f 100644 --- a/fs/super.c +++ b/fs/super.c @@ -680,6 +680,7 @@ restart: spin_unlock(&sb_lock); return NULL; } +EXPORT_SYMBOL(get_active_super); struct super_block *user_get_super(dev_t dev) { -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 4/4] ploop: add PLOOP_IOC_FREEZE and PLOOP_IOC_THAW ioctls
These ioctls simply freeze and thaw ploop bdev, respectively (i.e. FS mounted over ploop device). They are required by ploop push backup for freezing secondary ploops mounted inside containers. The point is that these mount points are not shown in host's /proc/mounts due to mount namespace, so there's no easy way for push backup process to get the mount point given a device name in order to call FIFREEZE ioctl. (Actually, there's a way - using /proc/PID/mounts and /proc/PID/root where PID is the pid of a container's process, but it's cumbersome). https://jira.sw.ru/browse/PSBM-49091 Signed-off-by: Vladimir Davydov Cc: Pavel Borzenkov --- drivers/block/ploop/dev.c | 28 include/linux/ploop/ploop_if.h | 6 ++ 2 files changed, 34 insertions(+) diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index e5f010b9aeba..d2b3c9fd9176 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -4815,6 +4815,28 @@ static int ploop_push_backup_stop(struct ploop_device *plo, unsigned long arg) return copy_to_user((void*)arg, &ctl, sizeof(ctl)); } +static int ploop_freeze(struct block_device *bdev) +{ + struct super_block *sb; + + sb = freeze_bdev(bdev); + if (sb && IS_ERR(sb)) + return PTR_ERR(sb); + return 0; +} + +static int ploop_thaw(struct block_device *bdev) +{ + struct super_block *sb; + int err; + + sb = get_active_super(bdev); + err = thaw_bdev(bdev, sb); + if (sb) + deactivate_super(sb); + return err; +} + static int ploop_ioctl(struct block_device *bdev, fmode_t fmode, unsigned int cmd, unsigned long arg) { @@ -4928,6 +4950,12 @@ static int ploop_ioctl(struct block_device *bdev, fmode_t fmode, unsigned int cm case PLOOP_IOC_PUSH_BACKUP_STOP: err = ploop_push_backup_stop(plo, arg); break; + case PLOOP_IOC_FREEZE: + err = ploop_freeze(bdev); + break; + case PLOOP_IOC_THAW: + err = ploop_thaw(bdev); + break; default: err = -EINVAL; } diff --git a/include/linux/ploop/ploop_if.h b/include/linux/ploop/ploop_if.h index a098ca9d0ef0..302ace984a5a 100644 --- a/include/linux/ploop/ploop_if.h +++ b/include/linux/ploop/ploop_if.h @@ -352,6 +352,12 @@ struct ploop_track_extent /* Stop push backup */ #define PLOOP_IOC_PUSH_BACKUP_STOP _IOR(PLOOPCTLTYPE, 31, struct ploop_push_backup_stop_ctl) +/* Freeze FS mounted over ploop */ +#define PLOOP_IOC_FREEZE _IO(PLOOPCTLTYPE, 32) + +/* Unfreeze FS mounted over ploop */ +#define PLOOP_IOC_THAW _IO(PLOOPCTLTYPE, 33) + /* Events exposed via /sys/block/ploopN/pstate/event */ #define PLOOP_EVENT_ABORTED1 #define PLOOP_EVENT_STOPPED2 -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 2/4] fs: fix thaw_bdev return value in case bdev is not frozen
We should return -EINVAL in this case, but instead return 0. Also, remove a duplicate code block, while we're here. Signed-off-by: Vladimir Davydov --- fs/block_dev.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 325ee7161fbf..0310d6402cf5 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -274,14 +274,11 @@ int thaw_bdev(struct block_device *bdev, struct super_block *sb) goto out; error = thaw_super(sb); - if (error) { + if (error) bdev->bd_fsfreeze_count++; - mutex_unlock(&bdev->bd_fsfreeze_mutex); - return error; - } out: mutex_unlock(&bdev->bd_fsfreeze_mutex); - return 0; + return error; } EXPORT_SYMBOL(thaw_bdev); -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 1/4] fs: do not fail on double freeze bdev w/o sb
It's possible to freeze a bdev which is not mounted. In this case freeze_bdev() only increments bd_fsfrozen_count in order to prevent the bdev from being mounted and does nothing else. A second freeze attempt on the same device is supposed to increment bd_fsfrozen_count again, but it results in NULL ptr dereference, because freeze_bdev() doesn't check the return value of get_super(). Fix that. Signed-off-by: Vladimir Davydov --- fs/block_dev.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 4575c62d8b0b..325ee7161fbf 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -227,7 +227,8 @@ struct super_block *freeze_bdev(struct block_device *bdev) * thaw_bdev drops it. */ sb = get_super(bdev); - drop_super(sb); + if (sb) + drop_super(sb); mutex_unlock(&bdev->bd_fsfreeze_mutex); return sb; } -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] fs: make overlayfs disabled in CT by default (v2)
On Thu, Jul 07, 2016 at 01:00:36PM -0700, Maxim Patlasov wrote: > Overlayfs is in "TECH PREVIEW" state right now. Letting CT users to freely > mount and exercise overlayfs, we risk to have the whole node crashed. > > Let's disable it for CT users by default. Customers who need it (e.g. to > run Docker in CT) may enable it like this: > > # echo 1 > /proc/sys/fs/experimental_fs_enable > > The patch is a temporary (awkward) workaround until we make overlayfs > production-ready. Then we'll roll back the patch. > > Changed in v2: > - let's only leave system-wide sysctl for permitting overlayfs; the sysctl >is "rw" in ve0, but "ro" inside CT. > > https://jira.sw.ru/browse/PSBM-47981 Reviewed-by: Vladimir Davydov ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] mm: memcontrol: carefully check for user charges while reparenting
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.23 --> commit 50feb34e21530c4b27eae2836c4630d2b104476f Author: Vladimir Davydov Date: Thu Jul 7 18:33:25 2016 +0400 mm: memcontrol: carefully check for user charges while reparenting kmem is uncharged before res, therefore when checking if there are still user charges in a memory cgroup, we should read res before kmem, otherwise a kmem uncharge can get in-between two reads, leading to false-positive res <= kmem. Add smp_rmb() to guarantee this never happens. Note, since x86 doesn't reorder reads, this patch doesn't actually introduce any functional changes - it just clarifies the code. Fixes: 35c0d2a992aa ("mm: memcontrol: fix race between kmem uncharge and charge reparenting") Signed-off-by: Vladimir Davydov --- mm/memcontrol.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1f525f27e481..8151d4259c6b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4814,7 +4814,7 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg, static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg) { int node, zid; - u64 usage; + u64 res, kmem; do { /* This is for making all *used* pages to be on LRU. */ @@ -4845,10 +4845,17 @@ static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg) * so the lru seemed empty but the page could have been added * right after the check. RES_USAGE should be safe as we always * charge before adding to the LRU. +* +* Note, we must read memcg->res strictly before memcg->kmem, +* because otherwise a kmem charge might get uncharged in +* between the two reads leading to res <= kmem, even though +* there are still user pages charged to this cgroup out there. +* (see also comment in memcg_charge_kmem()) */ - usage = res_counter_read_u64(&memcg->res, RES_USAGE) - - res_counter_read_u64(&memcg->kmem, RES_USAGE); - } while (usage > 0); + res = res_counter_read_u64(&memcg->res, RES_USAGE); + smp_rmb(); + kmem = res_counter_read_u64(&memcg->kmem, RES_USAGE); + } while (res > kmem); } /* ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] dcache: fix dentry leak when shrink races with kill
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.23 --> commit 88b8ca0fe3055767d9b92495b3e28f587b923d04 Author: Vladimir Davydov Date: Thu Jul 7 18:33:13 2016 +0400 dcache: fix dentry leak when shrink races with kill dentry_kill() does not free a dentry in case it is on a shrink list - see __dentry_kill() -> d_free(). Instead it just marks it DCACHE_MAY_FREE, which will make the shrinker free it when it's done with it. This is required to avoid use after free in shrink_dentry_list(). This logic was back-ported by commit e33cae748d1a ("ms/dcache: dentry_kill(): don't try to remove from shrink list"). When back-porting this commit I occasionally missed a hunk for shrink_dentry_list(). The hunk makes shrink_dentry_list() more carefully check dentry->d_lock_ref.count, i.e. instead of merely checking if it's 0 or not, it makes it check if it's strictly greater than 0. W/o this check a dentry might leak if shrink races with kill, because before trying to free a dentry, dentry_kill() first calls lockref_mark_dead(&dentry->d_lockref), which sets d_lockref.count to -128, so that shrink_dentry_list() will silently skip the dentry instead of freeing it. This patch resurrects the missing hunk. https://jira.sw.ru/browse/PSBM-49321 Fixes: e33cae748d1a ("ms/dcache: dentry_kill(): don't try to remove from shrink list") Signed-off-by: Vladimir Davydov --- fs/dcache.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/dcache.c b/fs/dcache.c index 09ed486c9f1d..6433814a02d2 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -874,7 +874,7 @@ static void shrink_dentry_list(struct list_head *list) * We found an inuse dentry which was not removed from * the LRU because of laziness during lookup. Do not free it. */ - if (dentry->d_lockref.count) { + if ((int)dentry->d_lockref.count > 0) { spin_unlock(&dentry->d_lock); if (parent) spin_unlock(&parent->d_lock); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] propogate_mnt: Handle the first propogated copy being a slave
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.23 --> commit 529c728822cb387ac470cd10969c5b1458467a94 Author: Eric W. Biederman Date: Thu Jul 7 18:33:19 2016 +0400 propogate_mnt: Handle the first propogated copy being a slave When the first propgated copy was a slave the following oops would result: > BUG: unable to handle kernel NULL pointer dereference at 0010 > IP: [] propagate_one+0xbe/0x1c0 > PGD bacd4067 PUD bac66067 PMD 0 > Oops: [#1] SMP > Modules linked in: > CPU: 1 PID: 824 Comm: mount Not tainted 4.6.0-rc5userns+ #1523 > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 > task: 8800bb0a8000 ti: 8800bac3c000 task.ti: 8800bac3c000 > RIP: 0010:[] [] propagate_one+0xbe/0x1c0 > RSP: 0018:8800bac3fd38 EFLAGS: 00010283 > RAX: RBX: 8800bb77ec00 RCX: 0010 > RDX: RSI: 8800bb58c000 RDI: 8800bb58c480 > RBP: 8800bac3fd48 R08: 0001 R09: > R10: 1ca1 R11: 1c9d R12: > R13: 8800ba713800 R14: 8800bac3fda0 R15: 8800bb77ec00 > FS: 7f3c0cd9b7e0() GS:8800bfb0() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 0010 CR3: bb79d000 CR4: 06e0 > Stack: > 8800bb77ec00 8800bac3fd88 811fbf85 > 8800bac3fd98 8800bb77f080 8800ba713800 8800bb262b40 > 8800bac3fdd8 811f1da0 > Call Trace: > [] propagate_mnt+0x105/0x140 > [] attach_recursive_mnt+0x120/0x1e0 > [] graft_tree+0x63/0x70 > [] do_add_mount+0x9b/0x100 > [] do_mount+0x2aa/0xdf0 > [] ? strndup_user+0x4e/0x70 > [] SyS_mount+0x75/0xc0 > [] do_syscall_64+0x4b/0xa0 > [] entry_SYSCALL64_slow_path+0x25/0x25 > Code: 00 00 75 ec 48 89 0d 02 22 22 01 8b 89 10 01 00 00 48 89 05 fd 21 22 01 39 8e 10 01 00 00 0f 84 e0 00 00 00 48 8b 80 d8 00 00 00 <48> 8b 50 10 48 89 05 df 21 22 01 48 89 15 d0 21 22 01 8b 53 30 > RIP [] propagate_one+0xbe/0x1c0 > RSP > CR2: 0010 > ---[ end trace 2725ecd95164f217 ]--- This oops happens with the namespace_sem held and can be triggered by non-root users. An all around not pleasant experience. To avoid this scenario when finding the appropriate source mount to copy stop the walk up the mnt_master chain when the first source mount is encountered. Further rewrite the walk up the last_source mnt_master chain so that it is clear what is going on. The reason why the first source mount is special is that it it's mnt_parent is not a mount in the dest_mnt propagation tree, and as such termination conditions based up on the dest_mnt mount propgation tree do not make sense. To avoid other kinds of confusion last_dest is not changed when computing last_source. last_dest is only used once in propagate_one and that is above the point of the code being modified, so changing the global variable is meaningless and confusing. Cc: sta...@vger.kernel.org fixes: f2ebb3a921c1ca1e2ddd9242e95a1989a50c4c68 ("smarter propagate_mnt()") Reported-by: Tycho Andersen Reviewed-by: Seth Forshee Tested-by: Seth Forshee Signed-off-by: "Eric W. Biederman" (cherry picked from commit 5ec0811d30378ae104f250bfc9b3640242d81e3f) Signed-off-by: Vladimir Davydov Fixes: CVE-2016-4581 Conflicts: fs/pnode.c --- fs/pnode.c | 28 +++- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/fs/pnode.c b/fs/pnode.c index 74f10c0e7e00..cc9ac074ba00 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -198,7 +198,7 @@ static struct mount *next_group(struct mount *m, struct mount *origin) /* all accesses are serialized by namespace_sem */ static struct user_namespace *user_ns; -static struct mount *last_dest, *last_source, *dest_master; +static struct mount *last_dest, *first_source, *last_source, *dest_master; static struct mountpoint *mp; static struct list_head *list; @@ -216,22 +216,23 @@ static int propagate_one(struct mount *m) type = CL_MAKE_SHARED; } else { struct mount *n, *p; + bool done; for (n = m; ; n = p) { p = n->mnt_master; - if (p == dest_master || IS_MNT_MARKED(p)) { - while (last_dest->mnt_master !
[Devel] [PATCH RHEL7 COMMIT] mm: memcontrol: fix race between user memory reparent and charge
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.23 --> commit 9c3a30958ef74ed70bcc92ecf3727d36dbe463a0 Author: Vladimir Davydov Date: Thu Jul 7 18:33:30 2016 +0400 mm: memcontrol: fix race between user memory reparent and charge When a memory cgroup is destroyed (via rmdir), user memory pages accounted to it get recharged to the parent cgroup - see mem_cgroup_css_offlie() and mem_cgroup_reparent_charges(). If, for some reason, a page is left charged to the destroyed cgroup after mem_cgroup_reparent_charges() was done, we might get use-after-free, because user memory charges do not hold reference to the cgroup. And it seems to be possible in case reparenting races with __mem_cgroup_try_charge() as follows: __mem_cgroup_try_charge get memcg from mm, inc ref mem_cgroup_css_offline mem_cgroup_reparent_charges charge page to memcg put ref to memcg To fix this issue, let's make __mem_cgroup_try_charge() cancel charge if it detects that the cgroup was destroyed. https://jira.sw.ru/browse/PSBM-49117 Signed-off-by: Vladimir Davydov --- mm/memcontrol.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8151d4259c6b..e3a16b99ccc6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -314,6 +314,7 @@ struct mem_cgroup { * Should the accounting and control be hierarchical, per subtree? */ bool use_hierarchy; + bool is_offline; unsigned long kmem_account_flags; /* See KMEM_ACCOUNTED_*, below */ booloom_lock; @@ -2952,6 +2953,23 @@ again: } } while (ret != CHARGE_OK); + /* +* Cancel charge in case this cgroup was destroyed while we were here, +* otherwise we can get a pending user memory charge to an offline +* cgroup, which might result in use-after-free after the cgroup gets +* released (see also mem_cgroup_css_offline()). +* +* Note, no need to issue an explicit barrier here, because a +* successful charge implies full memory barrier. +*/ + if (unlikely(memcg->is_offline)) { + res_counter_uncharge(&memcg->res, batch * PAGE_SIZE); + if (do_swap_account) + res_counter_uncharge(&memcg->memsw, batch * PAGE_SIZE); + css_put(&memcg->css); + goto bypass; + } + if (batch > nr_pages) refill_stock(memcg, batch - nr_pages); @@ -6657,6 +6675,15 @@ static void mem_cgroup_css_offline(struct cgroup *cont) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); + /* +* Mark memory cgroup as offline before going to reparent charges. +* This guarantees that __mem_cgroup_try_charge() either charges before +* reparenting starts or doesn't charge at all, hence we won't have +* pending user memory charges after reparenting is done. +*/ + memcg->is_offline = true; + smp_mb(); + memcg_deactivate_kmem(memcg); mem_cgroup_invalidate_reclaim_iterators(memcg); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] overlayfs: verify upper dentry before unlink and rename
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.23 --> commit 4791f46af35069ef00b582a7f229974f6b16e5bd Author: Maxim Patlasov Date: Thu Jul 7 18:32:38 2016 +0400 overlayfs: verify upper dentry before unlink and rename Without this patch it is easy to crash node by fiddling with overlayfs dirs. Backport commit 11f37104 from ms: From: Miklos Szeredi ovl: verify upper dentry before unlink and rename Unlink and rename in overlayfs checked the upper dentry for staleness by verifying upper->d_parent against upperdir. However the dentry can go stale also by being unhashed, for example. Expand the verification to actually look up the name again (under parent lock) and check if it matches the upper dentry. This matches what the VFS does before passing the dentry to filesytem's unlink/rename methods, which excludes any inconsistency caused by overlayfs. Signed-off-by: Miklos Szeredi https://jira.sw.ru/browse/PSBM-47981 --- fs/overlayfs/dir.c | 59 +++--- 1 file changed, 38 insertions(+), 21 deletions(-) diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c index 33c47712d16f..229b9e4be3bd 100644 --- a/fs/overlayfs/dir.c +++ b/fs/overlayfs/dir.c @@ -596,21 +596,25 @@ static int ovl_remove_upper(struct dentry *dentry, bool is_dir) { struct dentry *upperdir = ovl_dentry_upper(dentry->d_parent); struct inode *dir = upperdir->d_inode; - struct dentry *upper = ovl_dentry_upper(dentry); + struct dentry *upper; int err; mutex_lock_nested(&dir->i_mutex, I_MUTEX_PARENT); + upper = lookup_one_len(dentry->d_name.name, upperdir, + dentry->d_name.len); + err = PTR_ERR(upper); + if (IS_ERR(upper)) + goto out_unlock; + err = -ESTALE; - if (upper->d_parent == upperdir) { - /* Don't let d_delete() think it can reset d_inode */ - dget(upper); + if (upper == ovl_dentry_upper(dentry)) { if (is_dir) err = vfs_rmdir(dir, upper); else err = vfs_unlink(dir, upper, NULL); - dput(upper); ovl_dentry_version_inc(dentry->d_parent); } + dput(upper); /* * Keeping this dentry hashed would mean having to release @@ -619,6 +623,7 @@ static int ovl_remove_upper(struct dentry *dentry, bool is_dir) * now. */ d_drop(dentry); +out_unlock: mutex_unlock(&dir->i_mutex); return err; @@ -839,29 +844,39 @@ static int ovl_rename2(struct inode *olddir, struct dentry *old, trap = lock_rename(new_upperdir, old_upperdir); - olddentry = ovl_dentry_upper(old); - newdentry = ovl_dentry_upper(new); - if (newdentry) { + + olddentry = lookup_one_len(old->d_name.name, old_upperdir, + old->d_name.len); + err = PTR_ERR(olddentry); + if (IS_ERR(olddentry)) + goto out_unlock; + + err = -ESTALE; + if (olddentry != ovl_dentry_upper(old)) + goto out_dput_old; + + newdentry = lookup_one_len(new->d_name.name, new_upperdir, + new->d_name.len); + err = PTR_ERR(newdentry); + if (IS_ERR(newdentry)) + goto out_dput_old; + + err = -ESTALE; + if (ovl_dentry_upper(new)) { if (opaquedir) { - newdentry = opaquedir; - opaquedir = NULL; + if (newdentry != opaquedir) + goto out_dput; } else { - dget(newdentry); + if (newdentry != ovl_dentry_upper(new)) + goto out_dput; } } else { new_create = true; - newdentry = lookup_one_len(new->d_name.name, new_upperdir, - new->d_name.len); - err = PTR_ERR(newdentry); - if (IS_ERR(newdentry)) - goto out_unlock; + if (!d_is_negative(newdentry) && + (!new_opaque || !ovl_is_whiteout(newdentry))) + goto out_dput; } - err = -ESTALE; - if (olddentry->d_parent != old_upperdir) - goto out_dput; - if (newdentry->d_parent != new_upperdir) - goto out_dput; if (olddentry == trap) goto out_dput; if (newdentry == trap) @@ -917,6 +932,8 @@ static int ovl_rename2(struct inode *olddir, struct dentry *old, out_dput: dput(newdentry); +out_dput_old: + dput(olddentry); out_unlock:
[Devel] [PATCH rh7] mm: memcontrol: fix race between user memory reparent and charge
When a memory cgroup is destroyed (via rmdir), user memory pages accounted to it get recharged to the parent cgroup - see mem_cgroup_css_offlie() and mem_cgroup_reparent_charges(). If, for some reason, a page is left charged to the destroyed cgroup after mem_cgroup_reparent_charges() was done, we might get use-after-free, because user memory charges do not hold reference to the cgroup. And it seems to be possible in case reparenting races with __mem_cgroup_try_charge() as follows: __mem_cgroup_try_charge get memcg from mm, inc ref mem_cgroup_css_offline mem_cgroup_reparent_charges charge page to memcg put ref to memcg To fix this issue, let's make __mem_cgroup_try_charge() cancel charge if it detects that the cgroup was destroyed. https://jira.sw.ru/browse/PSBM-49117 Signed-off-by: Vladimir Davydov --- mm/memcontrol.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8151d4259c6b..e3a16b99ccc6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -314,6 +314,7 @@ struct mem_cgroup { * Should the accounting and control be hierarchical, per subtree? */ bool use_hierarchy; + bool is_offline; unsigned long kmem_account_flags; /* See KMEM_ACCOUNTED_*, below */ booloom_lock; @@ -2952,6 +2953,23 @@ again: } } while (ret != CHARGE_OK); + /* +* Cancel charge in case this cgroup was destroyed while we were here, +* otherwise we can get a pending user memory charge to an offline +* cgroup, which might result in use-after-free after the cgroup gets +* released (see also mem_cgroup_css_offline()). +* +* Note, no need to issue an explicit barrier here, because a +* successful charge implies full memory barrier. +*/ + if (unlikely(memcg->is_offline)) { + res_counter_uncharge(&memcg->res, batch * PAGE_SIZE); + if (do_swap_account) + res_counter_uncharge(&memcg->memsw, batch * PAGE_SIZE); + css_put(&memcg->css); + goto bypass; + } + if (batch > nr_pages) refill_stock(memcg, batch - nr_pages); @@ -6657,6 +6675,15 @@ static void mem_cgroup_css_offline(struct cgroup *cont) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); + /* +* Mark memory cgroup as offline before going to reparent charges. +* This guarantees that __mem_cgroup_try_charge() either charges before +* reparenting starts or doesn't charge at all, hence we won't have +* pending user memory charges after reparenting is done. +*/ + memcg->is_offline = true; + smp_mb(); + memcg_deactivate_kmem(memcg); mem_cgroup_invalidate_reclaim_iterators(memcg); -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] mm: memcontrol: carefully check for user charges while reparenting
kmem is uncharged before res, therefore when checking if there are still user charges in a memory cgroup, we should read res before kmem, otherwise a kmem uncharge can get in-between two reads, leading to false-positive res <= kmem. Add smp_rmb() to guarantee this never happens. Note, since x86 doesn't reorder reads, this patch doesn't actually introduce any functional changes - it just clarifies the code. Fixes: 35c0d2a992aa ("mm: memcontrol: fix race between kmem uncharge and charge reparenting") Signed-off-by: Vladimir Davydov --- mm/memcontrol.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1f525f27e481..8151d4259c6b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4814,7 +4814,7 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg, static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg) { int node, zid; - u64 usage; + u64 res, kmem; do { /* This is for making all *used* pages to be on LRU. */ @@ -4845,10 +4845,17 @@ static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg) * so the lru seemed empty but the page could have been added * right after the check. RES_USAGE should be safe as we always * charge before adding to the LRU. +* +* Note, we must read memcg->res strictly before memcg->kmem, +* because otherwise a kmem charge might get uncharged in +* between the two reads leading to res <= kmem, even though +* there are still user pages charged to this cgroup out there. +* (see also comment in memcg_charge_kmem()) */ - usage = res_counter_read_u64(&memcg->res, RES_USAGE) - - res_counter_read_u64(&memcg->kmem, RES_USAGE); - } while (usage > 0); + res = res_counter_read_u64(&memcg->res, RES_USAGE); + smp_rmb(); + kmem = res_counter_read_u64(&memcg->kmem, RES_USAGE); + } while (res > kmem); } /* -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] fs: make overlayfs disabled in CT by default
On Wed, Jul 06, 2016 at 10:33:07AM -0700, Maxim Patlasov wrote: > On 07/06/2016 02:26 AM, Vladimir Davydov wrote: > > >On Tue, Jul 05, 2016 at 04:45:10PM -0700, Maxim Patlasov wrote: > >>Vova, > >> > >> > >>On 07/04/2016 11:03 AM, Maxim Patlasov wrote: > >>>On 07/04/2016 08:53 AM, Vladimir Davydov wrote: > >>> > >>>>On Tue, Jun 28, 2016 at 03:48:54PM -0700, Maxim Patlasov wrote: > >>>>... > >>>>>@@ -643,6 +643,7 @@ static struct cgroup_subsys_state > >>>>>*ve_create(struct cgroup *cg) > >>>>>ve->odirect_enable = 2; > >>>>> ve->fsync_enable = 2; > >>>>>+ve->experimental_fs_enable = 2; > >>>>For odirect_enable and fsync_enable, 2 means follow the host's config, 1 > >>>>means enable unconditionally, and 0 means disable unconditionally. But > >>>>we don't want to allow a user inside a CT to enable this feature, right? > >>>I thought it's OK to allow user inside CT to enable it if host sysadmin is > >>>OK about it. The same logic as for odirect: by default > >>>ve0->experimental_fs_enable = 0, so whatever user inside CT writes to this > >>>knob, the feature is disabled. If sysadmin writes '1' to ve0->..., the > >>>feature becomes enabled. If an user wants to voluntarily disable it inside > >>>CT, that's OK too. > >>> > >>>>This is confusing. May be, we'd better add a new VE_FEATURE for the > >>>>purpose? > >>>Not sure right now. I'll look at it and let you know later. > >>Technically, it is very easy to implement new VE_FEATURE for overlayfs. But > >>this approach is less flexible because we return EPERM from ve_write_64 if > >>CT is running, and we'll need to involve userspace team to make the feature > >>configurable and (possibly) persistent. Do you think it's worthy for > >>something we'll get rid of soon anyway (I mean as soon as PSBM-47981 > >>resolved)? > >Fair enough, not much point in introducing yet another feature for the > >purpose, at least right now, sysctl should do for the beginning. > > > >Come to think of it, do we really need this sysctl inside containers? I > >mean, by enabling this sysctl on the host we open a possible system-wide > >security hole, which a CT admin won't be able to mitigate by disabling > >overlayfs inside her CT. So why would she need it for? To prevent > >non-privileged CT users from mounting overlayfs inside a user ns? But > >overlayfs is not permitted to be mounted by a userns root anyway AFAICS. > >May be, just drop in-CT sysctl then? > > Currently, anyone who can login into CT as root may mount overlayfs, then > try to exploit its weak sides. This is a problem. > > Until we ensure that overlayfs is production-ready (at least does not have > obvious breaches), let's disable it by default (of course, if ve != ve0). > Those who want to play with overlayfs at their own risk will enable it by > turning on some knob on host system (ve == ve0). > > I don't think that mixing trusted (overlayfs-enabled) CTs and not trusted > (overlayfs-disabled) CTs on the same physical node is important use-case for > now. So, any simple system-wide knob must work. > Essentially, the same scheme > with odirect: by default it is '0' in ve0 and the root inside CT cannot turn > it on; and if it is manually set to '1' in ve0, the behavior will depend on > per-CT root willing. No, that's not how it works. AFAICS (see may_use_odirect), ve0 sysctlve sysctl odirect allowed in ve? x 0 0 x 1 1 x 2 x i.e. system-wide sysctl can't be used to disallow odirect inside a VE, while you want a different behavior AFAIU - you want to enable overlayfs if both ve0 sysctl and ve sysctl are set. That's why the patch looks confusing to me. Let's only leave system-wide sysctl for permitting overlayfs. VE sysctl doesn't make any sense - only root user is allowed to mount overlayfs inside a CT and she can set this sysctl anyway. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH] propogate_mnt: Handle the first propogated copy being a slave
From: "Eric W. Biederman" When the first propgated copy was a slave the following oops would result: > BUG: unable to handle kernel NULL pointer dereference at 0010 > IP: [] propagate_one+0xbe/0x1c0 > PGD bacd4067 PUD bac66067 PMD 0 > Oops: [#1] SMP > Modules linked in: > CPU: 1 PID: 824 Comm: mount Not tainted 4.6.0-rc5userns+ #1523 > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 > task: 8800bb0a8000 ti: 8800bac3c000 task.ti: 8800bac3c000 > RIP: 0010:[] [] propagate_one+0xbe/0x1c0 > RSP: 0018:8800bac3fd38 EFLAGS: 00010283 > RAX: RBX: 8800bb77ec00 RCX: 0010 > RDX: RSI: 8800bb58c000 RDI: 8800bb58c480 > RBP: 8800bac3fd48 R08: 0001 R09: > R10: 1ca1 R11: 1c9d R12: > R13: 8800ba713800 R14: 8800bac3fda0 R15: 8800bb77ec00 > FS: 7f3c0cd9b7e0() GS:8800bfb0() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 0010 CR3: bb79d000 CR4: 06e0 > Stack: > 8800bb77ec00 8800bac3fd88 811fbf85 > 8800bac3fd98 8800bb77f080 8800ba713800 8800bb262b40 > 8800bac3fdd8 811f1da0 > Call Trace: > [] propagate_mnt+0x105/0x140 > [] attach_recursive_mnt+0x120/0x1e0 > [] graft_tree+0x63/0x70 > [] do_add_mount+0x9b/0x100 > [] do_mount+0x2aa/0xdf0 > [] ? strndup_user+0x4e/0x70 > [] SyS_mount+0x75/0xc0 > [] do_syscall_64+0x4b/0xa0 > [] entry_SYSCALL64_slow_path+0x25/0x25 > Code: 00 00 75 ec 48 89 0d 02 22 22 01 8b 89 10 01 00 00 48 89 05 fd 21 22 01 > 39 8e 10 01 00 00 0f 84 e0 00 00 00 48 8b 80 d8 00 00 00 <48> 8b 50 10 48 89 > 05 df 21 22 01 48 89 15 d0 21 22 01 8b 53 30 > RIP [] propagate_one+0xbe/0x1c0 > RSP > CR2: 0010 > ---[ end trace 2725ecd95164f217 ]--- This oops happens with the namespace_sem held and can be triggered by non-root users. An all around not pleasant experience. To avoid this scenario when finding the appropriate source mount to copy stop the walk up the mnt_master chain when the first source mount is encountered. Further rewrite the walk up the last_source mnt_master chain so that it is clear what is going on. The reason why the first source mount is special is that it it's mnt_parent is not a mount in the dest_mnt propagation tree, and as such termination conditions based up on the dest_mnt mount propgation tree do not make sense. To avoid other kinds of confusion last_dest is not changed when computing last_source. last_dest is only used once in propagate_one and that is above the point of the code being modified, so changing the global variable is meaningless and confusing. Cc: sta...@vger.kernel.org fixes: f2ebb3a921c1ca1e2ddd9242e95a1989a50c4c68 ("smarter propagate_mnt()") Reported-by: Tycho Andersen Reviewed-by: Seth Forshee Tested-by: Seth Forshee Signed-off-by: "Eric W. Biederman" (cherry picked from commit 5ec0811d30378ae104f250bfc9b3640242d81e3f) Signed-off-by: Vladimir Davydov Fixes: CVE-2016-4581 Conflicts: fs/pnode.c --- fs/pnode.c | 28 +++- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/fs/pnode.c b/fs/pnode.c index 74f10c0e7e00..cc9ac074ba00 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -198,7 +198,7 @@ static struct mount *next_group(struct mount *m, struct mount *origin) /* all accesses are serialized by namespace_sem */ static struct user_namespace *user_ns; -static struct mount *last_dest, *last_source, *dest_master; +static struct mount *last_dest, *first_source, *last_source, *dest_master; static struct mountpoint *mp; static struct list_head *list; @@ -216,22 +216,23 @@ static int propagate_one(struct mount *m) type = CL_MAKE_SHARED; } else { struct mount *n, *p; + bool done; for (n = m; ; n = p) { p = n->mnt_master; - if (p == dest_master || IS_MNT_MARKED(p)) { - while (last_dest->mnt_master != p) { - last_source = last_source->mnt_master; - last_dest = last_source->mnt_parent; - } - if (n->mnt_group_id != last_dest->mnt_group_id || - (!n->mnt_group_id && -!last_dest->mnt_group_id)) { - last_source = last_source->mnt_master; - last_dest = last_source->mnt_parent; - } + if (p ==
Re: [Devel] [PATCH rh7] fs: make overlayfs disabled in CT by default
On Tue, Jul 05, 2016 at 04:45:10PM -0700, Maxim Patlasov wrote: > Vova, > > > On 07/04/2016 11:03 AM, Maxim Patlasov wrote: > >On 07/04/2016 08:53 AM, Vladimir Davydov wrote: > > > >>On Tue, Jun 28, 2016 at 03:48:54PM -0700, Maxim Patlasov wrote: > >>... > >>>@@ -643,6 +643,7 @@ static struct cgroup_subsys_state > >>>*ve_create(struct cgroup *cg) > >>>ve->odirect_enable = 2; > >>> ve->fsync_enable = 2; > >>>+ve->experimental_fs_enable = 2; > >>For odirect_enable and fsync_enable, 2 means follow the host's config, 1 > >>means enable unconditionally, and 0 means disable unconditionally. But > >>we don't want to allow a user inside a CT to enable this feature, right? > > > >I thought it's OK to allow user inside CT to enable it if host sysadmin is > >OK about it. The same logic as for odirect: by default > >ve0->experimental_fs_enable = 0, so whatever user inside CT writes to this > >knob, the feature is disabled. If sysadmin writes '1' to ve0->..., the > >feature becomes enabled. If an user wants to voluntarily disable it inside > >CT, that's OK too. > > > >>This is confusing. May be, we'd better add a new VE_FEATURE for the > >>purpose? > > > >Not sure right now. I'll look at it and let you know later. > > Technically, it is very easy to implement new VE_FEATURE for overlayfs. But > this approach is less flexible because we return EPERM from ve_write_64 if > CT is running, and we'll need to involve userspace team to make the feature > configurable and (possibly) persistent. Do you think it's worthy for > something we'll get rid of soon anyway (I mean as soon as PSBM-47981 > resolved)? Fair enough, not much point in introducing yet another feature for the purpose, at least right now, sysctl should do for the beginning. Come to think of it, do we really need this sysctl inside containers? I mean, by enabling this sysctl on the host we open a possible system-wide security hole, which a CT admin won't be able to mitigate by disabling overlayfs inside her CT. So why would she need it for? To prevent non-privileged CT users from mounting overlayfs inside a user ns? But overlayfs is not permitted to be mounted by a userns root anyway AFAICS. May be, just drop in-CT sysctl then? ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] dcache: fix dentry leak when shrink races with kill
dentry_kill() does not free a dentry in case it is on a shrink list - see __dentry_kill() -> d_free(). Instead it just marks it DCACHE_MAY_FREE, which will make the shrinker free it when it's done with it. This is required to avoid use after free in shrink_dentry_list(). This logic was back-ported by commit e33cae748d1a ("ms/dcache: dentry_kill(): don't try to remove from shrink list"). When back-porting this commit I occasionally missed a hunk for shrink_dentry_list(). The hunk makes shrink_dentry_list() more carefully check dentry->d_lock_ref.count, i.e. instead of merely checking if it's 0 or not, it makes it check if it's strictly greater than 0. W/o this check a dentry might leak if shrink races with kill, because before trying to free a dentry, dentry_kill() first calls lockref_mark_dead(&dentry->d_lockref), which sets d_lockref.count to -128, so that shrink_dentry_list() will silently skip the dentry instead of freeing it. This patch resurrects the missing hunk. https://jira.sw.ru/browse/PSBM-49321 Fixes: e33cae748d1a ("ms/dcache: dentry_kill(): don't try to remove from shrink list") Signed-off-by: Vladimir Davydov --- fs/dcache.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/dcache.c b/fs/dcache.c index 09ed486c9f1d..6433814a02d2 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -874,7 +874,7 @@ static void shrink_dentry_list(struct list_head *list) * We found an inuse dentry which was not removed from * the LRU because of laziness during lookup. Do not free it. */ - if (dentry->d_lockref.count) { + if ((int)dentry->d_lockref.count > 0) { spin_unlock(&dentry->d_lock); if (parent) spin_unlock(&parent->d_lock); -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] fs: make overlayfs disabled in CT by default
On Tue, Jun 28, 2016 at 03:48:54PM -0700, Maxim Patlasov wrote: ... > @@ -643,6 +643,7 @@ static struct cgroup_subsys_state *ve_create(struct > cgroup *cg) > > ve->odirect_enable = 2; > ve->fsync_enable = 2; > + ve->experimental_fs_enable = 2; For odirect_enable and fsync_enable, 2 means follow the host's config, 1 means enable unconditionally, and 0 means disable unconditionally. But we don't want to allow a user inside a CT to enable this feature, right? This is confusing. May be, we'd better add a new VE_FEATURE for the purpose? > > #ifdef CONFIG_VE_IPTABLES > ve->ipt_mask = ve_setup_iptables_mask(VE_IP_DEFAULT); > ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] config: disable numa balancing by default
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.22 --> commit 4699d7601d1d8c567c3f59dcaf4e3b1bde327710 Author: Vladimir Davydov Date: Mon Jul 4 18:11:21 2016 +0400 config: disable numa balancing by default It results in LAMP DVD store benchmark degradation. https://jira.sw.ru/browse/PSBM-49131 Signed-off-by: Vladimir Davydov --- configs/kernel-3.10.0-x86_64-debug.config | 2 ++ configs/kernel-3.10.0-x86_64.config | 2 ++ 2 files changed, 4 insertions(+) diff --git a/configs/kernel-3.10.0-x86_64-debug.config b/configs/kernel-3.10.0-x86_64-debug.config index 4142b41946ce..d65f0ea5ea17 100644 --- a/configs/kernel-3.10.0-x86_64-debug.config +++ b/configs/kernel-3.10.0-x86_64-debug.config @@ -5470,6 +5470,8 @@ CONFIG_QUOTA_COMPAT=y CONFIG_BLK_DEV_NBD=m +CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=n + # disabled in debug RHEL7 config by default CONFIG_PANIC_ON_OOPS=y CONFIG_PANIC_ON_OOPS_VALUE=1 diff --git a/configs/kernel-3.10.0-x86_64.config b/configs/kernel-3.10.0-x86_64.config index 3be34a6bcea5..53e103bf438c 100644 --- a/configs/kernel-3.10.0-x86_64.config +++ b/configs/kernel-3.10.0-x86_64.config @@ -5442,6 +5442,8 @@ CONFIG_QUOTA_COMPAT=y CONFIG_BLK_DEV_NBD=m +CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=n + # # OpenVZ # ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ploop: reloc vs extent_conversion race fix
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.22 --> commit bbc7ec7ef8492f736fa0cce85f4a609fb7cb80df Author: Dmitry Monakhov Date: Sun Jul 3 21:38:12 2016 +0400 ploop: reloc vs extent_conversion race fix We have fixed most relocation bugs during fixing https://jira.sw.ru/browse/PSBM-47107 Currently reloc_a looks like follows: 1->read_data_from_old_post 2->write_to_new_pos ->sumbit_alloc ->submit_pad ->post_submit->convert_unwritten 3->update_index ->write_page with FLUSH|FUA 4->nullify_old_pos 5->issue_flush But on step 3 extent coversion is not yet stable because belongs to uncommitted transaction. We MUST call ->fsync inside ->post_sumit as we do for REQ_FUA requests. Let's tag relocatoin requests as FUA from very beginning in order to assert sync semantics. https://jira.sw.ru/browse/PSBM-49143 Signed-off-by: Dmitry Monakhov --- drivers/block/ploop/dev.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index 40768b6ef2c5..e5f010b9aeba 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -4097,7 +4097,7 @@ static void ploop_relocate(struct ploop_device * plo) preq->bl.tail = preq->bl.head = NULL; preq->req_cluster = 0; preq->req_size = 0; - preq->req_rw = WRITE_SYNC; + preq->req_rw = WRITE_SYNC|REQ_FUA; preq->eng_state = PLOOP_E_ENTRY; preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_A); preq->error = 0; @@ -4401,7 +4401,7 @@ static void ploop_relocblks_process(struct ploop_device *plo) preq->bl.tail = preq->bl.head = NULL; preq->req_cluster = ~0U; /* uninitialized */ preq->req_size = 0; - preq->req_rw = WRITE_SYNC; + preq->req_rw = WRITE_SYNC|REQ_FUA; preq->eng_state = PLOOP_E_ENTRY; preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_S); preq->error = 0; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 2/2] Revert "ve/vmscan: do not throttle kthreads due to too_many_isolated"
This reverts commit 5ce7561a6b0a517fcf4fbcd8a1b00dab0ddd4222. Not needed any longer as the previous patch fixed the issue in a different way. Signed-off-by: Vladimir Davydov --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 3ac08ddf50b8..06ff6972ef22 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1415,7 +1415,7 @@ static int __too_many_isolated(struct zone *zone, int file, static int too_many_isolated(struct zone *zone, int file, struct scan_control *sc) { - if (current->flags & PF_KTHREAD) + if (current_is_kswapd()) return 0; if (!global_reclaim(sc)) -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 1/2] mm: vmscan: never wait on writeback pages
Currently, if memcg reclaim encounters a page under writeback it waits for the writeback to finish. This is done in order to avoid hitting OOM when there are a lot of potentially reclaimable pages under writeback, as memcg lacks dirty pages limit. Although it saves us from premature OOM, this technique is deadlock prone if writeback is supposed to be done by a process that might need to allocate memory, like in case of vstorage. If the process responsible for writeback tries to allocate a page it might get stuck in too_many_isolated() loop waiting for processes performing memcg reclaim to put isolated pages back to the LRU, but memcg reclaim might be stuck waiting for writeback to complete, resulting in a deadlock. To avoid this kind of deadlock, let's, instead of waiting for page writeback directly, call congestion_wait() after returning isolated pages to the LRU in case writeback pages are recycled through the LRU before IO can complete. This should still prevent premature memcg OOM while rendering the deadlock described above impossible. https://jira.sw.ru/browse/PSBM-48115 Signed-off-by: Vladimir Davydov --- mm/vmscan.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 3f6ce18df3ed..3ac08ddf50b8 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -929,11 +929,11 @@ static unsigned long shrink_page_list(struct list_head *page_list, *__GFP_IO|__GFP_FS for this reason); but more thought *would probably show more reasons. * -* 3) memcg encounters a page that is not already marked +* 3) memcg encounters a page that is already marked *PageReclaim. memcg does not have any dirty pages *throttling so we could easily OOM just because too many *pages are in writeback and there is nothing else to -*reclaim. Wait for the writeback to complete. +*reclaim. Stall memcg reclaim then. */ if (PageWriteback(page)) { /* Case 1 above */ @@ -954,7 +954,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, * enough to care. What we do want is for this * page to have PageReclaim set next time memcg * reclaim reaches the tests above, so it will -* then wait_on_page_writeback() to avoid OOM; +* then stall to avoid OOM; * and it's also appropriate in global reclaim. */ SetPageReclaim(page); @@ -964,7 +964,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, /* Case 3 above */ } else { - wait_on_page_writeback(page); + nr_immediate++; + goto keep_locked; } } @@ -1586,10 +1587,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, if (nr_writeback && nr_writeback == nr_taken) zone_set_flag(zone, ZONE_WRITEBACK); - /* -* memcg will stall in page writeback so only consider forcibly -* stalling for global reclaim -*/ + if (!global_reclaim(sc) && nr_immediate) + congestion_wait(BLK_RW_ASYNC, HZ/10); + if (global_reclaim(sc)) { /* * Tag a zone as congested if all the dirty pages scanned were -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] ub: account memory overcommit failures in UB_PRIVVMPAGES.failcnt
On Mon, Jun 27, 2016 at 12:50:10PM +0300, Andrey Ryabinin wrote: > If allocation failed due to memory overcommit failcounters doesn't change. > This contradicts with userspace expectations. > With this patch, such failures will be accounted in failconter of > UB_PRIVVMPAGES. > > https://jira.sw.ru/browse/PSBM-48891 > > Signed-off-by: Andrey Ryabinin Reviewed-by: Vladimir Davydov ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] Resurrect proc fairsched files
They are still required by userspace, which checks for their presence. Leave them empty. https://jira.sw.ru/browse/PSBM-48824 Signed-off-by: Vladimir Davydov --- kernel/ve/veowner.c | 26 ++ 1 file changed, 26 insertions(+) diff --git a/kernel/ve/veowner.c b/kernel/ve/veowner.c index 86065072a9ca..757dde99ef0f 100644 --- a/kernel/ve/veowner.c +++ b/kernel/ve/veowner.c @@ -36,12 +36,38 @@ struct proc_dir_entry *proc_vz_dir; EXPORT_SYMBOL(proc_vz_dir); +static int proc_fairsched_open(struct inode *inode, struct file *file) +{ + return 0; +} + +static ssize_t proc_fairsched_read(struct file *file, char __user *buf, + size_t size, loff_t *ppos) +{ + return 0; +} + +static struct file_operations proc_fairsched_operations = { + .open = proc_fairsched_open, + .read = proc_fairsched_read, + .llseek = noop_llseek, +}; + static void prepare_proc(void) { proc_vz_dir = proc_mkdir_mode("vz", S_ISVTX | S_IRUGO | S_IXUGO, NULL); if (!proc_vz_dir) panic("Can't create /proc/vz dir\n"); + + /* Legacy files. They are not really needed and should be removed +* sooner or later, but leave the stubs for now as they may be required +* by userspace */ + proc_mkdir_mode("container", 0, proc_vz_dir); + proc_mkdir_mode("fairsched", 0, proc_vz_dir); + + proc_create("fairsched", S_ISVTX, NULL, &proc_fairsched_operations); + proc_create("fairsched2", S_ISVTX, NULL, &proc_fairsched_operations); } #endif -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] ve/fs: namespace -- Don't fail on permissions if @ve->devmnt_list is empty
On Thu, Jun 23, 2016 at 01:34:18PM +0300, Cyrill Gorcunov wrote: > In commit 7eeb5b4afa8db5a2f2e1e47ab6b84e55fc8c5661 I addressed > first half of a problem, but I happen to work with dirty copy > of libvzctl where mount_opts cgroup has been c/r'ed manually, > so I missed the case where @devmnt_list is empty on restore > (just like it is in vanilla libvzctl). So fix the second half. > > https://jira.sw.ru/browse/PSBM-48188 > > Reported-by: Igor Sukhih > Signed-off-by: Cyrill Gorcunov > CC: Vladimir Davydov > CC: Konstantin Khorenko Reviewed-by: Vladimir Davydov ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 3/4] net: ipip: fix crash on newlink if VE_FEATURE_IPIP is disabled
In this case net_generic returns NULL. We must handle this gracefully. Signed-off-by: Vladimir Davydov --- net/ipv4/ipip.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c index 7842dcb2fd65..b1004fb7539c 100644 --- a/net/ipv4/ipip.c +++ b/net/ipv4/ipip.c @@ -357,6 +357,9 @@ static int ipip_newlink(struct net *src_net, struct net_device *dev, { struct ip_tunnel_parm p; + if (net_generic(dev_net(dev), ipip_net_id) == NULL) + return -EACCES; + ipip_netlink_parms(data, &p); return ip_tunnel_newlink(dev, tb, &p); } -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 4/4] net: sit: fix crash on newlink if VE_FEATURE_SIT is disabled
In this case net_generic returns NULL. We must handle this gracefully. Signed-off-by: Vladimir Davydov --- net/ipv6/sit.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c index 2a73b520d3bf..6b1ae3b06be9 100644 --- a/net/ipv6/sit.c +++ b/net/ipv6/sit.c @@ -1441,6 +1441,9 @@ static int ipip6_newlink(struct net *src_net, struct net_device *dev, #endif int err; + if (net_generic(net, sit_net_id) == NULL) + return -EACCES; + nt = netdev_priv(dev); ipip6_netlink_parms(data, &nt->parms); -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 1/4] net: ipip: enable in container
Currently, we fail to init ipip per-net in a ve, because it has neither NETIF_F_VIRTUAL nor NETIF_F_NETNS_LOCAL: ipip_init_net ip_tunnel_init_net __ip_tunnel_create register_netdevice ve_is_dev_movable In PCS6 ipip has NETIF_F_NETNS_LOCAL, so everything works fine there, but this restriction was removed in RH7 kernel, so we fail to start a container if ipip is loaded (or load ipip if there are containers running). Mark ipip as NETIF_F_VIRTUAL to fix this issue. https://jira.sw.ru/browse/PSBM-48608 Signed-off-by: Vladimir Davydov --- net/ipv4/ipip.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c index e556a1df5a57..7842dcb2fd65 100644 --- a/net/ipv4/ipip.c +++ b/net/ipv4/ipip.c @@ -301,6 +301,7 @@ static void ipip_tunnel_setup(struct net_device *dev) netif_keep_dst(dev); dev->features |= IPIP_FEATURES; + dev->features |= NETIF_F_VIRTUAL; dev->hw_features|= IPIP_FEATURES; ip_tunnel_setup(dev, ipip_net_id); } -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 2/4] net: ip_vti: skip per net init in ve
ip_vti devices lack NETIF_F_VIRTUAL, so they can't be created inside a container. Problem is a device of this kind is created on net ns init if the module is loaded, as a result a container start fails with EPERM. We could allow ip_vti inside container (as well as other net devices, which I would really like to do), but this is insecure and might break migration, so let's keep it disabled and fix the issue by silently skipping ip_vti per net init if running inside a ve. https://jira.sw.ru/browse/PSBM-48698 Signed-off-by: Vladimir Davydov --- net/ipv4/ip_vti.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c index ce80a9a1be9d..3158100646ed 100644 --- a/net/ipv4/ip_vti.c +++ b/net/ipv4/ip_vti.c @@ -58,6 +58,9 @@ static int vti_input(struct sk_buff *skb, int nexthdr, __be32 spi, struct net *net = dev_net(skb->dev); struct ip_tunnel_net *itn = net_generic(net, vti_net_id); + if (itn == NULL) + return -EINVAL; + tunnel = ip_tunnel_lookup(itn, skb->dev->ifindex, TUNNEL_NO_KEY, iph->saddr, iph->daddr, 0); if (tunnel != NULL) { @@ -256,6 +259,9 @@ static int vti4_err(struct sk_buff *skb, u32 info) int protocol = iph->protocol; struct ip_tunnel_net *itn = net_generic(net, vti_net_id); + if (itn == NULL) + return -1; + tunnel = ip_tunnel_lookup(itn, skb->dev->ifindex, TUNNEL_NO_KEY, iph->daddr, iph->saddr, 0); if (!tunnel) @@ -413,6 +419,9 @@ static int __net_init vti_init_net(struct net *net) int err; struct ip_tunnel_net *itn; + if (!ve_is_super(net->owner_ve)) + return net_assign_generic(net, vti_net_id, NULL); + err = ip_tunnel_init_net(net, vti_net_id, &vti_link_ops, "ip_vti0"); if (err) return err; @@ -424,6 +433,9 @@ static int __net_init vti_init_net(struct net *net) static void __net_exit vti_exit_net(struct net *net) { struct ip_tunnel_net *itn = net_generic(net, vti_net_id); + + if (itn == NULL) + return; ip_tunnel_delete_net(itn, &vti_link_ops); } @@ -473,6 +485,9 @@ static int vti_newlink(struct net *src_net, struct net_device *dev, { struct ip_tunnel_parm parms; + if (net_generic(dev_net(dev), vti_net_id) == NULL) + return -EACCES; + vti_netlink_parms(data, &parms); return ip_tunnel_newlink(dev, tb, &parms); } -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] ve/cpustat: don't try to update vcpustats for root_task_group
On Wed, Jun 22, 2016 at 03:59:05PM +0300, Andrey Ryabinin wrote: > root_task_group doesn't have vcpu stats. Attempt to upate those leads > to NULL-ptr deref: > > BUG: unable to handle kernel NULL pointer dereference at > (null) > IP: [] cpu_cgroup_update_vcpustat+0x13c/0x620 > ... > Call Trace: >[] cpu_cgroup_get_stat+0x7b/0x180 >[] ve_get_cpu_stat+0x27/0x70 >[] fill_cpu_stat+0x91/0x1e0 [vzmon] >[] vzcalls_ioctl+0x2bb/0x430 [vzmon] >[] vzctl_ioctl+0x45/0x60 [vzdev] >[] do_vfs_ioctl+0x255/0x4f0 >[] SyS_ioctl+0x54/0xa0 >[] system_call_fastpath+0x16/0x1b > > So, return -ENOENT if we asked for vcpu stats of root_task_group. > > https://jira.sw.ru/browse/PSBM-48721 > > Signed-off-by: Andrey Ryabinin Reviewed-by: Vladimir Davydov ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 2/2] cgroup: un-export cgroup_kernel_* and zap cgroup_kernel_remove
After fairsched's gone, cgroup_kernel_remove is not used any more, so drop it. cgroup_kernel_* family of functions are now used only by beancounters, which is a part of the kernel, so un-export them. Signed-off-by: Vladimir Davydov --- include/linux/cgroup.h | 1 - kernel/cgroup.c| 26 -- 2 files changed, 27 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 730ca9091bfb..b34239dcdb52 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -55,7 +55,6 @@ struct cgroup *cgroup_kernel_lookup(struct vfsmount *mnt, const char *pathname); struct cgroup *cgroup_kernel_open(struct cgroup *parent, enum cgroup_open_flags flags, const char *name); -int cgroup_kernel_remove(struct cgroup *parent, const char *name); int cgroup_kernel_attach(struct cgroup *cgrp, struct task_struct *tsk); void cgroup_kernel_close(struct cgroup *cgrp); diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 581924e7af9e..1c047b9bb1fb 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -5669,13 +5669,11 @@ struct vfsmount *cgroup_kernel_mount(struct cgroup_sb_opts *opts) { return kern_mount_data(&cgroup_fs_type, opts); } -EXPORT_SYMBOL(cgroup_kernel_mount); struct cgroup *cgroup_get_root(struct vfsmount *mnt) { return mnt->mnt_root->d_fsdata; } -EXPORT_SYMBOL(cgroup_get_root); struct cgroup *cgroup_kernel_lookup(struct vfsmount *mnt, const char *pathname) @@ -5698,7 +5696,6 @@ struct cgroup *cgroup_kernel_lookup(struct vfsmount *mnt, path_put(&path); return cgrp; } -EXPORT_SYMBOL(cgroup_kernel_lookup); struct cgroup *cgroup_kernel_open(struct cgroup *parent, enum cgroup_open_flags flags, const char *name) @@ -5729,27 +5726,6 @@ out: mutex_unlock(&parent->dentry->d_inode->i_mutex); return cgrp; } -EXPORT_SYMBOL(cgroup_kernel_open); - -int cgroup_kernel_remove(struct cgroup *parent, const char *name) -{ - struct dentry *dentry; - int ret; - - mutex_lock_nested(&parent->dentry->d_inode->i_mutex, I_MUTEX_PARENT); - dentry = lookup_one_len(name, parent->dentry, strlen(name)); - ret = PTR_ERR(dentry); - if (IS_ERR(dentry)) - goto out; - ret = -ENOENT; - if (dentry->d_inode) - ret = vfs_rmdir(parent->dentry->d_inode, dentry); - dput(dentry); -out: - mutex_unlock(&parent->dentry->d_inode->i_mutex); - return ret; -} -EXPORT_SYMBOL(cgroup_kernel_remove); int cgroup_kernel_attach(struct cgroup *cgrp, struct task_struct *tsk) { @@ -5761,7 +5737,6 @@ int cgroup_kernel_attach(struct cgroup *cgrp, struct task_struct *tsk) mutex_unlock(&cgroup_mutex); return ret; } -EXPORT_SYMBOL(cgroup_kernel_attach); void cgroup_kernel_close(struct cgroup *cgrp) { @@ -5770,4 +5745,3 @@ void cgroup_kernel_close(struct cgroup *cgrp) check_for_release(cgrp); } } -EXPORT_SYMBOL(cgroup_kernel_close); -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] Remove container and beancounter directories from /proc/vz
In PCS6, cgroups were mounted there. Now they are unused, as all cgroups are supposed to be mounted by systemd under /sys/fs/cgroup. Signed-off-by: Vladimir Davydov --- kernel/bc/proc.c| 1 - kernel/ve/veowner.c | 1 - 2 files changed, 2 deletions(-) diff --git a/kernel/bc/proc.c b/kernel/bc/proc.c index 3a3b4e3f28c8..9f60d9991e0a 100644 --- a/kernel/bc/proc.c +++ b/kernel/bc/proc.c @@ -754,7 +754,6 @@ static int __init ub_init_proc(void) entry = proc_create("user_beancounters", S_IRUSR|S_ISVTX, NULL, &ub_file_operations); proc_create("vswap", S_IRUSR, proc_vz_dir, &ub_vswap_fops); - proc_mkdir_mode("beancounter", 0, proc_vz_dir); return 0; } diff --git a/kernel/ve/veowner.c b/kernel/ve/veowner.c index 86065072a9ca..7642191bf517 100644 --- a/kernel/ve/veowner.c +++ b/kernel/ve/veowner.c @@ -41,7 +41,6 @@ static void prepare_proc(void) proc_vz_dir = proc_mkdir_mode("vz", S_ISVTX | S_IRUGO | S_IXUGO, NULL); if (!proc_vz_dir) panic("Can't create /proc/vz dir\n"); - proc_mkdir_mode("container", 0, proc_vz_dir); } #endif -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 1/2] ve: drop ve_cgroup_open and ve_cgroup_remove
Fairsched was the last user of these functions. After it's gone, we don't need them any longer. Signed-off-by: Vladimir Davydov --- include/linux/ve_proto.h | 2 -- kernel/ve/ve.c | 21 - 2 files changed, 23 deletions(-) diff --git a/include/linux/ve_proto.h b/include/linux/ve_proto.h index 8cc7fe3ba2a3..d2dc12d2f2c2 100644 --- a/include/linux/ve_proto.h +++ b/include/linux/ve_proto.h @@ -50,8 +50,6 @@ extern struct list_head ve_list_head; #define for_each_ve(ve)list_for_each_entry((ve), &ve_list_head, ve_list) extern struct mutex ve_list_lock; extern struct ve_struct *get_ve_by_id(envid_t); -extern struct cgroup *ve_cgroup_open(struct cgroup *root, int flags, envid_t veid); -extern int ve_cgroup_remove(struct cgroup *root, envid_t veid); extern int nr_threads_ve(struct ve_struct *ve); diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index 2459cb53a665..9995dbcd1623 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -156,27 +156,6 @@ const char *ve_name(struct ve_struct *ve) } EXPORT_SYMBOL(ve_name); -/* Cgroup must be closed with cgroup_kernel_close */ -struct cgroup *ve_cgroup_open(struct cgroup *root, int flags, envid_t veid) -{ - char name[16]; - struct cgroup *cgrp; - - snprintf(name, sizeof(name), "%u", veid); - cgrp = cgroup_kernel_open(root, flags, name); - return cgrp ? cgrp : ERR_PTR(-ENOENT); -} -EXPORT_SYMBOL(ve_cgroup_open); - -int ve_cgroup_remove(struct cgroup *root, envid_t veid) -{ - char name[16]; - - snprintf(name, sizeof(name), "%u", veid); - return cgroup_kernel_remove(root, name); -} -EXPORT_SYMBOL(ve_cgroup_remove); - /* under rcu_read_lock if task != current */ const char *task_ve_name(struct task_struct *task) { -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] Drop CAP_VE_ADMIN and CAP_VE_NET_ADMIN
Not needed anymore as we use user ns for capability checking. Also, move capable_setveid() helper to ve.h so as not to pollute generic headers. Signed-off-by: Vladimir Davydov --- include/linux/ve.h | 3 +++ include/uapi/linux/capability.h | 55 - 2 files changed, 3 insertions(+), 55 deletions(-) diff --git a/include/linux/ve.h b/include/linux/ve.h index cea3a87cb9c0..247cadb78c06 100644 --- a/include/linux/ve.h +++ b/include/linux/ve.h @@ -138,6 +138,9 @@ struct ve_devmnt { #define VE_MEMINFO_DEFAULT 1 /* default behaviour */ #define VE_MEMINFO_SYSTEM 0 /* disable meminfo virtualization */ +#define capable_setveid() \ + (ve_is_super(get_exec_env()) && capable(CAP_SYS_ADMIN)) + extern int nr_ve; extern struct proc_dir_entry *proc_vz_dir; extern struct cgroup_subsys ve_subsys; diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h index cadbfe6109e8..b3d37bb108b8 100644 --- a/include/uapi/linux/capability.h +++ b/include/uapi/linux/capability.h @@ -307,61 +307,6 @@ struct vfs_cap_data { #define CAP_SETFCAP 31 -#ifdef __KERNEL__ -/* - * Important note: VZ capabilities do intersect with CAP_AUDIT - * this is due to compatibility reasons. Nothing bad. - * Both VZ and Audit/SELinux caps are disabled in VPSs. - */ - -/* Allow access to all information. In the other case some structures will be - * hiding to ensure different Virtual Environment non-interaction on the same - * node (NOW OBSOLETED) - */ -#define CAP_SETVEID 29 - -#define capable_setveid() ({ \ - ve_is_super(get_exec_env()) && \ - (capable(CAP_SYS_ADMIN) || \ -capable(CAP_VE_ADMIN));\ - }) - -/* - * coinsides with CAP_AUDIT_CONTROL but we don't care, since - * audit is disabled in Virtuozzo - */ -#define CAP_VE_ADMIN30 - -#ifdef CONFIG_VE - -/* Replacement for CAP_NET_ADMIN: - delegated rights to the Virtual environment of its network administration. - For now the following rights have been delegated: - - Allow setting arbitrary process / process group ownership on sockets - Allow interface configuration - */ -#define CAP_VE_NET_ADMIN CAP_VE_ADMIN - -/* Replacement for CAP_SYS_ADMIN: - delegated rights to the Virtual environment of its administration. - For now the following rights have been delegated: - */ -/* Allow mount/umount/remount */ -/* Allow examination and configuration of disk quotas */ -/* Allow removing semaphores */ -/* Used instead of CAP_CHOWN to "chown" IPC message queues, semaphores - and shared memory */ -/* Allow locking/unlocking of shared memory segment */ -/* Allow forged pids on socket credentials passing */ - -#define CAP_VE_SYS_ADMIN CAP_VE_ADMIN -#else -#define CAP_VE_NET_ADMIN CAP_NET_ADMIN -#define CAP_VE_SYS_ADMIN CAP_SYS_ADMIN -#endif -#endif - /* Override MAC access. The base kernel enforces no MAC policy. An LSM may enforce a MAC policy, and if it does and it chooses -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] mm: memcontrol: reclaim when shrinking memory.high below usage
From: Johannes Weiner When setting memory.high below usage, nothing happens until the next charge comes along, and then it will only reclaim its own charge and not the now potentially huge excess of the new memory.high. This can cause groups to stay in excess of their memory.high indefinitely. To fix that, when shrinking memory.high, kick off a reclaim cycle that goes after the delta. https://jira.sw.ru/browse/PSBM-48546 Signed-off-by: Johannes Weiner Acked-by: Michal Hocko Cc: Vladimir Davydov Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds (cherry picked from commit 588083bb37a3cea8533c392370a554417c8f29cb) Signed-off-by: Vladimir Davydov Conflicts: mm/memcontrol.c --- mm/memcontrol.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index de7c36295515..1f525f27e481 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5314,7 +5314,7 @@ static int mem_cgroup_high_write(struct cgroup *cont, struct cftype *cft, const char *buffer) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); - unsigned long long val; + unsigned long long val, usage; int ret; ret = res_counter_memparse_write_strategy(buffer, &val); @@ -5322,6 +5322,12 @@ static int mem_cgroup_high_write(struct cgroup *cont, struct cftype *cft, return ret; memcg->high = val; + + usage = res_counter_read_u64(&memcg->res, RES_USAGE); + if (usage > val) + try_to_free_mem_cgroup_pages(memcg, +(usage - val) >> PAGE_SHIFT, +GFP_KERNEL, false); return 0; } -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] cgroup: fix path mangling for ve cgroups
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.16 --> commit 79fa6ee2446a3efe9791378cf9b582bbee0ef7ec Author: Vladimir Davydov Date: Mon Jun 20 21:07:58 2016 +0400 cgroup: fix path mangling for ve cgroups Presently, we just cut first component off cgroup path when inside a VE, because all VE cgroups are located at the top level of the cgroup hierarchy. However, this is going to change - the cgroups are going to move to machine.slice - so we should introduce a more generic way of mangling cgroup paths. This patch does the trick. On a VE start it marks all cgroups the init task of the VE resides in with a special flag (CGRP_VE_ROOT). Cgroups marked this way will be treated as root if looked at from inside a VE. As long as we don't have nested VEs, this should work fine. Note, we don't need to clear these flags on VE destruction, because vzctl always creates new cgroups on VE start. https://jira.sw.ru/browse/PSBM-48629 Signed-off-by: Vladimir Davydov --- include/linux/cgroup.h | 3 +++ kernel/cgroup.c| 27 --- kernel/ve/ve.c | 4 3 files changed, 27 insertions(+), 7 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index aad06e8e0258..730ca9091bfb 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -175,6 +175,9 @@ enum { CGRP_CPUSET_CLONE_CHILDREN, /* see the comment above CGRP_ROOT_SANE_BEHAVIOR for details */ CGRP_SANE_BEHAVIOR, + + /* The cgroup is root in a VE */ + CGRP_VE_ROOT, }; struct cgroup_name { diff --git a/kernel/cgroup.c b/kernel/cgroup.c index dd548853e2eb..581924e7af9e 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1791,6 +1791,21 @@ static struct file_system_type cgroup_fs_type = { static struct kobject *cgroup_kobj; +#ifdef CONFIG_VE +void cgroup_mark_ve_root(struct ve_struct *ve) +{ + struct cgroup *cgrp; + struct cgroupfs_root *root; + + mutex_lock(&cgroup_mutex); + for_each_active_root(root) { + cgrp = task_cgroup_from_root(ve->init_task, root); + set_bit(CGRP_VE_ROOT, &cgrp->flags); + } + mutex_unlock(&cgroup_mutex); +} +#endif + /** * cgroup_path - generate the path of a cgroup * @cgrp: the cgroup in question @@ -1804,7 +1819,8 @@ static struct kobject *cgroup_kobj; * inode's i_mutex, while on the other hand cgroup_path() can be called * with some irq-safe spinlocks held. */ -int __cgroup_path(const struct cgroup *cgrp, char *buf, int buflen, bool virt) +static int __cgroup_path(const struct cgroup *cgrp, char *buf, int buflen, +bool virt) { int ret = -ENAMETOOLONG; char *start; @@ -1824,14 +1840,11 @@ int __cgroup_path(const struct cgroup *cgrp, char *buf, int buflen, bool virt) int len; #ifdef CONFIG_VE - if (virt && cgrp->parent && !cgrp->parent->parent) { + if (virt && test_bit(CGRP_VE_ROOT, &cgrp->flags)) { /* * Containers cgroups are bind-mounted from node * so they are like '/' from inside, thus we have -* to mangle cgroup path output. Effectively it is -* enough to remove two topmost cgroups from path. -* e.g. in ct 101: /101/test.slice/test.scope -> -* /test.slice/test.scope +* to mangle cgroup path output. */ if (*start != '/') { if (--start < buf) @@ -2391,7 +2404,7 @@ static ssize_t cgroup_file_write(struct file *file, const char __user *buf, * inside a container FS. */ if (!ve_is_super(get_exec_env()) - && (!cgrp->parent || !cgrp->parent->parent) + && test_bit(CGRP_VE_ROOT, &cgrp->flags) && !get_exec_env()->is_pseudosuper && !(cft->flags & CFTYPE_VE_WRITABLE)) return -EPERM; diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index 9904a4ae130e..2459cb53a665 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -452,6 +452,8 @@ static void ve_drop_context(struct ve_struct *ve) static const struct timespec zero_time = { }; +extern void cgroup_mark_ve_root(struct ve_struct *ve); + /* under ve->op_sem write-lock */ static int ve_start_container(struct ve_struct *ve) { @@ -499,6 +501,8 @@ static int ve_start_container(struct ve_struct *ve) if (err < 0) goto err_iterate; + cgroup_mark_ve_root(ve); +
[Devel] [PATCH RHEL7 COMMIT] Drop vz_compat boot param
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.16 --> commit f8b72e7837625c7de569fefcf3bba05ac2ef6b5e Author: Vladimir Davydov Date: Mon Jun 20 21:01:36 2016 +0400 Drop vz_compat boot param It was introduced by commit d7b23ae8a314f ("ve/cgroups: use cgroup subsystem names only if in vz compat mode") in order to provide a way of running pcs6 environment along with vz7 kernel. Turned out, this is not needed, so drop the option altogether. Signed-off-by: Vladimir Davydov --- include/linux/ve.h | 4 kernel/bc/beancounter.c | 2 -- kernel/fairsched.c | 1 - kernel/ve/ve.c | 10 -- kernel/ve/vecalls.c | 1 - 5 files changed, 18 deletions(-) diff --git a/include/linux/ve.h b/include/linux/ve.h index 813f16d5e825..182a63899a0b 100644 --- a/include/linux/ve.h +++ b/include/linux/ve.h @@ -153,8 +153,6 @@ extern __u64 ve_setup_iptables_mask(__u64 init_mask); #ifdef CONFIG_VE #define ve_uevent_seqnum (get_exec_env()->_uevent_seqnum) -extern int vz_compat; - extern struct kobj_ns_type_operations ve_ns_type_operations; extern struct kobject * kobject_create_and_add_ve(const char *name, struct kobject *parent); @@ -247,8 +245,6 @@ static inline void ve_mount_nr_dec(void) #define ve_uevent_seqnum uevent_seqnum -#define vz_compat (0) - static inline int vz_security_family_check(struct net *net, int family) { return 0; } static inline int vz_security_protocol_check(struct net *net, int protocol) { return 0; } diff --git a/kernel/bc/beancounter.c b/kernel/bc/beancounter.c index f8a397269152..d35ddb3499d4 100644 --- a/kernel/bc/beancounter.c +++ b/kernel/bc/beancounter.c @@ -33,7 +33,6 @@ #include #include #include -#include #include #include @@ -1179,7 +1178,6 @@ void __init ub_init_late(void) int __init ub_init_cgroup(void) { struct cgroup_sb_opts blkio_opts = { - .name = vz_compat ? "beancounter" : NULL, .subsys_mask= (1ul << blkio_subsys_id), }; struct cgroup_sb_opts mem_opts = { diff --git a/kernel/fairsched.c b/kernel/fairsched.c index 959c19f4d7fc..e015cff87a97 100644 --- a/kernel/fairsched.c +++ b/kernel/fairsched.c @@ -796,7 +796,6 @@ int __init fairsched_init(void) { struct vfsmount *cpu_mnt, *cpuset_mnt; struct cgroup_sb_opts cpu_opts = { - .name = vz_compat ? "fairsched" : NULL, .subsys_mask= (1ul << cpu_cgroup_subsys_id) | (1ul << cpuacct_subsys_id), diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index 22df66e1b257..d811d4818fa6 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -87,18 +87,8 @@ DEFINE_MUTEX(ve_list_lock); int nr_ve = 1; /* One VE always exists. Compatibility with vestat */ EXPORT_SYMBOL(nr_ve); -int vz_compat; -EXPORT_SYMBOL(vz_compat); - static DEFINE_IDR(ve_idr); -static int __init vz_compat_setup(char *arg) -{ - get_option(&arg, &vz_compat); - return 0; -} -early_param("vz_compat", vz_compat_setup); - struct ve_struct *get_ve(struct ve_struct *ve) { if (ve) diff --git a/kernel/ve/vecalls.c b/kernel/ve/vecalls.c index 5aa9722d692d..2b8b27998f07 100644 --- a/kernel/ve/vecalls.c +++ b/kernel/ve/vecalls.c @@ -309,7 +309,6 @@ static struct vfsmount *ve_cgroup_mnt, *devices_cgroup_mnt; static int __init init_vecalls_cgroups(void) { struct cgroup_sb_opts devices_opts = { - .name = vz_compat ? "container" : NULL, .subsys_mask= (1ul << devices_subsys_id), }; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] timers should not get negative argument
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.16 --> commit 3788c76811b2b04318c3f4b240f1e83245ad15e5 Author: Vasily Averin Date: Mon Jun 20 20:58:56 2016 +0400 timers should not get negative argument This patch fixes 25-sec delay on login into systemd based containers. Userspace application can set timer for past and expect that the timer will be expired immediately. This can do not work as expected inside migrated containers. Translated argument provided to timer can become negative, and according timer will sleep a very long time. https://jira.sw.ru/browse/PSBM-48475 CC: Vladimir Davydov CC: Konstantin Khorenko Signed-off-by: Vasily Averin Acked-by: Cyrill Gorcunov --- kernel/posix-timers.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c index b98cfe429d9b..8ebf01827ee6 100644 --- a/kernel/posix-timers.c +++ b/kernel/posix-timers.c @@ -133,6 +133,8 @@ static struct k_clock posix_clocks[MAX_CLOCKS]; (which_clock) == CLOCK_MONOTONIC_COARSE) #ifdef CONFIG_VE +static struct timespec zero_time; + void monotonic_abs_to_ve(clockid_t which_clock, struct timespec *tp) { struct ve_struct *ve = get_exec_env(); @@ -151,6 +153,10 @@ void monotonic_ve_to_abs(clockid_t which_clock, struct timespec *tp) set_normalized_timespec(tp, tp->tv_sec + ve->start_timespec.tv_sec, tp->tv_nsec + ve->start_timespec.tv_nsec); + if (timespec_compare(tp, &zero_time) <= 0) { + tp->tv_sec = 0; + tp->tv_nsec = 1; + } } #endif ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 0/6] Support containers in machine.slice
The following problems have to be solved if we want to move containers to machine.slice: - CPU stats reporting. Currently, we just open cgroup by name when we need stats corresponding to a VE. This is addressed by patch 3. - setdevperms ioctl. The same problem as in case 1. Addressed by patch 3 as well. - cgroup path mangling (/proc/self/cgroup, mountinfo). This is fixed by patches 5 and 6. With containers moved to machine.slice fairsched syscalls and VZCTL_ENV_CREATE ioctl get broken and can't be easily fixed, so we just drop them (patches 1, 2, 4). This should be fine, because libvctl switched to cgroup interface long ago. https://jira.sw.ru/browse/PSBM-48629 Vladimir Davydov (6): Drop vz_compat boot param Drop VZCTL_ENV_CREATE Use ve init task's css instead of opening cgroup via vfs Drop fairsched syscalls cgroup: use cgroup_path_ve helper in cgroup_show_path cgroup: fix path mangling for ve cgroups arch/powerpc/include/asm/systbl.h | 16 +- arch/powerpc/include/uapi/asm/unistd.h| 8 - arch/x86/syscalls/syscall_32.tbl | 9 - arch/x86/syscalls/syscall_64.tbl | 8 - configs/kernel-3.10.0-x86_64-debug.config | 1 - configs/kernel-3.10.0-x86_64.config | 1 - fs/proc/loadavg.c | 3 +- fs/proc/stat.c| 3 +- fs/proc/uptime.c | 15 +- include/linux/cgroup.h| 3 + include/linux/cpuset.h| 5 - include/linux/device_cgroup.h | 6 +- include/linux/fairsched.h | 88 include/linux/sched.h | 21 - include/linux/ve.h| 30 +- include/linux/ve_proto.h | 4 - include/uapi/linux/Kbuild | 1 - include/uapi/linux/fairsched.h| 8 - init/Kconfig | 20 +- kernel/Makefile | 1 - kernel/bc/beancounter.c | 2 - kernel/cgroup.c | 66 ++- kernel/cpuset.c | 26 - kernel/fairsched.c| 829 -- kernel/sched/core.c | 69 +-- kernel/sched/cpuacct.h| 2 + kernel/sys_ni.c | 10 - kernel/ve/ve.c| 104 +++- kernel/ve/vecalls.c | 505 +- security/device_cgroup.c | 65 +-- 30 files changed, 191 insertions(+), 1738 deletions(-) delete mode 100644 include/linux/fairsched.h delete mode 100644 include/uapi/linux/fairsched.h delete mode 100644 kernel/fairsched.c -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Use ve init task's css instead of opening cgroup via vfs
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.16 --> commit 083ecd8a5051975639669e3349a17e07d299c299 Author: Vladimir Davydov Date: Mon Jun 20 19:40:13 2016 +0300 Use ve init task's css instead of opening cgroup via vfs Currently, whenever we need to get cpu or devices cgroup corresponding to a ve, we open it using cgroup_kernel_open(). This is inflexible, because it relies on the fact that all container cgroups are located at a specific location which can never change (at the top level). Since we want to move container cgroups to machine.slice, we need to rework this. This patch does the trick. It makes each ve remember its init task at container start, and use css corresponding to init task whenever we need to get a corresponding cgroup. Note, that after this patch is applied, we don't need to mount cpu and devices cgroup in kernel. https://jira.sw.ru/browse/PSBM-48629 Signed-off-by: Vladimir Davydov --- fs/proc/loadavg.c | 3 +- fs/proc/stat.c| 3 +- fs/proc/uptime.c | 15 include/linux/device_cgroup.h | 5 ++- include/linux/fairsched.h | 23 include/linux/ve.h| 18 ++ kernel/fairsched.c| 61 kernel/ve/ve.c| 82 ++- kernel/ve/vecalls.c | 67 --- security/device_cgroup.c | 19 +- 10 files changed, 126 insertions(+), 170 deletions(-) diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c index 4cbdeef1aa71..40d8a90b0f13 100644 --- a/fs/proc/loadavg.c +++ b/fs/proc/loadavg.c @@ -6,7 +6,6 @@ #include #include #include -#include #include #define LOAD_INT(x) ((x) >> FSHIFT) @@ -20,7 +19,7 @@ static int loadavg_proc_show(struct seq_file *m, void *v) ve = get_exec_env(); if (!ve_is_super(ve)) { int ret; - ret = fairsched_show_loadavg(ve_name(ve), m); + ret = ve_show_loadavg(ve, m); if (ret != -ENOSYS) return ret; } diff --git a/fs/proc/stat.c b/fs/proc/stat.c index e9991db527e0..7f7e87c855e4 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -10,7 +10,6 @@ #include #include #include -#include #include #include #include @@ -98,7 +97,7 @@ static int show_stat(struct seq_file *p, void *v) ve = get_exec_env(); if (!ve_is_super(ve)) { int ret; - ret = fairsched_show_stat(ve_name(ve), p); + ret = ve_show_cpu_stat(ve, p); if (ret != -ENOSYS) return ret; } diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c index 6fd56831c796..8fa578e8a553 100644 --- a/fs/proc/uptime.c +++ b/fs/proc/uptime.c @@ -5,7 +5,6 @@ #include #include #include -#include #include #include @@ -25,11 +24,11 @@ static inline void get_ve0_idle(struct timespec *idle) idle->tv_nsec = rem; } -static inline void get_veX_idle(struct timespec *idle, struct cgroup* cgrp) +static inline void get_veX_idle(struct ve_struct *ve, struct timespec *idle) { struct kernel_cpustat kstat; - cpu_cgroup_get_stat(cgrp, &kstat); + ve_get_cpu_stat(ve, &kstat); cputime_to_timespec(kstat.cpustat[CPUTIME_IDLE], idle); } @@ -37,14 +36,12 @@ static int uptime_proc_show(struct seq_file *m, void *v) { struct timespec uptime; struct timespec idle; + struct ve_struct *ve = get_exec_env(); - if (ve_is_super(get_exec_env())) + if (ve_is_super(ve)) get_ve0_idle(&idle); - else { - rcu_read_lock(); - get_veX_idle(&idle, task_cgroup(current, cpu_cgroup_subsys_id)); - rcu_read_unlock(); - } + else + get_veX_idle(ve, &idle); do_posix_clock_monotonic_gettime(&uptime); monotonic_to_bootbased(&uptime); diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h index 64c2da27278c..25ea2270aabe 100644 --- a/include/linux/device_cgroup.h +++ b/include/linux/device_cgroup.h @@ -16,10 +16,9 @@ extern int devcgroup_device_permission(umode_t mode, dev_t dev, int mask); extern int devcgroup_device_visible(umode_t mode, int major, int start_minor, int nr_minors); -struct cgroup; -int devcgroup_set_perms_ve(struct cgroup *cgroup, unsigned, dev_t, unsigned); struct ve_struct; -int devcgroup_seq_show_ve(struct cgroup *devices_root, struct ve_struct *ve, struct seq_file *m); +int devcgroup_set_perms_ve(struct ve_struct *, unsigned, dev_t, unsigned); +int devcgroup_seq_show_ve(struct ve_struct *, struct seq_file *); #else static inline int de
[Devel] [PATCH RHEL7 COMMIT] cgroup: use cgroup_path_ve helper in cgroup_show_path
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.16 --> commit df0243406fc27e4af78ca6d9111a0bd30fea00a3 Author: Vladimir Davydov Date: Mon Jun 20 21:07:48 2016 +0400 cgroup: use cgroup_path_ve helper in cgroup_show_path Presently, it basically duplicates the code used for mangling cgroup path shown inside ve, which is already present in cgroup_path_ve. Let's reuse it. Signed-off-by: Vladimir Davydov --- kernel/cgroup.c | 39 +-- 1 file changed, 9 insertions(+), 30 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 5c012f6e94e5..dd548853e2eb 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1373,41 +1373,20 @@ static int cgroup_remount(struct super_block *sb, int *flags, char *data) } #ifdef CONFIG_VE -int cgroup_show_path(struct seq_file *m, struct dentry *dentry) +static int cgroup_show_path(struct seq_file *m, struct dentry *dentry) { - char *buf; + struct cgroup *cgrp = __d_cgrp(dentry); + char *buf, *end; size_t size = seq_get_buf(m, &buf); - int res = -1, err = 0; - - if (size) { - char *p = dentry_path(dentry, buf, size); - if (!IS_ERR(p)) { - char *end; - if (!ve_is_super(get_exec_env())) { - while (*++p != '/') { - /* -* Mangle one level when showing -* cgroup mount source in container -* e.g.: "/111" -> "/", -* "/111/test.slice/test.scope" -> -* "/test.slice/test.scope" -*/ - if (*p == '\0') { - *--p = '/'; - break; - } - } - } - end = mangle_path(buf, p, " \t\n\\"); - if (end) - res = end - buf; - } else { - err = PTR_ERR(p); - } + int res = -1; + + if (size > 0 && cgroup_path_ve(cgrp, buf, size) == 0) { + end = mangle_path(buf, buf, " \t\n\\"); + res = end - buf; } seq_commit(m, res); - return err; + return 0; } #endif ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Drop fairsched syscalls
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.16 --> commit 13985cb1990d71a321504c58daa16b50ac9a0ec7 Author: Vladimir Davydov Date: Mon Jun 20 19:40:14 2016 +0300 Drop fairsched syscalls Everything that can be configured via fairsched syscalls is accessible via cpu cgroup. Since it's getting difficult to maintain the syscalls due to the upcoming move of containers to machine.slice, drop them. Also, drop all functions from sched and cpuset which were used only by fairsched syscalls. Note, I make CFS_BANDWIDTH select CFS_CPULIMIT config option. This is, because otherwise it won't get selected, because its only user was VZ_FAIRSCHED config option dropped by this patch. I think we need to merge this option with CFS_BANDWIDTH eventually, but let's leave it as is for now. Signed-off-by: Vladimir Davydov --- arch/powerpc/include/asm/systbl.h | 16 +- arch/powerpc/include/uapi/asm/unistd.h| 8 - arch/x86/syscalls/syscall_32.tbl | 9 - arch/x86/syscalls/syscall_64.tbl | 8 - configs/kernel-3.10.0-x86_64-debug.config | 1 - configs/kernel-3.10.0-x86_64.config | 1 - include/linux/cpuset.h| 5 - include/linux/fairsched.h | 58 --- include/linux/sched.h | 20 - include/uapi/linux/Kbuild | 1 - include/uapi/linux/fairsched.h| 8 - init/Kconfig | 20 +- kernel/Makefile | 1 - kernel/cpuset.c | 26 -- kernel/fairsched.c| 705 -- kernel/sched/core.c | 69 +-- kernel/sched/cpuacct.h| 2 + kernel/sys_ni.c | 10 - 18 files changed, 25 insertions(+), 943 deletions(-) diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h index ce9d2d7977e5..8a44bbd2bee6 100644 --- a/arch/powerpc/include/asm/systbl.h +++ b/arch/powerpc/include/asm/systbl.h @@ -374,14 +374,14 @@ SYSCALL(ni_syscall) SYSCALL(ni_syscall) SYSCALL(ni_syscall) SYSCALL(ni_syscall) -SYSCALL(fairsched_mknod) -SYSCALL(fairsched_rmnod) -SYSCALL(fairsched_chwt) -SYSCALL(fairsched_mvpr) -SYSCALL(fairsched_rate) -SYSCALL(fairsched_vcpus) -SYSCALL(fairsched_cpumask) -SYSCALL(fairsched_nodemask) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) SYSCALL(getluid) SYSCALL(setluid) SYSCALL(setublimit) diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h index e90207158a12..41fc69c6822b 100644 --- a/arch/powerpc/include/uapi/asm/unistd.h +++ b/arch/powerpc/include/uapi/asm/unistd.h @@ -387,14 +387,6 @@ #define __NR_execveat 362 #define __NR_switch_endian 363 -#define __NR_fairsched_mknod 360 -#define __NR_fairsched_rmnod 361 -#define __NR_fairsched_chwt362 -#define __NR_fairsched_mvpr363 -#define __NR_fairsched_rate364 -#define __NR_fairsched_vcpus 365 -#define __NR_fairsched_cpumask 366 -#define __NR_fairsched_nodemask367 #define __NR_getluid 368 #define __NR_setluid 369 #define __NR_setublimit370 diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl index e60fd32ebba3..f8ed67d66913 100644 --- a/arch/x86/syscalls/syscall_32.tbl +++ b/arch/x86/syscalls/syscall_32.tbl @@ -360,15 +360,6 @@ 356i386memfd_createsys_memfd_create 374i386userfaultfd sys_userfaultfd -500i386fairsched_mknod sys_fairsched_mknod -501i386fairsched_rmnod sys_fairsched_rmnod -502i386fairsched_chwt sys_fairsched_chwt -503i386fairsched_mvpr sys_fairsched_mvpr -504i386fairsched_rate sys_fairsched_rate -505i386fairsched_vcpus sys_fairsched_vcpus -506i386fairsched_cpumask sys_fairsched_cpumask -507i386fairsched_nodemask sys_fairsched_nodemask - 510i386getluid sys_getluid 511i386setluid sys_setluid 512i386setublimit sys_setublimit compat_sys_setublimit diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl index 846183e5a9f0..7f009985158e 100644 --- a/arch/x86/syscalls/syscall_64.tbl +++ b/arch/x86/syscalls/syscall_64.tbl @@ -325,18 +325,10 @@ 320common kexec_file_load sys_kexec_file_load 323common userfaultfd sys_userfaultfd -49764 fairsched_nodemask sys_fairsched_nodemask -49864 fairsched_cpumask sys_
[Devel] [PATCH RHEL7 COMMIT] Drop VZCTL_ENV_CREATE
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.16 --> commit 8d46dca70d92147cf928633f279b9c36deb234c2 Author: Vladimir Davydov Date: Mon Jun 20 19:40:12 2016 +0300 Drop VZCTL_ENV_CREATE It's getting too difficult to support it. Since we've been using cgroup interface for creating VE for quite a while, let's drop it. Signed-off-by: Vladimir Davydov --- include/linux/device_cgroup.h | 1 - include/linux/fairsched.h | 7 - include/linux/sched.h | 1 - include/linux/ve.h| 8 - include/linux/ve_proto.h | 4 - kernel/fairsched.c| 64 +-- kernel/ve/ve.c| 8 +- kernel/ve/vecalls.c | 437 +- security/device_cgroup.c | 46 - 9 files changed, 5 insertions(+), 571 deletions(-) diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h index 32588bb8fb4e..64c2da27278c 100644 --- a/include/linux/device_cgroup.h +++ b/include/linux/device_cgroup.h @@ -17,7 +17,6 @@ extern int devcgroup_device_visible(umode_t mode, int major, int start_minor, int nr_minors); struct cgroup; -int devcgroup_default_perms_ve(struct cgroup *cgroup); int devcgroup_set_perms_ve(struct cgroup *cgroup, unsigned, dev_t, unsigned); struct ve_struct; int devcgroup_seq_show_ve(struct cgroup *devices_root, struct ve_struct *ve, struct seq_file *m); diff --git a/include/linux/fairsched.h b/include/linux/fairsched.h index f3dede236945..b73f51eadabc 100644 --- a/include/linux/fairsched.h +++ b/include/linux/fairsched.h @@ -51,10 +51,6 @@ asmlinkage long sys_fairsched_cpumask(unsigned int id, unsigned int len, asmlinkage long sys_fairsched_nodemask(unsigned int id, unsigned int len, unsigned long __user *user_mask_ptr); -int fairsched_new_node(int id, unsigned int vcpus); -int fairsched_move_task(int id, struct task_struct *tsk); -void fairsched_drop_node(int id, int leave); - int fairsched_get_cpu_stat(const char *name, struct kernel_cpustat *kstat); int cpu_cgroup_get_avenrun(struct cgroup *cgrp, unsigned long *avenrun); @@ -71,9 +67,6 @@ int fairsched_show_loadavg(const char *name, struct seq_file *p); #else /* CONFIG_VZ_FAIRSCHED */ -static inline int fairsched_new_node(int id, unsigned int vcpus) { return 0; } -static inline int fairsched_move_task(int id, struct task_struct *tsk) { return 0; } -static inline void fairsched_drop_node(int id, int leave) { } static inline int fairsched_show_stat(const char *name, struct seq_file *p) { return -ENOSYS; } static inline int fairsched_show_loadavg(const char *name, struct seq_file *p) { return -ENOSYS; } static inline int fairsched_get_cpu_avenrun(const char *name, unsigned long *avenrun) { return -ENOSYS; } diff --git a/include/linux/sched.h b/include/linux/sched.h index 21775a21f8ab..84a9888b2483 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1241,7 +1241,6 @@ struct task_struct { unsigned in_execve:1; /* Tell the LSMs that the process is doing an * execve */ unsigned in_iowait:1; - unsigned did_ve_enter:1; unsigned no_new_privs:1; /* task may not gain privileges */ unsigned may_throttle:1; diff --git a/include/linux/ve.h b/include/linux/ve.h index 182a63899a0b..459c8bc581d9 100644 --- a/include/linux/ve.h +++ b/include/linux/ve.h @@ -41,13 +41,10 @@ struct ve_struct { struct list_headve_list; envid_t veid; - boollegacy; /* created using the legacy API - (vzctl ioctl - see do_env_create) */ unsigned intclass_id; struct rw_semaphore op_sem; int is_running; - int is_locked; int is_pseudosuper; atomic_tsuspend; /* see vzcalluser.h for VE_FEATURE_XXX definitions */ @@ -146,10 +143,6 @@ extern struct cgroup_subsys ve_subsys; extern unsigned int sysctl_ve_mount_nr; -#ifdef CONFIG_VE_IPTABLES -extern __u64 ve_setup_iptables_mask(__u64 init_mask); -#endif - #ifdef CONFIG_VE #define ve_uevent_seqnum (get_exec_env()->_uevent_seqnum) @@ -209,7 +202,6 @@ extern void monotonic_ve_to_abs(clockid_t which_clock, struct timespec *tp); void ve_stop_ns(struct pid_namespace *ns); void ve_exit_ns(struct pid_namespace *ns); -int ve_start_container(struct ve_struct *ve); extern bool current_user_ns_initial(void); struct user_namespace *ve_init_user_ns(void); diff --git a/include/linux/ve_proto.h b/include/linux/ve_proto.h index 61d80190d0f1..8cc7fe3ba2a3 100644 --- a/include/linux/ve_proto.h +++ b/include/linux/ve_proto.h @@ -53,10 +53,6 @@ extern
[Devel] [PATCH RHEL7 COMMIT] Drop fairsched syscalls
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.16 --> commit 13985cb1990d71a321504c58daa16b50ac9a0ec7 Author: Vladimir Davydov Date: Mon Jun 20 19:40:14 2016 +0300 Drop fairsched syscalls Everything that can be configured via fairsched syscalls is accessible via cpu cgroup. Since it's getting difficult to maintain the syscalls due to the upcoming move of containers to machine.slice, drop them. Also, drop all functions from sched and cpuset which were used only by fairsched syscalls. Note, I make CFS_BANDWIDTH select CFS_CPULIMIT config option. This is, because otherwise it won't get selected, because its only user was VZ_FAIRSCHED config option dropped by this patch. I think we need to merge this option with CFS_BANDWIDTH eventually, but let's leave it as is for now. Signed-off-by: Vladimir Davydov --- arch/powerpc/include/asm/systbl.h | 16 +- arch/powerpc/include/uapi/asm/unistd.h| 8 - arch/x86/syscalls/syscall_32.tbl | 9 - arch/x86/syscalls/syscall_64.tbl | 8 - configs/kernel-3.10.0-x86_64-debug.config | 1 - configs/kernel-3.10.0-x86_64.config | 1 - include/linux/cpuset.h| 5 - include/linux/fairsched.h | 58 --- include/linux/sched.h | 20 - include/uapi/linux/Kbuild | 1 - include/uapi/linux/fairsched.h| 8 - init/Kconfig | 20 +- kernel/Makefile | 1 - kernel/cpuset.c | 26 -- kernel/fairsched.c| 705 -- kernel/sched/core.c | 69 +-- kernel/sched/cpuacct.h| 2 + kernel/sys_ni.c | 10 - 18 files changed, 25 insertions(+), 943 deletions(-) diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h index ce9d2d7977e5..8a44bbd2bee6 100644 --- a/arch/powerpc/include/asm/systbl.h +++ b/arch/powerpc/include/asm/systbl.h @@ -374,14 +374,14 @@ SYSCALL(ni_syscall) SYSCALL(ni_syscall) SYSCALL(ni_syscall) SYSCALL(ni_syscall) -SYSCALL(fairsched_mknod) -SYSCALL(fairsched_rmnod) -SYSCALL(fairsched_chwt) -SYSCALL(fairsched_mvpr) -SYSCALL(fairsched_rate) -SYSCALL(fairsched_vcpus) -SYSCALL(fairsched_cpumask) -SYSCALL(fairsched_nodemask) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) SYSCALL(getluid) SYSCALL(setluid) SYSCALL(setublimit) diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h index e90207158a12..41fc69c6822b 100644 --- a/arch/powerpc/include/uapi/asm/unistd.h +++ b/arch/powerpc/include/uapi/asm/unistd.h @@ -387,14 +387,6 @@ #define __NR_execveat 362 #define __NR_switch_endian 363 -#define __NR_fairsched_mknod 360 -#define __NR_fairsched_rmnod 361 -#define __NR_fairsched_chwt362 -#define __NR_fairsched_mvpr363 -#define __NR_fairsched_rate364 -#define __NR_fairsched_vcpus 365 -#define __NR_fairsched_cpumask 366 -#define __NR_fairsched_nodemask367 #define __NR_getluid 368 #define __NR_setluid 369 #define __NR_setublimit370 diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl index e60fd32ebba3..f8ed67d66913 100644 --- a/arch/x86/syscalls/syscall_32.tbl +++ b/arch/x86/syscalls/syscall_32.tbl @@ -360,15 +360,6 @@ 356i386memfd_createsys_memfd_create 374i386userfaultfd sys_userfaultfd -500i386fairsched_mknod sys_fairsched_mknod -501i386fairsched_rmnod sys_fairsched_rmnod -502i386fairsched_chwt sys_fairsched_chwt -503i386fairsched_mvpr sys_fairsched_mvpr -504i386fairsched_rate sys_fairsched_rate -505i386fairsched_vcpus sys_fairsched_vcpus -506i386fairsched_cpumask sys_fairsched_cpumask -507i386fairsched_nodemask sys_fairsched_nodemask - 510i386getluid sys_getluid 511i386setluid sys_setluid 512i386setublimit sys_setublimit compat_sys_setublimit diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl index 846183e5a9f0..7f009985158e 100644 --- a/arch/x86/syscalls/syscall_64.tbl +++ b/arch/x86/syscalls/syscall_64.tbl @@ -325,18 +325,10 @@ 320common kexec_file_load sys_kexec_file_load 323common userfaultfd sys_userfaultfd -49764 fairsched_nodemask sys_fairsched_nodemask -49864 fairsched_cpumask sys_
[Devel] [PATCH RHEL7 COMMIT] Drop VZCTL_ENV_CREATE
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.16 --> commit 8d46dca70d92147cf928633f279b9c36deb234c2 Author: Vladimir Davydov Date: Mon Jun 20 19:40:12 2016 +0300 Drop VZCTL_ENV_CREATE It's getting too difficult to support it. Since we've been using cgroup interface for creating VE for quite a while, let's drop it. Signed-off-by: Vladimir Davydov --- include/linux/device_cgroup.h | 1 - include/linux/fairsched.h | 7 - include/linux/sched.h | 1 - include/linux/ve.h| 8 - include/linux/ve_proto.h | 4 - kernel/fairsched.c| 64 +-- kernel/ve/ve.c| 8 +- kernel/ve/vecalls.c | 437 +- security/device_cgroup.c | 46 - 9 files changed, 5 insertions(+), 571 deletions(-) diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h index 32588bb8fb4e..64c2da27278c 100644 --- a/include/linux/device_cgroup.h +++ b/include/linux/device_cgroup.h @@ -17,7 +17,6 @@ extern int devcgroup_device_visible(umode_t mode, int major, int start_minor, int nr_minors); struct cgroup; -int devcgroup_default_perms_ve(struct cgroup *cgroup); int devcgroup_set_perms_ve(struct cgroup *cgroup, unsigned, dev_t, unsigned); struct ve_struct; int devcgroup_seq_show_ve(struct cgroup *devices_root, struct ve_struct *ve, struct seq_file *m); diff --git a/include/linux/fairsched.h b/include/linux/fairsched.h index f3dede236945..b73f51eadabc 100644 --- a/include/linux/fairsched.h +++ b/include/linux/fairsched.h @@ -51,10 +51,6 @@ asmlinkage long sys_fairsched_cpumask(unsigned int id, unsigned int len, asmlinkage long sys_fairsched_nodemask(unsigned int id, unsigned int len, unsigned long __user *user_mask_ptr); -int fairsched_new_node(int id, unsigned int vcpus); -int fairsched_move_task(int id, struct task_struct *tsk); -void fairsched_drop_node(int id, int leave); - int fairsched_get_cpu_stat(const char *name, struct kernel_cpustat *kstat); int cpu_cgroup_get_avenrun(struct cgroup *cgrp, unsigned long *avenrun); @@ -71,9 +67,6 @@ int fairsched_show_loadavg(const char *name, struct seq_file *p); #else /* CONFIG_VZ_FAIRSCHED */ -static inline int fairsched_new_node(int id, unsigned int vcpus) { return 0; } -static inline int fairsched_move_task(int id, struct task_struct *tsk) { return 0; } -static inline void fairsched_drop_node(int id, int leave) { } static inline int fairsched_show_stat(const char *name, struct seq_file *p) { return -ENOSYS; } static inline int fairsched_show_loadavg(const char *name, struct seq_file *p) { return -ENOSYS; } static inline int fairsched_get_cpu_avenrun(const char *name, unsigned long *avenrun) { return -ENOSYS; } diff --git a/include/linux/sched.h b/include/linux/sched.h index 21775a21f8ab..84a9888b2483 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1241,7 +1241,6 @@ struct task_struct { unsigned in_execve:1; /* Tell the LSMs that the process is doing an * execve */ unsigned in_iowait:1; - unsigned did_ve_enter:1; unsigned no_new_privs:1; /* task may not gain privileges */ unsigned may_throttle:1; diff --git a/include/linux/ve.h b/include/linux/ve.h index 182a63899a0b..459c8bc581d9 100644 --- a/include/linux/ve.h +++ b/include/linux/ve.h @@ -41,13 +41,10 @@ struct ve_struct { struct list_headve_list; envid_t veid; - boollegacy; /* created using the legacy API - (vzctl ioctl - see do_env_create) */ unsigned intclass_id; struct rw_semaphore op_sem; int is_running; - int is_locked; int is_pseudosuper; atomic_tsuspend; /* see vzcalluser.h for VE_FEATURE_XXX definitions */ @@ -146,10 +143,6 @@ extern struct cgroup_subsys ve_subsys; extern unsigned int sysctl_ve_mount_nr; -#ifdef CONFIG_VE_IPTABLES -extern __u64 ve_setup_iptables_mask(__u64 init_mask); -#endif - #ifdef CONFIG_VE #define ve_uevent_seqnum (get_exec_env()->_uevent_seqnum) @@ -209,7 +202,6 @@ extern void monotonic_ve_to_abs(clockid_t which_clock, struct timespec *tp); void ve_stop_ns(struct pid_namespace *ns); void ve_exit_ns(struct pid_namespace *ns); -int ve_start_container(struct ve_struct *ve); extern bool current_user_ns_initial(void); struct user_namespace *ve_init_user_ns(void); diff --git a/include/linux/ve_proto.h b/include/linux/ve_proto.h index 61d80190d0f1..8cc7fe3ba2a3 100644 --- a/include/linux/ve_proto.h +++ b/include/linux/ve_proto.h @@ -53,10 +53,6 @@ extern
[Devel] [PATCH RHEL7 COMMIT] mm: memcontrol: fix race between kmem uncharge and charge reparenting
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-327.18.2.vz7.14.16 --> commit 35c0d2a992aaa399cccaee2fc9f3ed6879840dd4 Author: Vladimir Davydov Date: Mon Jun 20 20:59:38 2016 +0400 mm: memcontrol: fix race between kmem uncharge and charge reparenting When a cgroup is destroyed, all user memory pages get recharged to the parent cgroup. Recharging is done by mem_cgroup_reparent_charges which keeps looping until res <= kmem. This is supposed to guarantee that by the time cgroup gets released, no pages is charged to it. However, the guarantee might be violated in case mem_cgroup_reparent_charges races with kmem charge or uncharge. Currently, kmem is charged before res and uncharged after. As a result, kmem might become greater than res for a short period of time even if there are still user memory pages charged to the cgroup. In this case mem_cgroup_reparent_charges will give up prematurely, and the cgroup might be released though there are still pages charged to it. Uncharge of such a page will trigger kernel panic: general protection fault: [#1] SMP CPU: 0 PID: 972445 Comm: httpd ve: 0 Tainted: G OE 3.10.0-427.10.1.lve1.4.9.el7.x86_64 #1 12.14 task: 88065d53d8d0 ti: 880224f34000 task.ti: 880224f34000 RIP: 0010:[] [] mem_cgroup_charge_statistics.isra.16+0x13/0x60 RSP: 0018:880224f37a80 EFLAGS: 00010202 RAX: RBX: 8807b26f0110 RCX: RDX: 79726f6765746163 RSI: ea000c9c0440 RDI: 8806a55662f8 RBP: 880224f37a80 R08: R09: 03808000 R10: 00b8 R11: ea001eaa8980 R12: ea000c9c0440 R13: 0001 R14: R15: 8806a5566000 FS: () GS:8807d400() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f54289bd74c CR3: 0006638b1000 CR4: 06f0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Stack: 880224f37ac0 811e9ddf 88060001 ea000c9c0440 0001 037d1000 880224f37c78 0380 880224f37ad0 811ee99a 880224f37b08 811b9ec9 Call Trace: [] __mem_cgroup_uncharge_common+0xcf/0x320 [] mem_cgroup_uncharge_page+0x2a/0x30 [] page_remove_rmap+0xb9/0x160 [] ? res_counter_uncharge+0x13/0x20 [] unmap_page_range+0x460/0x870 [] unmap_single_vma+0x81/0xf0 [] unmap_vmas+0x49/0x90 [] exit_mmap+0xac/0x1a0 [] mmput+0x6b/0x140 [] flush_old_exec+0x467/0x8d0 [] load_elf_binary+0x33c/0xde0 [] ? get_user_pages+0x52/0x60 [] ? load_elf_library+0x220/0x220 [] search_binary_handler+0xd5/0x300 [] do_execve_common.isra.26+0x657/0x720 [] SyS_execve+0x29/0x30 [] stub_execve+0x69/0xa0 To prevent this from happening, let's always charge kmem after res and uncharge before res. https://bugs.openvz.org/browse/OVZ-6756 Reported-by: Anatoly Stepanov Signed-off-by: Vladimir Davydov Reviewed-by: Kirill Tkhai --- mm/memcontrol.c | 44 1 file changed, 36 insertions(+), 8 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1c3fbb2d2c48..de7c36295515 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3163,10 +3163,6 @@ int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size) int ret = 0; bool may_oom; - ret = res_counter_charge(&memcg->kmem, size, &fail_res); - if (ret) - return ret; - /* * Conditions under which we can wait for the oom_killer. Those are * the same conditions tested by the core page allocator @@ -3198,8 +3194,33 @@ int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size) res_counter_charge_nofail(&memcg->memsw, size, &fail_res); ret = 0; - } else if (ret) - res_counter_uncharge(&memcg->kmem, size); + } + + if (ret) + return ret; + + /* +* When a cgroup is destroyed, all user memory pages get recharged to +* the parent cgroup. Recharging is done by mem_cgroup_reparent_charges +* which keeps looping until res <= kmem. This is supposed to guarantee +* that by the time cgroup gets released, no pages is charged to it. +* +* If kmem were charged before res or uncharged after, kmem might +* become grea
[Devel] [PATCH rh7 4/6] Drop fairsched syscalls
Everything that can be configured via fairsched syscalls is accessible via cpu cgroup. Since it's getting difficult to maintain the syscalls due to the upcoming move of containers to machine.slice, drop them. Also, drop all functions from sched and cpuset which were used only by fairsched syscalls. Note, I make CFS_BANDWIDTH select CFS_CPULIMIT config option. This is, because otherwise it won't get selected, because its only user was VZ_FAIRSCHED config option dropped by this patch. I think we need to merge this option with CFS_BANDWIDTH eventually, but let's leave it as is for now. Signed-off-by: Vladimir Davydov --- arch/powerpc/include/asm/systbl.h | 16 +- arch/powerpc/include/uapi/asm/unistd.h| 8 - arch/x86/syscalls/syscall_32.tbl | 9 - arch/x86/syscalls/syscall_64.tbl | 8 - configs/kernel-3.10.0-x86_64-debug.config | 1 - configs/kernel-3.10.0-x86_64.config | 1 - include/linux/cpuset.h| 5 - include/linux/fairsched.h | 58 --- include/linux/sched.h | 20 - include/uapi/linux/Kbuild | 1 - include/uapi/linux/fairsched.h| 8 - init/Kconfig | 20 +- kernel/Makefile | 1 - kernel/cpuset.c | 26 -- kernel/fairsched.c| 705 -- kernel/sched/core.c | 69 +-- kernel/sched/cpuacct.h| 2 + kernel/sys_ni.c | 10 - 18 files changed, 25 insertions(+), 943 deletions(-) delete mode 100644 include/linux/fairsched.h delete mode 100644 include/uapi/linux/fairsched.h delete mode 100644 kernel/fairsched.c diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h index ce9d2d7977e5..8a44bbd2bee6 100644 --- a/arch/powerpc/include/asm/systbl.h +++ b/arch/powerpc/include/asm/systbl.h @@ -374,14 +374,14 @@ SYSCALL(ni_syscall) SYSCALL(ni_syscall) SYSCALL(ni_syscall) SYSCALL(ni_syscall) -SYSCALL(fairsched_mknod) -SYSCALL(fairsched_rmnod) -SYSCALL(fairsched_chwt) -SYSCALL(fairsched_mvpr) -SYSCALL(fairsched_rate) -SYSCALL(fairsched_vcpus) -SYSCALL(fairsched_cpumask) -SYSCALL(fairsched_nodemask) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) +SYSCALL(ni_syscall) SYSCALL(getluid) SYSCALL(setluid) SYSCALL(setublimit) diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h index e90207158a12..41fc69c6822b 100644 --- a/arch/powerpc/include/uapi/asm/unistd.h +++ b/arch/powerpc/include/uapi/asm/unistd.h @@ -387,14 +387,6 @@ #define __NR_execveat 362 #define __NR_switch_endian 363 -#define __NR_fairsched_mknod 360 -#define __NR_fairsched_rmnod 361 -#define __NR_fairsched_chwt362 -#define __NR_fairsched_mvpr363 -#define __NR_fairsched_rate364 -#define __NR_fairsched_vcpus 365 -#define __NR_fairsched_cpumask 366 -#define __NR_fairsched_nodemask367 #define __NR_getluid 368 #define __NR_setluid 369 #define __NR_setublimit370 diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl index e60fd32ebba3..f8ed67d66913 100644 --- a/arch/x86/syscalls/syscall_32.tbl +++ b/arch/x86/syscalls/syscall_32.tbl @@ -360,15 +360,6 @@ 356i386memfd_createsys_memfd_create 374i386userfaultfd sys_userfaultfd -500i386fairsched_mknod sys_fairsched_mknod -501i386fairsched_rmnod sys_fairsched_rmnod -502i386fairsched_chwt sys_fairsched_chwt -503i386fairsched_mvpr sys_fairsched_mvpr -504i386fairsched_rate sys_fairsched_rate -505i386fairsched_vcpus sys_fairsched_vcpus -506i386fairsched_cpumask sys_fairsched_cpumask -507i386fairsched_nodemask sys_fairsched_nodemask - 510i386getluid sys_getluid 511i386setluid sys_setluid 512i386setublimit sys_setublimit compat_sys_setublimit diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl index 846183e5a9f0..7f009985158e 100644 --- a/arch/x86/syscalls/syscall_64.tbl +++ b/arch/x86/syscalls/syscall_64.tbl @@ -325,18 +325,10 @@ 320common kexec_file_load sys_kexec_file_load 323common userfaultfd sys_userfaultfd -49764 fairsched_nodemask sys_fairsched_nodemask -49864 fairsched_cpumask sys_fairsched_cpumask -49964 fairsched_vcpus sys_fairsched_vcpus 50064 getluid sys_getluid 50164 setluid sys_setluid 50264 setublimit sys_setublimit 503
[Devel] [PATCH rh7 1/6] Drop vz_compat boot param
It was introduced by commit d7b23ae8a314f ("ve/cgroups: use cgroup subsystem names only if in vz compat mode") in order to provide a way of running pcs6 environment along with vz7 kernel. Turned out, this is not needed, so drop the option altogether. Signed-off-by: Vladimir Davydov --- include/linux/ve.h | 4 kernel/bc/beancounter.c | 2 -- kernel/fairsched.c | 1 - kernel/ve/ve.c | 10 -- kernel/ve/vecalls.c | 1 - 5 files changed, 18 deletions(-) diff --git a/include/linux/ve.h b/include/linux/ve.h index 2d0c19ee2d98..a40e219c8bce 100644 --- a/include/linux/ve.h +++ b/include/linux/ve.h @@ -155,8 +155,6 @@ extern __u64 ve_setup_iptables_mask(__u64 init_mask); #ifdef CONFIG_VE #define ve_uevent_seqnum (get_exec_env()->_uevent_seqnum) -extern int vz_compat; - extern struct kobj_ns_type_operations ve_ns_type_operations; extern struct kobject * kobject_create_and_add_ve(const char *name, struct kobject *parent); @@ -249,8 +247,6 @@ static inline void ve_mount_nr_dec(void) #define ve_uevent_seqnum uevent_seqnum -#define vz_compat (0) - static inline int vz_security_family_check(struct net *net, int family) { return 0; } static inline int vz_security_protocol_check(struct net *net, int protocol) { return 0; } diff --git a/kernel/bc/beancounter.c b/kernel/bc/beancounter.c index b26d292e2881..935ca517e1f4 100644 --- a/kernel/bc/beancounter.c +++ b/kernel/bc/beancounter.c @@ -35,7 +35,6 @@ #include #include #include -#include #include #include @@ -1181,7 +1180,6 @@ void __init ub_init_late(void) int __init ub_init_cgroup(void) { struct cgroup_sb_opts blkio_opts = { - .name = vz_compat ? "beancounter" : NULL, .subsys_mask= (1ul << blkio_subsys_id), }; struct cgroup_sb_opts mem_opts = { diff --git a/kernel/fairsched.c b/kernel/fairsched.c index d3d17126a85c..8149076c8cb8 100644 --- a/kernel/fairsched.c +++ b/kernel/fairsched.c @@ -796,7 +796,6 @@ int __init fairsched_init(void) { struct vfsmount *cpu_mnt, *cpuset_mnt; struct cgroup_sb_opts cpu_opts = { - .name = vz_compat ? "fairsched" : NULL, .subsys_mask= (1ul << cpu_cgroup_subsys_id) | (1ul << cpuacct_subsys_id), diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index 53fa12dca238..703f97c03cb2 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -89,18 +89,8 @@ DEFINE_MUTEX(ve_list_lock); int nr_ve = 1; /* One VE always exists. Compatibility with vestat */ EXPORT_SYMBOL(nr_ve); -int vz_compat; -EXPORT_SYMBOL(vz_compat); - static DEFINE_IDR(ve_idr); -static int __init vz_compat_setup(char *arg) -{ - get_option(&arg, &vz_compat); - return 0; -} -early_param("vz_compat", vz_compat_setup); - struct ve_struct *get_ve(struct ve_struct *ve) { if (ve) diff --git a/kernel/ve/vecalls.c b/kernel/ve/vecalls.c index 537fc4aa964b..a690a8faabba 100644 --- a/kernel/ve/vecalls.c +++ b/kernel/ve/vecalls.c @@ -309,7 +309,6 @@ static struct vfsmount *ve_cgroup_mnt, *devices_cgroup_mnt; static int __init init_vecalls_cgroups(void) { struct cgroup_sb_opts devices_opts = { - .name = vz_compat ? "container" : NULL, .subsys_mask= (1ul << devices_subsys_id), }; -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 6/6] cgroup: fix path mangling for ve cgroups
Presently, we just cut first component off cgroup path when inside a VE, because all VE cgroups are located at the top level of the cgroup hierarchy. However, this is going to change - the cgroups are going to move to machine.slice - so we should introduce a more generic way of mangling cgroup paths. This patch does the trick. On a VE start it marks all cgroups the init task of the VE resides in with a special flag (CGRP_VE_ROOT). Cgroups marked this way will be treated as root if looked at from inside a VE. As long as we don't have nested VEs, this should work fine. Note, we don't need to clear these flags on VE destruction, because vzctl always creates new cgroups on VE start. https://jira.sw.ru/browse/PSBM-48629 Signed-off-by: Vladimir Davydov --- include/linux/cgroup.h | 3 +++ kernel/cgroup.c| 27 --- kernel/ve/ve.c | 4 3 files changed, 27 insertions(+), 7 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index aad06e8e0258..730ca9091bfb 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -175,6 +175,9 @@ enum { CGRP_CPUSET_CLONE_CHILDREN, /* see the comment above CGRP_ROOT_SANE_BEHAVIOR for details */ CGRP_SANE_BEHAVIOR, + + /* The cgroup is root in a VE */ + CGRP_VE_ROOT, }; struct cgroup_name { diff --git a/kernel/cgroup.c b/kernel/cgroup.c index dd548853e2eb..581924e7af9e 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1791,6 +1791,21 @@ static struct file_system_type cgroup_fs_type = { static struct kobject *cgroup_kobj; +#ifdef CONFIG_VE +void cgroup_mark_ve_root(struct ve_struct *ve) +{ + struct cgroup *cgrp; + struct cgroupfs_root *root; + + mutex_lock(&cgroup_mutex); + for_each_active_root(root) { + cgrp = task_cgroup_from_root(ve->init_task, root); + set_bit(CGRP_VE_ROOT, &cgrp->flags); + } + mutex_unlock(&cgroup_mutex); +} +#endif + /** * cgroup_path - generate the path of a cgroup * @cgrp: the cgroup in question @@ -1804,7 +1819,8 @@ static struct kobject *cgroup_kobj; * inode's i_mutex, while on the other hand cgroup_path() can be called * with some irq-safe spinlocks held. */ -int __cgroup_path(const struct cgroup *cgrp, char *buf, int buflen, bool virt) +static int __cgroup_path(const struct cgroup *cgrp, char *buf, int buflen, +bool virt) { int ret = -ENAMETOOLONG; char *start; @@ -1824,14 +1840,11 @@ int __cgroup_path(const struct cgroup *cgrp, char *buf, int buflen, bool virt) int len; #ifdef CONFIG_VE - if (virt && cgrp->parent && !cgrp->parent->parent) { + if (virt && test_bit(CGRP_VE_ROOT, &cgrp->flags)) { /* * Containers cgroups are bind-mounted from node * so they are like '/' from inside, thus we have -* to mangle cgroup path output. Effectively it is -* enough to remove two topmost cgroups from path. -* e.g. in ct 101: /101/test.slice/test.scope -> -* /test.slice/test.scope +* to mangle cgroup path output. */ if (*start != '/') { if (--start < buf) @@ -2391,7 +2404,7 @@ static ssize_t cgroup_file_write(struct file *file, const char __user *buf, * inside a container FS. */ if (!ve_is_super(get_exec_env()) - && (!cgrp->parent || !cgrp->parent->parent) + && test_bit(CGRP_VE_ROOT, &cgrp->flags) && !get_exec_env()->is_pseudosuper && !(cft->flags & CFTYPE_VE_WRITABLE)) return -EPERM; diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index 08a15fc02e21..e65130f18bb4 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -454,6 +454,8 @@ static void ve_drop_context(struct ve_struct *ve) static const struct timespec zero_time = { }; +extern void cgroup_mark_ve_root(struct ve_struct *ve); + /* under ve->op_sem write-lock */ static int ve_start_container(struct ve_struct *ve) { @@ -501,6 +503,8 @@ static int ve_start_container(struct ve_struct *ve) if (err < 0) goto err_iterate; + cgroup_mark_ve_root(ve); + ve->is_running = 1; printk(KERN_INFO "CT: %s: started\n", ve_name(ve)); -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 5/6] cgroup: use cgroup_path_ve helper in cgroup_show_path
Presently, it basically duplicates the code used for mangling cgroup path shown inside ve, which is already present in cgroup_path_ve. Let's reuse it. Signed-off-by: Vladimir Davydov --- kernel/cgroup.c | 39 +-- 1 file changed, 9 insertions(+), 30 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 5c012f6e94e5..dd548853e2eb 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1373,41 +1373,20 @@ static int cgroup_remount(struct super_block *sb, int *flags, char *data) } #ifdef CONFIG_VE -int cgroup_show_path(struct seq_file *m, struct dentry *dentry) +static int cgroup_show_path(struct seq_file *m, struct dentry *dentry) { - char *buf; + struct cgroup *cgrp = __d_cgrp(dentry); + char *buf, *end; size_t size = seq_get_buf(m, &buf); - int res = -1, err = 0; - - if (size) { - char *p = dentry_path(dentry, buf, size); - if (!IS_ERR(p)) { - char *end; - if (!ve_is_super(get_exec_env())) { - while (*++p != '/') { - /* -* Mangle one level when showing -* cgroup mount source in container -* e.g.: "/111" -> "/", -* "/111/test.slice/test.scope" -> -* "/test.slice/test.scope" -*/ - if (*p == '\0') { - *--p = '/'; - break; - } - } - } - end = mangle_path(buf, p, " \t\n\\"); - if (end) - res = end - buf; - } else { - err = PTR_ERR(p); - } + int res = -1; + + if (size > 0 && cgroup_path_ve(cgrp, buf, size) == 0) { + end = mangle_path(buf, buf, " \t\n\\"); + res = end - buf; } seq_commit(m, res); - return err; + return 0; } #endif -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 2/6] Drop VZCTL_ENV_CREATE
It's getting too difficult to support it. Since we've been using cgroup interface for creating VE for quite a while, let's drop it. Signed-off-by: Vladimir Davydov --- include/linux/device_cgroup.h | 1 - include/linux/fairsched.h | 7 - include/linux/sched.h | 1 - include/linux/ve.h| 8 - include/linux/ve_proto.h | 4 - kernel/fairsched.c| 64 +-- kernel/ve/ve.c| 8 +- kernel/ve/vecalls.c | 437 +- security/device_cgroup.c | 46 - 9 files changed, 5 insertions(+), 571 deletions(-) diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h index 32588bb8fb4e..64c2da27278c 100644 --- a/include/linux/device_cgroup.h +++ b/include/linux/device_cgroup.h @@ -17,7 +17,6 @@ extern int devcgroup_device_visible(umode_t mode, int major, int start_minor, int nr_minors); struct cgroup; -int devcgroup_default_perms_ve(struct cgroup *cgroup); int devcgroup_set_perms_ve(struct cgroup *cgroup, unsigned, dev_t, unsigned); struct ve_struct; int devcgroup_seq_show_ve(struct cgroup *devices_root, struct ve_struct *ve, struct seq_file *m); diff --git a/include/linux/fairsched.h b/include/linux/fairsched.h index e242c0d4c065..615e88928e25 100644 --- a/include/linux/fairsched.h +++ b/include/linux/fairsched.h @@ -51,10 +51,6 @@ asmlinkage long sys_fairsched_cpumask(unsigned int id, unsigned int len, asmlinkage long sys_fairsched_nodemask(unsigned int id, unsigned int len, unsigned long __user *user_mask_ptr); -int fairsched_new_node(int id, unsigned int vcpus); -int fairsched_move_task(int id, struct task_struct *tsk); -void fairsched_drop_node(int id, int leave); - int fairsched_get_cpu_stat(const char *name, struct kernel_cpustat *kstat); int cpu_cgroup_get_avenrun(struct cgroup *cgrp, unsigned long *avenrun); @@ -71,9 +67,6 @@ int fairsched_show_loadavg(const char *name, struct seq_file *p); #else /* CONFIG_VZ_FAIRSCHED */ -static inline int fairsched_new_node(int id, unsigned int vcpus) { return 0; } -static inline int fairsched_move_task(int id, struct task_struct *tsk) { return 0; } -static inline void fairsched_drop_node(int id, int leave) { } static inline int fairsched_show_stat(const char *name, struct seq_file *p) { return -ENOSYS; } static inline int fairsched_show_loadavg(const char *name, struct seq_file *p) { return -ENOSYS; } static inline int fairsched_get_cpu_avenrun(const char *name, unsigned long *avenrun) { return -ENOSYS; } diff --git a/include/linux/sched.h b/include/linux/sched.h index 21775a21f8ab..84a9888b2483 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1241,7 +1241,6 @@ struct task_struct { unsigned in_execve:1; /* Tell the LSMs that the process is doing an * execve */ unsigned in_iowait:1; - unsigned did_ve_enter:1; unsigned no_new_privs:1; /* task may not gain privileges */ unsigned may_throttle:1; diff --git a/include/linux/ve.h b/include/linux/ve.h index a40e219c8bce..878ca284a6ba 100644 --- a/include/linux/ve.h +++ b/include/linux/ve.h @@ -43,13 +43,10 @@ struct ve_struct { struct list_headve_list; envid_t veid; - boollegacy; /* created using the legacy API - (vzctl ioctl - see do_env_create) */ unsigned intclass_id; struct rw_semaphore op_sem; int is_running; - int is_locked; int is_pseudosuper; atomic_tsuspend; /* see vzcalluser.h for VE_FEATURE_XXX definitions */ @@ -148,10 +145,6 @@ extern struct cgroup_subsys ve_subsys; extern unsigned int sysctl_ve_mount_nr; -#ifdef CONFIG_VE_IPTABLES -extern __u64 ve_setup_iptables_mask(__u64 init_mask); -#endif - #ifdef CONFIG_VE #define ve_uevent_seqnum (get_exec_env()->_uevent_seqnum) @@ -211,7 +204,6 @@ extern void monotonic_ve_to_abs(clockid_t which_clock, struct timespec *tp); void ve_stop_ns(struct pid_namespace *ns); void ve_exit_ns(struct pid_namespace *ns); -int ve_start_container(struct ve_struct *ve); extern bool current_user_ns_initial(void); struct user_namespace *ve_init_user_ns(void); diff --git a/include/linux/ve_proto.h b/include/linux/ve_proto.h index 153f18bd19b1..5787afe275ce 100644 --- a/include/linux/ve_proto.h +++ b/include/linux/ve_proto.h @@ -55,10 +55,6 @@ extern struct ve_struct *get_ve_by_id(envid_t); extern struct cgroup *ve_cgroup_open(struct cgroup *root, int flags, envid_t veid); extern int ve_cgroup_remove(struct cgroup *root, envid_t veid); -struct env_create_param3; -extern int real_env_create(envid_t veid, unsigned flags, u32 class_id, - struct env_create_param3
[Devel] [PATCH rh7 3/6] Use ve init task's css instead of opening cgroup via vfs
Currently, whenever we need to get cpu or devices cgroup corresponding to a ve, we open it using cgroup_kernel_open(). This is inflexible, because it relies on the fact that all container cgroups are located at a specific location which can never change (at the top level). Since we want to move container cgroups to machine.slice, we need to rework this. This patch does the trick. It makes each ve remember its init task at container start, and use css corresponding to init task whenever we need to get a corresponding cgroup. Note, that after this patch is applied, we don't need to mount cpu and devices cgroup in kernel. https://jira.sw.ru/browse/PSBM-48629 Signed-off-by: Vladimir Davydov --- fs/proc/loadavg.c | 3 +- fs/proc/stat.c| 3 +- fs/proc/uptime.c | 15 include/linux/device_cgroup.h | 5 ++- include/linux/fairsched.h | 23 include/linux/ve.h| 18 ++ kernel/fairsched.c| 61 kernel/ve/ve.c| 82 ++- kernel/ve/vecalls.c | 67 --- security/device_cgroup.c | 19 +- 10 files changed, 126 insertions(+), 170 deletions(-) diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c index 4cbdeef1aa71..40d8a90b0f13 100644 --- a/fs/proc/loadavg.c +++ b/fs/proc/loadavg.c @@ -6,7 +6,6 @@ #include #include #include -#include #include #define LOAD_INT(x) ((x) >> FSHIFT) @@ -20,7 +19,7 @@ static int loadavg_proc_show(struct seq_file *m, void *v) ve = get_exec_env(); if (!ve_is_super(ve)) { int ret; - ret = fairsched_show_loadavg(ve_name(ve), m); + ret = ve_show_loadavg(ve, m); if (ret != -ENOSYS) return ret; } diff --git a/fs/proc/stat.c b/fs/proc/stat.c index e9991db527e0..7f7e87c855e4 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -10,7 +10,6 @@ #include #include #include -#include #include #include #include @@ -98,7 +97,7 @@ static int show_stat(struct seq_file *p, void *v) ve = get_exec_env(); if (!ve_is_super(ve)) { int ret; - ret = fairsched_show_stat(ve_name(ve), p); + ret = ve_show_cpu_stat(ve, p); if (ret != -ENOSYS) return ret; } diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c index 6fd56831c796..8fa578e8a553 100644 --- a/fs/proc/uptime.c +++ b/fs/proc/uptime.c @@ -5,7 +5,6 @@ #include #include #include -#include #include #include @@ -25,11 +24,11 @@ static inline void get_ve0_idle(struct timespec *idle) idle->tv_nsec = rem; } -static inline void get_veX_idle(struct timespec *idle, struct cgroup* cgrp) +static inline void get_veX_idle(struct ve_struct *ve, struct timespec *idle) { struct kernel_cpustat kstat; - cpu_cgroup_get_stat(cgrp, &kstat); + ve_get_cpu_stat(ve, &kstat); cputime_to_timespec(kstat.cpustat[CPUTIME_IDLE], idle); } @@ -37,14 +36,12 @@ static int uptime_proc_show(struct seq_file *m, void *v) { struct timespec uptime; struct timespec idle; + struct ve_struct *ve = get_exec_env(); - if (ve_is_super(get_exec_env())) + if (ve_is_super(ve)) get_ve0_idle(&idle); - else { - rcu_read_lock(); - get_veX_idle(&idle, task_cgroup(current, cpu_cgroup_subsys_id)); - rcu_read_unlock(); - } + else + get_veX_idle(ve, &idle); do_posix_clock_monotonic_gettime(&uptime); monotonic_to_bootbased(&uptime); diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h index 64c2da27278c..25ea2270aabe 100644 --- a/include/linux/device_cgroup.h +++ b/include/linux/device_cgroup.h @@ -16,10 +16,9 @@ extern int devcgroup_device_permission(umode_t mode, dev_t dev, int mask); extern int devcgroup_device_visible(umode_t mode, int major, int start_minor, int nr_minors); -struct cgroup; -int devcgroup_set_perms_ve(struct cgroup *cgroup, unsigned, dev_t, unsigned); struct ve_struct; -int devcgroup_seq_show_ve(struct cgroup *devices_root, struct ve_struct *ve, struct seq_file *m); +int devcgroup_set_perms_ve(struct ve_struct *, unsigned, dev_t, unsigned); +int devcgroup_seq_show_ve(struct ve_struct *, struct seq_file *); #else static inline int devcgroup_inode_permission(struct inode *inode, int mask) diff --git a/include/linux/fairsched.h b/include/linux/fairsched.h index 615e88928e25..b779d2e85b12 100644 --- a/include/linux/fairsched.h +++ b/include/linux/fairsched.h @@ -51,31 +51,8 @@ asmlinkage long sys_fairsched_cpumask(unsigned int id, unsigned int len, asmlinkage long sys_fairsched_nodemask(unsigned int id, unsigned int len,
[Devel] test - pls ignore
___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] mm: memcontrol: fix race between kmem uncharge and charge reparenting
When a cgroup is destroyed, all user memory pages get recharged to the parent cgroup. Recharging is done by mem_cgroup_reparent_charges which keeps looping until res <= kmem. This is supposed to guarantee that by the time cgroup gets released, no pages is charged to it. However, the guarantee might be violated in case mem_cgroup_reparent_charges races with kmem charge or uncharge. Currently, kmem is charged before res and uncharged after. As a result, kmem might become greater than res for a short period of time even if there are still user memory pages charged to the cgroup. In this case mem_cgroup_reparent_charges will give up prematurely, and the cgroup might be released though there are still pages charged to it. Uncharge of such a page will trigger kernel panic: general protection fault: [#1] SMP CPU: 0 PID: 972445 Comm: httpd ve: 0 Tainted: G OE 3.10.0-427.10.1.lve1.4.9.el7.x86_64 #1 12.14 task: 88065d53d8d0 ti: 880224f34000 task.ti: 880224f34000 RIP: 0010:[] [] mem_cgroup_charge_statistics.isra.16+0x13/0x60 RSP: 0018:880224f37a80 EFLAGS: 00010202 RAX: RBX: 8807b26f0110 RCX: RDX: 79726f6765746163 RSI: ea000c9c0440 RDI: 8806a55662f8 RBP: 880224f37a80 R08: R09: 03808000 R10: 00b8 R11: ea001eaa8980 R12: ea000c9c0440 R13: 0001 R14: R15: 8806a5566000 FS: () GS:8807d400() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f54289bd74c CR3: 0006638b1000 CR4: 06f0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Stack: 880224f37ac0 811e9ddf 88060001 ea000c9c0440 0001 037d1000 880224f37c78 0380 880224f37ad0 811ee99a 880224f37b08 811b9ec9 Call Trace: [] __mem_cgroup_uncharge_common+0xcf/0x320 [] mem_cgroup_uncharge_page+0x2a/0x30 [] page_remove_rmap+0xb9/0x160 [] ? res_counter_uncharge+0x13/0x20 [] unmap_page_range+0x460/0x870 [] unmap_single_vma+0x81/0xf0 [] unmap_vmas+0x49/0x90 [] exit_mmap+0xac/0x1a0 [] mmput+0x6b/0x140 [] flush_old_exec+0x467/0x8d0 [] load_elf_binary+0x33c/0xde0 [] ? get_user_pages+0x52/0x60 [] ? load_elf_library+0x220/0x220 [] search_binary_handler+0xd5/0x300 [] do_execve_common.isra.26+0x657/0x720 [] SyS_execve+0x29/0x30 [] stub_execve+0x69/0xa0 To prevent this from happening, let's always charge kmem after res and uncharge before res. https://bugs.openvz.org/browse/OVZ-6756 Reported-by: Anatoly Stepanov Signed-off-by: Vladimir Davydov --- mm/memcontrol.c | 44 1 file changed, 36 insertions(+), 8 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1c3fbb2d2c48..de7c36295515 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3163,10 +3163,6 @@ int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size) int ret = 0; bool may_oom; - ret = res_counter_charge(&memcg->kmem, size, &fail_res); - if (ret) - return ret; - /* * Conditions under which we can wait for the oom_killer. Those are * the same conditions tested by the core page allocator @@ -3198,8 +3194,33 @@ int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size) res_counter_charge_nofail(&memcg->memsw, size, &fail_res); ret = 0; - } else if (ret) - res_counter_uncharge(&memcg->kmem, size); + } + + if (ret) + return ret; + + /* +* When a cgroup is destroyed, all user memory pages get recharged to +* the parent cgroup. Recharging is done by mem_cgroup_reparent_charges +* which keeps looping until res <= kmem. This is supposed to guarantee +* that by the time cgroup gets released, no pages is charged to it. +* +* If kmem were charged before res or uncharged after, kmem might +* become greater than res for a short period of time even if there +* were still user memory pages charged to the cgroup. In this case +* mem_cgroup_reparent_charges would give up prematurely, and the +* cgroup could be released though there were still pages charged to +* it. Uncharge of such a page would trigger kernel panic. +* +* To prevent this from happening, kmem must be charged after res and +* uncharged before res. +*/ + ret = res_counter_charge(&memcg->kmem, size, &fail_res); + if (ret) { + res_counter_uncharge(&
Re: [Devel] memcg: mem_cgroup_uncharge_page() kernel panic/lockup
Hi, Thanks for the report. Could you please - file a bug to bugzilla.openvz.org - upload the vmcore at rsync://fe.sw.ru/f837d67c8e2ade8cee3367cb0f880268/ On Mon, Jun 13, 2016 at 09:24:33AM +0300, Anatoly Stepanov wrote: > Hello everyone! > > We encounter an issue with mem_cgroup_uncharge_page() function, > it appears quite often on our clients servers. > > Basically the issue sometimes leads to hard-lockup, sometimes to GP fault. > > Based on bug reports from clients, the problem shows up when a user > process calls "execve" or "exit" syscalls. > As we know in those cases kernel invokes "uncharging" for every page > when its unmapped from all the mm's. > > Kernel dump analysis shows that at the moment of > mem_cgroup_uncharge_page() "memcg" pointer > (taken from page_cgroup) seems to be pointing to some random memory area. > > On the other hand, if we look at current->mm->css, then memcg instance > exists and is "online". > > This led me to a thought that "page_cgroup->memcg" may be changed by > some part of memcg code in parallel. > As far as i understand, the only option here is "reclaim code path" > (may be i'm wrong) > > So, i suppose there might be a race between "memcg uncharge code" and > "memcg reclaim code". > > Please, give me your thoughts about it > thanks > > P.S.: > > Additional info: > > Kernel: rh7-3.10.0-327.10.1.vz7.12.14 > > *1st > BT > > PID: 972445 TASK: 88065d53d8d0 CPU: 0 COMMAND: "httpd" > #0 [880224f37818] machine_kexec at 8105249b > #1 [880224f37878] crash_kexec at 81103532 > #2 [880224f37948] oops_end at 81641628 > #3 [880224f37970] die at 810184cb > #4 [880224f379a0] do_general_protection at 81640f24 > #5 [880224f379d0] general_protection at 81640768 > [exception RIP: mem_cgroup_charge_statistics+19] > RIP: 811e7733 RSP: 880224f37a80 RFLAGS: 00010202 > RAX: RBX: 8807b26f0110 RCX: > RDX: 79726f6765746163 RSI: ea000c9c0440 RDI: 8806a55662f8 > RBP: 880224f37a80 R8: R9: 03808000 > R10: 00b8 R11: ea001eaa8980 R12: ea000c9c0440 > R13: 0001 R14: R15: 8806a5566000 > ORIG_RAX: CS: 0010 SS: 0018 > #6 [880224f37a88] __mem_cgroup_uncharge_common at 811e9ddf > #7 [880224f37ac8] mem_cgroup_uncharge_page at 811ee99a > #8 [880224f37ad8] page_remove_rmap at 811b9ec9 > #9 [880224f37b10] unmap_page_range at 811ab580 > #10 [880224f37bf8] unmap_single_vma at 811aba11 > #11 [880224f37c30] unmap_vmas at 811ace79 > #12 [880224f37c68] exit_mmap at 811b663c > #13 [880224f37d18] mmput at 8107853b > #14 [880224f37d38] flush_old_exec at 81202547 > #15 [880224f37d88] load_elf_binary at 8125883c > #16 [880224f37e58] search_binary_handler at 81201c25 > #17 [880224f37ea0] do_execve_common at 812032b7 > #18 [880224f37f30] sys_execve at 81203619 > #19 [880224f37f50] stub_execve at 81649369 > RIP: 7f54284b3287 RSP: 7ffda57a0698 RFLAGS: 0297 > RAX: 003b RBX: 037c5fe8 RCX: > RDX: 037cf3f8 RSI: 037ce5f8 RDI: 7f5425fcabf1 > RBP: 7ffda57a0750 R8: 0001 R9: > > > ***2nd > BT**: > > PID: 168440 TASK: 88001e31cc20 CPU: 18 COMMAND: "httpd" > #0 [88007255f838] machine_kexec at 8105249b > #1 [88007255f898] crash_kexec at 81103532 > #2 [88007255f968] oops_end at 81641628 > #3 [88007255f990] no_context at 8163222b > #4 [88007255f9e0] __bad_area_nosemaphore at 816322c1 > #5 [88007255fa30] bad_area_nosemaphore at 8163244a > #6 [88007255fa40] __do_page_fault at 8164443e > #7 [88007255faa0] trace_do_page_fault at 81644673 > #8 [88007255fad8] do_async_page_fault at 81643d59 > #9 [88007255faf0] async_page_fault at 816407f8 > [exception RIP: memcg_check_events+435] > RIP: 811e9b53 RSP: 88007255fba0 RFLAGS: 00010246 > RAX: f81ef81e RBX: 8802106d5000 RCX: > RDX: f81e RSI: 0002 RDI: 8807aa2642e8 > RBP: 88007255fbf0 R8: 0202 R9: > R10: 0010 R11: 88007255ffd8 R12: 8807aa2642e0 > R13: 0410 R14: 8802073de700 R15: 8802106d5000 > ORIG_RAX: CS: 0010 SS: 0018 > #10 [88007255fbf8] __mem_cgroup_uncharge_common at 811
Re: [Devel] [PATCH rh7 v3] vtty: Don't free console mapping until no clients left
On Tue, Jun 14, 2016 at 12:20:17PM +0300, Cyrill Gorcunov wrote: > Currently on container's stop we free vtty mapping in a force way > so that if there is active console hooked from the node it become > unusable since then. It was easier to work with when we've been > reworking virtual console code. > > Now lets make console fully functional as it was in pcs6: > when opened it must survice container start/stop cycle > and checkpoint/restore as well. > > For this sake we: > > - drop ve_hook code, it no longer needed > - free console @map on final close of the last tty opened > > https://jira.sw.ru/browse/PSBM-39463 > > Signed-off-by: Cyrill Gorcunov > CC: Vladimir Davydov > CC: Konstantin Khorenko > CC: Igor Sukhih > CC: Pavel Emelyanov Reviewed-by: Vladimir Davydov ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7 v2] vtty: Don't free console mapping until no clients left
On Sat, Jun 11, 2016 at 12:35:13PM +0300, Cyrill Gorcunov wrote: ... > @@ -939,6 +938,7 @@ static vtty_map_t *vtty_map_alloc(envid_ > lockdep_assert_held(&tty_mutex); > if (map) { > map->veid = veid; > + init_completion(&map->work); Stale hunk? > veid = idr_alloc(&vtty_idr, map, veid, veid + 1, GFP_KERNEL); > if (veid < 0) { > kfree(map); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7 v2] vtty: Allow to wait until container's console appear
On Fri, Jun 10, 2016 at 04:34:34PM +0300, Cyrill Gorcunov wrote: > After tty code redesing we've been requiring container to start > first before be able to connect into it via vzctl console command. > Here we rather allow userspace tool to wait until container brought > to life and proceed connecting into console. > > https://jira.sw.ru/browse/PSBM-39463 > > Note: when someone tried to open several consoles for offline > mode (say vzctl console 300 1 and vzctl console 300 2) simultaneously > only one is allowed until VE is up, the second vzctl command will > exit with -EBUSY. > > v2: > - move everything into vtty code > > Signed-off-by: Cyrill Gorcunov > CC: Vladimir Davydov > CC: Konstantin Khorenko > CC: Igor Sukhih > CC: Pavel Emelyanov > --- > drivers/tty/pty.c | 67 > > include/linux/ve.h |2 + > kernel/ve/ve.c |5 +++ > kernel/ve/vecalls.c |6 ++-- > 4 files changed, 77 insertions(+), 3 deletions(-) > > Index: linux-pcs7.git/drivers/tty/pty.c > === > --- linux-pcs7.git.orig/drivers/tty/pty.c > +++ linux-pcs7.git/drivers/tty/pty.c > @@ -1284,8 +1284,64 @@ static int __init vtty_init(void) > return 0; > } > > +static DECLARE_RWSEM(vtty_console_sem); > +static DEFINE_IDR(vtty_idr_console); We already have vtty_idr, may be reuse it? > + > +static struct ve_struct *vtty_get_ve_by_id(envid_t veid) > +{ > + DECLARE_COMPLETION_ONSTACK(console_work); > + struct ve_struct *ve; > + int ret; > + > + down_write(&vtty_console_sem); > + ve = get_ve_by_id(veid); > + if (ve) { > + up_write(&vtty_console_sem); > + return ve; > + } > + > + if (idr_find(&vtty_idr_console, veid)) { > + up_write(&vtty_console_sem); > + return ERR_PTR(-EBUSY); > + } This block is useless - it's handled by ENOSPC check below. > + > + ret = idr_alloc(&vtty_idr_console, &console_work, veid, veid + 1, > GFP_KERNEL); > + if (ret < 0) { > + if (ret == -ENOSPC) > + ret = -EBUSY; > + } else > + ret = 0; > + up_write(&vtty_console_sem); > + > + if (!ret) > + ret = wait_for_completion_interruptible(&console_work); > + > + if (!ret) > + ve = get_ve_by_id(veid); > + else > + ve = ERR_PTR(ret); > + > + down_write(&vtty_console_sem); > + if (!ret) > + idr_remove(&vtty_idr_console, veid); > + up_write(&vtty_console_sem); > + return ve; > +} > + > +void vtty_console_notify(struct ve_struct *ve) > +{ > + struct completion *console_work; > + > + down_read(&vtty_console_sem); > + console_work = idr_find(&vtty_idr_console, ve->veid); > + if (console_work) > + complete(console_work); > + up_read(&vtty_console_sem); > +} > + > int vtty_open_master(envid_t veid, int idx) > { > + struct ve_struct *ve = NULL; > struct tty_struct *tty; > struct file *file; > char devname[64]; > @@ -1298,6 +1354,16 @@ int vtty_open_master(envid_t veid, int i > if (fd < 0) > return fd; > > + ve = vtty_get_ve_by_id(veid); > + if (IS_ERR_OR_NULL(ve)) { > + if (IS_ERR(ve)) > + ret = PTR_ERR(ve); > + else > + ret = -ENOENT; > + ve = NULL; > + goto err_put_unused_fd; > + } > + Come to think of it, is this really necessary? Can't we just allocate vtty_map in vtty_open_master and return master tty w/o open slave? Any write/read will put the caller to sleep anyway. > snprintf(devname, sizeof(devname), "v%utty%d", veid, idx); > file = anon_inode_getfile(devname, &vtty_fops, NULL, O_RDWR); > if (IS_ERR(file)) { > @@ -1364,6 +1430,7 @@ int vtty_open_master(envid_t veid, int i > mutex_unlock(&tty_mutex); > ret = fd; > out: > + put_ve(ve); > return ret; > > err_install: > Index: linux-pcs7.git/include/linux/ve.h > === > --- linux-pcs7.git.orig/include/linux/ve.h > +++ linux-pcs7.git/include/linux/ve.h > @@ -215,6 +215,8 @@ void ve_stop_ns(struct pid_namespace *ns > void ve_exit_ns(struct pid_namespace *ns); > int ve_start_container(struct ve_struct *ve); > > +void vtty_console_notify(struct ve_struct *ve); > + > extern b
Re: [Devel] [PATCH rh7] vtty: Don't free console mapping until no clients left
On Tue, Jun 07, 2016 at 06:18:38PM +0300, Cyrill Gorcunov wrote: > Currently on container's stop we free vtty mapping in a force way > so that if there is active console hooked from the node it become > unusable since then. It was easier to work with when we've been > reworking virtual console code. > > Now lets make console fully functional as it was in pcs6: > when opened it must survice container start/stop cycle > and checkpoint/restore as well. > > For this sake we: > > - drop ve_hook code, it no longer needed > - free console @map on final close of the last tty opened > > https://jira.sw.ru/browse/PSBM-39463 > > Signed-off-by: Cyrill Gorcunov > CC: Vladimir Davydov > CC: Konstantin Khorenko > CC: Igor Sukhih > CC: Pavel Emelyanov Reviewed-by: Vladimir Davydov ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] vtty: Allow to wait until container's console appear
On Mon, Jun 06, 2016 at 07:26:57PM +0300, Cyrill Gorcunov wrote: > After tty code redesing we've been requiring container to start > first before be able to connect into it via vzctl console command. > Here we rather allow userspace tool to wait until container brought > to life and proceed connecting into console. > > https://jira.sw.ru/browse/PSBM-39463 > > Signed-off-by: Cyrill Gorcunov > CC: Vladimir Davydov > CC: Konstantin Khorenko > CC: Igor Sukhih > CC: Pavel Emelyanov > --- > include/linux/ve.h |2 ++ > kernel/ve/ve.c | 48 > kernel/ve/vecalls.c | 23 +-- > 3 files changed, 71 insertions(+), 2 deletions(-) > > Index: linux-pcs7.git/include/linux/ve.h > === > --- linux-pcs7.git.orig/include/linux/ve.h > +++ linux-pcs7.git/include/linux/ve.h > @@ -215,6 +215,8 @@ void ve_stop_ns(struct pid_namespace *ns > void ve_exit_ns(struct pid_namespace *ns); > int ve_start_container(struct ve_struct *ve); > > +int ve_console_wait(envid_t veid); > + > extern bool current_user_ns_initial(void); > struct user_namespace *ve_init_user_ns(void); > > Index: linux-pcs7.git/kernel/ve/ve.c > === > --- linux-pcs7.git.orig/kernel/ve/ve.c > +++ linux-pcs7.git/kernel/ve/ve.c > @@ -260,6 +260,49 @@ struct user_namespace *ve_init_user_ns(v > } > EXPORT_SYMBOL(ve_init_user_ns); > > +static DEFINE_IDR(ve_idr_console); > +static DECLARE_RWSEM(ve_console_sem); > + > +int ve_console_wait(envid_t veid) > +{ > + DECLARE_COMPLETION_ONSTACK(console_work); > + int ret; > + > + down_write(&ve_console_sem); > + if (idr_find(&ve_idr_console, veid)) { > + up_write(&ve_console_sem); > + return -EEXIST; > + } > + > + ret = idr_alloc(&ve_idr_console, &console_work, veid, veid + 1, > GFP_KERNEL); > + if (ret < 0) { > + if (ret == -ENOSPC) > + ret = -EEXIST; > + } else > + ret = 0; > + downgrade_write(&ve_console_sem); > + > + if (!ret) { > + ret = wait_for_completion_interruptible(&console_work); > + idr_remove(&ve_idr_console, veid); > + } > + > + up_read(&ve_console_sem); > + return ret; > +} > +EXPORT_SYMBOL(ve_console_wait); > + > +static void ve_console_notify(struct ve_struct *ve) > +{ > + struct completion *console_work; > + > + down_read(&ve_console_sem); > + console_work = idr_find(&ve_idr_console, ve->veid); > + if (console_work) > + complete(console_work); > + up_read(&ve_console_sem); > +} > + > int nr_threads_ve(struct ve_struct *ve) > { > return cgroup_task_count(ve->css.cgroup); > @@ -494,6 +537,11 @@ int ve_start_container(struct ve_struct > > get_ve(ve); /* for ve_exit_ns() */ > > + /* > + * Console waiter are to be notified at the very > + * end when everything else is ready. > + */ > + ve_console_notify(ve); > return 0; > > err_iterate: > Index: linux-pcs7.git/kernel/ve/vecalls.c > === > --- linux-pcs7.git.orig/kernel/ve/vecalls.c > +++ linux-pcs7.git/kernel/ve/vecalls.c > @@ -991,8 +991,27 @@ static int ve_configure(envid_t veid, un > int err = -ENOKEY; > > ve = get_ve_by_id(veid); > - if (!ve) > - return -EINVAL; > + if (!ve) { > + > + if (key != VE_CONFIGURE_OPEN_TTY) > + return -EINVAL; > + /* > + * Offline console management: > + * wait until ve is up and proceed. > + */ What if a VE is created right here, before we call ve_console_wait()? Looks like the caller will hang forever... > + err = ve_console_wait(veid); > + if (err) > + return err; > + > + /* > + * A container should not exit immediately once > + * started but if it does, for any reason, simply > + * exit out gracefully. > + */ > + ve = get_ve_by_id(veid); > + if (!ve) > + return -ENOENT; > + } Can't we fold this into vtty_open_master()? The latter doesn't need ve object, it only needs veid, which is known here. > > switch(key) { > case VE_CONFIGURE_OS_RELEASE: ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] sched/core/cfs: don't reset nr_cpus while setting cpu limits
On Tue, Jun 07, 2016 at 04:50:38PM +0300, Andrey Ryabinin wrote: > Setting cpu limits reset number of cpus > # echo 2 >/sys/fs/cgroup/cpu,cpuacct/101/cpu.nr_cpus > # exec 101 cat /proc/cpuinfo |grep -c processor > 2 > # echo 16 >/sys/fs/cgroup/cpu,cpuacct/101/cpu.cfs_quota_us > # vzctl exec 101 cat /proc/cpuinfo |grep -c processor > 4 > # cat /sys/fs/cgroup/cpu,cpuacct/101/cpu.nr_cpus > 0 > > tg_update_cpu_limit() does that without any apparent reason, > so let's fix it,. > > https://jira.sw.ru/browse/PSBM-48061 > > Signed-off-by: Andrey Ryabinin > --- > kernel/sched/core.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 2c147c8..51ebed2 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -8696,7 +8696,6 @@ static void tg_update_cpu_limit(struct task_group *tg) > } > > tg->cpu_rate = rate; > - tg->nr_cpus = 0; This is incorrect. Suppose nr_cpus = 2 and you set cfs_quota to 4 * cfs_period. If you don't reset nr_cpus, you'll get cpu limit equal to 400, although it should be min(nr_cpus * 100, cpu_rate) = 200. > } > > static int tg_set_cpu_limit(struct task_group *tg, ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] net: packet: rework rx/tx ring pages accounting
To account tx/rx ring pages to kmemcg, we allocate them with __GFP_ACCOUNT. After commit 1265d3474391 ("mm: charge/uncharge kmemcg from generic page allocator paths") this implies that these pages have PAGE_KMEMCG_MAPCOUNT_VALUE stored in page->_mapcount. This is incorrect as these pages are supposed to be mapped to userspace: BUG: Bad page map in process packet_sock_mma pte:800241837025 pmd:2428aa067 page:ea0009060dc0 count:2 mapcount:-255 mapping: (null) index:0x0 page flags: 0x2f0004(referenced) page dumped because: bad pte addr:7f16c9a8c000 vm_flags:18100073 anon_vma: (null) mapping:880210caed80 index:0 vma->vm_ops->fault: (null) vma->vm_file->f_op->mmap: sock_mmap+0x0/0x20 CPU: 2 PID: 6141 Comm: packet_sock_mma ve: e7eccd35-3ea1-4dc1-9a04-dba948120299 Not tainted 3.10.0-327.18.2.vz7.14.10 #1 14.10 Hardware name: DEPO Computers To Be Filled By O.E.M./H67DE3, BIOS L1.60c 07/14/2011 ea0009060dc0 7be30e48 88024235ba68 81633548 88024235bab0 811a908f 800241837025 8802428aa460 ea0009060dc0 7f16c9a8c000 88024235bc20 Call Trace: [] dump_stack+0x19/0x1b [] print_bad_pte+0x1af/0x250 [] unmap_page_range+0x76b/0x870 [] unmap_single_vma+0x81/0xf0 [] unmap_vmas+0x49/0x90 [] exit_mmap+0xac/0x1a0 [] mmput+0x6b/0x140 [] do_exit+0x2ac/0xb10 [] ? plist_del+0x46/0x70 [] ? __unqueue_futex+0x32/0x70 [] ? futex_wait+0x11d/0x280 [] do_group_exit+0x3f/0xa0 [] get_signal_to_deliver+0x1d0/0x6d0 [] do_signal+0x57/0x6c0 [] ? do_futex+0x15b/0x600 [] do_notify_resume+0x5f/0xb0 [] int_signal+0x12/0x17 To fix that, let's charge these pages directly using memcg_charge_kmem() to the cgroup the packet socket is accounted to (via ->sk_cgrp). https://jira.sw.ru/browse/PSBM-47873 Signed-off-by: Vladimir Davydov --- net/packet/af_packet.c | 20 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index ecb5464c5622..2a1b15a85928 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -3712,7 +3712,7 @@ static void free_pg_vec(struct pgv *pg_vec, unsigned int order, static char *alloc_one_pg_vec_page(unsigned long order) { char *buffer = NULL; - gfp_t gfp_flags = GFP_KERNEL_ACCOUNT | __GFP_COMP | + gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY; buffer = (char *) __get_free_pages(gfp_flags, order); @@ -3723,7 +3723,7 @@ static char *alloc_one_pg_vec_page(unsigned long order) /* * __get_free_pages failed, fall back to vmalloc */ - buffer = vzalloc_account((1 << order) * PAGE_SIZE); + buffer = vzalloc((1 << order) * PAGE_SIZE); if (buffer) return buffer; @@ -3770,6 +3770,7 @@ out_free_pgvec: static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, int closing, int tx_ring) { + struct packet_sk_charge *psc = (struct packet_sk_charge *)sk->sk_cgrp; struct pgv *pg_vec = NULL; struct packet_sock *po = pkt_sk(sk); int was_running, order = 0; @@ -3839,9 +3840,16 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, err = -ENOMEM; order = get_order(req->tp_block_size); + if (psc && memcg_charge_kmem(psc->memcg, GFP_KERNEL, + (PAGE_SIZE << order) * req->tp_block_nr)) + goto out; pg_vec = alloc_pg_vec(req, order); - if (unlikely(!pg_vec)) + if (unlikely(!pg_vec)) { + if (psc) + memcg_uncharge_kmem(psc->memcg, + (PAGE_SIZE << order) * req->tp_block_nr); goto out; + } switch (po->tp_version) { case TPACKET_V3: /* Transmit path is not supported. We checked @@ -3912,8 +3920,12 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, } release_sock(sk); - if (pg_vec) + if (pg_vec) { + if (psc) + memcg_uncharge_kmem(psc->memcg, + (PAGE_SIZE << order) * req->tp_block_nr); free_pg_vec(pg_vec, order, req->tp_block_nr); + } out: return err; } -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] mm: fix PAGE_KMEMCG_MAPCOUNT_VALUE
It should be -512, -256 is used for balloon pages. Fixes: 1265d3474391 ("mm: charge/uncharge kmemcg from generic page allocator paths") Signed-off-by: Vladimir Davydov --- include/linux/page-flags.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index d15d20d84142..731a76613ea4 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -523,7 +523,7 @@ static inline void __ClearPageBalloon(struct page *page) atomic_set(&page->_mapcount, -1); } -#define PAGE_KMEMCG_MAPCOUNT_VALUE (-256) +#define PAGE_KMEMCG_MAPCOUNT_VALUE (-512) static inline int PageKmemcg(struct page *page) { -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] pfcache: abort ext4_pfcache_open if inode already has peer installed
Calling ioctl(FS_IOC_PFCACHE_OPEN) on an inode that already has a pfcache peer installed results in i_peer_list corruption: WARNING: at lib/list_debug.c:36 __list_add+0x8a/0xc0() list_add double add: new=88009c525d40, prev=880088a5bac0, next=88009c525d40. CPU: 5 PID: 1429 Comm: pfcached ve: 0 Not tainted 3.10.0-327.18.2.vz7.14.9 #1 14.9 0024 85f7231d 88008f153c80 81632bb7 88008f153cb8 8107b460 88009c525d40 88009c525d40 880088a5bac0 88009c525cc8 88009c525c90 88008f153d20 Call Trace: [] dump_stack+0x19/0x1b [] warn_slowpath_common+0x70/0xb0 [] warn_slowpath_fmt+0x5c/0x80 [] __list_add+0x8a/0xc0 [] open_mapping_peer+0x15c/0x1f0 [] ext4_open_pfcache+0x155/0x1b0 [ext4] [] ext4_ioctl+0xa9/0x15f0 [ext4] [] ? handle_mm_fault+0x5b4/0xf50 [] ? do_filp_open+0x4b/0xb0 [] do_vfs_ioctl+0x255/0x4f0 [] ? __do_page_fault+0x164/0x450 [] SyS_ioctl+0x54/0xa0 [] system_call_fastpath+0x16/0x1b https://jira.sw.ru/browse/PSBM-47806 Signed-off-by: Vladimir Davydov --- fs/ext4/pfcache.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/ext4/pfcache.c b/fs/ext4/pfcache.c index fe1296f27eb2..ab2f20c243d1 100644 --- a/fs/ext4/pfcache.c +++ b/fs/ext4/pfcache.c @@ -43,6 +43,9 @@ int ext4_open_pfcache(struct inode *inode) struct path root, path; int ret; + if (inode->i_mapping->i_peer_file) + return -EBUSY; + if (!(ext4_test_inode_state(inode, EXT4_STATE_PFCACHE_CSUM) && EXT4_I(inode)->i_data_csum_end < 0)) return -ENODATA; -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 5/7] af_unix: charge buffers to kmemcg
Unix sockets can consume a significant amount of system memory, hence they should be accounted to kmemcg. Since unix socket buffers are always allocated from process context, all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in sock->sk_allocation mask. https://jira.sw.ru/browse/PSBM-34562 Signed-off-by: Vladimir Davydov --- net/unix/af_unix.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 0e629f509cd0..1da93a400145 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -761,6 +761,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock) lockdep_set_class(&sk->sk_receive_queue.lock, &af_unix_sk_receive_queue_lock_key); + sk->sk_allocation = GFP_KERNEL_ACCOUNT; sk->sk_write_space = unix_write_space; sk->sk_max_ack_backlog = net->unx.sysctl_max_dgram_qlen; sk->sk_destruct = unix_sock_destructor; -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 1/7] Drop alloc_kmem_pages and friends
These functions work exactly like alloc_pages and others except they will charge allocated page to current memcg if __GFP_ACCOUNT is passed. In the next patch I'm going to move charge/uncharge to generic allocation paths, so that these special helpers won't be necessary. Signed-off-by: Vladimir Davydov --- arch/x86/include/asm/pgalloc.h| 14 - arch/x86/kernel/ldt.c | 6 ++-- arch/x86/mm/pgtable.c | 19 +--- fs/pipe.c | 11 +++ include/linux/gfp.h | 8 - kernel/fork.c | 6 ++-- mm/memcontrol.c | 1 - mm/page_alloc.c | 65 --- mm/slab_common.c | 2 +- mm/slub.c | 4 +-- mm/vmalloc.c | 6 ++-- net/netfilter/nf_conntrack_core.c | 6 ++-- net/packet/af_packet.c| 8 ++--- 13 files changed, 38 insertions(+), 118 deletions(-) diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h index f5897582b88c..58e45671d127 100644 --- a/arch/x86/include/asm/pgalloc.h +++ b/arch/x86/include/asm/pgalloc.h @@ -48,7 +48,7 @@ static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) static inline void pte_free(struct mm_struct *mm, struct page *pte) { pgtable_page_dtor(pte); - __free_kmem_pages(pte, 0); + __free_page(pte); } extern void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte); @@ -81,11 +81,11 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr) { struct page *page; - page = alloc_kmem_pages(GFP_KERNEL_ACCOUNT | __GFP_REPEAT | __GFP_ZERO, 0); + page = alloc_pages(GFP_KERNEL_ACCOUNT | __GFP_REPEAT | __GFP_ZERO, 0); if (!page) return NULL; if (!pgtable_pmd_page_ctor(page)) { - __free_kmem_pages(page, 0); + __free_page(page); return NULL; } return (pmd_t *)page_address(page); @@ -95,7 +95,7 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd) { BUG_ON((unsigned long)pmd & (PAGE_SIZE-1)); pgtable_pmd_page_dtor(virt_to_page(pmd)); - free_kmem_pages((unsigned long)pmd, 0); + free_page((unsigned long)pmd); } extern void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd); @@ -125,14 +125,14 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud) static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr) { - return (pud_t *)__get_free_kmem_pages(GFP_KERNEL_ACCOUNT|__GFP_REPEAT| - __GFP_ZERO, 0); + return (pud_t *)__get_free_page(GFP_KERNEL_ACCOUNT|__GFP_REPEAT| + __GFP_ZERO); } static inline void pud_free(struct mm_struct *mm, pud_t *pud) { BUG_ON((unsigned long)pud & (PAGE_SIZE-1)); - free_kmem_pages((unsigned long)pud, 0); + free_page((unsigned long)pud); } extern void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud); diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c index 4a6c8fee47f2..942b0a4e40d5 100644 --- a/arch/x86/kernel/ldt.c +++ b/arch/x86/kernel/ldt.c @@ -44,7 +44,7 @@ static int alloc_ldt(mm_context_t *pc, int mincount, int reload) if (mincount * LDT_ENTRY_SIZE > PAGE_SIZE) newldt = vmalloc_account(mincount * LDT_ENTRY_SIZE); else - newldt = (void *)__get_free_kmem_pages(GFP_KERNEL_ACCOUNT, 0); + newldt = (void *)__get_free_page(GFP_KERNEL_ACCOUNT); if (!newldt) return -ENOMEM; @@ -83,7 +83,7 @@ static int alloc_ldt(mm_context_t *pc, int mincount, int reload) if (oldsize * LDT_ENTRY_SIZE > PAGE_SIZE) vfree(oldldt); else - __free_kmem_pages(virt_to_page(oldldt), 0); + __free_page(virt_to_page(oldldt)); } return 0; } @@ -138,7 +138,7 @@ void destroy_context(struct mm_struct *mm) if (mm->context.size * LDT_ENTRY_SIZE > PAGE_SIZE) vfree(mm->context.ldt); else - __free_kmem_pages(virt_to_page(mm->context.ldt), 0); + __free_page(virt_to_page(mm->context.ldt)); mm->context.size = 0; } } diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 02ec6243372d..ba13ef8e651a 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -25,11 +25,11 @@ pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address) { struct page *pte; - pte = alloc_kmem_pages(__userpte_alloc_gfp, 0); + pte = alloc_pages(__userpte_alloc_gfp, 0); if (!pte) r
[Devel] [PATCH rh7 4/7] mm: charge/uncharge kmemcg from generic page allocator paths
Currently, to charge a non-slab allocation to kmemcg one has to use alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with this helper should finally be freed using free_kmem_pages, otherwise it won't be uncharged. This API suits its current users fine, but it turns out to be impossible to use along with page reference counting, i.e. when an allocation is supposed to be freed with put_page, as it is the case with pipe or unix socket buffers. To overcome this limitation, this patch moves charging/uncharging to generic page allocator paths, i.e. to __alloc_pages_nodemask and free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way, one can use any of the available page allocation functions to get the allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT, just like in case of kmalloc and friends. A charged page will be automatically uncharged on free. To make it possible, we need to mark pages charged to kmemcg somehow. To avoid introducing a new page flag, we make use of page->_mapcount for marking such pages. Since pages charged to kmemcg are not supposed to be mapped to userspace, it should work just fine. There are other (ab)users of page->_mapcount - buddy and balloon pages - but we don't conflict with them. In case kmemcg is compiled out or not used at runtime, this patch introduces no overhead to generic page allocator paths. If kmemcg is used, it will be plus one gfp flags check on alloc and plus one page->_mapcount check on free, which shouldn't hurt performance, because the data accessed are hot. Signed-off-by: Vladimir Davydov --- include/linux/memcontrol.h | 3 ++- include/linux/page-flags.h | 19 +++ mm/memcontrol.c| 4 mm/page_alloc.c| 4 4 files changed, 29 insertions(+), 1 deletion(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d26adf10eaa7..48bf2caa008d 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -23,6 +23,7 @@ #include #include #include +#include struct mem_cgroup; struct page_cgroup; @@ -617,7 +618,7 @@ memcg_kmem_newpage_charge(struct page *page, gfp_t gfp, int order) static inline void memcg_kmem_uncharge_pages(struct page *page, int order) { - if (memcg_kmem_enabled()) + if (memcg_kmem_enabled() && PageKmemcg(page)) __memcg_kmem_uncharge_pages(page, order); } diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index cdf83ecac8f3..d15d20d84142 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -523,6 +523,25 @@ static inline void __ClearPageBalloon(struct page *page) atomic_set(&page->_mapcount, -1); } +#define PAGE_KMEMCG_MAPCOUNT_VALUE (-256) + +static inline int PageKmemcg(struct page *page) +{ + return atomic_read(&page->_mapcount) == PAGE_KMEMCG_MAPCOUNT_VALUE; +} + +static inline void __SetPageKmemcg(struct page *page) +{ + VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page); + atomic_set(&page->_mapcount, PAGE_KMEMCG_MAPCOUNT_VALUE); +} + +static inline void __ClearPageKmemcg(struct page *page) +{ + VM_BUG_ON_PAGE(!PageKmemcg(page), page); + atomic_set(&page->_mapcount, -1); +} + /* * If network-based swap is enabled, sl*b must keep track of whether pages * were allocated from pfmemalloc reserves. diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8eb48071ea22..1c3fbb2d2c48 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3555,6 +3555,8 @@ __memcg_kmem_newpage_charge(struct page *page, gfp_t gfp, int order) SetPageCgroupUsed(pc); unlock_page_cgroup(pc); + __SetPageKmemcg(page); + return true; } @@ -3588,6 +3590,8 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order) VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page); memcg_uncharge_kmem(memcg, PAGE_SIZE << order); + + __ClearPageKmemcg(page); } struct mem_cgroup *__mem_cgroup_from_kmem(void *ptr) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9f02d8013add..2b04f36ea016 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -748,6 +748,7 @@ static bool free_pages_prepare(struct page *page, unsigned int order) if (PageAnon(page)) page->mapping = NULL; + memcg_kmem_uncharge_pages(page, order); for (i = 0; i < (1 << order); i++) { bad += free_pages_check(page + i); if (static_key_false(&zero_free_pages)) @@ -2804,6 +2805,9 @@ out: if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) goto retry_cpuset; + if (page && !memcg_kmem_newpage_charge(page, gfp_mask, order)) + __free_pages(page, order); + return page; } EXPORT_SYMBOL(__alloc_pages_nodemask); -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 0/7] Some kmemcg related fixes
This patch set backports some changes from the following patch set submitted upstream: lkml.kernel.org/r/cover.1464079537.git.vdavy...@virtuozzo.com [hasn't been merged yet] namely: - move kmemcg charge/uncharge to generic allocator paths - fix pipe buffer stealing - avoid charging kernel page tables - account unix socket buffers to kmemcg (PSBM-34562) Vladimir Davydov (7): Drop alloc_kmem_pages and friends mm: memcontrol: drop memcg_kmem_commit_charge Move PageBalloon and PageBuddy helpers to page-flags.h mm: charge/uncharge kmemcg from generic page allocator paths af_unix: charge buffers to kmemcg pipe: uncharge page on ->steal arch: x86: don't charge kernel page tables to kmemcg arch/x86/include/asm/pgalloc.h| 22 + arch/x86/kernel/ldt.c | 6 ++-- arch/x86/mm/pgtable.c | 32 +- fs/pipe.c | 28 +++- include/linux/gfp.h | 8 - include/linux/memcontrol.h| 39 -- include/linux/mm.h| 47 -- include/linux/page-flags.h| 66 + kernel/fork.c | 6 ++-- mm/memcontrol.c | 31 ++ mm/page_alloc.c | 69 +++ mm/slab_common.c | 2 +- mm/slub.c | 4 +-- mm/vmalloc.c | 6 ++-- net/netfilter/nf_conntrack_core.c | 6 ++-- net/packet/af_packet.c| 8 ++--- net/unix/af_unix.c| 1 + 17 files changed, 157 insertions(+), 224 deletions(-) -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel