Re: [Devel] [PATCH rhel7] procfs: always expose /proc//map_files/ and make it readable
Acked-by: Andrey VaginOn Mon, May 16, 2016 at 11:28:51AM +0300, Cyrill Gorcunov wrote: > This is a backport of commit > > ML: bdb4d100afe9818aebd1d98ced575c5ef143456c > > From: Calvin Owens > > Currently, /proc//map_files/ is restricted to CAP_SYS_ADMIN, and is > only exposed if CONFIG_CHECKPOINT_RESTORE is set. > > Each mapped file region gets a symlink in /proc//map_files/ > corresponding to the virtual address range at which it is mapped. The > symlinks work like the symlinks in /proc//fd/, so you can follow them > to the backing file even if that backing file has been unlinked. > > Currently, files which are mapped, unlinked, and closed are impossible to > stat() from userspace. Exposing /proc//map_files/ closes this > functionality "hole". > > Not being able to stat() such files makes noticing and explicitly > accounting for the space they use on the filesystem impossible. You can > work around this by summing up the space used by every file in the > filesystem and subtracting that total from what statfs() tells you, but > that obviously isn't great, and it becomes unworkable once your filesystem > becomes large enough. > > This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and > adjusts the permissions enforced on it as follows: > > * proc_map_files_lookup() > * proc_map_files_readdir() > * map_files_d_revalidate() > > Remove the CAP_SYS_ADMIN restriction, leaving only the current > restriction requiring PTRACE_MODE_READ. The information made > available to userspace by these three functions is already > available in /proc/PID/maps with MODE_READ, so I don't see any > reason to limit them any further (see below for more detail). > > * proc_map_files_follow_link() > > This stub has been added, and requires that the user have > CAP_SYS_ADMIN in order to follow the links in map_files/, > since there was concern on LKML both about the potential for > bypassing permissions on ancestor directories in the path to > files pointed to, and about what happens with more exotic > memory mappings created by some drivers (ie dma-buf). > > In older versions of this patch, I changed every permission check in > the four functions above to enforce MODE_ATTACH instead of MODE_READ. > This was an oversight on my part, and after revisiting the discussion > it seems that nobody was concerned about anything outside of what is > made possible by ->follow_link(). So in this version, I've left the > checks for PTRACE_MODE_READ as-is. > > [a...@linux-foundation.org: catch up with concurrent proc_pid_follow_link() > changes] > Signed-off-by: Calvin Owens > Reviewed-by: Kees Cook > Cc: Andy Lutomirski > Cc: Cyrill Gorcunov > Cc: Joe Perches > Cc: Kirill A. Shutemov > Signed-off-by: Andrew Morton > Signed-off-by: Linus Torvalds > Signed-off-by: Cyrill Gorcunov > --- > > Kostya, please wait for Ack from Andrew. The patch on its own is not > bound to some of the bug we're working on now but usefull in general > and probably will help us with renaming of memfd restored memory > in criu (we use memfd to be able to restore anonymous shared memory > in userns case but memfd mangles the backend name, we didn't find > any problem with it yet, but been talking to Andrew and he agreed > that we might need to do something with this problem, and this patch > is first step). > > fs/proc/base.c | 44 +++- > 1 file changed, 23 insertions(+), 21 deletions(-) > > Index: linux-pcs7.git/fs/proc/base.c > === > --- linux-pcs7.git.orig/fs/proc/base.c > +++ linux-pcs7.git/fs/proc/base.c > @@ -1925,8 +1925,6 @@ end_instantiate: > return filldir(dirent, name, len, filp->f_pos, ino, type); > } > > -#ifdef CONFIG_CHECKPOINT_RESTORE > - > /* > * dname_to_vma_addr - maps a dentry name into two unsigned longs > * which represent vma start and end addresses. > @@ -1953,11 +1951,6 @@ static int map_files_d_revalidate(struct > if (flags & LOOKUP_RCU) > return -ECHILD; > > - if (!capable(CAP_SYS_ADMIN)) { > - status = -EPERM; > - goto out_notask; > - } > - > inode = dentry->d_inode; > task = get_proc_task(inode); > if (!task) > @@ -2048,6 +2041,28 @@ struct map_files_info { > unsigned char name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */ > }; > > +/* > + * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how > the > + * symlinks may be used to bypass permissions on ancestor directories in the > + * path to the file in question. > + */ > +static void *proc_map_files_follow_link(struct
[Devel] [NEW KERNEL] 3.10.0-327.18.2.vz7.14.3 (rhel7)
Changelog: OpenVZ kernel rh7-3.10.0-327.18.2.vz7.14.3 * technical rebulid of vz7.14.1 kernel Generated changelog: * Mon May 16 2016 Konstantin Khorenko[3.10.0-327.18.2.vz7.14.3] Built packages: http://kojistorage.eng.sw.ru/packages/vzkernel/3.10.0/327.18.2.vz7.14.3/ ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [NEW KERNEL] 3.10.0-327.18.2.vz7.14.2 (rhel7)
Changelog: OpenVZ kernel rh7-3.10.0-327.18.2.vz7.14.2 * technical rebuild of vz7.14.1 Generated changelog: * Mon May 16 2016 Konstantin Khorenko[3.10.0-327.18.2.vz7.14.2] Built packages: http://kojistorage.eng.sw.ru/packages/vzkernel/3.10.0/327.18.2.vz7.14.2/ ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 2/6] e4defrag2: [TP case] force defrag for very low populated clusters
If cluster has small numbers of blocks used it is reasonable to relocate such blocks regardless to inode's quality and free whole cluster. https://jira.sw.ru/browse/PSBM-46563 Signed-off-by: Dmitry Monakhov--- misc/e4defrag2.c | 54 +- 1 files changed, 45 insertions(+), 9 deletions(-) diff --git a/misc/e4defrag2.c b/misc/e4defrag2.c index 797a342..9206c89 100644 --- a/misc/e4defrag2.c +++ b/misc/e4defrag2.c @@ -279,6 +279,7 @@ enum spext_flags SP_FL_DIRLOCAL = 0x20, SP_FL_CSUM = 0x40, SP_FL_FMAP = 0x80, + SP_FL_TP_RELOC = 0x100, }; struct rb_fhandle @@ -383,6 +384,7 @@ struct defrag_context unsignedcluster_size; unsignedief_reloc_cluster; unsignedweight_scale; + unsignedtp_weight_scale; unsignedextents_quality; }; @@ -1098,6 +1100,7 @@ static int scan_inode_pass3(struct defrag_context *dfx, int fd, int is_old = 0; int is_rdonly = 0; __u64 ief_blocks = 0; + __u64 tp_blocks = 0; __u32 ino_flags = 0; __u64 size_blk = dfx_sz2b(dfx, stat->st_size); __u64 used_blk = dfx_sz2b(dfx, stat->st_blocks << 9); @@ -1158,13 +1161,16 @@ static int scan_inode_pass3(struct defrag_context *dfx, int fd, } if (se->flags & SP_FL_IEF_RELOC) ief_blocks += fec->fec_map[i].len; + if (se->flags & SP_FL_TP_RELOC) + tp_blocks += fec->fec_map[i].len; + fmap_csum_ext(fec->fec_map + i, ); } if (fest.local_ex == fec->fec_extents) ino_flags |= SP_FL_LOCAL; - if (ief_blocks) { + if (ief_blocks || tp_blocks) { /* * Even if some extents belong to IEF cluster, it is not a good * idea to relocate the whole file. From other point of view, @@ -1182,6 +1188,13 @@ static int scan_inode_pass3(struct defrag_context *dfx, int fd, "size_blk:%lld used_blk:%lld\n", __func__, stat->st_ino, ief_blocks, size_blk, used_blk); + } else if (tp_blocks * 4 > size_blk) { + ino_flags |= SP_FL_IEF_RELOC | SP_FL_TP_RELOC; + if (debug_flag & DBG_SCAN && ief_blocks != size_blk) + printf("%s Force add %lu to IEF/TP set ief:%lld " + "size_blk:%lld used_blk:%lld\n", + __func__, stat->st_ino, ief_blocks, + size_blk, used_blk); } else if (debug_flag & DBG_SCAN) { printf("%s Reject %lu from IEF set ief:%lld " "size_blk:%lld used_blk:%lld\n", @@ -1592,6 +1605,7 @@ static void pass3_prep(struct defrag_context *dfx) unsigned good = 0; unsigned count = 0; unsigned ief_ok = 0; + unsigned force_reloc = 0; if (verbose) printf("Pass3_prep: Scan and rate cached extents\n"); @@ -1610,18 +1624,29 @@ static void pass3_prep(struct defrag_context *dfx) print_spex("\t\t\t", ex); if (prev_cluster != cluster) { - ief_ok = 0; + force_reloc = ief_ok = 0; + /* Is cluster has enough RO(good) data blocks ?*/ if (dfx->cluster_size >= used * dfx->weight_scale && - good * 1000 >= count * dfx->extents_quality && - cluster_node) { + good * 1000 >= count * dfx->extents_quality) + ief_ok = 1; + + /* Thin provision corner case: If cluster has low number +* of data blocks it should be relocated regardless to +* block's quality in order to improve space efficency */ + if (dfx->cluster_size >= used * dfx->tp_weight_scale) { + ief_ok = 1; + force_reloc = 1; + } + + if (ief_ok && cluster_node) { while (cluster_node != node) { struct spextent *se = node_to_spextent(cluster_node); - ief_ok = 1; se->flags |= SP_FL_IEF_RELOC; + if (force_reloc) + se->flags |= SP_FL_TP_RELOC; if (debug_flag & DBG_TREE)
[Devel] [PATCH 1/6] e4defrag2: improve debugging
Dump doror rejection reason. Signed-off-by: Dmitry Monakhov--- misc/e4defrag2.c | 14 ++ 1 files changed, 10 insertions(+), 4 deletions(-) diff --git a/misc/e4defrag2.c b/misc/e4defrag2.c index 8ecae16..797a342 100644 --- a/misc/e4defrag2.c +++ b/misc/e4defrag2.c @@ -217,6 +217,7 @@ enum debug_flags { DBG_FS = 0x10, DBG_FIEMAP = 0x20, DBG_BITMAP = 0x40, + DBG_ERR = 0x80, }; /* The following macro is used for ioctl FS_IOC_FIEMAP @@ -1740,10 +1741,14 @@ static int do_alloc_donor_space(struct defrag_context *dfx, dgrp_t group, goto err; } TODO: Checks are sufficient for good donor? - if (force_local && donor->fest.local_ex != fec->fec_extents) + if (force_local && donor->fest.local_ex != fec->fec_extents) { + ret = -2; goto err; - if (donor->fest.frag > max_frag) + } + if (donor->fest.frag > max_frag) { + ret = -3; goto err; + } if (debug_flag & DBG_FS) printf("%s: Create donor file is_local:%d blocks:%lld\n", __func__, @@ -1754,11 +1759,12 @@ static int do_alloc_donor_space(struct defrag_context *dfx, dgrp_t group, donor->fec = fec; return 0; err: - if (debug_flag & DBG_RT) + if (debug_flag & DBG_ERR) printf("%s:%d REJECT donor grp:%u donor_fd:%d blocks:%llu local:%d frag:%u ret:%d\n", - __func__, __LINE__, group, donor->fd, blocks, force_local, max_frag, -1); + __func__, __LINE__, group, donor->fd, blocks, force_local, max_frag, ret); free(fec); + return -1; } -- 1.7.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 4/6] ext4defrag2: add on/off forcelocal option
Signed-off-by: Dmitry Monakhov--- misc/e4defrag2.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/misc/e4defrag2.c b/misc/e4defrag2.c index 0ca7a63..771ee51 100644 --- a/misc/e4defrag2.c +++ b/misc/e4defrag2.c @@ -2516,7 +2516,7 @@ int main(int argc, char *argv[]) add_error_table(_ext2_error_table); gettimeofday(_start, 0); - while ((c = getopt(argc, argv, "a:C:c:d:fF:hlmnt:s:S:T:vq:")) != EOF) { + while ((c = getopt(argc, argv, "a:C:c:d:fF:hl:mnt:s:S:T:vq:")) != EOF) { switch (c) { case 'a': min_frag_size = strtoul(optarg, , 0); @@ -2572,7 +2572,7 @@ int main(int argc, char *argv[]) usage(); break; case 'l': - dfx.ief_force_local = 1; + dfx.ief_force_local = !!strtoul(optarg, , 0); break; case 'n': -- 1.7.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 3/6] ext4defrag2: improve statistics configuration
Signed-off-by: Dmitry Monakhov--- misc/e4defrag2.c | 85 + 1 files changed, 65 insertions(+), 20 deletions(-) diff --git a/misc/e4defrag2.c b/misc/e4defrag2.c index 9206c89..0ca7a63 100644 --- a/misc/e4defrag2.c +++ b/misc/e4defrag2.c @@ -218,6 +218,10 @@ enum debug_flags { DBG_FIEMAP = 0x20, DBG_BITMAP = 0x40, DBG_ERR = 0x80, + DBG_CLUSTER = 0x100, + DBG_TAG = 0x200, + DBG_IAF = 0x400, + DBG_IEF = 0x800, }; /* The following macro is used for ioctl FS_IOC_FIEMAP @@ -903,7 +907,7 @@ static int group_add_ief_candidate(struct defrag_context *dfx, int dirfd, const fhp->handle_bytes = dfx->root_fhp->handle_bytes; ret = name_to_handle_at(dirfd, name, fhp, , 0); if (ret) { - if (debug_flag & DBG_SCAN) + if (debug_flag & (DBG_SCAN|DBG_IEF)) fprintf(stderr, "Unexpected result from name_to_handle_at()\n"); goto free_fh; } @@ -916,7 +920,7 @@ static int group_add_ief_candidate(struct defrag_context *dfx, int dirfd, const if (insert_fhandle(>group[group]->fh_root, >node)) { /* Inode is already in the list, likely nlink > 1 */ - if (debug_flag & DBG_SCAN) + if (debug_flag & (DBG_SCAN|DBG_IEF)) fprintf(stderr, "File is already in the list, nlink > 1," " Not an error\n"); ext2fs_free_mem(); @@ -1127,8 +1131,10 @@ static int scan_inode_pass3(struct defrag_context *dfx, int fd, goto out; group_add_dircache(dfx, dirfd, , "."); - do_iaf_defrag_one(dfx, dirfd, name, stat, fec, ); - goto out; + ret = do_iaf_defrag_one(dfx, dirfd, name, stat, fec, ); + if (!ret) + goto out; + } if (stat->st_mtime < older_than) @@ -1171,6 +1177,12 @@ static int scan_inode_pass3(struct defrag_context *dfx, int fd, ino_flags |= SP_FL_LOCAL; if (ief_blocks || tp_blocks) { + if (debug_flag & DBG_SCAN && ief_blocks != size_blk) + printf("%s ENTER %lu to IEF set ief:%lld " + "size_blk:%lld used_blk:%lld\n", + __func__, stat->st_ino, ief_blocks, + size_blk, used_blk); + /* * Even if some extents belong to IEF cluster, it is not a good * idea to relocate the whole file. From other point of view, @@ -1201,11 +1213,17 @@ static int scan_inode_pass3(struct defrag_context *dfx, int fd, __func__, stat->st_ino, ief_blocks, size_blk, used_blk); } + if (debug_flag & DBG_SCAN && ief_blocks != size_blk) + printf("%s ENTER %lu to IEF set ief:%lld " + "size_blk:%lld used_blk:%lld fl:%lx\n", + __func__, stat->st_ino, ief_blocks, + size_blk, used_blk, ino_flags); + } if (ino_flags & SP_FL_IEF_RELOC) { struct stat dst; - struct rb_fhandle *rbfh; + struct rb_fhandle *rbfh = NULL; /* FIXME: Is it any better way to find directory inode num? */ ret = fstat(dirfd, ); if (!ret && ino_grp == e4d_group_of_ino(dfx, dst.st_ino)) @@ -1456,7 +1474,7 @@ static int ief_defrag_prep_one(struct defrag_context *dfx, dgrp_t group, if (fhandle->flags & SP_FL_LOCAL) dfx->group[group]->ief_local++; - if (debug_flag & DBG_SCAN) + if (debug_flag & (DBG_SCAN | DBG_IEF)) printf("%s Check inode %lu flags:%x, OK...\n", __func__, stat->st_ino, fhandle->flags); @@ -1603,7 +1621,9 @@ static void pass3_prep(struct defrag_context *dfx) __u64 clusters_to_move = 0; unsigned used = 0; unsigned good = 0; + unsigned mdata = 0; unsigned count = 0; + unsigned found = 0; unsigned ief_ok = 0; unsigned force_reloc = 0; @@ -1620,7 +1640,7 @@ static void pass3_prep(struct defrag_context *dfx) ex->flags |= SP_FL_FULL; cluster = (ex->start + ex->count) & cluster_mask; - if (debug_flag & DBG_TREE) + if (debug_flag & DBG_CLUSTER) print_spex("\t\t\t", ex); if (prev_cluster != cluster) { @@ -1645,7 +1665,7 @@ static void pass3_prep(struct defrag_context *dfx) se->flags |= SP_FL_IEF_RELOC; if (force_reloc) se->flags |=
[Devel] [PATCH 6/6] e4defrag2: fix collapse inode index tree issue
Signed-off-by: Dmitry Monakhov--- misc/e4defrag2.c | 68 +++-- 1 files changed, 55 insertions(+), 13 deletions(-) diff --git a/misc/e4defrag2.c b/misc/e4defrag2.c index 7aab2b4..d351965 100644 --- a/misc/e4defrag2.c +++ b/misc/e4defrag2.c @@ -242,6 +242,7 @@ struct fmap_extent_cache { unsigned fec_size; /* map array size */ unsigned fec_extents; /* number of valid entries */ + struct fmap_extent *fec_xattr; struct fmap_extent fec_map[]; }; @@ -252,6 +253,9 @@ struct fmap_extent_stat unsigned group; /* Number of groups, counter is speculative */ unsigned local_ex; /* Number of extents from the same group as inode */ unsigned local_sz; /* Total len of local extents */ + unsigned nr_idx; /* Number of index blocks */ + __u64xattr; /* xattr phys block */ + }; /* Used space and integral inode usage stats */ @@ -750,9 +754,10 @@ static int __get_inode_fiemap(struct defrag_context *dfx, int fd, (*fec)->fec_size = DEFAULT_FMAP_CACHE_SZ; (*fec)->fec_extents = 0; } - if (fest) + if (fest) { memset(fest, 0 , sizeof(*fest)); - + fest->nr_idx = st->st_blocks >> (blksz_log - 9); + } ext_buf = fiemap_buf->fm_extents; memset(fiemap_buf, 0, fie_buf_size); fiemap_buf->fm_length = FIEMAP_MAX_OFFSET; @@ -791,6 +796,12 @@ static int __get_inode_fiemap(struct defrag_context *dfx, int fd, fest->group++; prev_blk_grp = blk_grp; } + /* We are work on livefs so race is possible */ + if (fest->nr_idx < len) { + ret = -1; + goto out; + } + fest->nr_idx -= len; } if ((*fec)->fec_extents && lblk == lblk_last && pblk == pblk_last) { @@ -834,12 +845,36 @@ static int __get_inode_fiemap(struct defrag_context *dfx, int fd, */ } while (fiemap_buf->fm_mapped_extents == EXTENT_MAX_COUNT && !(ext_buf[EXTENT_MAX_COUNT-1].fe_flags & FIEMAP_EXTENT_LAST)); + + /* get xattr block */ + fiemap_buf->fm_flags |= FIEMAP_FLAG_XATTR; + fiemap_buf->fm_start = 0; + memset(ext_buf, 0, ext_buf_size); + ret = ioctl(fd, FS_IOC_FIEMAP, fiemap_buf); + if (ret < 0 || fiemap_buf->fm_mapped_extents == 0) { + if (debug_flag & DBG_FIEMAP) { + fprintf(stderr, "%s: Can't get xattr info for" + " inode:%ld ret:%d mapped:%d\n", + __func__, st->st_ino, ret, + fiemap_buf->fm_mapped_extents); + } + goto out; + } + if (!(ext_buf[0].fe_flags & FIEMAP_EXTENT_DATA_INLINE)) { + fest->xattr = ext_buf[i].fe_physical >> blksz_log; + if (fest->nr_idx) + ret = -1; + + fest->nr_idx--; + } out: /FIXME:DEBUG - if (debug_flag & DBG_FIEMAP && fest) - printf("%s fmap stat ino:%ld hole:%d frag:%d local_ex:%d local_sz:%d group:%d\n", + if ((debug_flag & DBG_FIEMAP) && fest) + printf("%s fmap stat ino:%ld hole:%d frag:%d local_ex:%d " + "local_sz:%d group:%d nr_idx:%u xattr:%lld ret:%d\n", __func__, st->st_ino, fest->hole, fest->frag, - fest->local_ex, fest->local_sz, fest->group); + fest->local_ex, fest->local_sz, fest->group, fest->nr_idx, + fest->xattr, ret); free(fiemap_buf); @@ -1134,7 +1169,6 @@ static int scan_inode_pass3(struct defrag_context *dfx, int fd, ret = do_iaf_defrag_one(dfx, dirfd, name, stat, fec, ); if (!ret) goto out; - } if (stat->st_mtime < older_than) @@ -1916,7 +1950,7 @@ static int prepare_donor(struct defrag_context *dfx, dgrp_t group, printf("%s grp:%u donor_fd:%d blocks:%llu frag:%u\n", __func__, group, donor->fd, blocks, max_frag); } - assert(blocks); + assert(blocks && max_frag); /* First try to reuse existing donor if available */ if (donor->fd != -1) { @@ -1954,23 +1988,28 @@ static int check_iaf(struct defrag_context *dfx, struct stat64 *stat, __u64 eof_lblk; FIXME free_space_average should be tunable __u64 free_space_average = 64; + __u32 meta_blocks; int ret = 1; if (!S_ISREG(stat->st_mode)) ret = 0; -
[Devel] [PATCH 5/6] e4defrag2: prevent agressive donor lookup
It was bad idea to try all dirs from all groups for donor especially for big filesystems. Let's scan only local ones. Signed-off-by: Dmitry Monakhov--- misc/e4defrag2.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/misc/e4defrag2.c b/misc/e4defrag2.c index 771ee51..7aab2b4 100644 --- a/misc/e4defrag2.c +++ b/misc/e4defrag2.c @@ -1830,6 +1830,7 @@ static int do_find_donor(struct defrag_context *dfx, dgrp_t group, int dir, i, ret = 0; struct stat64 st; dgrp_t donor_grp; + int dir_retries = 3; unsigned char *raw_fh = dfx->group[group]->dir_rawh; const char *dfname = ".e4defrag2_donor.tmp"; @@ -1896,7 +1897,7 @@ static int do_find_donor(struct defrag_context *dfx, dgrp_t group, try_next: close(dir); close_donor(donor); - if (ret) + if (ret || !dir_retries--) return -1; } @@ -1934,7 +1935,7 @@ static int prepare_donor(struct defrag_context *dfx, dgrp_t group, return -1; /* Sequentially search groups and create first available */ - for (i = 0; i < nr_groups; i++) { + for (i = 1; i < 16; i++) { if (dfx->group[(group + i) % nr_groups]) { ret = do_find_donor(dfx, (group + i) % nr_groups, donor, blocks, 0, max_frag); -- 1.7.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] mm: writeback: do not check dirty limits for ub0
It's just a waste of time, because ub0 has no ub-specific dirty limits. balance_dirty_pages handles ub0 case anyway. Signed-off-by: Vladimir Davydov--- mm/page-writeback.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 9940d5fe7dcb..ba5f93a84fca 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1403,9 +1403,11 @@ static void balance_dirty_pages_ub(struct address_space *mapping, unsigned long pages_written = 0; unsigned long pause = 1; struct user_beancounter *ub = get_io_ub(); - struct backing_dev_info *bdi = mapping->backing_dev_info; + if (ub == get_ub0()) + return; + for (;;) { unsigned long nr_to_write = write_chunk - pages_written; -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rhel7] procfs: always expose /proc//map_files/ and make it readable
This is a backport of commit ML: bdb4d100afe9818aebd1d98ced575c5ef143456c From: Calvin OwensCurrently, /proc//map_files/ is restricted to CAP_SYS_ADMIN, and is only exposed if CONFIG_CHECKPOINT_RESTORE is set. Each mapped file region gets a symlink in /proc//map_files/ corresponding to the virtual address range at which it is mapped. The symlinks work like the symlinks in /proc//fd/, so you can follow them to the backing file even if that backing file has been unlinked. Currently, files which are mapped, unlinked, and closed are impossible to stat() from userspace. Exposing /proc//map_files/ closes this functionality "hole". Not being able to stat() such files makes noticing and explicitly accounting for the space they use on the filesystem impossible. You can work around this by summing up the space used by every file in the filesystem and subtracting that total from what statfs() tells you, but that obviously isn't great, and it becomes unworkable once your filesystem becomes large enough. This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and adjusts the permissions enforced on it as follows: * proc_map_files_lookup() * proc_map_files_readdir() * map_files_d_revalidate() Remove the CAP_SYS_ADMIN restriction, leaving only the current restriction requiring PTRACE_MODE_READ. The information made available to userspace by these three functions is already available in /proc/PID/maps with MODE_READ, so I don't see any reason to limit them any further (see below for more detail). * proc_map_files_follow_link() This stub has been added, and requires that the user have CAP_SYS_ADMIN in order to follow the links in map_files/, since there was concern on LKML both about the potential for bypassing permissions on ancestor directories in the path to files pointed to, and about what happens with more exotic memory mappings created by some drivers (ie dma-buf). In older versions of this patch, I changed every permission check in the four functions above to enforce MODE_ATTACH instead of MODE_READ. This was an oversight on my part, and after revisiting the discussion it seems that nobody was concerned about anything outside of what is made possible by ->follow_link(). So in this version, I've left the checks for PTRACE_MODE_READ as-is. [a...@linux-foundation.org: catch up with concurrent proc_pid_follow_link() changes] Signed-off-by: Calvin Owens Reviewed-by: Kees Cook Cc: Andy Lutomirski Cc: Cyrill Gorcunov Cc: Joe Perches Cc: Kirill A. Shutemov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Cyrill Gorcunov --- Kostya, please wait for Ack from Andrew. The patch on its own is not bound to some of the bug we're working on now but usefull in general and probably will help us with renaming of memfd restored memory in criu (we use memfd to be able to restore anonymous shared memory in userns case but memfd mangles the backend name, we didn't find any problem with it yet, but been talking to Andrew and he agreed that we might need to do something with this problem, and this patch is first step). fs/proc/base.c | 44 +++- 1 file changed, 23 insertions(+), 21 deletions(-) Index: linux-pcs7.git/fs/proc/base.c === --- linux-pcs7.git.orig/fs/proc/base.c +++ linux-pcs7.git/fs/proc/base.c @@ -1925,8 +1925,6 @@ end_instantiate: return filldir(dirent, name, len, filp->f_pos, ino, type); } -#ifdef CONFIG_CHECKPOINT_RESTORE - /* * dname_to_vma_addr - maps a dentry name into two unsigned longs * which represent vma start and end addresses. @@ -1953,11 +1951,6 @@ static int map_files_d_revalidate(struct if (flags & LOOKUP_RCU) return -ECHILD; - if (!capable(CAP_SYS_ADMIN)) { - status = -EPERM; - goto out_notask; - } - inode = dentry->d_inode; task = get_proc_task(inode); if (!task) @@ -2048,6 +2041,28 @@ struct map_files_info { unsigned char name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */ }; +/* + * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the + * symlinks may be used to bypass permissions on ancestor directories in the + * path to the file in question. + */ +static void *proc_map_files_follow_link(struct dentry *dentry, struct nameidata *nd) +{ + if (!capable(CAP_SYS_ADMIN)) + return ERR_PTR(-EPERM); + + return proc_pid_follow_link(dentry, nd); +} + +/* + * Identical to proc_pid_link_inode_operations except for follow_link() + */ +static const struct inode_operations
[Devel] [RH6 PATCH] [MS] ext4: collapse a single extent tree block into the inode if possible
Backport ecb94f5fdf4b72547fca022421a9dca1672bddd4 This patch is required for sane defragmenration procedure. https://jira.sw.ru/browse/PSBM-46563 #ORIG_MSG: [PATCH] ext4: collapse a single extent tree block into the inode if possible If an inode has more than 4 extents, but then later some of the extents are merged together, we can optimize the file system by moving the extents up into the inode, and discarding the extent tree block. This is important, because if there are a large number of inodes with an external extent tree blocks where the contents could fit in the inode, this can significantly increase the fsck time of the file system. Google-Bug-Id: 6801242 Signed-off-by: "Theodore Ts'o"Signed-off-by: Dmitry Monakhov diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 85c4d4e..5eba717 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -1668,10 +1668,54 @@ static int ext4_ext_try_to_merge_right(struct inode *inode, } /* + * This function does a very simple check to see if we can collapse + * an extent tree with a single extent tree leaf block into the inode. + */ +static void ext4_ext_try_to_merge_up(handle_t *handle, +struct inode *inode, +struct ext4_ext_path *path) +{ + size_t s; + unsigned max_root = ext4_ext_space_root(inode, 0); + ext4_fsblk_t blk; + + if ((path[0].p_depth != 1) || + (le16_to_cpu(path[0].p_hdr->eh_entries) != 1) || + (le16_to_cpu(path[1].p_hdr->eh_entries) > max_root)) + return; + + /* +* We need to modify the block allocation bitmap and the block +* group descriptor to release the extent tree block. If we +* can't get the journal credits, give up. +*/ + if (ext4_journal_extend(handle, 2)) + return; + + /* +* Copy the extent data up to the inode +*/ + blk = ext4_idx_pblock(path[0].p_idx); + s = le16_to_cpu(path[1].p_hdr->eh_entries) * + sizeof(struct ext4_extent_idx); + s += sizeof(struct ext4_extent_header); + + memcpy(path[0].p_hdr, path[1].p_hdr, s); + path[0].p_depth = 0; + path[0].p_ext = EXT_FIRST_EXTENT(path[0].p_hdr) + + (path[1].p_ext - EXT_FIRST_EXTENT(path[1].p_hdr)); + path[0].p_hdr->eh_max = cpu_to_le16(max_root); + + brelse(path[1].p_bh); + ext4_free_blocks(handle, inode, blk, 1, EXT4_FREE_BLOCKS_METADATA); +} + +/* * This function tries to merge the @ex extent to neighbours in the tree. * return 1 if merge left else 0. */ -static int ext4_ext_try_to_merge(struct inode *inode, +static int ext4_ext_try_to_merge(handle_t *handle, + struct inode *inode, struct ext4_ext_path *path, struct ext4_extent *ex) { struct ext4_extent_header *eh; @@ -1687,8 +1731,9 @@ static int ext4_ext_try_to_merge(struct inode *inode, merge_done = ext4_ext_try_to_merge_right(inode, path, ex - 1); if (!merge_done) - ret = ext4_ext_try_to_merge_right(inode, path, ex); + ret = ext4_ext_try_to_merge_right(inode, path, ex); + ext4_ext_try_to_merge_up(handle, inode, path); return ret; } @@ -1897,7 +1942,7 @@ has_space: merge: /* try to merge extents to the right */ if (!(flag & EXT4_GET_BLOCKS_DIO)) - ext4_ext_try_to_merge(inode, path, nearex); + ext4_ext_try_to_merge(handle, inode, path, nearex); /* try to merge extents to the left */ @@ -1906,7 +1951,7 @@ merge: if (err) goto cleanup; - err = ext4_ext_dirty(handle, inode, path + depth); + err = ext4_ext_dirty(handle, inode, path + path->p_depth); cleanup: if (npath) { @@ -2878,9 +2923,9 @@ static int ext4_split_extent_at(handle_t *handle, ext4_ext_mark_initialized(ex); if (!(flags & EXT4_GET_BLOCKS_DIO)) - ext4_ext_try_to_merge(inode, path, ex); + ext4_ext_try_to_merge(handle, inode, path, ex); - err = ext4_ext_dirty(handle, inode, path + depth); + err = ext4_ext_dirty(handle, inode, path + path->p_depth); goto out; } @@ -2894,7 +2939,7 @@ static int ext4_split_extent_at(handle_t *handle, * path may lead to new leaf, not to original leaf any more * after ext4_ext_insert_extent() returns, */ - err = ext4_ext_dirty(handle, inode, path + depth); + err = ext4_ext_dirty(handle, inode, path + path->p_depth); if (err) goto fix_extent_len; @@ -2912,8 +2957,8 @@ static int ext4_split_extent_at(handle_t *handle, goto fix_extent_len; /* update the extent length and mark as