Re: [Devel] [PATCH rhel7] procfs: always expose /proc//map_files/ and make it readable

2016-05-16 Thread Andrey Vagin
Acked-by: Andrey Vagin 

On Mon, May 16, 2016 at 11:28:51AM +0300, Cyrill Gorcunov wrote:
> This is a backport of commit
> 
> ML: bdb4d100afe9818aebd1d98ced575c5ef143456c
> 
> From: Calvin Owens 
> 
> Currently, /proc//map_files/ is restricted to CAP_SYS_ADMIN, and is
> only exposed if CONFIG_CHECKPOINT_RESTORE is set.
> 
> Each mapped file region gets a symlink in /proc//map_files/
> corresponding to the virtual address range at which it is mapped.  The
> symlinks work like the symlinks in /proc//fd/, so you can follow them
> to the backing file even if that backing file has been unlinked.
> 
> Currently, files which are mapped, unlinked, and closed are impossible to
> stat() from userspace.  Exposing /proc//map_files/ closes this
> functionality "hole".
> 
> Not being able to stat() such files makes noticing and explicitly
> accounting for the space they use on the filesystem impossible.  You can
> work around this by summing up the space used by every file in the
> filesystem and subtracting that total from what statfs() tells you, but
> that obviously isn't great, and it becomes unworkable once your filesystem
> becomes large enough.
> 
> This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
> adjusts the permissions enforced on it as follows:
> 
> * proc_map_files_lookup()
> * proc_map_files_readdir()
> * map_files_d_revalidate()
> 
>   Remove the CAP_SYS_ADMIN restriction, leaving only the current
>   restriction requiring PTRACE_MODE_READ. The information made
>   available to userspace by these three functions is already
>   available in /proc/PID/maps with MODE_READ, so I don't see any
>   reason to limit them any further (see below for more detail).
> 
> * proc_map_files_follow_link()
> 
>   This stub has been added, and requires that the user have
>   CAP_SYS_ADMIN in order to follow the links in map_files/,
>   since there was concern on LKML both about the potential for
>   bypassing permissions on ancestor directories in the path to
>   files pointed to, and about what happens with more exotic
>   memory mappings created by some drivers (ie dma-buf).
> 
> In older versions of this patch, I changed every permission check in
> the four functions above to enforce MODE_ATTACH instead of MODE_READ.
> This was an oversight on my part, and after revisiting the discussion
> it seems that nobody was concerned about anything outside of what is
> made possible by ->follow_link(). So in this version, I've left the
> checks for PTRACE_MODE_READ as-is.
> 
> [a...@linux-foundation.org: catch up with concurrent proc_pid_follow_link() 
> changes]
> Signed-off-by: Calvin Owens 
> Reviewed-by: Kees Cook 
> Cc: Andy Lutomirski 
> Cc: Cyrill Gorcunov 
> Cc: Joe Perches 
> Cc: Kirill A. Shutemov 
> Signed-off-by: Andrew Morton 
> Signed-off-by: Linus Torvalds 
> Signed-off-by: Cyrill Gorcunov 
> ---
> 
> Kostya, please wait for Ack from Andrew. The patch on its own is not
> bound to some of the bug we're working on now but usefull in general
> and probably will help us with renaming of memfd restored memory
> in criu (we use memfd to be able to restore anonymous shared memory
> in userns case but memfd mangles the backend name, we didn't find
> any problem with it yet, but been talking to Andrew and he agreed
> that we might need to do something with this problem, and this patch
> is first step).
> 
>  fs/proc/base.c |   44 +++-
>  1 file changed, 23 insertions(+), 21 deletions(-)
> 
> Index: linux-pcs7.git/fs/proc/base.c
> ===
> --- linux-pcs7.git.orig/fs/proc/base.c
> +++ linux-pcs7.git/fs/proc/base.c
> @@ -1925,8 +1925,6 @@ end_instantiate:
>   return filldir(dirent, name, len, filp->f_pos, ino, type);
>  }
>  
> -#ifdef CONFIG_CHECKPOINT_RESTORE
> -
>  /*
>   * dname_to_vma_addr - maps a dentry name into two unsigned longs
>   * which represent vma start and end addresses.
> @@ -1953,11 +1951,6 @@ static int map_files_d_revalidate(struct
>   if (flags & LOOKUP_RCU)
>   return -ECHILD;
>  
> - if (!capable(CAP_SYS_ADMIN)) {
> - status = -EPERM;
> - goto out_notask;
> - }
> -
>   inode = dentry->d_inode;
>   task = get_proc_task(inode);
>   if (!task)
> @@ -2048,6 +2041,28 @@ struct map_files_info {
>   unsigned char   name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
>  };
>  
> +/*
> + * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how 
> the
> + * symlinks may be used to bypass permissions on ancestor directories in the
> + * path to the file in question.
> + */
> +static void *proc_map_files_follow_link(struct 

[Devel] [NEW KERNEL] 3.10.0-327.18.2.vz7.14.3 (rhel7)

2016-05-16 Thread builder
Changelog:

OpenVZ kernel rh7-3.10.0-327.18.2.vz7.14.3

* technical rebulid of vz7.14.1 kernel


Generated changelog:

* Mon May 16 2016 Konstantin Khorenko  
[3.10.0-327.18.2.vz7.14.3]


Built packages: 
http://kojistorage.eng.sw.ru/packages/vzkernel/3.10.0/327.18.2.vz7.14.3/
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [NEW KERNEL] 3.10.0-327.18.2.vz7.14.2 (rhel7)

2016-05-16 Thread builder
Changelog:

OpenVZ kernel rh7-3.10.0-327.18.2.vz7.14.2

* technical rebuild of vz7.14.1


Generated changelog:

* Mon May 16 2016 Konstantin Khorenko  
[3.10.0-327.18.2.vz7.14.2]


Built packages: 
http://kojistorage.eng.sw.ru/packages/vzkernel/3.10.0/327.18.2.vz7.14.2/
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/6] e4defrag2: [TP case] force defrag for very low populated clusters

2016-05-16 Thread Dmitry Monakhov
If cluster has small numbers of blocks used it is reasonable to
relocate such blocks regardless to inode's quality and free whole cluster.


https://jira.sw.ru/browse/PSBM-46563

Signed-off-by: Dmitry Monakhov 
---
 misc/e4defrag2.c |   54 +-
 1 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/misc/e4defrag2.c b/misc/e4defrag2.c
index 797a342..9206c89 100644
--- a/misc/e4defrag2.c
+++ b/misc/e4defrag2.c
@@ -279,6 +279,7 @@ enum spext_flags
SP_FL_DIRLOCAL = 0x20,
SP_FL_CSUM = 0x40,
SP_FL_FMAP = 0x80,
+   SP_FL_TP_RELOC = 0x100,
 };
 
 struct rb_fhandle
@@ -383,6 +384,7 @@ struct defrag_context
unsignedcluster_size;
unsignedief_reloc_cluster;
unsignedweight_scale;
+   unsignedtp_weight_scale;
unsignedextents_quality;
 };
 
@@ -1098,6 +1100,7 @@ static int scan_inode_pass3(struct defrag_context *dfx, 
int fd,
int is_old = 0;
int is_rdonly = 0;
__u64 ief_blocks = 0;
+   __u64 tp_blocks = 0;
__u32 ino_flags = 0;
__u64 size_blk = dfx_sz2b(dfx, stat->st_size);
__u64 used_blk = dfx_sz2b(dfx, stat->st_blocks << 9);
@@ -1158,13 +1161,16 @@ static int scan_inode_pass3(struct defrag_context *dfx, 
int fd,
}
if (se->flags & SP_FL_IEF_RELOC)
ief_blocks += fec->fec_map[i].len;
+   if (se->flags & SP_FL_TP_RELOC)
+   tp_blocks += fec->fec_map[i].len;
+
fmap_csum_ext(fec->fec_map + i, );
}
 
if (fest.local_ex == fec->fec_extents)
ino_flags |= SP_FL_LOCAL;
 
-   if (ief_blocks) {
+   if (ief_blocks || tp_blocks) {
/*
 * Even if some extents belong to IEF cluster, it is not a good
 * idea to relocate the whole file. From other point of view,
@@ -1182,6 +1188,13 @@ static int scan_inode_pass3(struct defrag_context *dfx, 
int fd,
   "size_blk:%lld used_blk:%lld\n",
   __func__, stat->st_ino, ief_blocks,
   size_blk, used_blk);
+   } else if (tp_blocks * 4 > size_blk) {
+   ino_flags |= SP_FL_IEF_RELOC | SP_FL_TP_RELOC;
+   if (debug_flag & DBG_SCAN && ief_blocks != size_blk)
+   printf("%s Force add %lu to IEF/TP set ief:%lld 
"
+  "size_blk:%lld used_blk:%lld\n",
+  __func__, stat->st_ino, ief_blocks,
+  size_blk, used_blk);
} else if (debug_flag & DBG_SCAN) {
printf("%s Reject %lu from IEF set ief:%lld "
   "size_blk:%lld used_blk:%lld\n",
@@ -1592,6 +1605,7 @@ static void pass3_prep(struct defrag_context *dfx)
unsigned good = 0;
unsigned count = 0;
unsigned ief_ok = 0;
+   unsigned force_reloc = 0;
 
if (verbose)
printf("Pass3_prep:  Scan and rate cached extents\n");
@@ -1610,18 +1624,29 @@ static void pass3_prep(struct defrag_context *dfx)
print_spex("\t\t\t", ex);
 
if (prev_cluster != cluster) {
-   ief_ok = 0;
+   force_reloc = ief_ok = 0;
+   /* Is cluster has enough RO(good) data blocks ?*/
if (dfx->cluster_size  >= used * dfx->weight_scale &&
-   good * 1000 >= count * dfx->extents_quality &&
-   cluster_node) {
+   good * 1000 >= count * dfx->extents_quality)
+   ief_ok = 1;
+
+   /* Thin provision corner case: If cluster has low number
+* of data blocks it should be relocated regardless to
+* block's quality in order to improve space efficency 
*/
+   if (dfx->cluster_size  >= used * dfx->tp_weight_scale) {
+   ief_ok = 1;
+   force_reloc = 1;
+   }
+
+   if (ief_ok && cluster_node) {
while (cluster_node != node) {
struct spextent *se =
node_to_spextent(cluster_node);
-   ief_ok = 1;
se->flags |= SP_FL_IEF_RELOC;
+   if (force_reloc)
+   se->flags |= SP_FL_TP_RELOC;
if (debug_flag & DBG_TREE)
   

[Devel] [PATCH 1/6] e4defrag2: improve debugging

2016-05-16 Thread Dmitry Monakhov
Dump doror rejection reason.

Signed-off-by: Dmitry Monakhov 
---
 misc/e4defrag2.c |   14 ++
 1 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/misc/e4defrag2.c b/misc/e4defrag2.c
index 8ecae16..797a342 100644
--- a/misc/e4defrag2.c
+++ b/misc/e4defrag2.c
@@ -217,6 +217,7 @@ enum debug_flags {
DBG_FS = 0x10,
DBG_FIEMAP = 0x20,
DBG_BITMAP = 0x40,
+   DBG_ERR = 0x80,
 };
 
 /* The following macro is used for ioctl FS_IOC_FIEMAP
@@ -1740,10 +1741,14 @@ static int do_alloc_donor_space(struct defrag_context 
*dfx, dgrp_t group,
goto err;
}
TODO:  Checks are sufficient for good donor?
-   if (force_local && donor->fest.local_ex != fec->fec_extents)
+   if (force_local && donor->fest.local_ex != fec->fec_extents) {
+   ret = -2;
goto err;
-   if (donor->fest.frag > max_frag)
+   }
+   if (donor->fest.frag > max_frag) {
+   ret = -3;
goto err;
+   }
 
if (debug_flag & DBG_FS)
printf("%s: Create donor file is_local:%d blocks:%lld\n", 
__func__,
@@ -1754,11 +1759,12 @@ static int do_alloc_donor_space(struct defrag_context 
*dfx, dgrp_t group,
donor->fec = fec;
return 0;
 err:
-   if (debug_flag & DBG_RT)
+   if (debug_flag & DBG_ERR)
printf("%s:%d REJECT donor grp:%u donor_fd:%d blocks:%llu 
local:%d frag:%u ret:%d\n",
-  __func__, __LINE__,  group, donor->fd, blocks, 
force_local, max_frag, -1);
+  __func__, __LINE__,  group, donor->fd, blocks, 
force_local, max_frag, ret);
 
free(fec);
+
return -1;
 }
 
-- 
1.7.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 4/6] ext4defrag2: add on/off forcelocal option

2016-05-16 Thread Dmitry Monakhov

Signed-off-by: Dmitry Monakhov 
---
 misc/e4defrag2.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/misc/e4defrag2.c b/misc/e4defrag2.c
index 0ca7a63..771ee51 100644
--- a/misc/e4defrag2.c
+++ b/misc/e4defrag2.c
@@ -2516,7 +2516,7 @@ int main(int argc, char *argv[])
add_error_table(_ext2_error_table);
gettimeofday(_start, 0);
 
-   while ((c = getopt(argc, argv, "a:C:c:d:fF:hlmnt:s:S:T:vq:")) != EOF) {
+   while ((c = getopt(argc, argv, "a:C:c:d:fF:hl:mnt:s:S:T:vq:")) != EOF) {
switch (c) {
case 'a':
min_frag_size = strtoul(optarg, , 0);
@@ -2572,7 +2572,7 @@ int main(int argc, char *argv[])
usage();
break;
case 'l':
-   dfx.ief_force_local = 1;
+   dfx.ief_force_local = !!strtoul(optarg, , 0);
break;
 
case 'n':
-- 
1.7.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/6] ext4defrag2: improve statistics configuration

2016-05-16 Thread Dmitry Monakhov

Signed-off-by: Dmitry Monakhov 
---
 misc/e4defrag2.c |   85 +
 1 files changed, 65 insertions(+), 20 deletions(-)

diff --git a/misc/e4defrag2.c b/misc/e4defrag2.c
index 9206c89..0ca7a63 100644
--- a/misc/e4defrag2.c
+++ b/misc/e4defrag2.c
@@ -218,6 +218,10 @@ enum debug_flags {
DBG_FIEMAP = 0x20,
DBG_BITMAP = 0x40,
DBG_ERR = 0x80,
+   DBG_CLUSTER = 0x100,
+   DBG_TAG = 0x200,
+   DBG_IAF = 0x400,
+   DBG_IEF = 0x800,
 };
 
 /* The following macro is used for ioctl FS_IOC_FIEMAP
@@ -903,7 +907,7 @@ static int group_add_ief_candidate(struct defrag_context 
*dfx, int dirfd, const
fhp->handle_bytes = dfx->root_fhp->handle_bytes;
ret = name_to_handle_at(dirfd, name, fhp, , 0);
if (ret) {
-   if (debug_flag & DBG_SCAN)
+   if (debug_flag & (DBG_SCAN|DBG_IEF))
fprintf(stderr, "Unexpected result from 
name_to_handle_at()\n");
goto free_fh;
}
@@ -916,7 +920,7 @@ static int group_add_ief_candidate(struct defrag_context 
*dfx, int dirfd, const
 
if (insert_fhandle(>group[group]->fh_root, >node)) {
/* Inode is already in the list, likely nlink > 1 */
-   if (debug_flag & DBG_SCAN)
+   if (debug_flag & (DBG_SCAN|DBG_IEF))
fprintf(stderr, "File is already in the list, nlink > 
1,"
" Not an error\n");
ext2fs_free_mem();
@@ -1127,8 +1131,10 @@ static int scan_inode_pass3(struct defrag_context *dfx, 
int fd,
goto out;
 
group_add_dircache(dfx, dirfd, , ".");
-   do_iaf_defrag_one(dfx, dirfd, name, stat, fec, );
-   goto out;
+   ret = do_iaf_defrag_one(dfx, dirfd, name, stat, fec, );
+   if (!ret)
+   goto out;
+   
}
 
if (stat->st_mtime  < older_than)
@@ -1171,6 +1177,12 @@ static int scan_inode_pass3(struct defrag_context *dfx, 
int fd,
ino_flags |= SP_FL_LOCAL;
 
if (ief_blocks || tp_blocks) {
+   if (debug_flag & DBG_SCAN && ief_blocks != size_blk)
+   printf("%s ENTER %lu to IEF set ief:%lld "
+  "size_blk:%lld used_blk:%lld\n",
+  __func__, stat->st_ino, ief_blocks,
+  size_blk, used_blk);
+
/*
 * Even if some extents belong to IEF cluster, it is not a good
 * idea to relocate the whole file. From other point of view,
@@ -1201,11 +1213,17 @@ static int scan_inode_pass3(struct defrag_context *dfx, 
int fd,
   __func__, stat->st_ino, ief_blocks,
   size_blk, used_blk);
}
+   if (debug_flag & DBG_SCAN && ief_blocks != size_blk)
+   printf("%s ENTER %lu to IEF set ief:%lld "
+  "size_blk:%lld used_blk:%lld fl:%lx\n",
+  __func__, stat->st_ino, ief_blocks,
+  size_blk, used_blk, ino_flags);
+
}
 
if (ino_flags & SP_FL_IEF_RELOC) {
struct stat dst;
-   struct rb_fhandle *rbfh;
+   struct rb_fhandle *rbfh = NULL;
/* FIXME: Is it any better way to find directory inode num? */
ret = fstat(dirfd, );
if (!ret && ino_grp ==  e4d_group_of_ino(dfx, dst.st_ino))
@@ -1456,7 +1474,7 @@ static int ief_defrag_prep_one(struct defrag_context 
*dfx, dgrp_t group,
if (fhandle->flags & SP_FL_LOCAL)
dfx->group[group]->ief_local++;
 
-   if (debug_flag & DBG_SCAN)
+   if (debug_flag & (DBG_SCAN | DBG_IEF))
printf("%s Check inode %lu flags:%x, OK...\n",
   __func__, stat->st_ino, fhandle->flags);
 
@@ -1603,7 +1621,9 @@ static void pass3_prep(struct defrag_context *dfx)
__u64 clusters_to_move = 0;
unsigned used = 0;
unsigned good = 0;
+   unsigned mdata = 0;
unsigned count = 0;
+   unsigned found = 0;
unsigned ief_ok = 0;
unsigned force_reloc = 0;
 
@@ -1620,7 +1640,7 @@ static void pass3_prep(struct defrag_context *dfx)
ex->flags |= SP_FL_FULL;
cluster = (ex->start + ex->count) & cluster_mask;
 
-   if (debug_flag & DBG_TREE)
+   if (debug_flag & DBG_CLUSTER)
print_spex("\t\t\t", ex);
 
if (prev_cluster != cluster) {
@@ -1645,7 +1665,7 @@ static void pass3_prep(struct defrag_context *dfx)
se->flags |= SP_FL_IEF_RELOC;
if (force_reloc)
se->flags |= 

[Devel] [PATCH 6/6] e4defrag2: fix collapse inode index tree issue

2016-05-16 Thread Dmitry Monakhov

Signed-off-by: Dmitry Monakhov 
---
 misc/e4defrag2.c |   68 +++--
 1 files changed, 55 insertions(+), 13 deletions(-)

diff --git a/misc/e4defrag2.c b/misc/e4defrag2.c
index 7aab2b4..d351965 100644
--- a/misc/e4defrag2.c
+++ b/misc/e4defrag2.c
@@ -242,6 +242,7 @@ struct fmap_extent_cache
 {
unsigned fec_size;  /* map array size */
unsigned fec_extents;   /* number of valid entries */
+   struct fmap_extent *fec_xattr;
struct fmap_extent fec_map[];
 };
 
@@ -252,6 +253,9 @@ struct fmap_extent_stat
unsigned group; /* Number of groups, counter is speculative */
unsigned local_ex; /* Number of extents from  the same group as inode */
unsigned local_sz; /* Total len of local extents */
+   unsigned nr_idx; /* Number of index blocks */
+   __u64xattr; /* xattr phys block */
+
 };
 
 /* Used space and integral inode usage stats */
@@ -750,9 +754,10 @@ static int __get_inode_fiemap(struct defrag_context *dfx, 
int fd,
(*fec)->fec_size = DEFAULT_FMAP_CACHE_SZ;
(*fec)->fec_extents = 0;
}
-   if (fest)
+   if (fest) {
memset(fest, 0 , sizeof(*fest));
-
+   fest->nr_idx = st->st_blocks >> (blksz_log - 9);
+   }
ext_buf = fiemap_buf->fm_extents;
memset(fiemap_buf, 0, fie_buf_size);
fiemap_buf->fm_length = FIEMAP_MAX_OFFSET;
@@ -791,6 +796,12 @@ static int __get_inode_fiemap(struct defrag_context *dfx, 
int fd,
fest->group++;
prev_blk_grp = blk_grp;
}
+   /* We are work on livefs so race is possible */
+   if (fest->nr_idx < len) {
+   ret = -1;
+   goto out;
+   }
+   fest->nr_idx -= len;
}
 
if ((*fec)->fec_extents && lblk == lblk_last && pblk == 
pblk_last) {
@@ -834,12 +845,36 @@ static int __get_inode_fiemap(struct defrag_context *dfx, 
int fd,
 */
} while (fiemap_buf->fm_mapped_extents == EXTENT_MAX_COUNT &&
 !(ext_buf[EXTENT_MAX_COUNT-1].fe_flags & FIEMAP_EXTENT_LAST));
+
+   /* get xattr block */
+   fiemap_buf->fm_flags |= FIEMAP_FLAG_XATTR;
+   fiemap_buf->fm_start = 0;
+   memset(ext_buf, 0, ext_buf_size);
+   ret = ioctl(fd, FS_IOC_FIEMAP, fiemap_buf);
+   if (ret < 0 || fiemap_buf->fm_mapped_extents == 0) {
+   if (debug_flag & DBG_FIEMAP) {
+   fprintf(stderr, "%s: Can't get xattr info for"
+   " inode:%ld ret:%d mapped:%d\n",
+   __func__, st->st_ino, ret,
+   fiemap_buf->fm_mapped_extents);
+   }
+   goto out;
+   }
+   if (!(ext_buf[0].fe_flags & FIEMAP_EXTENT_DATA_INLINE)) {
+   fest->xattr = ext_buf[i].fe_physical >> blksz_log;
+   if (fest->nr_idx)
+   ret = -1;
+
+   fest->nr_idx--;
+   }
 out:
/FIXME:DEBUG
-   if (debug_flag & DBG_FIEMAP && fest)
-   printf("%s fmap stat ino:%ld hole:%d frag:%d local_ex:%d 
local_sz:%d group:%d\n",
+   if ((debug_flag & DBG_FIEMAP) && fest)
+   printf("%s fmap stat ino:%ld hole:%d frag:%d local_ex:%d "
+  "local_sz:%d group:%d nr_idx:%u xattr:%lld ret:%d\n",
   __func__, st->st_ino, fest->hole, fest->frag,
-  fest->local_ex, fest->local_sz, fest->group);
+  fest->local_ex, fest->local_sz, fest->group, 
fest->nr_idx,
+  fest->xattr, ret);
 
free(fiemap_buf);
 
@@ -1134,7 +1169,6 @@ static int scan_inode_pass3(struct defrag_context *dfx, 
int fd,
ret = do_iaf_defrag_one(dfx, dirfd, name, stat, fec, );
if (!ret)
goto out;
-   
}
 
if (stat->st_mtime  < older_than)
@@ -1916,7 +1950,7 @@ static int prepare_donor(struct defrag_context *dfx, 
dgrp_t group,
printf("%s grp:%u donor_fd:%d blocks:%llu frag:%u\n",
   __func__, group, donor->fd, blocks, max_frag);
}
-   assert(blocks);
+   assert(blocks && max_frag);
 
/* First try to reuse existing donor if available */
if (donor->fd != -1) {
@@ -1954,23 +1988,28 @@ static int check_iaf(struct defrag_context *dfx, struct 
stat64 *stat,
__u64 eof_lblk;
 FIXME free_space_average should be tunable
__u64 free_space_average = 64;
+   __u32 meta_blocks;
int ret  = 1;
 
if (!S_ISREG(stat->st_mode))
ret = 0;
- 

[Devel] [PATCH 5/6] e4defrag2: prevent agressive donor lookup

2016-05-16 Thread Dmitry Monakhov
It was bad idea to try all dirs from all groups for donor especially for big 
filesystems.
Let's scan only local ones.

Signed-off-by: Dmitry Monakhov 
---
 misc/e4defrag2.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/misc/e4defrag2.c b/misc/e4defrag2.c
index 771ee51..7aab2b4 100644
--- a/misc/e4defrag2.c
+++ b/misc/e4defrag2.c
@@ -1830,6 +1830,7 @@ static int do_find_donor(struct defrag_context *dfx, 
dgrp_t group,
int dir, i, ret = 0;
struct stat64 st;
dgrp_t donor_grp;
+   int dir_retries = 3;
unsigned char *raw_fh = dfx->group[group]->dir_rawh;
const char *dfname = ".e4defrag2_donor.tmp";
 
@@ -1896,7 +1897,7 @@ static int do_find_donor(struct defrag_context *dfx, 
dgrp_t group,
try_next:
close(dir);
close_donor(donor);
-   if (ret)
+   if (ret || !dir_retries--)
return -1;
}
 
@@ -1934,7 +1935,7 @@ static int prepare_donor(struct defrag_context *dfx, 
dgrp_t group,
return -1;
 
/* Sequentially search groups and create first available */
-   for (i = 0; i < nr_groups; i++) {
+   for (i = 1; i < 16; i++) {
if (dfx->group[(group + i) % nr_groups]) {
ret = do_find_donor(dfx, (group + i) % nr_groups,
donor, blocks, 0, max_frag);
-- 
1.7.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] mm: writeback: do not check dirty limits for ub0

2016-05-16 Thread Vladimir Davydov
It's just a waste of time, because ub0 has no ub-specific dirty limits.
balance_dirty_pages handles ub0 case anyway.

Signed-off-by: Vladimir Davydov 
---
 mm/page-writeback.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 9940d5fe7dcb..ba5f93a84fca 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1403,9 +1403,11 @@ static void balance_dirty_pages_ub(struct address_space 
*mapping,
unsigned long pages_written = 0;
unsigned long pause = 1;
struct user_beancounter *ub = get_io_ub();
-
struct backing_dev_info *bdi = mapping->backing_dev_info;
 
+   if (ub == get_ub0())
+   return;
+
for (;;) {
unsigned long nr_to_write = write_chunk - pages_written;
 
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rhel7] procfs: always expose /proc//map_files/ and make it readable

2016-05-16 Thread Cyrill Gorcunov
This is a backport of commit

ML: bdb4d100afe9818aebd1d98ced575c5ef143456c

From: Calvin Owens 

Currently, /proc//map_files/ is restricted to CAP_SYS_ADMIN, and is
only exposed if CONFIG_CHECKPOINT_RESTORE is set.

Each mapped file region gets a symlink in /proc//map_files/
corresponding to the virtual address range at which it is mapped.  The
symlinks work like the symlinks in /proc//fd/, so you can follow them
to the backing file even if that backing file has been unlinked.

Currently, files which are mapped, unlinked, and closed are impossible to
stat() from userspace.  Exposing /proc//map_files/ closes this
functionality "hole".

Not being able to stat() such files makes noticing and explicitly
accounting for the space they use on the filesystem impossible.  You can
work around this by summing up the space used by every file in the
filesystem and subtracting that total from what statfs() tells you, but
that obviously isn't great, and it becomes unworkable once your filesystem
becomes large enough.

This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
adjusts the permissions enforced on it as follows:

* proc_map_files_lookup()
* proc_map_files_readdir()
* map_files_d_revalidate()

Remove the CAP_SYS_ADMIN restriction, leaving only the current
restriction requiring PTRACE_MODE_READ. The information made
available to userspace by these three functions is already
available in /proc/PID/maps with MODE_READ, so I don't see any
reason to limit them any further (see below for more detail).

* proc_map_files_follow_link()

This stub has been added, and requires that the user have
CAP_SYS_ADMIN in order to follow the links in map_files/,
since there was concern on LKML both about the potential for
bypassing permissions on ancestor directories in the path to
files pointed to, and about what happens with more exotic
memory mappings created by some drivers (ie dma-buf).

In older versions of this patch, I changed every permission check in
the four functions above to enforce MODE_ATTACH instead of MODE_READ.
This was an oversight on my part, and after revisiting the discussion
it seems that nobody was concerned about anything outside of what is
made possible by ->follow_link(). So in this version, I've left the
checks for PTRACE_MODE_READ as-is.

[a...@linux-foundation.org: catch up with concurrent proc_pid_follow_link() 
changes]
Signed-off-by: Calvin Owens 
Reviewed-by: Kees Cook 
Cc: Andy Lutomirski 
Cc: Cyrill Gorcunov 
Cc: Joe Perches 
Cc: Kirill A. Shutemov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Cyrill Gorcunov 
---

Kostya, please wait for Ack from Andrew. The patch on its own is not
bound to some of the bug we're working on now but usefull in general
and probably will help us with renaming of memfd restored memory
in criu (we use memfd to be able to restore anonymous shared memory
in userns case but memfd mangles the backend name, we didn't find
any problem with it yet, but been talking to Andrew and he agreed
that we might need to do something with this problem, and this patch
is first step).

 fs/proc/base.c |   44 +++-
 1 file changed, 23 insertions(+), 21 deletions(-)

Index: linux-pcs7.git/fs/proc/base.c
===
--- linux-pcs7.git.orig/fs/proc/base.c
+++ linux-pcs7.git/fs/proc/base.c
@@ -1925,8 +1925,6 @@ end_instantiate:
return filldir(dirent, name, len, filp->f_pos, ino, type);
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
-
 /*
  * dname_to_vma_addr - maps a dentry name into two unsigned longs
  * which represent vma start and end addresses.
@@ -1953,11 +1951,6 @@ static int map_files_d_revalidate(struct
if (flags & LOOKUP_RCU)
return -ECHILD;
 
-   if (!capable(CAP_SYS_ADMIN)) {
-   status = -EPERM;
-   goto out_notask;
-   }
-
inode = dentry->d_inode;
task = get_proc_task(inode);
if (!task)
@@ -2048,6 +2041,28 @@ struct map_files_info {
unsigned char   name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
 };
 
+/*
+ * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
+ * symlinks may be used to bypass permissions on ancestor directories in the
+ * path to the file in question.
+ */
+static void *proc_map_files_follow_link(struct dentry *dentry, struct 
nameidata *nd)
+{
+   if (!capable(CAP_SYS_ADMIN))
+   return ERR_PTR(-EPERM);
+
+   return proc_pid_follow_link(dentry, nd);
+}
+
+/*
+ * Identical to proc_pid_link_inode_operations except for follow_link()
+ */
+static const struct inode_operations 

[Devel] [RH6 PATCH] [MS] ext4: collapse a single extent tree block into the inode if possible

2016-05-16 Thread Dmitry Monakhov

Backport ecb94f5fdf4b72547fca022421a9dca1672bddd4
This patch is required for sane defragmenration procedure.
https://jira.sw.ru/browse/PSBM-46563
#ORIG_MSG:
[PATCH] ext4: collapse a single extent tree block into the inode if possible

If an inode has more than 4 extents, but then later some of the
extents are merged together, we can optimize the file system by moving
the extents up into the inode, and discarding the extent tree block.
This is important, because if there are a large number of inodes with
an external extent tree blocks where the contents could fit in the
inode, this can significantly increase the fsck time of the file
system.

Google-Bug-Id: 6801242

Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Dmitry Monakhov 

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 85c4d4e..5eba717 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1668,10 +1668,54 @@ static int ext4_ext_try_to_merge_right(struct inode 
*inode,
 }
 
 /*
+ * This function does a very simple check to see if we can collapse
+ * an extent tree with a single extent tree leaf block into the inode.
+ */
+static void ext4_ext_try_to_merge_up(handle_t *handle,
+struct inode *inode,
+struct ext4_ext_path *path)
+{
+   size_t s;
+   unsigned max_root = ext4_ext_space_root(inode, 0);
+   ext4_fsblk_t blk;
+
+   if ((path[0].p_depth != 1) ||
+   (le16_to_cpu(path[0].p_hdr->eh_entries) != 1) ||
+   (le16_to_cpu(path[1].p_hdr->eh_entries) > max_root))
+   return;
+
+   /*
+* We need to modify the block allocation bitmap and the block
+* group descriptor to release the extent tree block.  If we
+* can't get the journal credits, give up.
+*/
+   if (ext4_journal_extend(handle, 2))
+   return;
+
+   /*
+* Copy the extent data up to the inode
+*/
+   blk = ext4_idx_pblock(path[0].p_idx);
+   s = le16_to_cpu(path[1].p_hdr->eh_entries) *
+   sizeof(struct ext4_extent_idx);
+   s += sizeof(struct ext4_extent_header);
+
+   memcpy(path[0].p_hdr, path[1].p_hdr, s);
+   path[0].p_depth = 0;
+   path[0].p_ext = EXT_FIRST_EXTENT(path[0].p_hdr) +
+   (path[1].p_ext - EXT_FIRST_EXTENT(path[1].p_hdr));
+   path[0].p_hdr->eh_max = cpu_to_le16(max_root);
+
+   brelse(path[1].p_bh);
+   ext4_free_blocks(handle, inode, blk, 1, EXT4_FREE_BLOCKS_METADATA);
+}
+
+/*
  * This function tries to merge the @ex extent to neighbours in the tree.
  * return 1 if merge left else 0.
  */
-static int ext4_ext_try_to_merge(struct inode *inode,
+static int ext4_ext_try_to_merge(handle_t *handle,
+ struct inode *inode,
  struct ext4_ext_path *path,
  struct ext4_extent *ex) {
struct ext4_extent_header *eh;
@@ -1687,8 +1731,9 @@ static int ext4_ext_try_to_merge(struct inode *inode,
merge_done = ext4_ext_try_to_merge_right(inode, path, ex - 1);
 
if (!merge_done)
-   ret = ext4_ext_try_to_merge_right(inode, path, ex);
+   ret =  ext4_ext_try_to_merge_right(inode, path, ex);
 
+   ext4_ext_try_to_merge_up(handle, inode, path);
return ret;
 }
 
@@ -1897,7 +1942,7 @@ has_space:
 merge:
/* try to merge extents to the right */
if (!(flag & EXT4_GET_BLOCKS_DIO))
-   ext4_ext_try_to_merge(inode, path, nearex);
+   ext4_ext_try_to_merge(handle, inode, path, nearex);
 
/* try to merge extents to the left */
 
@@ -1906,7 +1951,7 @@ merge:
if (err)
goto cleanup;
 
-   err = ext4_ext_dirty(handle, inode, path + depth);
+   err = ext4_ext_dirty(handle, inode, path + path->p_depth);
 
 cleanup:
if (npath) {
@@ -2878,9 +2923,9 @@ static int ext4_split_extent_at(handle_t *handle,
ext4_ext_mark_initialized(ex);
 
if (!(flags & EXT4_GET_BLOCKS_DIO))
-   ext4_ext_try_to_merge(inode, path, ex);
+   ext4_ext_try_to_merge(handle, inode, path, ex);
 
-   err = ext4_ext_dirty(handle, inode, path + depth);
+   err = ext4_ext_dirty(handle, inode, path + path->p_depth);
goto out;
}
 
@@ -2894,7 +2939,7 @@ static int ext4_split_extent_at(handle_t *handle,
 * path may lead to new leaf, not to original leaf any more
 * after ext4_ext_insert_extent() returns,
 */
-   err = ext4_ext_dirty(handle, inode, path + depth);
+   err = ext4_ext_dirty(handle, inode, path + path->p_depth);
if (err)
goto fix_extent_len;
 
@@ -2912,8 +2957,8 @@ static int ext4_split_extent_at(handle_t *handle,
goto fix_extent_len;
/* update the extent length and mark as