Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api

2016-04-29 Thread Steven Whitehouse

Hi,

On 29/04/16 06:35, NeilBrown wrote:

On Tue, Apr 26 2016, Michal Hocko wrote:


Hi,
we have discussed this topic at LSF/MM this year. There was a general
interest in the scope GFP_NOFS allocation context among some FS
developers. For those who are not aware of the discussion or the issue
I am trying to sort out (or at least start in that direction) please
have a look at patch 1 which adds memalloc_nofs_{save,restore} api
which basically copies what we have for the scope GFP_NOIO allocation
context. I haven't converted any of the FS myself because that is way
beyond my area of expertise but I would be happy to help with further
changes on the MM front as well as in some more generic code paths.

Dave had an idea on how to further improve the reclaim context to be
less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
and FS specific cookie set in the FS allocation context and consumed
by the FS reclaim context to allow doing some provably save actions
that would be skipped due to GFP_NOFS normally.  I like this idea and
I believe we can go that direction regardless of the approach taken here.
Many filesystems simply need to cleanup their NOFS usage first before
diving into a more complex changes.>

This strikes me as over-engineering to work around an unnecessarily
burdensome interface but without details it is hard to be certain.

Exactly what things happen in "FS reclaim context" which may, or may
not, be safe depending on the specific FS allocation context?  Do they
need to happen at all?

My research suggests that for most filesystems the only thing that
happens in reclaim context that is at all troublesome is the final
'evict()' on an inode.  This needs to flush out dirty pages and sync the
inode to storage.  Some time ago we moved most dirty-page writeout out
of the reclaim context and into kswapd.  I think this was an excellent
advance in simplicity.
If we could similarly move evict() into kswapd (and I believe we can)
then most file systems would do nothing in reclaim context that
interferes with allocation context.
evict() is an issue, but moving it into kswapd would be a potential 
problem for GFS2. We already have a memory allocation issue when 
recovery is taking place and memory is short. The code path is as follows:


 1. Inode is scheduled for eviction (which requires deallocation)
 2. The glock is required in order to perform the deallocation, which 
implies getting a DLM lock

 3. Another node in the cluster fails, so needs recovery
 4. When the DLM lock is requested, it gets blocked until recovery is 
complete (for the failed node)

 5. Recovery is performed using a userland fencing utility
 6. Fencing requires memory and then blocks on the eviction
 7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on 
DLM lock, DLM lock waiting on fencing)


It doesn't happen often, but we've been looking at the best place to 
break that cycle, and one of the things we've been wondering is whether 
we could avoid deallocation evictions from memory related contexts, or 
at least make it async somehow.


The issue is that it is not possible to know in advance whether an 
eviction will result in mearly writing things back to disk (because the 
inode is being dropped from cache, but still resides on disk) which is 
easy to do, or whether it requires a full deallocation (n_link==0) which 
may require significant resources and time,


Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Cluster-devel] [PATCH 16/18] gfs2: use generic posix ACL infrastructure

2013-12-04 Thread Steven Whitehouse
Hi,

On Sun, 2013-12-01 at 03:59 -0800, Christoph Hellwig wrote:
> plain text document attachment
> (0016-gfs2-use-generic-posix-ACL-infrastructure.patch)
> This contains some major refactoring for the create path so that
> inodes are created with the right mode to start with instead of
> fixing it up later.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/gfs2/acl.c   |  229 
> +++
>  fs/gfs2/acl.h   |4 +-
>  fs/gfs2/inode.c |   33 ++--
>  fs/gfs2/xattr.c |4 +-
>  4 files changed, 61 insertions(+), 209 deletions(-)
> 
Looks very good. I'd really like to be able to do something similar with
the security xattrs, in terms of the refactoring that at inode creation
to give the xattrs ahead of the inode allocation itself. That way it
should be possible to allocate the xattr blocks at the same time as the
inode, rather than as an after thought.

Some more comments below

> diff --git a/fs/gfs2/acl.c b/fs/gfs2/acl.c
> index e82e4ac..e6c7a2c 100644
> --- a/fs/gfs2/acl.c
> +++ b/fs/gfs2/acl.c
[snip]
> -
> -static int gfs2_xattr_system_set(struct dentry *dentry, const char *name,
> -  const void *value, size_t size, int flags,
> -  int xtype)
> -{
> - struct inode *inode = dentry->d_inode;
> - struct gfs2_sbd *sdp = GFS2_SB(inode);
> - struct posix_acl *acl = NULL;
> - int error = 0, type;
> -
> - if (!sdp->sd_args.ar_posix_acl)
> - return -EOPNOTSUPP;
> -
> - type = gfs2_acl_type(name);
> - if (type < 0)
> - return type;
> - if (flags & XATTR_CREATE)
> - return -EINVAL;
> - if (type == ACL_TYPE_DEFAULT && !S_ISDIR(inode->i_mode))
> - return value ? -EACCES : 0;
> - if (!uid_eq(current_fsuid(), inode->i_uid) && !capable(CAP_FOWNER))
> - return -EPERM;
> - if (S_ISLNK(inode->i_mode))
> - return -EOPNOTSUPP;
> -
> - if (!value)
> - goto set_acl;
>  
> - acl = posix_acl_from_xattr(&init_user_ns, value, size);
> - if (!acl) {
> - /*
> -  * acl_set_file(3) may request that we set default ACLs with
> -  * zero length -- defend (gracefully) against that here.
> -  */
> - goto out;
> - }
> - if (IS_ERR(acl)) {
> - error = PTR_ERR(acl);
> - goto out;
> - }
> -
> - error = posix_acl_valid(acl);
> - if (error)
> - goto out_release;
> -
> - error = -EINVAL;
>   if (acl->a_count > GFS2_ACL_MAX_ENTRIES)
> - goto out_release;
> + return -EINVAL;
>  
>   if (type == ACL_TYPE_ACCESS) {
>   umode_t mode = inode->i_mode;
> +
>   error = posix_acl_equiv_mode(acl, &mode);
> + if (error < 0)
>  
Andy Price has pointed out a missing "return error;" here

> - if (error <= 0) {
> - posix_acl_release(acl);
> + if (error == 0)
>   acl = NULL;
>  
> - if (error < 0)
> - return error;
> - }
> -

Also, there seems to be a white space error in the xfs patch around line
170 in fs/xfs/xfs_iops.c where there is an added "if (default_acl)" with
a space before the tab,

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Cluster-devel] [PATCH 16/18] gfs2: use generic posix ACL infrastructure

2013-12-11 Thread Steven Whitehouse
Hi,

On Wed, 2013-12-11 at 02:42 -0800, Christoph Hellwig wrote:
> plain text document attachment
> (0016-gfs2-use-generic-posix-ACL-infrastructure.patch)
> This contains some major refactoring for the create path so that
> inodes are created with the right mode to start with instead of
> fixing it up later.
> 
> Signed-off-by: Christoph Hellwig 
Acked-by: Steven Whitehouse 

A really nice clean up - this is a very useful step forward in
simplifying the create path. Thanks for sorting this out,

Steve.

> ---
>  fs/gfs2/acl.c   |  234 
> +++
>  fs/gfs2/acl.h   |4 +-
>  fs/gfs2/inode.c |   33 ++--
>  fs/gfs2/xattr.c |4 +-
>  4 files changed, 62 insertions(+), 213 deletions(-)
> 
> diff --git a/fs/gfs2/acl.c b/fs/gfs2/acl.c
> index e82e4ac..ba94566 100644
> --- a/fs/gfs2/acl.c
> +++ b/fs/gfs2/acl.c
> @@ -49,10 +49,6 @@ struct posix_acl *gfs2_get_acl(struct inode *inode, int 
> type)
>   if (!ip->i_eattr)
>   return NULL;
>  
> - acl = get_cached_acl(&ip->i_inode, type);
> - if (acl != ACL_NOT_CACHED)
> - return acl;
> -
>   name = gfs2_acl_name(type);
>   if (name == NULL)
>   return ERR_PTR(-EINVAL);
> @@ -80,7 +76,7 @@ static int gfs2_set_mode(struct inode *inode, umode_t mode)
>   return error;
>  }
>  
> -static int gfs2_acl_set(struct inode *inode, int type, struct posix_acl *acl)
> +int gfs2_set_acl(struct inode *inode, struct posix_acl *acl, int type)
>  {
>   int error;
>   int len;
> @@ -88,219 +84,49 @@ static int gfs2_acl_set(struct inode *inode, int type, 
> struct posix_acl *acl)
>   const char *name = gfs2_acl_name(type);
>  
>   BUG_ON(name == NULL);
> - len = posix_acl_to_xattr(&init_user_ns, acl, NULL, 0);
> - if (len == 0)
> - return 0;
> - data = kmalloc(len, GFP_NOFS);
> - if (data == NULL)
> - return -ENOMEM;
> - error = posix_acl_to_xattr(&init_user_ns, acl, data, len);
> - if (error < 0)
> - goto out;
> - error = __gfs2_xattr_set(inode, name, data, len, 0, GFS2_EATYPE_SYS);
> - if (!error)
> - set_cached_acl(inode, type, acl);
> -out:
> - kfree(data);
> - return error;
> -}
> -
> -int gfs2_acl_create(struct gfs2_inode *dip, struct inode *inode)
> -{
> - struct gfs2_sbd *sdp = GFS2_SB(&dip->i_inode);
> - struct posix_acl *acl;
> - umode_t mode = inode->i_mode;
> - int error = 0;
> -
> - if (!sdp->sd_args.ar_posix_acl)
> - return 0;
> - if (S_ISLNK(inode->i_mode))
> - return 0;
> -
> - acl = gfs2_get_acl(&dip->i_inode, ACL_TYPE_DEFAULT);
> - if (IS_ERR(acl))
> - return PTR_ERR(acl);
> - if (!acl) {
> - mode &= ~current_umask();
> - return gfs2_set_mode(inode, mode);
> - }
> -
> - if (S_ISDIR(inode->i_mode)) {
> - error = gfs2_acl_set(inode, ACL_TYPE_DEFAULT, acl);
> - if (error)
> - goto out;
> - }
> -
> - error = __posix_acl_create(&acl, GFP_NOFS, &mode);
> - if (error < 0)
> - return error;
>  
> - if (error == 0)
> - goto munge;
> -
> - error = gfs2_acl_set(inode, ACL_TYPE_ACCESS, acl);
> - if (error)
> - goto out;
> -munge:
> - error = gfs2_set_mode(inode, mode);
> -out:
> - posix_acl_release(acl);
> - return error;
> -}
> -
> -int gfs2_acl_chmod(struct gfs2_inode *ip, struct iattr *attr)
> -{
> - struct inode *inode = &ip->i_inode;
> - struct posix_acl *acl;
> - char *data;
> - unsigned int len;
> - int error;
> -
> - acl = gfs2_get_acl(&ip->i_inode, ACL_TYPE_ACCESS);
> - if (IS_ERR(acl))
> - return PTR_ERR(acl);
> - if (!acl)
> - return gfs2_setattr_simple(inode, attr);
> -
> - error = __posix_acl_chmod(&acl, GFP_NOFS, attr->ia_mode);
> - if (error)
> - return error;
> -
> - len = posix_acl_to_xattr(&init_user_ns, acl, NULL, 0);
> - data = kmalloc(len, GFP_NOFS);
> - error = -ENOMEM;
> - if (data == NULL)
> - goto out;
> - posix_acl_to_xattr(&init_user_ns, acl, data, len);
> - error = gfs2_xattr_acl_chmod(ip, attr, data);
> - kfree(data);
> - set_cached_acl(&ip->i_inode, ACL_TYPE_ACCESS, acl);
> -
> -out:
> - posix_acl_release(acl);
> - return error;
> -}
> -
> -static int gfs2_acl_type(const char *name)
&g

Re: [Cluster-devel] [PATCH v5 6/7] fs: pass iocb to generic_write_sync

2014-11-06 Thread Steven Whitehouse

Hi,

On 05/11/14 21:14, Milosz Tanski wrote:

From: Christoph Hellwig 

Clean up the generic_write_sync by just passing an iocb and a bytes
written / negative errno argument.  In addition to simplifying the
callers this also prepares for passing a per-operation O_DSYNC
flag.  Two callers didn't quite fit that scheme:

  - dio_complete didn't both to update ki_pos as we don't need it
on a iocb that is about to be freed, so we had to add it. Additionally
it also synced out written data in the error case, which has been
changed to operate like the other callers.
  - gfs2 also used generic_write_sync to implement a crude version
of fallocate.  It has been switched to use an open coded variant
instead.

Signed-off-by: Christoph Hellwig 

GFS2 bits:

Acked-by: Steven Whitehouse 

I know that Andy Price has some work in this area too, so in due course 
we'll have to be careful not to create a merge conflict here. Copying in 
Andy so he can see the changes,


Steve.


---
  fs/block_dev.c |  8 +---
  fs/btrfs/file.c|  7 ++-
  fs/cifs/file.c |  8 +---
  fs/direct-io.c |  8 ++--
  fs/ext4/file.c |  8 +---
  fs/gfs2/file.c |  9 +++--
  fs/ntfs/file.c |  8 ++--
  fs/udf/file.c  | 11 ++-
  fs/xfs/xfs_file.c  |  8 +---
  include/linux/fs.h |  8 +---
  mm/filemap.c   | 30 --
  11 files changed, 40 insertions(+), 73 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index cc9d411..c529b1c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1568,18 +1568,12 @@ static long block_ioctl(struct file *file, unsigned 
cmd, unsigned long arg)
   */
  ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
  {
-   struct file *file = iocb->ki_filp;
struct blk_plug plug;
ssize_t ret;
  
  	blk_start_plug(&plug);

ret = __generic_file_write_iter(iocb, from);
-   if (ret > 0) {
-   ssize_t err;
-   err = generic_write_sync(file, iocb->ki_pos - ret, ret);
-   if (err < 0)
-   ret = err;
-   }
+   ret = generic_write_sync(iocb, ret);
blk_finish_plug(&plug);
return ret;
  }
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index a18ceab..4f4a6f7 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1820,11 +1820,8 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 */
BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
BTRFS_I(inode)->last_sub_trans = root->log_transid;
-   if (num_written > 0) {
-   err = generic_write_sync(file, pos, num_written);
-   if (err < 0)
-   num_written = err;
-   }
+
+   num_written = generic_write_sync(iocb, num_written);
  
  	if (sync)

atomic_dec(&BTRFS_I(inode)->sync_writers);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index c485afa..32359de 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2706,13 +2706,7 @@ cifs_writev(struct kiocb *iocb, struct iov_iter *from)
rc = __generic_file_write_iter(iocb, from);
mutex_unlock(&inode->i_mutex);
  
-		if (rc > 0) {

-   ssize_t err;
-
-   err = generic_write_sync(file, iocb->ki_pos - rc, rc);
-   if (err < 0)
-   rc = err;
-   }
+   rc = generic_write_sync(iocb, rc);
} else {
mutex_unlock(&inode->i_mutex);
}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index e181b6b..b72ac83 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -257,12 +257,8 @@ static ssize_t dio_complete(struct dio *dio, loff_t 
offset, ssize_t ret,
inode_dio_done(dio->inode);
if (is_async) {
if (dio->rw & WRITE) {
-   int err;
-
-   err = generic_write_sync(dio->iocb->ki_filp, offset,
-transferred);
-   if (err < 0 && ret > 0)
-   ret = err;
+   dio->iocb->ki_pos = offset + transferred;
+   ret = generic_write_sync(dio->iocb, ret);
}
  
  		aio_complete(dio->iocb, ret, 0);

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index aca7b24..79b000c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -175,13 +175,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
ret = __generic_file_write_iter(iocb, from);
mutex_unlock(&inode->i_mutex);
  
-	if (ret > 0) {

-   ssize_t err;
-
-   err = generic_write_sync(file, iocb->ki_pos - ret, ret);
-   if (err < 0)
-   ret = err;
-   }
+   ret = generi

Re: [RFC v4+ hot_track 03/19] vfs: add I/O frequency update function

2012-11-05 Thread Steven Whitehouse
Hi,

On Mon, 2012-10-29 at 12:30 +0800, zwu.ker...@gmail.com wrote:
> From: Zhi Yong Wu 
> 
>   Add some util helpers to update access frequencies
> for one file or its range.
> 
> Signed-off-by: Zhi Yong Wu 
> ---
>  fs/hot_tracking.c|  179 
> ++
>  fs/hot_tracking.h|7 ++
>  include/linux/hot_tracking.h |2 +
>  3 files changed, 188 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
> index 68591f0..0a7d9a3 100644
> --- a/fs/hot_tracking.c
> +++ b/fs/hot_tracking.c
> @@ -172,6 +172,137 @@ static void hot_inode_tree_exit(struct hot_info *root)
>   }
>  }
>  
> +struct hot_inode_item
> +*hot_inode_item_find(struct hot_info *root, u64 ino)
> +{
> + struct hot_inode_item *he;
> + int ret;
> +
> +again:
> + spin_lock(&root->lock);
> + he = radix_tree_lookup(&root->hot_inode_tree, ino);
> + if (he) {
> + kref_get(&he->hot_inode.refs);
> + spin_unlock(&root->lock);
> + return he;
> + }
> + spin_unlock(&root->lock);
> +
> + he = kmem_cache_zalloc(hot_inode_item_cachep,
> + GFP_KERNEL | GFP_NOFS);
This doesn't look quite right... which of these two did you mean? I
assume probably just GFP_NOFS

> + if (!he)
> + return ERR_PTR(-ENOMEM);
> +
> + hot_inode_item_init(he, ino, &root->hot_inode_tree);
> +
> + ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM);
> + if (ret) {
> + kmem_cache_free(hot_inode_item_cachep, he);
> + return ERR_PTR(ret);
> + }
> +
> + spin_lock(&root->lock);
> + ret = radix_tree_insert(&root->hot_inode_tree, ino, he);
> + if (ret == -EEXIST) {
> + kmem_cache_free(hot_inode_item_cachep, he);
> + spin_unlock(&root->lock);
> + radix_tree_preload_end();
> + goto again;
> + }
> + spin_unlock(&root->lock);
> + radix_tree_preload_end();
> +
> + kref_get(&he->hot_inode.refs);
> + return he;
> +}
> +EXPORT_SYMBOL_GPL(hot_inode_item_find);
> +
> +static struct hot_range_item
> +*hot_range_item_find(struct hot_inode_item *he,
> + u32 start)
> +{
> + struct hot_range_item *hr;
> + int ret;
> +
> +again:
> + spin_lock(&he->lock);
> + hr = radix_tree_lookup(&he->hot_range_tree, start);
> + if (hr) {
> + kref_get(&hr->hot_range.refs);
> + spin_unlock(&he->lock);
> + return hr;
> + }
> + spin_unlock(&he->lock);
> +
> + hr = kmem_cache_zalloc(hot_range_item_cachep,
> + GFP_KERNEL | GFP_NOFS);
Likewise, here too.

Steve.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v4+ hot_track 09/19] vfs: add one work queue

2012-11-05 Thread Steven Whitehouse
Hi,

On Mon, 2012-10-29 at 12:30 +0800, zwu.ker...@gmail.com wrote:
> From: Zhi Yong Wu 
> 
>   Add a per-superblock workqueue and a delayed_work
> to run periodic work to update map info on each superblock.
> 
> Signed-off-by: Zhi Yong Wu 
> ---
>  fs/hot_tracking.c|   85 
> ++
>  fs/hot_tracking.h|3 +
>  include/linux/hot_tracking.h |3 +
>  3 files changed, 91 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
> index fff0038..0ef9cad 100644
> --- a/fs/hot_tracking.c
> +++ b/fs/hot_tracking.c
> @@ -15,9 +15,12 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "hot_tracking.h"
>  
> @@ -557,6 +560,67 @@ static void hot_map_array_exit(struct hot_info *root)
>   }
>  }
>  
> +/* Temperature compare function*/
> +static int hot_temp_cmp(void *priv, struct list_head *a,
> + struct list_head *b)
> +{
> + struct hot_comm_item *ap =
> + container_of(a, struct hot_comm_item, n_list);
> + struct hot_comm_item *bp =
> + container_of(b, struct hot_comm_item, n_list);
> +
> + int diff = ap->hot_freq_data.last_temp
> + - bp->hot_freq_data.last_temp;
> + if (diff > 0)
> + return -1;
> + if (diff < 0)
> + return 1;
> + return 0;
> +}
> +
> +/*
> + * Every sync period we update temperatures for
> + * each hot inode item and hot range item for aging
> + * purposes.
> + */
> +static void hot_update_worker(struct work_struct *work)
> +{
> + struct hot_info *root = container_of(to_delayed_work(work),
> + struct hot_info, update_work);
> + struct hot_inode_item *hi_nodes[8];
> + u64 ino = 0;
> + int i, n;
> +
> + while (1) {
> + n = radix_tree_gang_lookup(&root->hot_inode_tree,
> +(void **)hi_nodes, ino,
> +ARRAY_SIZE(hi_nodes));
> + if (!n)
> + break;
> +
> + ino = hi_nodes[n - 1]->i_ino + 1;
> + for (i = 0; i < n; i++) {
> + kref_get(&hi_nodes[i]->hot_inode.refs);
> + hot_map_array_update(
> + &hi_nodes[i]->hot_inode.hot_freq_data, root);
> + hot_range_update(hi_nodes[i], root);
> + hot_inode_item_put(hi_nodes[i]);
> + }
> + }
> +
> + /* Sort temperature map info */
> + for (i = 0; i < HEAT_MAP_SIZE; i++) {
> + list_sort(NULL, &root->heat_inode_map[i].node_list,
> + hot_temp_cmp);
> + list_sort(NULL, &root->heat_range_map[i].node_list,
> + hot_temp_cmp);
> + }
> +

If this list can potentially have one (or more) entries per inode, then
filesystems with a lot of inodes (millions) may potentially exceed the
max size of list which list_sort() can handle. If that happens it still
works, but you'll get a warning message and it won't be as efficient.

It is something that we've run into with list_sort() and GFS2, but it
only happens very rarely,

Steve.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v4+ hot_track 09/19] vfs: add one work queue

2012-11-05 Thread Steven Whitehouse
Hi,

On Mon, 2012-11-05 at 19:55 +0800, Zhi Yong Wu wrote:
> On Mon, Nov 5, 2012 at 7:21 PM, Steven Whitehouse  wrote:
> > Hi,
> >
> > On Mon, 2012-10-29 at 12:30 +0800, zwu.ker...@gmail.com wrote:
> >> From: Zhi Yong Wu 
> >>
> >>   Add a per-superblock workqueue and a delayed_work
> >> to run periodic work to update map info on each superblock.
> >>
> >> Signed-off-by: Zhi Yong Wu 
> >> ---
> >>  fs/hot_tracking.c|   85 
> >> ++
> >>  fs/hot_tracking.h|3 +
> >>  include/linux/hot_tracking.h |3 +
> >>  3 files changed, 91 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
> >> index fff0038..0ef9cad 100644
> >> --- a/fs/hot_tracking.c
> >> +++ b/fs/hot_tracking.c
> >> @@ -15,9 +15,12 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> >> +#include 
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> >>  #include 
> >>  #include "hot_tracking.h"
> >>
> >> @@ -557,6 +560,67 @@ static void hot_map_array_exit(struct hot_info *root)
> >>   }
> >>  }
> >>
> >> +/* Temperature compare function*/
> >> +static int hot_temp_cmp(void *priv, struct list_head *a,
> >> + struct list_head *b)
> >> +{
> >> + struct hot_comm_item *ap =
> >> + container_of(a, struct hot_comm_item, n_list);
> >> + struct hot_comm_item *bp =
> >> + container_of(b, struct hot_comm_item, n_list);
> >> +
> >> + int diff = ap->hot_freq_data.last_temp
> >> + - bp->hot_freq_data.last_temp;
> >> + if (diff > 0)
> >> + return -1;
> >> + if (diff < 0)
> >> + return 1;
> >> + return 0;
> >> +}
> >> +
> >> +/*
> >> + * Every sync period we update temperatures for
> >> + * each hot inode item and hot range item for aging
> >> + * purposes.
> >> + */
> >> +static void hot_update_worker(struct work_struct *work)
> >> +{
> >> + struct hot_info *root = container_of(to_delayed_work(work),
> >> + struct hot_info, update_work);
> >> + struct hot_inode_item *hi_nodes[8];
> >> + u64 ino = 0;
> >> + int i, n;
> >> +
> >> + while (1) {
> >> + n = radix_tree_gang_lookup(&root->hot_inode_tree,
> >> +(void **)hi_nodes, ino,
> >> +ARRAY_SIZE(hi_nodes));
> >> + if (!n)
> >> + break;
> >> +
> >> + ino = hi_nodes[n - 1]->i_ino + 1;
> >> + for (i = 0; i < n; i++) {
> >> + kref_get(&hi_nodes[i]->hot_inode.refs);
> >> + hot_map_array_update(
> >> + &hi_nodes[i]->hot_inode.hot_freq_data, root);
> >> + hot_range_update(hi_nodes[i], root);
> >> + hot_inode_item_put(hi_nodes[i]);
> >> + }
> >> + }
> >> +
> >> + /* Sort temperature map info */
> >> + for (i = 0; i < HEAT_MAP_SIZE; i++) {
> >> + list_sort(NULL, &root->heat_inode_map[i].node_list,
> >> + hot_temp_cmp);
> >> + list_sort(NULL, &root->heat_range_map[i].node_list,
> >> + hot_temp_cmp);
> >> + }
> >> +
> >
> > If this list can potentially have one (or more) entries per inode, then
> Only one hot_inode_item per inode, while maybe multiple
> hot_range_items per inode.
> > filesystems with a lot of inodes (millions) may potentially exceed the
> > max size of list which list_sort() can handle. If that happens it still
> > works, but you'll get a warning message and it won't be as efficient.
> I haven't do so large scale test. If we want to find that issue, we
> need to do large scale performance test, before that, i want to make
> sure the code change is correct at first.
> To be honest, for that issue you pointed to, i also have such
> concern.But list_sort() performance looks good from the test result of
&g

Re: [Cluster-devel] [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()

2014-03-13 Thread Steven Whitehouse
Hi,

On Thu, 2014-03-13 at 17:23 +0100, Jan Kara wrote:
> On Thu 13-03-14 10:20:56, Ted Tso wrote:
> > Previously, the no-op "mount -o mount /dev/xxx" operation when the
>   ^^remount
> 
> > file system is already mounted read-write causes an implied,
> > unconditional syncfs().  This seems pretty stupid, and it's certainly
> > documented or guaraunteed to do this, nor is it particularly useful,
> > except in the case where the file system was mounted rw and is getting
> > remounted read-only.
> > 
> > However, it's possible that there might be some file systems that are
> > actually depending on this behavior.  In most file systems, it's
> > probably fine to only call sync_filesystem() when transitioning from
> > read-write to read-only, and there are some file systems where this is
> > not needed at all (for example, for a pseudo-filesystem or something
> > like romfs).
>   Hum, I'd avoid this excercise at least for filesystem where
> sync_filesystem() is obviously useless - proc, debugfs, pstore, devpts,
> also always read-only filesystems such as isofs, qnx4, qnx6, befs, cramfs,
> efs, freevxfs, romfs, squashfs. I think you can find a couple more which
> clearly don't care about sync_filesystem() if you look a bit closer.
> 
>
>   Honza

I guess the same is true for other file systems which are mounted ro
too. So maybe a check for MS_RDONLY before doing the sync in those
cases?

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Cluster-devel] fallocate vs O_(D)SYNC

2011-11-16 Thread Steven Whitehouse
Hi,

On Wed, 2011-11-16 at 03:42 -0500, Christoph Hellwig wrote:
> It seems all filesystems but XFS ignore O_SYNC for fallocate, and never
> make sure the size update transaction made it to disk.
> 
> Given that a fallocate without FALLOC_FL_KEEP_SIZE very much is a data
> operation (it adds new blocks that return zeroes) that seems like a
> fairly nasty surprise for O_SYNC users.
> 


In GFS2 we zero out the data blocks as we go (since our metadata doesn't
allow us to mark blocks as zeroed at alloc time) and also because we are
mostly interested in being able to do FALLOC_FL_KEEP_SIZE which we use
on our rindex system file in order to ensure that there is always enough
space to expand a filesystem.

So there is no danger of having non-zeroed blocks appearing later, as
that is done before the metadata change.

Our fallocate_chunk() function calls mark_inode_dirty(inode) on each
call, so that fsync should pick that up and ensure that the metadata has
been written back. So we should thus have both data and metadata stable
on disk.

Do you have some evidence that this is not happening?

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Cluster-devel] fallocate vs O_(D)SYNC

2011-11-16 Thread Steven Whitehouse
Hi,

On Wed, 2011-11-16 at 11:54 +0100, Jan Kara wrote:
> Hello,
> 
> On Wed 16-11-11 09:43:08, Steven Whitehouse wrote:
> > On Wed, 2011-11-16 at 03:42 -0500, Christoph Hellwig wrote:
> > > It seems all filesystems but XFS ignore O_SYNC for fallocate, and never
> > > make sure the size update transaction made it to disk.
> > > 
> > > Given that a fallocate without FALLOC_FL_KEEP_SIZE very much is a data
> > > operation (it adds new blocks that return zeroes) that seems like a
> > > fairly nasty surprise for O_SYNC users.
> > 
> > In GFS2 we zero out the data blocks as we go (since our metadata doesn't
> > allow us to mark blocks as zeroed at alloc time) and also because we are
> > mostly interested in being able to do FALLOC_FL_KEEP_SIZE which we use
> > on our rindex system file in order to ensure that there is always enough
> > space to expand a filesystem.
> > 
> > So there is no danger of having non-zeroed blocks appearing later, as
> > that is done before the metadata change.
> > 
> > Our fallocate_chunk() function calls mark_inode_dirty(inode) on each
> > call, so that fsync should pick that up and ensure that the metadata has
> > been written back. So we should thus have both data and metadata stable
> > on disk.
> > 
> > Do you have some evidence that this is not happening?
>   Yeah, only that nobody calls that fsync() automatically if the fd is
> O_SYNC if I'm right. But maybe calling fdatasync() on the range which was
> fallocated from sys_fallocate() if the fd is O_SYNC would do the trick for
> most filesystems? That would match how we treat O_SYNC for other operations
> as well. I'm just not sure whether XFS wouldn't take unnecessarily big hit
> with this.
> 
>   Honza

Ah, I see now. Sorry, I missed the original point. So that would just be
a VFS addition to check the O_(D)SYNC flag as you suggest. I've no
objections to that, it makes sense to me,

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Cluster-devel] fallocate vs O_(D)SYNC

2011-11-18 Thread Steven Whitehouse
Hi,

Here is what I'm planning for GFS2:


Add sync of metadata after fallocate for O_SYNC files to ensure that we
meet expectations for everything being on disk in this case.
Unfortunately, the offset and len parameters are modified during the
course of the fallocate function, so I've had to add a couple of new
variables to call generic_write_sync() at the end.

I know that potentially this will sync data as well within the range,
but I think that is a fairly harmless side-effect overall, since we
would not normally expect there to be any dirty data within the range in
question.

Signed-off-by: Steven Whitehouse 
Cc: Christoph Hellwig 
Cc: Benjamin Marzinski 

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 6336bc6..9b6c6ac 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -752,6 +752,8 @@ static long gfs2_fallocate(struct file *file, int mode, 
loff_t offset,
loff_t bytes, max_bytes;
struct gfs2_alloc *al;
int error;
+   const loff_t pos = offset;
+   const loff_t count = len;
loff_t bsize_mask = ~((loff_t)sdp->sd_sb.sb_bsize - 1);
loff_t next = (offset + len - 1) >> sdp->sd_sb.sb_bsize_shift;
loff_t max_chunk_size = UINT_MAX & bsize_mask;
@@ -834,6 +836,9 @@ retry:
gfs2_quota_unlock(ip);
gfs2_alloc_put(ip);
}
+
+   if (error == 0)
+   error = generic_write_sync(file, pos, count);
goto out_unlock;
 
 out_trans_fail:


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Cluster-devel] [PATCH 3/4] gfs2: Use generic handlers of O_SYNC AIO DIO

2012-02-13 Thread Steven Whitehouse
Hi,

Acked-by: Steven Whitehouse 

That looks ok to me,

Steve.

On Fri, 2012-02-10 at 17:04 +0100, Jan Kara wrote:
> Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC
> file.
> 
> Signed-off-by: Jan Kara 
> ---
>  fs/gfs2/aops.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> index 501e5cb..9c381ff 100644
> --- a/fs/gfs2/aops.c
> +++ b/fs/gfs2/aops.c
> @@ -1034,7 +1034,7 @@ static ssize_t gfs2_direct_IO(int rw, struct kiocb 
> *iocb,
>  
>   rv = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
> offset, nr_segs, gfs2_get_block_direct,
> -   NULL, NULL, 0);
> +   NULL, NULL, DIO_SYNC_WRITES);
>  out:
>   gfs2_glock_dq_m(1, &gh);
>   gfs2_holder_uninit(&gh);


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs: encode_fh: return FILEID_INVALID if invalid fid_type

2012-09-05 Thread Steven Whitehouse
Hi,

On Fri, 2012-08-31 at 12:46 -0400, Namjae Jeon wrote:
> This patch is a follow up on below patch:
> 
> [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
> https://patchwork.kernel.org/patch/1385131/
> 
> Signed-off-by: Namjae Jeon 
> Signed-off-by: Vivek Trivedi 
Acked-by: Steven Whitehouse 

for the gfs2 bits,

Steve.

> ---
>  fs/btrfs/export.c   |4 ++--
>  fs/ceph/export.c|4 ++--
>  fs/fuse/inode.c |2 +-
>  fs/gfs2/export.c|4 ++--
>  fs/isofs/export.c   |4 ++--
>  fs/nilfs2/namei.c   |4 ++--
>  fs/ocfs2/export.c   |4 ++--
>  fs/reiserfs/inode.c |4 ++--
>  fs/udf/namei.c  |4 ++--
>  fs/xfs/xfs_export.c |4 ++--
>  mm/cleancache.c |2 +-
>  mm/shmem.c  |2 +-
>  12 files changed, 21 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
> index 614f34a..81ee29e 100644
> --- a/fs/btrfs/export.c
> +++ b/fs/btrfs/export.c
> @@ -22,10 +22,10 @@ static int btrfs_encode_fh(struct inode *inode, u32 *fh, 
> int *max_len,
>  
>   if (parent && (len < BTRFS_FID_SIZE_CONNECTABLE)) {
>   *max_len = BTRFS_FID_SIZE_CONNECTABLE;
> - return 255;
> + return FILEID_INVALID;
>   } else if (len < BTRFS_FID_SIZE_NON_CONNECTABLE) {
>   *max_len = BTRFS_FID_SIZE_NON_CONNECTABLE;
> - return 255;
> + return FILEID_INVALID;
>   }
>  
>   len  = BTRFS_FID_SIZE_NON_CONNECTABLE;
> diff --git a/fs/ceph/export.c b/fs/ceph/export.c
> index 8e1b60e..98bde89 100644
> --- a/fs/ceph/export.c
> +++ b/fs/ceph/export.c
> @@ -79,7 +79,7 @@ static int ceph_encode_fh(struct inode *inode, u32 *rawfh, 
> int *max_len,
>   if (parent_inode) {
>   /* nfsd wants connectable */
>   *max_len = connected_handle_length;
> - type = 255;
> + type = FILEID_INVALID;
>   } else {
>   dout("encode_fh %p\n", dentry);
>   fh->ino = ceph_ino(inode);
> @@ -88,7 +88,7 @@ static int ceph_encode_fh(struct inode *inode, u32 *rawfh, 
> int *max_len,
>   }
>   } else {
>   *max_len = handle_length;
> - type = 255;
> + type = FILEID_INVALID;
>   }
>   return type;
>  }
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 2690a76..b787a6f 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -672,7 +672,7 @@ static int fuse_encode_fh(struct inode *inode, u32 *fh, 
> int *max_len,
>  
>   if (*max_len < len) {
>   *max_len = len;
> - return  255;
> + return  FILEID_INVALID;
>   }
>  
>   nodeid = get_fuse_inode(inode)->nodeid;
> diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c
> index e8ed6d4..f7a8092 100644
> --- a/fs/gfs2/export.c
> +++ b/fs/gfs2/export.c
> @@ -37,10 +37,10 @@ static int gfs2_encode_fh(struct inode *inode, __u32 *p, 
> int *len,
>  
>   if (parent && (*len < GFS2_LARGE_FH_SIZE)) {
>   *len = GFS2_LARGE_FH_SIZE;
> - return 255;
> + return FILEID_INVALID;
>   } else if (*len < GFS2_SMALL_FH_SIZE) {
>   *len = GFS2_SMALL_FH_SIZE;
> - return 255;
> + return FILEID_INVALID;
>   }
>  
>   fh[0] = cpu_to_be32(ip->i_no_formal_ino >> 32);
> diff --git a/fs/isofs/export.c b/fs/isofs/export.c
> index 1d38044..5e693b3 100644
> --- a/fs/isofs/export.c
> +++ b/fs/isofs/export.c
> @@ -125,10 +125,10 @@ isofs_export_encode_fh(struct inode *inode,
>*/
>   if (parent && (len < 5)) {
>   *max_len = 5;
> - return 255;
> + return FILEID_INVALID;
>   } else if (len < 3) {
>   *max_len = 3;
> - return 255;
> + return FILEID_INVALID;
>   }
>  
>   len = 3;
> diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
> index 1d0c0b8..9de78f0 100644
> --- a/fs/nilfs2/namei.c
> +++ b/fs/nilfs2/namei.c
> @@ -517,11 +517,11 @@ static int nilfs_encode_fh(struct inode *inode, __u32 
> *fh, int *lenp,
>  
>   if (parent && *lenp < NILFS_FID_SIZE_CONNECTABLE) {
>   *lenp = NILFS_FID_SIZE_CONNECTABLE;
> - return 255;
> + return FILEID_INVALID;
>   }
>   if (*lenp < NILFS_FID_SIZE_NON_CONNECTABLE) {
>   *lenp = NILFS_FID_SIZE_NON_CONNECTABLE;
> - return 255;
> + return FILEID_INVALID;