Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api
Hi, On 29/04/16 06:35, NeilBrown wrote: On Tue, Apr 26 2016, Michal Hocko wrote: Hi, we have discussed this topic at LSF/MM this year. There was a general interest in the scope GFP_NOFS allocation context among some FS developers. For those who are not aware of the discussion or the issue I am trying to sort out (or at least start in that direction) please have a look at patch 1 which adds memalloc_nofs_{save,restore} api which basically copies what we have for the scope GFP_NOIO allocation context. I haven't converted any of the FS myself because that is way beyond my area of expertise but I would be happy to help with further changes on the MM front as well as in some more generic code paths. Dave had an idea on how to further improve the reclaim context to be less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque and FS specific cookie set in the FS allocation context and consumed by the FS reclaim context to allow doing some provably save actions that would be skipped due to GFP_NOFS normally. I like this idea and I believe we can go that direction regardless of the approach taken here. Many filesystems simply need to cleanup their NOFS usage first before diving into a more complex changes.> This strikes me as over-engineering to work around an unnecessarily burdensome interface but without details it is hard to be certain. Exactly what things happen in "FS reclaim context" which may, or may not, be safe depending on the specific FS allocation context? Do they need to happen at all? My research suggests that for most filesystems the only thing that happens in reclaim context that is at all troublesome is the final 'evict()' on an inode. This needs to flush out dirty pages and sync the inode to storage. Some time ago we moved most dirty-page writeout out of the reclaim context and into kswapd. I think this was an excellent advance in simplicity. If we could similarly move evict() into kswapd (and I believe we can) then most file systems would do nothing in reclaim context that interferes with allocation context. evict() is an issue, but moving it into kswapd would be a potential problem for GFS2. We already have a memory allocation issue when recovery is taking place and memory is short. The code path is as follows: 1. Inode is scheduled for eviction (which requires deallocation) 2. The glock is required in order to perform the deallocation, which implies getting a DLM lock 3. Another node in the cluster fails, so needs recovery 4. When the DLM lock is requested, it gets blocked until recovery is complete (for the failed node) 5. Recovery is performed using a userland fencing utility 6. Fencing requires memory and then blocks on the eviction 7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on DLM lock, DLM lock waiting on fencing) It doesn't happen often, but we've been looking at the best place to break that cycle, and one of the things we've been wondering is whether we could avoid deallocation evictions from memory related contexts, or at least make it async somehow. The issue is that it is not possible to know in advance whether an eviction will result in mearly writing things back to disk (because the inode is being dropped from cache, but still resides on disk) which is easy to do, or whether it requires a full deallocation (n_link==0) which may require significant resources and time, Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cluster-devel] [PATCH 16/18] gfs2: use generic posix ACL infrastructure
Hi, On Sun, 2013-12-01 at 03:59 -0800, Christoph Hellwig wrote: > plain text document attachment > (0016-gfs2-use-generic-posix-ACL-infrastructure.patch) > This contains some major refactoring for the create path so that > inodes are created with the right mode to start with instead of > fixing it up later. > > Signed-off-by: Christoph Hellwig > --- > fs/gfs2/acl.c | 229 > +++ > fs/gfs2/acl.h |4 +- > fs/gfs2/inode.c | 33 ++-- > fs/gfs2/xattr.c |4 +- > 4 files changed, 61 insertions(+), 209 deletions(-) > Looks very good. I'd really like to be able to do something similar with the security xattrs, in terms of the refactoring that at inode creation to give the xattrs ahead of the inode allocation itself. That way it should be possible to allocate the xattr blocks at the same time as the inode, rather than as an after thought. Some more comments below > diff --git a/fs/gfs2/acl.c b/fs/gfs2/acl.c > index e82e4ac..e6c7a2c 100644 > --- a/fs/gfs2/acl.c > +++ b/fs/gfs2/acl.c [snip] > - > -static int gfs2_xattr_system_set(struct dentry *dentry, const char *name, > - const void *value, size_t size, int flags, > - int xtype) > -{ > - struct inode *inode = dentry->d_inode; > - struct gfs2_sbd *sdp = GFS2_SB(inode); > - struct posix_acl *acl = NULL; > - int error = 0, type; > - > - if (!sdp->sd_args.ar_posix_acl) > - return -EOPNOTSUPP; > - > - type = gfs2_acl_type(name); > - if (type < 0) > - return type; > - if (flags & XATTR_CREATE) > - return -EINVAL; > - if (type == ACL_TYPE_DEFAULT && !S_ISDIR(inode->i_mode)) > - return value ? -EACCES : 0; > - if (!uid_eq(current_fsuid(), inode->i_uid) && !capable(CAP_FOWNER)) > - return -EPERM; > - if (S_ISLNK(inode->i_mode)) > - return -EOPNOTSUPP; > - > - if (!value) > - goto set_acl; > > - acl = posix_acl_from_xattr(&init_user_ns, value, size); > - if (!acl) { > - /* > - * acl_set_file(3) may request that we set default ACLs with > - * zero length -- defend (gracefully) against that here. > - */ > - goto out; > - } > - if (IS_ERR(acl)) { > - error = PTR_ERR(acl); > - goto out; > - } > - > - error = posix_acl_valid(acl); > - if (error) > - goto out_release; > - > - error = -EINVAL; > if (acl->a_count > GFS2_ACL_MAX_ENTRIES) > - goto out_release; > + return -EINVAL; > > if (type == ACL_TYPE_ACCESS) { > umode_t mode = inode->i_mode; > + > error = posix_acl_equiv_mode(acl, &mode); > + if (error < 0) > Andy Price has pointed out a missing "return error;" here > - if (error <= 0) { > - posix_acl_release(acl); > + if (error == 0) > acl = NULL; > > - if (error < 0) > - return error; > - } > - Also, there seems to be a white space error in the xfs patch around line 170 in fs/xfs/xfs_iops.c where there is an added "if (default_acl)" with a space before the tab, Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cluster-devel] [PATCH 16/18] gfs2: use generic posix ACL infrastructure
Hi, On Wed, 2013-12-11 at 02:42 -0800, Christoph Hellwig wrote: > plain text document attachment > (0016-gfs2-use-generic-posix-ACL-infrastructure.patch) > This contains some major refactoring for the create path so that > inodes are created with the right mode to start with instead of > fixing it up later. > > Signed-off-by: Christoph Hellwig Acked-by: Steven Whitehouse A really nice clean up - this is a very useful step forward in simplifying the create path. Thanks for sorting this out, Steve. > --- > fs/gfs2/acl.c | 234 > +++ > fs/gfs2/acl.h |4 +- > fs/gfs2/inode.c | 33 ++-- > fs/gfs2/xattr.c |4 +- > 4 files changed, 62 insertions(+), 213 deletions(-) > > diff --git a/fs/gfs2/acl.c b/fs/gfs2/acl.c > index e82e4ac..ba94566 100644 > --- a/fs/gfs2/acl.c > +++ b/fs/gfs2/acl.c > @@ -49,10 +49,6 @@ struct posix_acl *gfs2_get_acl(struct inode *inode, int > type) > if (!ip->i_eattr) > return NULL; > > - acl = get_cached_acl(&ip->i_inode, type); > - if (acl != ACL_NOT_CACHED) > - return acl; > - > name = gfs2_acl_name(type); > if (name == NULL) > return ERR_PTR(-EINVAL); > @@ -80,7 +76,7 @@ static int gfs2_set_mode(struct inode *inode, umode_t mode) > return error; > } > > -static int gfs2_acl_set(struct inode *inode, int type, struct posix_acl *acl) > +int gfs2_set_acl(struct inode *inode, struct posix_acl *acl, int type) > { > int error; > int len; > @@ -88,219 +84,49 @@ static int gfs2_acl_set(struct inode *inode, int type, > struct posix_acl *acl) > const char *name = gfs2_acl_name(type); > > BUG_ON(name == NULL); > - len = posix_acl_to_xattr(&init_user_ns, acl, NULL, 0); > - if (len == 0) > - return 0; > - data = kmalloc(len, GFP_NOFS); > - if (data == NULL) > - return -ENOMEM; > - error = posix_acl_to_xattr(&init_user_ns, acl, data, len); > - if (error < 0) > - goto out; > - error = __gfs2_xattr_set(inode, name, data, len, 0, GFS2_EATYPE_SYS); > - if (!error) > - set_cached_acl(inode, type, acl); > -out: > - kfree(data); > - return error; > -} > - > -int gfs2_acl_create(struct gfs2_inode *dip, struct inode *inode) > -{ > - struct gfs2_sbd *sdp = GFS2_SB(&dip->i_inode); > - struct posix_acl *acl; > - umode_t mode = inode->i_mode; > - int error = 0; > - > - if (!sdp->sd_args.ar_posix_acl) > - return 0; > - if (S_ISLNK(inode->i_mode)) > - return 0; > - > - acl = gfs2_get_acl(&dip->i_inode, ACL_TYPE_DEFAULT); > - if (IS_ERR(acl)) > - return PTR_ERR(acl); > - if (!acl) { > - mode &= ~current_umask(); > - return gfs2_set_mode(inode, mode); > - } > - > - if (S_ISDIR(inode->i_mode)) { > - error = gfs2_acl_set(inode, ACL_TYPE_DEFAULT, acl); > - if (error) > - goto out; > - } > - > - error = __posix_acl_create(&acl, GFP_NOFS, &mode); > - if (error < 0) > - return error; > > - if (error == 0) > - goto munge; > - > - error = gfs2_acl_set(inode, ACL_TYPE_ACCESS, acl); > - if (error) > - goto out; > -munge: > - error = gfs2_set_mode(inode, mode); > -out: > - posix_acl_release(acl); > - return error; > -} > - > -int gfs2_acl_chmod(struct gfs2_inode *ip, struct iattr *attr) > -{ > - struct inode *inode = &ip->i_inode; > - struct posix_acl *acl; > - char *data; > - unsigned int len; > - int error; > - > - acl = gfs2_get_acl(&ip->i_inode, ACL_TYPE_ACCESS); > - if (IS_ERR(acl)) > - return PTR_ERR(acl); > - if (!acl) > - return gfs2_setattr_simple(inode, attr); > - > - error = __posix_acl_chmod(&acl, GFP_NOFS, attr->ia_mode); > - if (error) > - return error; > - > - len = posix_acl_to_xattr(&init_user_ns, acl, NULL, 0); > - data = kmalloc(len, GFP_NOFS); > - error = -ENOMEM; > - if (data == NULL) > - goto out; > - posix_acl_to_xattr(&init_user_ns, acl, data, len); > - error = gfs2_xattr_acl_chmod(ip, attr, data); > - kfree(data); > - set_cached_acl(&ip->i_inode, ACL_TYPE_ACCESS, acl); > - > -out: > - posix_acl_release(acl); > - return error; > -} > - > -static int gfs2_acl_type(const char *name) &g
Re: [Cluster-devel] [PATCH v5 6/7] fs: pass iocb to generic_write_sync
Hi, On 05/11/14 21:14, Milosz Tanski wrote: From: Christoph Hellwig Clean up the generic_write_sync by just passing an iocb and a bytes written / negative errno argument. In addition to simplifying the callers this also prepares for passing a per-operation O_DSYNC flag. Two callers didn't quite fit that scheme: - dio_complete didn't both to update ki_pos as we don't need it on a iocb that is about to be freed, so we had to add it. Additionally it also synced out written data in the error case, which has been changed to operate like the other callers. - gfs2 also used generic_write_sync to implement a crude version of fallocate. It has been switched to use an open coded variant instead. Signed-off-by: Christoph Hellwig GFS2 bits: Acked-by: Steven Whitehouse I know that Andy Price has some work in this area too, so in due course we'll have to be careful not to create a merge conflict here. Copying in Andy so he can see the changes, Steve. --- fs/block_dev.c | 8 +--- fs/btrfs/file.c| 7 ++- fs/cifs/file.c | 8 +--- fs/direct-io.c | 8 ++-- fs/ext4/file.c | 8 +--- fs/gfs2/file.c | 9 +++-- fs/ntfs/file.c | 8 ++-- fs/udf/file.c | 11 ++- fs/xfs/xfs_file.c | 8 +--- include/linux/fs.h | 8 +--- mm/filemap.c | 30 -- 11 files changed, 40 insertions(+), 73 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index cc9d411..c529b1c 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1568,18 +1568,12 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg) */ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from) { - struct file *file = iocb->ki_filp; struct blk_plug plug; ssize_t ret; blk_start_plug(&plug); ret = __generic_file_write_iter(iocb, from); - if (ret > 0) { - ssize_t err; - err = generic_write_sync(file, iocb->ki_pos - ret, ret); - if (err < 0) - ret = err; - } + ret = generic_write_sync(iocb, ret); blk_finish_plug(&plug); return ret; } diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index a18ceab..4f4a6f7 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1820,11 +1820,8 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, */ BTRFS_I(inode)->last_trans = root->fs_info->generation + 1; BTRFS_I(inode)->last_sub_trans = root->log_transid; - if (num_written > 0) { - err = generic_write_sync(file, pos, num_written); - if (err < 0) - num_written = err; - } + + num_written = generic_write_sync(iocb, num_written); if (sync) atomic_dec(&BTRFS_I(inode)->sync_writers); diff --git a/fs/cifs/file.c b/fs/cifs/file.c index c485afa..32359de 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -2706,13 +2706,7 @@ cifs_writev(struct kiocb *iocb, struct iov_iter *from) rc = __generic_file_write_iter(iocb, from); mutex_unlock(&inode->i_mutex); - if (rc > 0) { - ssize_t err; - - err = generic_write_sync(file, iocb->ki_pos - rc, rc); - if (err < 0) - rc = err; - } + rc = generic_write_sync(iocb, rc); } else { mutex_unlock(&inode->i_mutex); } diff --git a/fs/direct-io.c b/fs/direct-io.c index e181b6b..b72ac83 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -257,12 +257,8 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, inode_dio_done(dio->inode); if (is_async) { if (dio->rw & WRITE) { - int err; - - err = generic_write_sync(dio->iocb->ki_filp, offset, -transferred); - if (err < 0 && ret > 0) - ret = err; + dio->iocb->ki_pos = offset + transferred; + ret = generic_write_sync(dio->iocb, ret); } aio_complete(dio->iocb, ret, 0); diff --git a/fs/ext4/file.c b/fs/ext4/file.c index aca7b24..79b000c 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -175,13 +175,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) ret = __generic_file_write_iter(iocb, from); mutex_unlock(&inode->i_mutex); - if (ret > 0) { - ssize_t err; - - err = generic_write_sync(file, iocb->ki_pos - ret, ret); - if (err < 0) - ret = err; - } + ret = generi
Re: [RFC v4+ hot_track 03/19] vfs: add I/O frequency update function
Hi, On Mon, 2012-10-29 at 12:30 +0800, zwu.ker...@gmail.com wrote: > From: Zhi Yong Wu > > Add some util helpers to update access frequencies > for one file or its range. > > Signed-off-by: Zhi Yong Wu > --- > fs/hot_tracking.c| 179 > ++ > fs/hot_tracking.h|7 ++ > include/linux/hot_tracking.h |2 + > 3 files changed, 188 insertions(+), 0 deletions(-) > > diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c > index 68591f0..0a7d9a3 100644 > --- a/fs/hot_tracking.c > +++ b/fs/hot_tracking.c > @@ -172,6 +172,137 @@ static void hot_inode_tree_exit(struct hot_info *root) > } > } > > +struct hot_inode_item > +*hot_inode_item_find(struct hot_info *root, u64 ino) > +{ > + struct hot_inode_item *he; > + int ret; > + > +again: > + spin_lock(&root->lock); > + he = radix_tree_lookup(&root->hot_inode_tree, ino); > + if (he) { > + kref_get(&he->hot_inode.refs); > + spin_unlock(&root->lock); > + return he; > + } > + spin_unlock(&root->lock); > + > + he = kmem_cache_zalloc(hot_inode_item_cachep, > + GFP_KERNEL | GFP_NOFS); This doesn't look quite right... which of these two did you mean? I assume probably just GFP_NOFS > + if (!he) > + return ERR_PTR(-ENOMEM); > + > + hot_inode_item_init(he, ino, &root->hot_inode_tree); > + > + ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM); > + if (ret) { > + kmem_cache_free(hot_inode_item_cachep, he); > + return ERR_PTR(ret); > + } > + > + spin_lock(&root->lock); > + ret = radix_tree_insert(&root->hot_inode_tree, ino, he); > + if (ret == -EEXIST) { > + kmem_cache_free(hot_inode_item_cachep, he); > + spin_unlock(&root->lock); > + radix_tree_preload_end(); > + goto again; > + } > + spin_unlock(&root->lock); > + radix_tree_preload_end(); > + > + kref_get(&he->hot_inode.refs); > + return he; > +} > +EXPORT_SYMBOL_GPL(hot_inode_item_find); > + > +static struct hot_range_item > +*hot_range_item_find(struct hot_inode_item *he, > + u32 start) > +{ > + struct hot_range_item *hr; > + int ret; > + > +again: > + spin_lock(&he->lock); > + hr = radix_tree_lookup(&he->hot_range_tree, start); > + if (hr) { > + kref_get(&hr->hot_range.refs); > + spin_unlock(&he->lock); > + return hr; > + } > + spin_unlock(&he->lock); > + > + hr = kmem_cache_zalloc(hot_range_item_cachep, > + GFP_KERNEL | GFP_NOFS); Likewise, here too. Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 09/19] vfs: add one work queue
Hi, On Mon, 2012-10-29 at 12:30 +0800, zwu.ker...@gmail.com wrote: > From: Zhi Yong Wu > > Add a per-superblock workqueue and a delayed_work > to run periodic work to update map info on each superblock. > > Signed-off-by: Zhi Yong Wu > --- > fs/hot_tracking.c| 85 > ++ > fs/hot_tracking.h|3 + > include/linux/hot_tracking.h |3 + > 3 files changed, 91 insertions(+), 0 deletions(-) > > diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c > index fff0038..0ef9cad 100644 > --- a/fs/hot_tracking.c > +++ b/fs/hot_tracking.c > @@ -15,9 +15,12 @@ > #include > #include > #include > +#include > +#include > #include > #include > #include > +#include > #include > #include "hot_tracking.h" > > @@ -557,6 +560,67 @@ static void hot_map_array_exit(struct hot_info *root) > } > } > > +/* Temperature compare function*/ > +static int hot_temp_cmp(void *priv, struct list_head *a, > + struct list_head *b) > +{ > + struct hot_comm_item *ap = > + container_of(a, struct hot_comm_item, n_list); > + struct hot_comm_item *bp = > + container_of(b, struct hot_comm_item, n_list); > + > + int diff = ap->hot_freq_data.last_temp > + - bp->hot_freq_data.last_temp; > + if (diff > 0) > + return -1; > + if (diff < 0) > + return 1; > + return 0; > +} > + > +/* > + * Every sync period we update temperatures for > + * each hot inode item and hot range item for aging > + * purposes. > + */ > +static void hot_update_worker(struct work_struct *work) > +{ > + struct hot_info *root = container_of(to_delayed_work(work), > + struct hot_info, update_work); > + struct hot_inode_item *hi_nodes[8]; > + u64 ino = 0; > + int i, n; > + > + while (1) { > + n = radix_tree_gang_lookup(&root->hot_inode_tree, > +(void **)hi_nodes, ino, > +ARRAY_SIZE(hi_nodes)); > + if (!n) > + break; > + > + ino = hi_nodes[n - 1]->i_ino + 1; > + for (i = 0; i < n; i++) { > + kref_get(&hi_nodes[i]->hot_inode.refs); > + hot_map_array_update( > + &hi_nodes[i]->hot_inode.hot_freq_data, root); > + hot_range_update(hi_nodes[i], root); > + hot_inode_item_put(hi_nodes[i]); > + } > + } > + > + /* Sort temperature map info */ > + for (i = 0; i < HEAT_MAP_SIZE; i++) { > + list_sort(NULL, &root->heat_inode_map[i].node_list, > + hot_temp_cmp); > + list_sort(NULL, &root->heat_range_map[i].node_list, > + hot_temp_cmp); > + } > + If this list can potentially have one (or more) entries per inode, then filesystems with a lot of inodes (millions) may potentially exceed the max size of list which list_sort() can handle. If that happens it still works, but you'll get a warning message and it won't be as efficient. It is something that we've run into with list_sort() and GFS2, but it only happens very rarely, Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 09/19] vfs: add one work queue
Hi, On Mon, 2012-11-05 at 19:55 +0800, Zhi Yong Wu wrote: > On Mon, Nov 5, 2012 at 7:21 PM, Steven Whitehouse wrote: > > Hi, > > > > On Mon, 2012-10-29 at 12:30 +0800, zwu.ker...@gmail.com wrote: > >> From: Zhi Yong Wu > >> > >> Add a per-superblock workqueue and a delayed_work > >> to run periodic work to update map info on each superblock. > >> > >> Signed-off-by: Zhi Yong Wu > >> --- > >> fs/hot_tracking.c| 85 > >> ++ > >> fs/hot_tracking.h|3 + > >> include/linux/hot_tracking.h |3 + > >> 3 files changed, 91 insertions(+), 0 deletions(-) > >> > >> diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c > >> index fff0038..0ef9cad 100644 > >> --- a/fs/hot_tracking.c > >> +++ b/fs/hot_tracking.c > >> @@ -15,9 +15,12 @@ > >> #include > >> #include > >> #include > >> +#include > >> +#include > >> #include > >> #include > >> #include > >> +#include > >> #include > >> #include "hot_tracking.h" > >> > >> @@ -557,6 +560,67 @@ static void hot_map_array_exit(struct hot_info *root) > >> } > >> } > >> > >> +/* Temperature compare function*/ > >> +static int hot_temp_cmp(void *priv, struct list_head *a, > >> + struct list_head *b) > >> +{ > >> + struct hot_comm_item *ap = > >> + container_of(a, struct hot_comm_item, n_list); > >> + struct hot_comm_item *bp = > >> + container_of(b, struct hot_comm_item, n_list); > >> + > >> + int diff = ap->hot_freq_data.last_temp > >> + - bp->hot_freq_data.last_temp; > >> + if (diff > 0) > >> + return -1; > >> + if (diff < 0) > >> + return 1; > >> + return 0; > >> +} > >> + > >> +/* > >> + * Every sync period we update temperatures for > >> + * each hot inode item and hot range item for aging > >> + * purposes. > >> + */ > >> +static void hot_update_worker(struct work_struct *work) > >> +{ > >> + struct hot_info *root = container_of(to_delayed_work(work), > >> + struct hot_info, update_work); > >> + struct hot_inode_item *hi_nodes[8]; > >> + u64 ino = 0; > >> + int i, n; > >> + > >> + while (1) { > >> + n = radix_tree_gang_lookup(&root->hot_inode_tree, > >> +(void **)hi_nodes, ino, > >> +ARRAY_SIZE(hi_nodes)); > >> + if (!n) > >> + break; > >> + > >> + ino = hi_nodes[n - 1]->i_ino + 1; > >> + for (i = 0; i < n; i++) { > >> + kref_get(&hi_nodes[i]->hot_inode.refs); > >> + hot_map_array_update( > >> + &hi_nodes[i]->hot_inode.hot_freq_data, root); > >> + hot_range_update(hi_nodes[i], root); > >> + hot_inode_item_put(hi_nodes[i]); > >> + } > >> + } > >> + > >> + /* Sort temperature map info */ > >> + for (i = 0; i < HEAT_MAP_SIZE; i++) { > >> + list_sort(NULL, &root->heat_inode_map[i].node_list, > >> + hot_temp_cmp); > >> + list_sort(NULL, &root->heat_range_map[i].node_list, > >> + hot_temp_cmp); > >> + } > >> + > > > > If this list can potentially have one (or more) entries per inode, then > Only one hot_inode_item per inode, while maybe multiple > hot_range_items per inode. > > filesystems with a lot of inodes (millions) may potentially exceed the > > max size of list which list_sort() can handle. If that happens it still > > works, but you'll get a warning message and it won't be as efficient. > I haven't do so large scale test. If we want to find that issue, we > need to do large scale performance test, before that, i want to make > sure the code change is correct at first. > To be honest, for that issue you pointed to, i also have such > concern.But list_sort() performance looks good from the test result of &g
Re: [Cluster-devel] [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()
Hi, On Thu, 2014-03-13 at 17:23 +0100, Jan Kara wrote: > On Thu 13-03-14 10:20:56, Ted Tso wrote: > > Previously, the no-op "mount -o mount /dev/xxx" operation when the > ^^remount > > > file system is already mounted read-write causes an implied, > > unconditional syncfs(). This seems pretty stupid, and it's certainly > > documented or guaraunteed to do this, nor is it particularly useful, > > except in the case where the file system was mounted rw and is getting > > remounted read-only. > > > > However, it's possible that there might be some file systems that are > > actually depending on this behavior. In most file systems, it's > > probably fine to only call sync_filesystem() when transitioning from > > read-write to read-only, and there are some file systems where this is > > not needed at all (for example, for a pseudo-filesystem or something > > like romfs). > Hum, I'd avoid this excercise at least for filesystem where > sync_filesystem() is obviously useless - proc, debugfs, pstore, devpts, > also always read-only filesystems such as isofs, qnx4, qnx6, befs, cramfs, > efs, freevxfs, romfs, squashfs. I think you can find a couple more which > clearly don't care about sync_filesystem() if you look a bit closer. > > > Honza I guess the same is true for other file systems which are mounted ro too. So maybe a check for MS_RDONLY before doing the sync in those cases? Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cluster-devel] fallocate vs O_(D)SYNC
Hi, On Wed, 2011-11-16 at 03:42 -0500, Christoph Hellwig wrote: > It seems all filesystems but XFS ignore O_SYNC for fallocate, and never > make sure the size update transaction made it to disk. > > Given that a fallocate without FALLOC_FL_KEEP_SIZE very much is a data > operation (it adds new blocks that return zeroes) that seems like a > fairly nasty surprise for O_SYNC users. > In GFS2 we zero out the data blocks as we go (since our metadata doesn't allow us to mark blocks as zeroed at alloc time) and also because we are mostly interested in being able to do FALLOC_FL_KEEP_SIZE which we use on our rindex system file in order to ensure that there is always enough space to expand a filesystem. So there is no danger of having non-zeroed blocks appearing later, as that is done before the metadata change. Our fallocate_chunk() function calls mark_inode_dirty(inode) on each call, so that fsync should pick that up and ensure that the metadata has been written back. So we should thus have both data and metadata stable on disk. Do you have some evidence that this is not happening? Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cluster-devel] fallocate vs O_(D)SYNC
Hi, On Wed, 2011-11-16 at 11:54 +0100, Jan Kara wrote: > Hello, > > On Wed 16-11-11 09:43:08, Steven Whitehouse wrote: > > On Wed, 2011-11-16 at 03:42 -0500, Christoph Hellwig wrote: > > > It seems all filesystems but XFS ignore O_SYNC for fallocate, and never > > > make sure the size update transaction made it to disk. > > > > > > Given that a fallocate without FALLOC_FL_KEEP_SIZE very much is a data > > > operation (it adds new blocks that return zeroes) that seems like a > > > fairly nasty surprise for O_SYNC users. > > > > In GFS2 we zero out the data blocks as we go (since our metadata doesn't > > allow us to mark blocks as zeroed at alloc time) and also because we are > > mostly interested in being able to do FALLOC_FL_KEEP_SIZE which we use > > on our rindex system file in order to ensure that there is always enough > > space to expand a filesystem. > > > > So there is no danger of having non-zeroed blocks appearing later, as > > that is done before the metadata change. > > > > Our fallocate_chunk() function calls mark_inode_dirty(inode) on each > > call, so that fsync should pick that up and ensure that the metadata has > > been written back. So we should thus have both data and metadata stable > > on disk. > > > > Do you have some evidence that this is not happening? > Yeah, only that nobody calls that fsync() automatically if the fd is > O_SYNC if I'm right. But maybe calling fdatasync() on the range which was > fallocated from sys_fallocate() if the fd is O_SYNC would do the trick for > most filesystems? That would match how we treat O_SYNC for other operations > as well. I'm just not sure whether XFS wouldn't take unnecessarily big hit > with this. > > Honza Ah, I see now. Sorry, I missed the original point. So that would just be a VFS addition to check the O_(D)SYNC flag as you suggest. I've no objections to that, it makes sense to me, Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cluster-devel] fallocate vs O_(D)SYNC
Hi, Here is what I'm planning for GFS2: Add sync of metadata after fallocate for O_SYNC files to ensure that we meet expectations for everything being on disk in this case. Unfortunately, the offset and len parameters are modified during the course of the fallocate function, so I've had to add a couple of new variables to call generic_write_sync() at the end. I know that potentially this will sync data as well within the range, but I think that is a fairly harmless side-effect overall, since we would not normally expect there to be any dirty data within the range in question. Signed-off-by: Steven Whitehouse Cc: Christoph Hellwig Cc: Benjamin Marzinski diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 6336bc6..9b6c6ac 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -752,6 +752,8 @@ static long gfs2_fallocate(struct file *file, int mode, loff_t offset, loff_t bytes, max_bytes; struct gfs2_alloc *al; int error; + const loff_t pos = offset; + const loff_t count = len; loff_t bsize_mask = ~((loff_t)sdp->sd_sb.sb_bsize - 1); loff_t next = (offset + len - 1) >> sdp->sd_sb.sb_bsize_shift; loff_t max_chunk_size = UINT_MAX & bsize_mask; @@ -834,6 +836,9 @@ retry: gfs2_quota_unlock(ip); gfs2_alloc_put(ip); } + + if (error == 0) + error = generic_write_sync(file, pos, count); goto out_unlock; out_trans_fail: -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cluster-devel] [PATCH 3/4] gfs2: Use generic handlers of O_SYNC AIO DIO
Hi, Acked-by: Steven Whitehouse That looks ok to me, Steve. On Fri, 2012-02-10 at 17:04 +0100, Jan Kara wrote: > Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC > file. > > Signed-off-by: Jan Kara > --- > fs/gfs2/aops.c |2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c > index 501e5cb..9c381ff 100644 > --- a/fs/gfs2/aops.c > +++ b/fs/gfs2/aops.c > @@ -1034,7 +1034,7 @@ static ssize_t gfs2_direct_IO(int rw, struct kiocb > *iocb, > > rv = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, > offset, nr_segs, gfs2_get_block_direct, > - NULL, NULL, 0); > + NULL, NULL, DIO_SYNC_WRITES); > out: > gfs2_glock_dq_m(1, &gh); > gfs2_holder_uninit(&gh); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs: encode_fh: return FILEID_INVALID if invalid fid_type
Hi, On Fri, 2012-08-31 at 12:46 -0400, Namjae Jeon wrote: > This patch is a follow up on below patch: > > [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type > https://patchwork.kernel.org/patch/1385131/ > > Signed-off-by: Namjae Jeon > Signed-off-by: Vivek Trivedi Acked-by: Steven Whitehouse for the gfs2 bits, Steve. > --- > fs/btrfs/export.c |4 ++-- > fs/ceph/export.c|4 ++-- > fs/fuse/inode.c |2 +- > fs/gfs2/export.c|4 ++-- > fs/isofs/export.c |4 ++-- > fs/nilfs2/namei.c |4 ++-- > fs/ocfs2/export.c |4 ++-- > fs/reiserfs/inode.c |4 ++-- > fs/udf/namei.c |4 ++-- > fs/xfs/xfs_export.c |4 ++-- > mm/cleancache.c |2 +- > mm/shmem.c |2 +- > 12 files changed, 21 insertions(+), 21 deletions(-) > > diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c > index 614f34a..81ee29e 100644 > --- a/fs/btrfs/export.c > +++ b/fs/btrfs/export.c > @@ -22,10 +22,10 @@ static int btrfs_encode_fh(struct inode *inode, u32 *fh, > int *max_len, > > if (parent && (len < BTRFS_FID_SIZE_CONNECTABLE)) { > *max_len = BTRFS_FID_SIZE_CONNECTABLE; > - return 255; > + return FILEID_INVALID; > } else if (len < BTRFS_FID_SIZE_NON_CONNECTABLE) { > *max_len = BTRFS_FID_SIZE_NON_CONNECTABLE; > - return 255; > + return FILEID_INVALID; > } > > len = BTRFS_FID_SIZE_NON_CONNECTABLE; > diff --git a/fs/ceph/export.c b/fs/ceph/export.c > index 8e1b60e..98bde89 100644 > --- a/fs/ceph/export.c > +++ b/fs/ceph/export.c > @@ -79,7 +79,7 @@ static int ceph_encode_fh(struct inode *inode, u32 *rawfh, > int *max_len, > if (parent_inode) { > /* nfsd wants connectable */ > *max_len = connected_handle_length; > - type = 255; > + type = FILEID_INVALID; > } else { > dout("encode_fh %p\n", dentry); > fh->ino = ceph_ino(inode); > @@ -88,7 +88,7 @@ static int ceph_encode_fh(struct inode *inode, u32 *rawfh, > int *max_len, > } > } else { > *max_len = handle_length; > - type = 255; > + type = FILEID_INVALID; > } > return type; > } > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c > index 2690a76..b787a6f 100644 > --- a/fs/fuse/inode.c > +++ b/fs/fuse/inode.c > @@ -672,7 +672,7 @@ static int fuse_encode_fh(struct inode *inode, u32 *fh, > int *max_len, > > if (*max_len < len) { > *max_len = len; > - return 255; > + return FILEID_INVALID; > } > > nodeid = get_fuse_inode(inode)->nodeid; > diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c > index e8ed6d4..f7a8092 100644 > --- a/fs/gfs2/export.c > +++ b/fs/gfs2/export.c > @@ -37,10 +37,10 @@ static int gfs2_encode_fh(struct inode *inode, __u32 *p, > int *len, > > if (parent && (*len < GFS2_LARGE_FH_SIZE)) { > *len = GFS2_LARGE_FH_SIZE; > - return 255; > + return FILEID_INVALID; > } else if (*len < GFS2_SMALL_FH_SIZE) { > *len = GFS2_SMALL_FH_SIZE; > - return 255; > + return FILEID_INVALID; > } > > fh[0] = cpu_to_be32(ip->i_no_formal_ino >> 32); > diff --git a/fs/isofs/export.c b/fs/isofs/export.c > index 1d38044..5e693b3 100644 > --- a/fs/isofs/export.c > +++ b/fs/isofs/export.c > @@ -125,10 +125,10 @@ isofs_export_encode_fh(struct inode *inode, >*/ > if (parent && (len < 5)) { > *max_len = 5; > - return 255; > + return FILEID_INVALID; > } else if (len < 3) { > *max_len = 3; > - return 255; > + return FILEID_INVALID; > } > > len = 3; > diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c > index 1d0c0b8..9de78f0 100644 > --- a/fs/nilfs2/namei.c > +++ b/fs/nilfs2/namei.c > @@ -517,11 +517,11 @@ static int nilfs_encode_fh(struct inode *inode, __u32 > *fh, int *lenp, > > if (parent && *lenp < NILFS_FID_SIZE_CONNECTABLE) { > *lenp = NILFS_FID_SIZE_CONNECTABLE; > - return 255; > + return FILEID_INVALID; > } > if (*lenp < NILFS_FID_SIZE_NON_CONNECTABLE) { > *lenp = NILFS_FID_SIZE_NON_CONNECTABLE; > - return 255; > + return FILEID_INVALID;