Re: [PATCH] Btrfs: add another missing end_page_writeback on submit_extent_page failure

2016-12-21 Thread Liu Bo
On Fri, Dec 16, 2016 at 03:41:50PM +0900, Takafumi Kubota wrote:
> This is actually inspired by Filipe's patch(55e3bd2e0c2e1).
> 
> When submit_extent_page() in __extent_writepage_io() fails,
> Btrfs misses clearing a writeback bit of the failed page.
> This causes the false under-writeback page.
> Then, another sync task hangs in filemap_fdatawait_range(),
> because it waits the false under-writeback page.
> 
> CPU0CPU1
> 
> __extent_writepage_io()
>   ret = submit_extent_page() // fail
> 
>   if (ret)
> SetPageError(page)
> // miss clearing the writeback bit
> 
> sync()
>   ...
>   filemap_fdatawait_range()
> wait_on_page_writeback(page);
> // wait the false under-writeback page
> 
> Signed-off-by: Takafumi Kubota 
> ---
>  fs/btrfs/extent_io.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 1e67723..ef9793b 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3443,8 +3443,10 @@ static noinline_for_stack int 
> __extent_writepage_io(struct inode *inode,
>bdev, >bio, max_nr,
>end_bio_extent_writepage,
>0, 0, 0, false);
> - if (ret)
> + if (ret) {
>   SetPageError(page);
> + end_page_writeback(page);
> + }

OK...this could be complex as we don't know which part in
submit_extent_page gets the error, if the page has been added into bio
and bio_end would call end_page_writepage(page) as well, so whichever
comes later, the BUG() in end_page_writeback() would complain.

Looks like commit 55e3bd2e0c2e1 also has the same problem although I
gave it my reviewed-by.

Thanks,

-liubo

>  
>   cur = cur + iosize;
>   pg_offset += iosize;
> -- 
> 1.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2 v2] btrfs: fix error handling when run_delayed_extent_op fails

2016-12-21 Thread Liu Bo
On Tue, Dec 20, 2016 at 01:28:27PM -0500, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> In __btrfs_run_delayed_refs, the error path when run_delayed_extent_op
> fails sets locked_ref->processing = 0 but doesn't re-increment
> delayed_refs->num_heads_ready.  As a result, we end up triggering
> the WARN_ON in btrfs_select_ref_head.
> 
> Fixes: d7df2c796d7 (Btrfs: attach delayed ref updates to delayed ref heads)
> Reported-by: Jon Nelson 
> Signed-off-by: Jeff Mahoney 
> ---
>  fs/btrfs/extent-tree.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index db74889..d74adf1 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2576,7 +2576,10 @@ static noinline int __btrfs_run_delayed_refs(struct 
> btrfs_trans_handle *trans,
>*/
>   if (must_insert_reserved)
>   
> locked_ref->must_insert_reserved = 1;
> + spin_lock(_refs->lock);
>   locked_ref->processing = 0;
> + delayed_refs->num_heads_ready++;
> + spin_unlock(_refs->lock);

Looks good to me.

Reviewed-by: Liu Bo 

Thanks,

-liubo
>   btrfs_debug(fs_info,
>   "run_delayed_extent_op 
> returned %d",
>   ret);
> -- 
> 1.8.5.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: fix locking when we put back a delayed ref that's too new

2016-12-21 Thread Liu Bo
On Tue, Dec 20, 2016 at 01:28:28PM -0500, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> In __btrfs_run_delayed_refs, when we put back a delayed ref that's too
> new, we have already dropped the lock on locked_ref when we set
> ->processing = 0.
> 
> This patch keeps the lock to cover that assignment.
> 
> Fixes: d7df2c796d7 (Btrfs: attach delayed ref updates to delayed ref heads)
> Signed-off-by: Jeff Mahoney 
> ---
>  fs/btrfs/extent-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index d74adf1..930ac8e 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2526,11 +2526,11 @@ static noinline int __btrfs_run_delayed_refs(struct 
> btrfs_trans_handle *trans,
>   if (ref && ref->seq &&
>   btrfs_check_delayed_seq(fs_info, delayed_refs, ref->seq)) {
>   spin_unlock(_ref->lock);
> - btrfs_delayed_ref_unlock(locked_ref);
>   spin_lock(_refs->lock);
>   locked_ref->processing = 0;
>   delayed_refs->num_heads_ready++;
>   spin_unlock(_refs->lock);
> + btrfs_delayed_ref_unlock(locked_ref);

I don't think that this would end up a deadlock as we use mutex_try_lock
for head->mutex everywhere, but I'd rather have it cleaned up.

Reviewed-by: Liu Bo 

Thanks,

-liubo
>   locked_ref = NULL;
>   cond_resched();
>   count++;
> -- 
> 1.8.5.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[bug or by design ?] btrfs defrag compression does not persist

2016-12-21 Thread Anand Jain


A quick design specific question.

The following command converts file-data-extents to the specified
encoder (lzo).

  $ btrfs filesystem defrag -v -r -f -clzo dir/

However the lzo property does not persist through the file modify.
As the above operation does not update the btrfs.compression property.

Question:
 I wonder if this should be a bug or if its by design ?
 What could be the main use case to _associate compression
 at the time of defrag_ ?
 The no-conflict fix will be to add another option to make
 it persistent.

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fstests: btrfs: Test scrub and replace race for RAID56

2016-12-21 Thread Qu Wenruo
Although by design, btrfs scrub and replace share the same code path, so
they are exclusive to each other.

But the fact is, there is still some critical region not protected well,
so we can have the following kernel panic, especially easy to trigger on
RAID5/6 profiles.

general protection fault:  [#1] SMP
Call Trace:
 [] ? generic_make_request+0xcf/0x290
 [] generic_make_request+0x24/0x290
 [] ? generic_make_request+0xcf/0x290
 [] submit_bio+0x6e/0x120
 [] ? rbio_orig_end_io+0x80/0x80 [btrfs]
 [] finish_rmw+0x401/0x550 [btrfs]
 [] validate_rbio_for_rmw+0x36/0x40 [btrfs]
 [] raid_rmw_end_io+0x7d/0x90 [btrfs]
 [] bio_endio+0x56/0x60
 [] end_workqueue_fn+0x3c/0x40 [btrfs]
 [] btrfs_scrubparity_helper+0xef/0x610 [btrfs]
 [] btrfs_endio_raid56_helper+0xe/0x10 [btrfs]
 [] process_one_work+0x2af/0x720
 [] ? process_one_work+0x22b/0x720
 [] worker_thread+0x4b/0x4f0
 [] ? process_one_work+0x720/0x720
 [] ? process_one_work+0x720/0x720
 [] kthread+0xf3/0x110
 [] ? kthread_park+0x60/0x60
 [] ret_from_fork+0x27/0x40
RIP  [] generic_make_request_checks+0x198/0x5a0
 RSP 
---[ end trace f48aec343095cd83 ]---

Since it's a racy panic, the reproducibility may var on different
platforms.
In my physical test machine, it takes less than 10s to trigger panic,
while in my VM, it takes about 40~60s for one panic,

So this test case uses TIME_FACTOR to meet different needs.

Signed-off-by: Qu Wenruo 
---
 tests/btrfs/133 | 126 
 tests/btrfs/133.out |   2 +
 tests/btrfs/group   |   1 +
 3 files changed, 129 insertions(+)
 create mode 100755 tests/btrfs/133
 create mode 100644 tests/btrfs/133.out

diff --git a/tests/btrfs/133 b/tests/btrfs/133
new file mode 100755
index 000..c912e40
--- /dev/null
+++ b/tests/btrfs/133
@@ -0,0 +1,126 @@
+#! /bin/bash
+# FS QA Test 133
+#
+# Test scrub and replace race for RAID5/6
+#
+# Even these 2 operations are exclusive to each other, they can still
+# cause race and trigger a NULL pointer panic for any multi-device
+# profile.
+#
+# This bug is most obvoious for RAID5/6 profiles, although other profile
+# like RAID0 can also trigger it, the possibility is quite low compared to
+# RAID5/6
+#
+#---
+# Copyright (c) 2016 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch_dev_pool 5
+_require_scratch_dev_pool_equal_size
+
+runtime=$((60 * $TIME_FACTOR))
+
+nr_devs=$(($(echo $SCRATCH_DEV_POOL | wc -w) - 1))
+
+run_test()
+{
+   local mkfs_opts=$1
+   local saved_scratch_dev_pool=$SCRATCH_DEV_POOL
+
+   echo "Test $mkfs_opts" >>$seqres.full
+
+   # remove the last device from the SCRATCH_DEV_POOL list so
+   # _scratch_pool_mkfs won't use all devices in pool
+   local last_dev="`echo $SCRATCH_DEV_POOL | $AWK_PROG '{print $NF}'`"
+   SCRATCH_DEV_POOL=`echo $SCRATCH_DEV_POOL | sed -e "s# *$last_dev *##"`
+   _scratch_pool_mkfs $mkfs_opts >>$seqres.full 2>&1
+   # make sure we created btrfs with desired options
+   if [ $? -ne 0 ]; then
+   echo "mkfs $mkfs_opts failed"
+   SCRATCH_DEV_POOL=$saved_scratch_dev_pool
+   return
+   fi
+   _scratch_mount >>$seqres.full 2>&1
+   SCRATCH_DEV_POOL=$saved_scratch_dev_pool
+
+   # Fill the fs so that each device has at least 64M data
+   # This will slow down replace and increase the possibility to
+   # trigger the bug
+   _pwrite_byte 0xcdcdcdcd 0 $(($nr_devs * 64 * 1024 * 1024)) \
+   $SCRATCH_MNT/file > /dev/null 2>&1
+   sync
+
+   echo -n "Start replace worker: " >>$seqres.full
+   _btrfs_stress_replace $SCRATCH_MNT >>$seqres.full 2>&1 &
+   replace_pid=$!
+   echo 

Re: [PATCH v3 0/6] Convert rollback rework for v4.9

2016-12-21 Thread Qu Wenruo



At 12/21/2016 10:35 PM, David Sterba wrote:

On Mon, Dec 19, 2016 at 02:56:36PM +0800, Qu Wenruo wrote:

Can be fetched from github:
https://github.com/adam900710/btrfs-progs.git convert_rework_for_4.9

This is mainly to fix problems exposed by Chandan's fix for 64K nodesize.


Thanks for the fix. That's a lot of new code that I don't feel
comfortable to put to the upcomming release, so I'll merge 1 and 6 now
and queue the rest for 4.9.1.


Feel free to delay them to 4.9.1.

But since we have regression, I think we'd better delay the following 
patches along with the rollback rework:


btrfs-progs: convert: Fix migrate_super_block() to work with 64k sectorsize

btrfs-progs: convert: Prevent accounting blocks beyond end of device

And the 6th patch of the patchset.


The 2 convert fixes are completely fine, but they slightly change the 
chunk layout, and makes 4K sectorsize convert test fails.
So is the 6th patch, which enhanced the testcases, to make them always 
fail if rollback rework is not applied.


So IMHO it's better to ensure the v4.9 release can pass self tests.
(Even we know it has bugs)

And fixes all the known convert bugs in v4.9.1.

How do you think about this idea?

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [CORRUPTION FILESYSTEM] Corrupted and unrecoverable file system during the snapshot receive

2016-12-21 Thread Xin Zhou
Hi,
Racing condition can happen, if running multiple transfers to the same 
destination.
Would you like to tell how many transfers are the scripts running at a time to 
a specific hdd?

Thanks,
Xin
 

Sent: Wednesday, December 21, 2016 at 1:11 PM
From: "Chris Murphy" 
To: No recipient address
Cc: "Giuseppe Della Bianca" , "Xin Zhou" , 
"Btrfs BTRFS" 
Subject: Re: [CORRUPTION FILESYSTEM] Corrupted and unrecoverable file system 
during the snapshot receive
On Wed, Dec 21, 2016 at 2:09 PM, Chris Murphy  wrote:
> What about CONFIG_BTRFS_FS_CHECK_INTEGRITY? And then using check_int
> mount option?

This slows things down, and in that case it might avoid the problem if
it's the result of a race condition.

--
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [CORRUPTION FILESYSTEM] Corrupted and unrecoverable file system during the snapshot receive

2016-12-21 Thread Chris Murphy
On Wed, Dec 21, 2016 at 2:09 PM, Chris Murphy  wrote:
> What about CONFIG_BTRFS_FS_CHECK_INTEGRITY? And then using check_int
> mount option?

This slows things down, and in that case it might avoid the problem if
it's the result of a race condition.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [CORRUPTION FILESYSTEM] Corrupted and unrecoverable file system during the snapshot receive

2016-12-21 Thread Chris Murphy
What about CONFIG_BTRFS_FS_CHECK_INTEGRITY? And then using check_int
mount option?


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 12/30] fat: convert to new i_version API

2016-12-21 Thread Jeff Layton
Signed-off-by: Jeff Layton 
---
 fs/fat/dir.c |  2 +-
 fs/fat/inode.c   |  8 
 fs/fat/namei_msdos.c |  6 +++---
 fs/fat/namei_vfat.c  | 20 ++--
 4 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/fat/dir.c b/fs/fat/dir.c
index 81cecbe6d7cf..76a4cace7543 100644
--- a/fs/fat/dir.c
+++ b/fs/fat/dir.c
@@ -1056,7 +1056,7 @@ int fat_remove_entries(struct inode *dir, struct 
fat_slot_info *sinfo)
brelse(bh);
if (err)
return err;
-   dir->i_version++;
+   inode_inc_iversion_locked(dir);
 
if (nr_slots) {
/*
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 338d2f73eb29..9a8a22111031 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -507,7 +507,7 @@ int fat_fill_inode(struct inode *inode, struct 
msdos_dir_entry *de)
MSDOS_I(inode)->i_pos = 0;
inode->i_uid = sbi->options.fs_uid;
inode->i_gid = sbi->options.fs_gid;
-   inode->i_version++;
+   inode_inc_iversion_locked(inode);
inode->i_generation = get_seconds();
 
if ((de->attr & ATTR_DIR) && !IS_FREE(de->name)) {
@@ -590,7 +590,7 @@ struct inode *fat_build_inode(struct super_block *sb,
goto out;
}
inode->i_ino = iunique(sb, MSDOS_ROOT_INO);
-   inode->i_version = 1;
+   inode_set_iversion(inode, 1);
err = fat_fill_inode(inode, de);
if (err) {
iput(inode);
@@ -1367,7 +1367,7 @@ static int fat_read_root(struct inode *inode)
MSDOS_I(inode)->i_pos = MSDOS_ROOT_INO;
inode->i_uid = sbi->options.fs_uid;
inode->i_gid = sbi->options.fs_gid;
-   inode->i_version++;
+   inode_inc_iversion_locked(inode);
inode->i_generation = 0;
inode->i_mode = fat_make_mode(sbi, ATTR_DIR, S_IRWXUGO);
inode->i_op = sbi->dir_ops;
@@ -1817,7 +1817,7 @@ int fat_fill_super(struct super_block *sb, void *data, 
int silent, int isvfat,
if (!root_inode)
goto out_fail;
root_inode->i_ino = MSDOS_ROOT_INO;
-   root_inode->i_version = 1;
+   inode_set_iversion(root_inode, 1);
error = fat_read_root(root_inode);
if (error < 0) {
iput(root_inode);
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index 7d6a105d601b..459fe1d9b578 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -480,7 +480,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned 
char *old_name,
} else
mark_inode_dirty(old_inode);
 
-   old_dir->i_version++;
+   inode_inc_iversion_locked(old_dir);
old_dir->i_ctime = old_dir->i_mtime = 
current_time(old_dir);
if (IS_DIRSYNC(old_dir))
(void)fat_sync_inode(old_dir);
@@ -508,7 +508,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned 
char *old_name,
goto out;
new_i_pos = sinfo.i_pos;
}
-   new_dir->i_version++;
+   inode_inc_iversion_locked(new_dir);
 
fat_detach(old_inode);
fat_attach(old_inode, new_i_pos);
@@ -540,7 +540,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned 
char *old_name,
old_sinfo.bh = NULL;
if (err)
goto error_dotdot;
-   old_dir->i_version++;
+   inode_inc_iversion_locked(old_dir);
old_dir->i_ctime = old_dir->i_mtime = ts;
if (IS_DIRSYNC(old_dir))
(void)fat_sync_inode(old_dir);
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index 6a7152d0c250..d7db82751a98 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -44,7 +44,7 @@ static int vfat_revalidate_shortname(struct dentry *dentry)
 {
int ret = 1;
spin_lock(>d_lock);
-   if (vfat_d_version(dentry) != d_inode(dentry->d_parent)->i_version)
+   if (inode_cmp_iversion(d_inode(dentry->d_parent), 
vfat_d_version(dentry)))
ret = 0;
spin_unlock(>d_lock);
return ret;
@@ -770,7 +770,7 @@ static struct dentry *vfat_lookup(struct inode *dir, struct 
dentry *dentry,
 out:
mutex_unlock(_SB(sb)->s_lock);
if (!inode)
-   vfat_d_version_set(dentry, dir->i_version);
+   vfat_d_version_set(dentry, inode_get_iversion(dir));
return d_splice_alias(inode, dentry);
 error:
mutex_unlock(_SB(sb)->s_lock);
@@ -792,7 +792,7 @@ static int vfat_create(struct inode *dir, struct dentry 
*dentry, umode_t mode,
err = vfat_add_entry(dir, >d_name, 0, 0, , );
if (err)
goto out;
-   dir->i_version++;
+   inode_inc_iversion_locked(dir);
 
inode = fat_build_inode(sb, sinfo.de, sinfo.i_pos);
brelse(sinfo.bh);
@@ -800,7 +800,7 @@ static int vfat_create(struct inode *dir, struct dentry 
*dentry, umode_t mode,
err = 

[RFC PATCH v1 02/30] ecryptfs: remove unnecessary i_version bump

2016-12-21 Thread Jeff Layton
ecryptfs bumps the i_version counter when a new inode is
instantiated, but never touches it again.

Signed-off-by: Jeff Layton 
---
 fs/ecryptfs/inode.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index e7413f82d27b..9493e8614de1 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -64,7 +64,6 @@ static int ecryptfs_inode_set(struct inode *inode, void 
*opaque)
/* i_size will be overwritten for encrypted regular files */
fsstack_copy_inode_size(inode, lower_inode);
inode->i_ino = lower_inode->i_ino;
-   inode->i_version++;
inode->i_mapping->a_ops = _aops;
 
if (S_ISLNK(inode->i_mode))
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 09/30] reiserfs: remove unneeded i_version bump

2016-12-21 Thread Jeff Layton
The i_version field in reiserfs is not initialized and is only ever
updated here. Nothing ever views it, so just remove it.

Signed-off-by: Jeff Layton 
---
 fs/reiserfs/super.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c
index e314cb30a181..b7cf777fc2ac 100644
--- a/fs/reiserfs/super.c
+++ b/fs/reiserfs/super.c
@@ -2521,7 +2521,6 @@ static ssize_t reiserfs_quota_write(struct super_block 
*sb, int type,
return err;
if (inode->i_size < off + len - towrite)
i_size_write(inode, off + len - towrite);
-   inode->i_version++;
inode->i_mtime = inode->i_ctime = current_time(inode);
mark_inode_dirty(inode);
return len - towrite;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 08/30] orangefs: remove initialization of i_version

2016-12-21 Thread Jeff Layton
...as it's completely unused.

Signed-off-by: Jeff Layton 
---
 fs/orangefs/super.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/orangefs/super.c b/fs/orangefs/super.c
index c48859f16e7b..58ad9875eafc 100644
--- a/fs/orangefs/super.c
+++ b/fs/orangefs/super.c
@@ -85,8 +85,6 @@ static void orangefs_inode_cache_ctor(void *req)
 
inode_init_once(_inode->vfs_inode);
init_rwsem(_inode->xattr_sem);
-
-   orangefs_inode->vfs_inode.i_version = 1;
 }
 
 static struct inode *orangefs_alloc_inode(struct super_block *sb)
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 01/30] lustre: don't set f_version in ll_readdir

2016-12-21 Thread Jeff Layton
f_version is only ever used by filesystem-specific code, and nothing in
lustre ever looks at it.

Signed-off-by: Jeff Layton 
---
 drivers/staging/lustre/lustre/llite/dir.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/dir.c 
b/drivers/staging/lustre/lustre/llite/dir.c
index ea5d247a3f70..1d288ec5f03e 100644
--- a/drivers/staging/lustre/lustre/llite/dir.c
+++ b/drivers/staging/lustre/lustre/llite/dir.c
@@ -367,8 +367,6 @@ static int ll_readdir(struct file *filp, struct dir_context 
*ctx)
}
ctx->pos = pos;
ll_finish_md_op_data(op_data);
-   filp->f_version = inode->i_version;
-
 out:
if (!rc)
ll_stats_ops_tally(sbi, LPROC_LL_READDIR, 1);
@@ -1671,7 +1669,6 @@ static loff_t ll_dir_seek(struct file *file, loff_t 
offset, int origin)
else
fd->lfd_pos = offset;
file->f_pos = offset;
-   file->f_version = 0;
}
ret = offset;
}
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2016-12-21 Thread Jeff Layton
tl;dr: I think we can greatly reduce the cost of the inode->i_version
counter, by exploiting the fact that we don't need to increment it
if no one is looking at it. We can also clean up the code to prepare
to eventually expose this value via statx().

The inode->i_version field is supposed to be a value that changes
whenever there is any data or metadata change to the inode. Some
filesystems use it internally to detect directory changes during
readdir. knfsd will use it if the filesystem has MS_I_VERSION
set. IMA will also use it (though it's not clear to me that that
works 100% -- no check for MS_I_VERSION there).

Only btrfs, ext4, and xfs implement it for data changes. Because of
this, these filesystems must log the inode to disk whenever the
i_version counter changes. That has a non-zero performance impact,
especially on write-heavy workloads, because we end up dirtying the
inode metadata on every write, not just when the times change. [1]

It turns out though that none of these users of i_version require that
i_version change on every change to the file. The only real requirement
is that it be different if _something_ changed since the last time we
queried for it. [2]

So, if we simply keep track of when something queries the value, we
can avoid bumping the counter and that metadata update.

This patchset implements this:

It starts with some small cleanup patches to just remove any mention of
the i_version counter in filesystems that don't actually use it.

Then, we add a new set of static inlines for managing the counter. The
initial version should work identically to what we have now. Then, all
of the remaining filesystems that use i_version are converted to the new
inlines.

Once that's in place, we switch to a new implementation that allows us
to track readers of i_version counter, and only bump it when it's
necessary or convenient (i.e. we're going to disk anyway).

The final patch switches from a scheme that uses the i_lock to serialize
the counter updates during write to an atomic64_t. That's a wash
performance-wise in my testing, but I like not having to take the i_lock
down where it could end up nested inside other locks.

With this, we reduce inode metadata updates across all 3 filesystems
down to roughly the frequency of the timestamp granularity, particularly
when it's not being queried (the vastly common case).

The pessimal workload here is 1 byte writes, and it helps that
significantly. Of course, that's not a real-world workload.

A tiobench-example.fio workload also shows some modest performance
gains, and I've gotten mails from the kernel test robot that show some
significant performance gains on some microbenchmarks (case-msync-mt in
the vm-scalability testsuite to be specific).

I'm happy to run other workloads if anyone can suggest them.

At this point, the patchset works and does what it's expected to do in
my own testing. It seems like it's at least a modest performance win
across all 3 major disk-based filesystems. It may also encourage others
to implement i_version as well since it reduces that cost.

Is this an avenue that's worthwhile to pursue?

Note that I think we may have other changes coming in the future that
will make this sort of cleanup necessary anyway. I'd like to plug in the
Ceph change attribute here eventually, and that will require something
like this anyway.

Thoughts, comments and suggestions are welcome...

---

[1]: On ext4 it must be turned on with the i_version mount option,
 mostly due to fears of incurring this impact, AFAICT.

[2]: NFS also recommends that it appear to increase in value over time, so
 that clients can discard metadata updates that are older than ones
 they've already seen.

Jeff Layton (30):
  lustre: don't set f_version in ll_readdir
  ecryptfs: remove unnecessary i_version bump
  ceph: remove the bump of i_version
  f2fs: don't bother setting i_version
  hpfs: don't bother with the i_version counter
  jfs: remove initialization of i_version counter
  nilfs2: remove inode->i_version initialization
  orangefs: remove initialization of i_version
  reiserfs: remove unneeded i_version bump
  ntfs: remove i_version handling
  fs: new API for handling i_version
  fat: convert to new i_version API
  affs: convert to new i_version API
  afs: convert to new i_version API
  btrfs: convert to new i_version API
  exofs: switch to new i_version API
  ext2: convert to new i_version API
  ext4: convert to new i_version API
  nfs: convert to new i_version API
  nfsd: convert to new i_version API
  ocfs2: convert to new i_version API
  ufs: use new i_version API
  xfs: convert to new i_version API
  IMA: switch IMA over to new i_version API
  fs: add a "force" parameter to inode_inc_iversion
  fs: only set S_VERSION when updating times if it has been queried
  xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need
incrementing
  btrfs: only dirty the inode in btrfs_update_time if something was
changed
  fs: track whether the 

[RFC PATCH v1 10/30] ntfs: remove i_version handling

2016-12-21 Thread Jeff Layton
Nothing uses this, and the i_version is never incremented on ntfs.

Signed-off-by: Jeff Layton 
---
 fs/ntfs/inode.c | 9 -
 fs/ntfs/mft.c   | 6 --
 2 files changed, 15 deletions(-)

diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c
index 7c410f879412..1c1ee489284b 100644
--- a/fs/ntfs/inode.c
+++ b/fs/ntfs/inode.c
@@ -560,13 +560,6 @@ static int ntfs_read_locked_inode(struct inode *vi)
ntfs_debug("Entering for i_ino 0x%lx.", vi->i_ino);
 
/* Setup the generic vfs inode parts now. */
-
-   /*
-* This is for checking whether an inode has changed w.r.t. a file so
-* that the file can be updated if necessary (compare with f_version).
-*/
-   vi->i_version = 1;
-
vi->i_uid = vol->uid;
vi->i_gid = vol->gid;
vi->i_mode = 0;
@@ -1240,7 +1233,6 @@ static int ntfs_read_locked_attr_inode(struct inode 
*base_vi, struct inode *vi)
base_ni = NTFS_I(base_vi);
 
/* Just mirror the values from the base inode. */
-   vi->i_version   = base_vi->i_version;
vi->i_uid   = base_vi->i_uid;
vi->i_gid   = base_vi->i_gid;
set_nlink(vi, base_vi->i_nlink);
@@ -1507,7 +1499,6 @@ static int ntfs_read_locked_index_inode(struct inode 
*base_vi, struct inode *vi)
ni  = NTFS_I(vi);
base_ni = NTFS_I(base_vi);
/* Just mirror the values from the base inode. */
-   vi->i_version   = base_vi->i_version;
vi->i_uid   = base_vi->i_uid;
vi->i_gid   = base_vi->i_gid;
set_nlink(vi, base_vi->i_nlink);
diff --git a/fs/ntfs/mft.c b/fs/ntfs/mft.c
index b6f402194f02..165133321577 100644
--- a/fs/ntfs/mft.c
+++ b/fs/ntfs/mft.c
@@ -2641,12 +2641,6 @@ ntfs_inode *ntfs_mft_record_alloc(ntfs_volume *vol, 
const int mode,
goto undo_mftbmp_alloc;
}
vi->i_ino = bit;
-   /*
-* This is for checking whether an inode has changed w.r.t. a
-* file so that the file can be updated if necessary (compare
-* with f_version).
-*/
-   vi->i_version = 1;
 
/* The owner and group come from the ntfs volume. */
vi->i_uid = vol->uid;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 03/30] ceph: remove the bump of i_version

2016-12-21 Thread Jeff Layton
Eventually, we'll want to wire it up to use the change attribute that
the cluster tracks instead, but for now this is unneeded.

Signed-off-by: Jeff Layton 
---
 fs/ceph/inode.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 398e5328b309..c3406acb36e7 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -796,7 +796,6 @@ static int fill_inode(struct inode *inode, struct page 
*locked_page,
 
/* update inode */
ci->i_version = le64_to_cpu(info->version);
-   inode->i_version++;
inode->i_rdev = le32_to_cpu(info->rdev);
inode->i_blkbits = fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1;
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 05/30] hpfs: don't bother with the i_version counter

2016-12-21 Thread Jeff Layton
Nothing uses it on HPFS.

Signed-off-by: Jeff Layton 
---
 fs/hpfs/dir.c   | 1 -
 fs/hpfs/dnode.c | 2 --
 fs/hpfs/super.c | 1 -
 3 files changed, 4 deletions(-)

diff --git a/fs/hpfs/dir.c b/fs/hpfs/dir.c
index 7b9150c2e75c..1a9fded00378 100644
--- a/fs/hpfs/dir.c
+++ b/fs/hpfs/dir.c
@@ -149,7 +149,6 @@ static int hpfs_readdir(struct file *file, struct 
dir_context *ctx)
if (unlikely(ret < 0))
goto out;
ctx->pos = ((loff_t) 
hpfs_de_as_down_as_possible(inode->i_sb, hpfs_inode->i_dno) << 4) + 1;
-   file->f_version = inode->i_version;
}
next_pos = ctx->pos;
if (!(de = map_pos_dirent(inode, _pos, ))) {
diff --git a/fs/hpfs/dnode.c b/fs/hpfs/dnode.c
index 86ab7e790b4e..454acb0a009a 100644
--- a/fs/hpfs/dnode.c
+++ b/fs/hpfs/dnode.c
@@ -418,7 +418,6 @@ int hpfs_add_dirent(struct inode *i,
c = 1;
goto ret;
}   
-   i->i_version++;
c = hpfs_add_to_dnode(i, dno, name, namelen, new_de, 0);
ret:
return c;
@@ -725,7 +724,6 @@ int hpfs_remove_dirent(struct inode *i, dnode_secno dno, 
struct hpfs_dirent *de,
return 2;
}
}
-   i->i_version++;
for_all_poss(i, hpfs_pos_del, (t = get_pos(dnode, de)) + 1, 1);
hpfs_delete_de(i->i_sb, dnode, de);
hpfs_mark_4buffers_dirty(qbh);
diff --git a/fs/hpfs/super.c b/fs/hpfs/super.c
index 82067ca22f2b..2fd66c854255 100644
--- a/fs/hpfs/super.c
+++ b/fs/hpfs/super.c
@@ -235,7 +235,6 @@ static struct inode *hpfs_alloc_inode(struct super_block 
*sb)
ei = kmem_cache_alloc(hpfs_inode_cachep, GFP_NOFS);
if (!ei)
return NULL;
-   ei->vfs_inode.i_version = 1;
return >vfs_inode;
 }
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 14/30] afs: convert to new i_version API

2016-12-21 Thread Jeff Layton
For AFS, it's generally treated as an opaque value.

Note that AFS has quite a different definition for this counter. AFS
only increments it on changes to the data, not for the metadata. We'll
need to reconcile that somehow if we ever want to present this to
userspace via statx.

Signed-off-by: Jeff Layton 
---
 fs/afs/fsclient.c | 2 +-
 fs/afs/inode.c| 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
index 31c616ab9b40..7958001a2bf7 100644
--- a/fs/afs/fsclient.c
+++ b/fs/afs/fsclient.c
@@ -108,7 +108,7 @@ static void xdr_decode_AFSFetchStatus(const __be32 **_bp,
vnode->vfs_inode.i_ctime.tv_sec = status->mtime_server;
vnode->vfs_inode.i_mtime= vnode->vfs_inode.i_ctime;
vnode->vfs_inode.i_atime= vnode->vfs_inode.i_ctime;
-   vnode->vfs_inode.i_version  = data_version;
+   inode_set_iversion(>vfs_inode, data_version);
}
 
expected_version = status->data_version;
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 86cc7264c21c..ec3a0274dc95 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -77,7 +77,7 @@ static int afs_inode_map_status(struct afs_vnode *vnode, 
struct key *key)
inode->i_atime  = inode->i_mtime = inode->i_ctime;
inode->i_blocks = 0;
inode->i_generation = vnode->fid.unique;
-   inode->i_version= vnode->status.data_version;
+   inode_set_iversion(inode, vnode->status.data_version);
inode->i_mapping->a_ops = _fs_aops;
 
/* check to see whether a symbolic link is really a mountpoint */
@@ -182,7 +182,7 @@ struct inode *afs_iget_autocell(struct inode *dir, const 
char *dev_name,
inode->i_ctime.tv_nsec  = 0;
inode->i_atime  = inode->i_mtime = inode->i_ctime;
inode->i_blocks = 0;
-   inode->i_version= 0;
+   inode_set_iversion(inode, 0);
inode->i_generation = 0;
 
set_bit(AFS_VNODE_PSEUDODIR, >flags);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 04/30] f2fs: don't bother setting i_version

2016-12-21 Thread Jeff Layton
It's not used in f2fs currently. Just remove the initialization.

Signed-off-by: Jeff Layton 
---
 fs/f2fs/super.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 702638e21c76..78cceb0bed9e 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -572,7 +572,6 @@ static struct inode *f2fs_alloc_inode(struct super_block 
*sb)
init_once((void *) fi);
 
/* Initialize f2fs-specific inode info */
-   fi->vfs_inode.i_version = 1;
atomic_set(>dirty_pages, 0);
fi->i_current_depth = 1;
fi->i_advise = 0;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 07/30] nilfs2: remove inode->i_version initialization

2016-12-21 Thread Jeff Layton
It's never used in nilfs2.

Signed-off-by: Jeff Layton 
---
 fs/nilfs2/super.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index 12eeae62a2b1..d157f6f96d9a 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -160,7 +160,6 @@ struct inode *nilfs_alloc_inode(struct super_block *sb)
ii->i_bh = NULL;
ii->i_state = 0;
ii->i_cno = 0;
-   ii->vfs_inode.i_version = 1;
nilfs_mapping_init(>i_btnode_cache, >vfs_inode);
return >vfs_inode;
 }
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 06/30] jfs: remove initialization of i_version counter

2016-12-21 Thread Jeff Layton
It's bumped on quota writes but is otherwise untouched.

Signed-off-by: Jeff Layton 
---
 fs/jfs/super.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/jfs/super.c b/fs/jfs/super.c
index 85671f7f8518..2739600f826e 100644
--- a/fs/jfs/super.c
+++ b/fs/jfs/super.c
@@ -829,7 +829,6 @@ static ssize_t jfs_quota_write(struct super_block *sb, int 
type,
}
if (inode->i_size < off+len-towrite)
i_size_write(inode, off+len-towrite);
-   inode->i_version++;
inode->i_mtime = inode->i_ctime = current_time(inode);
mark_inode_dirty(inode);
inode_unlock(inode);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 17/30] ext2: convert to new i_version API

2016-12-21 Thread Jeff Layton
Signed-off-by: Jeff Layton 
---
 fs/ext2/dir.c   | 8 
 fs/ext2/super.c | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index d9650c9508e4..aae282798d18 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -91,7 +91,7 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
struct inode *dir = mapping->host;
int err = 0;
 
-   dir->i_version++;
+   inode_inc_iversion_locked(dir);
block_write_end(NULL, mapping, pos, len, len, page, NULL);
 
if (pos+len > dir->i_size) {
@@ -292,7 +292,7 @@ ext2_readdir(struct file *file, struct dir_context *ctx)
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(ext2_chunk_size(inode)-1);
unsigned char *types = NULL;
-   int need_revalidate = file->f_version != inode->i_version;
+   bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
 
if (pos > inode->i_size - EXT2_DIR_REC_LEN(1))
return 0;
@@ -318,8 +318,8 @@ ext2_readdir(struct file *file, struct dir_context *ctx)
offset = ext2_validate_entry(kaddr, offset, 
chunk_mask);
ctx->pos = (n<f_version = inode->i_version;
-   need_revalidate = 0;
+   file->f_version = inode_get_iversion(inode);
+   need_revalidate = false;
}
de = (ext2_dirent *)(kaddr+offset);
limit = kaddr + ext2_last_byte(inode, n) - EXT2_DIR_REC_LEN(1);
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 6cb042b53b5b..5c9ba3b3f77b 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -167,7 +167,7 @@ static struct inode *ext2_alloc_inode(struct super_block 
*sb)
if (!ei)
return NULL;
ei->i_block_alloc_info = NULL;
-   ei->vfs_inode.i_version = 1;
+   inode_set_iversion(>vfs_inode, 1);
 #ifdef CONFIG_QUOTA
memset(>i_dquot, 0, sizeof(ei->i_dquot));
 #endif
@@ -1542,7 +1542,7 @@ static ssize_t ext2_quota_write(struct super_block *sb, 
int type,
return err;
if (inode->i_size < off+len-towrite)
i_size_write(inode, off+len-towrite);
-   inode->i_version++;
+   inode_inc_iversion(inode);
inode->i_mtime = inode->i_ctime = current_time(inode);
mark_inode_dirty(inode);
return len - towrite;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 15/30] btrfs: convert to new i_version API

2016-12-21 Thread Jeff Layton
Signed-off-by: Jeff Layton 
---
 fs/btrfs/delayed-inode.c | 4 ++--
 fs/btrfs/inode.c | 4 ++--
 fs/btrfs/tree-log.c  | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 80982a83c9fd..5e1015705c8b 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1744,7 +1744,7 @@ static void fill_stack_inode_item(struct 
btrfs_trans_handle *trans,
btrfs_set_stack_inode_nbytes(inode_item, inode_get_bytes(inode));
btrfs_set_stack_inode_generation(inode_item,
 BTRFS_I(inode)->generation);
-   btrfs_set_stack_inode_sequence(inode_item, inode->i_version);
+   btrfs_set_stack_inode_sequence(inode_item, inode_get_iversion(inode));
btrfs_set_stack_inode_transid(inode_item, trans->transid);
btrfs_set_stack_inode_rdev(inode_item, inode->i_rdev);
btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
@@ -1798,7 +1798,7 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
BTRFS_I(inode)->generation = btrfs_stack_inode_generation(inode_item);
 BTRFS_I(inode)->last_trans = btrfs_stack_inode_transid(inode_item);
 
-   inode->i_version = btrfs_stack_inode_sequence(inode_item);
+   inode_set_iversion_read(inode, btrfs_stack_inode_sequence(inode_item));
inode->i_rdev = 0;
*rdev = btrfs_stack_inode_rdev(inode_item);
BTRFS_I(inode)->flags = btrfs_stack_inode_flags(inode_item);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f2b281ad7af6..a56a45a3e992 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3732,7 +3732,7 @@ static int btrfs_read_locked_inode(struct inode *inode)
BTRFS_I(inode)->generation = btrfs_inode_generation(leaf, inode_item);
BTRFS_I(inode)->last_trans = btrfs_inode_transid(leaf, inode_item);
 
-   inode->i_version = btrfs_inode_sequence(leaf, inode_item);
+   inode_set_iversion_read(inode, btrfs_inode_sequence(leaf, inode_item));
inode->i_generation = BTRFS_I(inode)->generation;
inode->i_rdev = 0;
rdev = btrfs_inode_rdev(leaf, inode_item);
@@ -3903,7 +3903,7 @@ static void fill_inode_item(struct btrfs_trans_handle 
*trans,
 );
btrfs_set_token_inode_generation(leaf, item, BTRFS_I(inode)->generation,
 );
-   btrfs_set_token_inode_sequence(leaf, item, inode->i_version, );
+   btrfs_set_token_inode_sequence(leaf, item, inode_get_iversion(inode), 
);
btrfs_set_token_inode_transid(leaf, item, trans->transid, );
btrfs_set_token_inode_rdev(leaf, item, inode->i_rdev, );
btrfs_set_token_inode_flags(leaf, item, BTRFS_I(inode)->flags, );
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index f10bf5213ed8..4acc4d27383a 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3585,7 +3585,7 @@ static void fill_inode_item(struct btrfs_trans_handle 
*trans,
btrfs_set_token_inode_nbytes(leaf, item, inode_get_bytes(inode),
 );
 
-   btrfs_set_token_inode_sequence(leaf, item, inode->i_version, );
+   btrfs_set_token_inode_sequence(leaf, item, inode_get_iversion(inode), 
);
btrfs_set_token_inode_transid(leaf, item, trans->transid, );
btrfs_set_token_inode_rdev(leaf, item, inode->i_rdev, );
btrfs_set_token_inode_flags(leaf, item, BTRFS_I(inode)->flags, );
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 16/30] exofs: switch to new i_version API

2016-12-21 Thread Jeff Layton
Signed-off-by: Jeff Layton 
---
 fs/exofs/dir.c   | 8 
 fs/exofs/super.c | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
index 42f9a0a0c4ca..753828ef2db1 100644
--- a/fs/exofs/dir.c
+++ b/fs/exofs/dir.c
@@ -60,7 +60,7 @@ static int exofs_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
struct inode *dir = mapping->host;
int err = 0;
 
-   dir->i_version++;
+   inode_inc_iversion_locked(dir);
 
if (!PageUptodate(page))
SetPageUptodate(page);
@@ -241,7 +241,7 @@ exofs_readdir(struct file *file, struct dir_context *ctx)
unsigned long n = pos >> PAGE_SHIFT;
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(exofs_chunk_size(inode)-1);
-   int need_revalidate = (file->f_version != inode->i_version);
+   bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
 
if (pos > inode->i_size - EXOFS_DIR_REC_LEN(1))
return 0;
@@ -264,8 +264,8 @@ exofs_readdir(struct file *file, struct dir_context *ctx)
chunk_mask);
ctx->pos = (n<f_version = inode->i_version;
-   need_revalidate = 0;
+   file->f_version = inode_get_iversion(inode);
+   need_revalidate = false;
}
de = (struct exofs_dir_entry *)(kaddr + offset);
limit = kaddr + exofs_last_byte(inode, n) -
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
index 1076a4233b39..74400d1dcea5 100644
--- a/fs/exofs/super.c
+++ b/fs/exofs/super.c
@@ -159,7 +159,7 @@ static struct inode *exofs_alloc_inode(struct super_block 
*sb)
if (!oi)
return NULL;
 
-   oi->vfs_inode.i_version = 1;
+   inode_set_iversion(>vfs_inode, 1);
return >vfs_inode;
 }
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 21/30] ocfs2: convert to new i_version API

2016-12-21 Thread Jeff Layton
Signed-off-by: Jeff Layton 
---
 fs/ocfs2/dir.c  | 14 +++---
 fs/ocfs2/inode.c|  2 +-
 fs/ocfs2/namei.c|  2 +-
 fs/ocfs2/quota_global.c |  2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c
index 3ecb9f337b7d..b78105d404b3 100644
--- a/fs/ocfs2/dir.c
+++ b/fs/ocfs2/dir.c
@@ -1174,7 +1174,7 @@ static int __ocfs2_delete_entry(handle_t *handle, struct 
inode *dir,
le16_add_cpu(>rec_len,
le16_to_cpu(de->rec_len));
de->inode = 0;
-   dir->i_version++;
+   inode_inc_iversion_locked(dir);
ocfs2_journal_dirty(handle, bh);
goto bail;
}
@@ -1729,7 +1729,7 @@ int __ocfs2_add_entry(handle_t *handle,
if (ocfs2_dir_indexed(dir))
ocfs2_recalc_free_list(dir, handle, lookup);
 
-   dir->i_version++;
+   inode_inc_iversion_locked(dir);
ocfs2_journal_dirty(handle, insert_bh);
retval = 0;
goto bail;
@@ -1775,7 +1775,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode,
 * readdir(2), then we might be pointing to an invalid
 * dirent right now.  Scan from the start of the block
 * to make sure. */
-   if (*f_version != inode->i_version) {
+   if (inode_cmp_iversion(inode, *f_version)) {
for (i = 0; i < i_size_read(inode) && i < offset; ) {
de = (struct ocfs2_dir_entry *)
(data->id_data + i);
@@ -1791,7 +1791,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode,
i += le16_to_cpu(de->rec_len);
}
ctx->pos = offset = i;
-   *f_version = inode->i_version;
+   *f_version = inode_get_iversion(inode);
}
 
de = (struct ocfs2_dir_entry *) (data->id_data + ctx->pos);
@@ -1869,7 +1869,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
 * readdir(2), then we might be pointing to an invalid
 * dirent right now.  Scan from the start of the block
 * to make sure. */
-   if (*f_version != inode->i_version) {
+   if (inode_cmp_iversion(inode, *f_version)) {
for (i = 0; i < sb->s_blocksize && i < offset; ) {
de = (struct ocfs2_dir_entry *) (bh->b_data + 
i);
/* It's too expensive to do a full
@@ -1886,7 +1886,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
offset = i;
ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1))
| offset;
-   *f_version = inode->i_version;
+   *f_version = inode_get_iversion(inode);
}
 
while (ctx->pos < i_size_read(inode)
@@ -1940,7 +1940,7 @@ static int ocfs2_dir_foreach_blk(struct inode *inode, u64 
*f_version,
  */
 int ocfs2_dir_foreach(struct inode *inode, struct dir_context *ctx)
 {
-   u64 version = inode->i_version;
+   u64 version = inode_get_iversion(inode);
ocfs2_dir_foreach_blk(inode, , ctx, true);
return 0;
 }
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index 382401d3e88f..31f1e80cd980 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -302,7 +302,7 @@ void ocfs2_populate_inode(struct inode *inode, struct 
ocfs2_dinode *fe,
OCFS2_I(inode)->ip_attr = le32_to_cpu(fe->i_attr);
OCFS2_I(inode)->ip_dyn_features = le16_to_cpu(fe->i_dyn_features);
 
-   inode->i_version = 1;
+   inode_set_iversion(inode, 1);
inode->i_generation = le32_to_cpu(fe->i_generation);
inode->i_rdev = huge_decode_dev(le64_to_cpu(fe->id1.dev1.i_rdev));
inode->i_mode = le16_to_cpu(fe->i_mode);
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 3b0a10d9b36f..c045826b716a 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -1520,7 +1520,7 @@ static int ocfs2_rename(struct inode *old_dir,
mlog_errno(status);
goto bail;
}
-   new_dir->i_version++;
+   inode_inc_iversion(new_dir);
 
if (S_ISDIR(new_inode->i_mode))
ocfs2_set_links_count(newfe, 0);
diff --git a/fs/ocfs2/quota_global.c b/fs/ocfs2/quota_global.c
index cec495a921e3..a9fad85597e1 100644
--- a/fs/ocfs2/quota_global.c
+++ b/fs/ocfs2/quota_global.c
@@ -288,7 +288,7 @@ ssize_t ocfs2_quota_write(struct super_block *sb, int type,

[RFC PATCH v1 23/30] xfs: convert to new i_version API

2016-12-21 Thread Jeff Layton
Signed-off-by: Jeff Layton 
---
 fs/xfs/libxfs/xfs_inode_buf.c | 4 ++--
 fs/xfs/xfs_icache.c   | 4 ++--
 fs/xfs/xfs_inode.c| 2 +-
 fs/xfs/xfs_inode_item.c   | 2 +-
 fs/xfs/xfs_trans_inode.c  | 2 +-
 5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index dd483e2767f7..a45e9ae5a17e 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -263,7 +263,7 @@ xfs_inode_from_disk(
to->di_flags= be16_to_cpu(from->di_flags);
 
if (to->di_version == 3) {
-   inode->i_version = be64_to_cpu(from->di_changecount);
+   inode_set_iversion_read(inode, 
be64_to_cpu(from->di_changecount));
to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
to->di_flags2 = be64_to_cpu(from->di_flags2);
@@ -313,7 +313,7 @@ xfs_inode_to_disk(
to->di_flags = cpu_to_be16(from->di_flags);
 
if (from->di_version == 3) {
-   to->di_changecount = cpu_to_be64(inode->i_version);
+   to->di_changecount = cpu_to_be64(inode_get_iversion_raw(inode));
to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
to->di_flags2 = cpu_to_be64(from->di_flags2);
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index ff4d6311c7f4..f5be7fef8f59 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -278,14 +278,14 @@ xfs_reinit_inode(
int error;
uint32_tnlink = inode->i_nlink;
uint32_tgeneration = inode->i_generation;
-   uint64_tversion = inode->i_version;
+   uint64_tversion = inode_get_iversion_raw(inode);
umode_t mode = inode->i_mode;
 
error = inode_init_always(mp->m_super, inode);
 
set_nlink(inode, nlink);
inode->i_generation = generation;
-   inode->i_version = version;
+   inode_set_iversion_read(inode, version);
inode->i_mode = mode;
return error;
 }
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index b9557795eb74..dad9c6476d6f 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -851,7 +851,7 @@ xfs_ialloc(
ip->i_d.di_flags = 0;
 
if (ip->i_d.di_version == 3) {
-   inode->i_version = 1;
+   inode_set_iversion(inode, 1);
ip->i_d.di_flags2 = 0;
ip->i_d.di_cowextsize = 0;
ip->i_d.di_crtime.t_sec = (__int32_t)tv.tv_sec;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index d90e7811ccdd..838a045cf9e5 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -364,7 +364,7 @@ xfs_inode_to_log_dinode(
to->di_flags = from->di_flags;
 
if (from->di_version == 3) {
-   to->di_changecount = inode->i_version;
+   to->di_changecount = inode_get_iversion_raw(inode);
to->di_crtime.t_sec = from->di_crtime.t_sec;
to->di_crtime.t_nsec = from->di_crtime.t_nsec;
to->di_flags2 = from->di_flags2;
diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index dab8daa676f9..844c08886170 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -117,7 +117,7 @@ xfs_trans_log_inode(
 */
if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
IS_I_VERSION(VFS_I(ip))) {
-   VFS_I(ip)->i_version++;
+   inode_inc_iversion_locked(VFS_I(ip));
flags |= XFS_ILOG_CORE;
}
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 19/30] nfs: convert to new i_version API

2016-12-21 Thread Jeff Layton
For NFS, we just use the "raw" API since the i_version is mostly
managed by the server. The exception there is when the client
holds a write delegation, but we only need to bump it once
there anyway to handle CB_GETATTR.

Signed-off-by: Jeff Layton 
---
 fs/nfs/delegation.c|  2 +-
 fs/nfs/fscache-index.c |  4 ++--
 fs/nfs/inode.c | 16 
 fs/nfs/nfs4proc.c  |  4 ++--
 fs/nfs/nfstrace.h  |  4 ++--
 fs/nfs/write.c |  2 +-
 6 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index d7df5e67b0c1..36d8a000e664 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -347,7 +347,7 @@ int nfs_inode_set_delegation(struct inode *inode, struct 
rpc_cred *cred, struct
nfs4_stateid_copy(>stateid, >delegation);
delegation->type = res->delegation_type;
delegation->pagemod_limit = res->pagemod_limit;
-   delegation->change_attr = inode->i_version;
+   delegation->change_attr = inode_get_iversion_raw(inode);
delegation->cred = get_rpccred(cred);
delegation->inode = inode;
delegation->flags = 1<vfs_inode.i_ctime;
 
if (NFS_SERVER(>vfs_inode)->nfs_client->rpc_ops->version == 4)
-   auxdata.change_attr = nfsi->vfs_inode.i_version;
+   auxdata.change_attr = inode_get_iversion_raw(>vfs_inode);
 
if (bufmax > sizeof(auxdata))
bufmax = sizeof(auxdata);
@@ -243,7 +243,7 @@ enum fscache_checkaux nfs_fscache_inode_check_aux(void 
*cookie_netfs_data,
auxdata.ctime = nfsi->vfs_inode.i_ctime;
 
if (NFS_SERVER(>vfs_inode)->nfs_client->rpc_ops->version == 4)
-   auxdata.change_attr = nfsi->vfs_inode.i_version;
+   auxdata.change_attr = inode_get_iversion_raw(>vfs_inode);
 
if (memcmp(data, , datalen) != 0)
return FSCACHE_CHECKAUX_OBSOLETE;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 5864146e05e6..d38ec9c92c1d 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -424,7 +424,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct 
nfs_fattr *fattr, st
memset(>i_atime, 0, sizeof(inode->i_atime));
memset(>i_mtime, 0, sizeof(inode->i_mtime));
memset(>i_ctime, 0, sizeof(inode->i_ctime));
-   inode->i_version = 0;
+   inode_set_iversion(inode, 0);
inode->i_size = 0;
clear_nlink(inode);
inode->i_uid = make_kuid(_user_ns, -2);
@@ -449,7 +449,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct 
nfs_fattr *fattr, st
else if (nfs_server_capable(inode, NFS_CAP_CTIME))
nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR);
if (fattr->valid & NFS_ATTR_FATTR_CHANGE)
-   inode->i_version = fattr->change_attr;
+   inode_set_iversion(inode, fattr->change_attr);
else
nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR
| NFS_INO_REVAL_PAGECACHE);
@@ -1244,8 +1244,8 @@ static unsigned long nfs_wcc_update_inode(struct inode 
*inode, struct nfs_fattr
 
if ((fattr->valid & NFS_ATTR_FATTR_PRECHANGE)
&& (fattr->valid & NFS_ATTR_FATTR_CHANGE)
-   && inode->i_version == fattr->pre_change_attr) {
-   inode->i_version = fattr->change_attr;
+   && !inode_cmp_iversion(inode, fattr->pre_change_attr)) {
+   inode_set_iversion(inode, fattr->change_attr);
if (S_ISDIR(inode->i_mode))
nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA);
ret |= NFS_INO_INVALID_ATTR;
@@ -1303,7 +1303,7 @@ static int nfs_check_inode_attributes(struct inode 
*inode, struct nfs_fattr *fat
 
if (!nfs_file_has_buffered_writers(nfsi)) {
/* Verify a few of the more important attributes */
-   if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && 
inode->i_version != fattr->change_attr)
+   if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && 
inode_cmp_iversion(inode, fattr->change_attr))
invalid |= NFS_INO_INVALID_ATTR | 
NFS_INO_REVAL_PAGECACHE;
 
if ((fattr->valid & NFS_ATTR_FATTR_MTIME) && 
!timespec_equal(>i_mtime, >mtime))
@@ -1604,7 +1604,7 @@ int nfs_post_op_update_inode_force_wcc_locked(struct 
inode *inode, struct nfs_fa
}
if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 &&
(fattr->valid & NFS_ATTR_FATTR_PRECHANGE) == 0) 

[RFC PATCH v1 25/30] fs: add a "force" parameter to inode_inc_iversion

2016-12-21 Thread Jeff Layton
In some coming changes, we will want to avoid changing the i_version
counter in some cases as that may allow us to avoid having to log a
metadata change to the inode. In other cases we will want to forcibly
change it even when it's not strictly required (i.e. if we're going to
log a time change to disk anyway).

Add a parameter to tell the incrementing routine to forcibly bump the
counter. For now, everyone sets this to "true" and the value is ignored,
but later patches will change this.

Signed-off-by: Jeff Layton 
---
 fs/btrfs/file.c|  4 ++--
 fs/btrfs/inode.c   | 34 +-
 fs/btrfs/ioctl.c   |  4 ++--
 fs/btrfs/xattr.c   |  2 +-
 fs/ext2/super.c|  2 +-
 fs/ext4/inode.c|  4 ++--
 fs/ext4/namei.c|  2 +-
 fs/inode.c |  2 +-
 fs/ocfs2/namei.c   |  2 +-
 include/linux/fs.h |  2 +-
 10 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b5c5da215d05..f301afaf9af5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1804,7 +1804,7 @@ static void update_time_for_write(struct inode *inode)
inode->i_ctime = now;
 
if (IS_I_VERSION(inode))
-   inode_inc_iversion(inode);
+   inode_inc_iversion(inode, true);
 }
 
 static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
@@ -2636,7 +2636,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
if (!trans)
goto out_free;
 
-   inode_inc_iversion(inode);
+   inode_inc_iversion(inode, true);
inode->i_mtime = inode->i_ctime = current_time(inode);
 
trans->block_rsv = _info->trans_block_rsv;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a56a45a3e992..a03e5a1d5e05 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4086,8 +4086,8 @@ static int __btrfs_unlink_inode(struct btrfs_trans_handle 
*trans,
goto out;
 
btrfs_i_size_write(dir, dir->i_size - name_len * 2);
-   inode_inc_iversion(inode);
-   inode_inc_iversion(dir);
+   inode_inc_iversion(inode, true);
+   inode_inc_iversion(dir, true);
inode->i_ctime = dir->i_mtime =
dir->i_ctime = current_time(inode);
ret = btrfs_update_inode(trans, root, dir);
@@ -4232,7 +4232,7 @@ int btrfs_unlink_subvol(struct btrfs_trans_handle *trans,
}
 
btrfs_i_size_write(dir, dir->i_size - name_len * 2);
-   inode_inc_iversion(dir);
+   inode_inc_iversion(dir, true);
dir->i_mtime = dir->i_ctime = current_time(dir);
ret = btrfs_update_inode_fallback(trans, root, dir);
if (ret)
@@ -5000,7 +5000,7 @@ static int btrfs_setsize(struct inode *inode, struct 
iattr *attr)
 * explicitly if it wants a timestamp update.
 */
if (newsize != oldsize) {
-   inode_inc_iversion(inode);
+   inode_inc_iversion(inode, true);
if (!(mask & (ATTR_CTIME | ATTR_MTIME)))
inode->i_ctime = inode->i_mtime =
current_time(inode);
@@ -5122,7 +5122,7 @@ static int btrfs_setattr(struct dentry *dentry, struct 
iattr *attr)
 
if (attr->ia_valid) {
setattr_copy(inode, attr);
-   inode_inc_iversion(inode);
+   inode_inc_iversion(inode, true);
err = btrfs_dirty_inode(inode);
 
if (!err && attr->ia_valid & ATTR_MODE)
@@ -6030,7 +6030,7 @@ static int btrfs_update_time(struct inode *inode, struct 
timespec *now,
return -EROFS;
 
if (flags & S_VERSION)
-   inode_inc_iversion(inode);
+   inode_inc_iversion(inode, true);
if (flags & S_CTIME)
inode->i_ctime = *now;
if (flags & S_MTIME)
@@ -6354,7 +6354,7 @@ int btrfs_add_link(struct btrfs_trans_handle *trans,
 
btrfs_i_size_write(parent_inode, parent_inode->i_size +
   name_len * 2);
-   inode_inc_iversion(parent_inode);
+   inode_inc_iversion(parent_inode, true);
parent_inode->i_mtime = parent_inode->i_ctime =
current_time(parent_inode);
ret = btrfs_update_inode(trans, root, parent_inode);
@@ -6576,7 +6576,7 @@ static int btrfs_link(struct dentry *old_dentry, struct 
inode *dir,
/* There are several dir indexes for this inode, clear the cache. */
BTRFS_I(inode)->dir_index = 0ULL;
inc_nlink(inode);
-   inode_inc_iversion(inode);
+   inode_inc_iversion(inode, true);
inode->i_ctime = current_time(inode);
ihold(inode);
set_bit(BTRFS_INODE_COPY_EVERYTHING, _I(inode)->runtime_flags);
@@ -9576,10 +9576,10 @@ static int btrfs_rename_exchange(struct inode *old_dir,
}
 
/* Update inode version and ctime/mtime. */
-   inode_inc_iversion(old_dir);
-   inode_inc_iversion(new_dir);
-   inode_inc_iversion(old_inode);
-   

[RFC PATCH v1 20/30] nfsd: convert to new i_version API

2016-12-21 Thread Jeff Layton
Mostly just making sure we use the "get" wrappers so we know when
it is being fetched for later use.

Signed-off-by: Jeff Layton 
---
 fs/nfsd/nfs3xdr.c | 2 +-
 fs/nfsd/nfs4xdr.c | 2 +-
 fs/nfsd/nfsfh.h   | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index dba2ff8eaa68..a645b1dbc4e3 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
printk("nfsd: inode locked twice during operation.\n");
 
err = fh_getattr(fhp, >fh_post_attr);
-   fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
+   fhp->fh_post_change = inode_get_iversion(d_inode(fhp->fh_dentry));
if (err) {
fhp->fh_post_saved = false;
/* Grab the ctime anyway - set_change_info might use it */
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 7ecf16be4a44..98bd030f849a 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -1969,7 +1969,7 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
 static __be32 *encode_change(__be32 *p, struct kstat *stat, struct inode 
*inode)
 {
if (IS_I_VERSION(inode)) {
-   p = xdr_encode_hyper(p, inode->i_version);
+   p = xdr_encode_hyper(p, inode_get_iversion(inode));
} else {
*p++ = cpu_to_be32(stat->ctime.tv_sec);
*p++ = cpu_to_be32(stat->ctime.tv_nsec);
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index f84fe6bf9aee..7a66065f6e74 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -253,7 +253,7 @@ fill_pre_wcc(struct svc_fh *fhp)
fhp->fh_pre_mtime = inode->i_mtime;
fhp->fh_pre_ctime = inode->i_ctime;
fhp->fh_pre_size  = inode->i_size;
-   fhp->fh_pre_change = inode->i_version;
+   fhp->fh_pre_change = inode_get_iversion(inode);
fhp->fh_pre_saved = true;
}
 }
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 13/30] affs: convert to new i_version API

2016-12-21 Thread Jeff Layton
Signed-off-by: Jeff Layton 
---
 fs/affs/amigaffs.c | 4 ++--
 fs/affs/dir.c  | 4 ++--
 fs/affs/super.c| 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/affs/amigaffs.c b/fs/affs/amigaffs.c
index 0ec65c133b93..00aff7cfc883 100644
--- a/fs/affs/amigaffs.c
+++ b/fs/affs/amigaffs.c
@@ -59,7 +59,7 @@ affs_insert_hash(struct inode *dir, struct buffer_head *bh)
affs_brelse(dir_bh);
 
dir->i_mtime = dir->i_ctime = current_time(dir);
-   dir->i_version++;
+   inode_inc_iversion_locked(dir);
mark_inode_dirty(dir);
 
return 0;
@@ -113,7 +113,7 @@ affs_remove_hash(struct inode *dir, struct buffer_head 
*rem_bh)
affs_brelse(bh);
 
dir->i_mtime = dir->i_ctime = current_time(dir);
-   dir->i_version++;
+   inode_inc_iversion_locked(dir);
mark_inode_dirty(dir);
 
return retval;
diff --git a/fs/affs/dir.c b/fs/affs/dir.c
index f1e7294381c5..a587d57668e3 100644
--- a/fs/affs/dir.c
+++ b/fs/affs/dir.c
@@ -79,7 +79,7 @@ affs_readdir(struct file *file, struct dir_context *ctx)
 * we can jump directly to where we left off.
 */
ino = (u32)(long)file->private_data;
-   if (ino && file->f_version == inode->i_version) {
+   if (ino && inode_cmp_iversion(inode, file->f_version) == 0) {
pr_debug("readdir() left off=%d\n", ino);
goto inside;
}
@@ -129,7 +129,7 @@ affs_readdir(struct file *file, struct dir_context *ctx)
} while (ino);
}
 done:
-   file->f_version = inode->i_version;
+   file->f_version = inode_get_iversion(inode);
file->private_data = (void *)(long)ino;
affs_brelse(fh_bh);
 
diff --git a/fs/affs/super.c b/fs/affs/super.c
index d6384863192c..b3ebf90f9798 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -99,7 +99,7 @@ static struct inode *affs_alloc_inode(struct super_block *sb)
if (!i)
return NULL;
 
-   i->vfs_inode.i_version = 1;
+   inode_set_iversion(>vfs_inode, 1);
i->i_lc = NULL;
i->i_ext_bh = NULL;
i->i_pa_cnt = 0;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 22/30] ufs: use new i_version API

2016-12-21 Thread Jeff Layton
Signed-off-by: Jeff Layton 
---
 fs/ufs/dir.c   | 8 
 fs/ufs/inode.c | 2 +-
 fs/ufs/super.c | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c
index de01b8f2aa78..c6e670732cc5 100644
--- a/fs/ufs/dir.c
+++ b/fs/ufs/dir.c
@@ -46,7 +46,7 @@ static int ufs_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
struct inode *dir = mapping->host;
int err = 0;
 
-   dir->i_version++;
+   inode_inc_iversion_locked(dir);
block_write_end(NULL, mapping, pos, len, len, page, NULL);
if (pos+len > dir->i_size) {
i_size_write(dir, pos+len);
@@ -427,7 +427,7 @@ ufs_readdir(struct file *file, struct dir_context *ctx)
unsigned long n = pos >> PAGE_SHIFT;
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(UFS_SB(sb)->s_uspi->s_dirblksize - 1);
-   int need_revalidate = file->f_version != inode->i_version;
+   bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
unsigned flags = UFS_SB(sb)->s_flags;
 
UFSD("BEGIN\n");
@@ -454,8 +454,8 @@ ufs_readdir(struct file *file, struct dir_context *ctx)
offset = ufs_validate_entry(sb, kaddr, offset, 
chunk_mask);
ctx->pos = (n<f_version = inode->i_version;
-   need_revalidate = 0;
+   file->f_version = inode_get_iversion(inode);
+   need_revalidate = false;
}
de = (struct ufs_dir_entry *)(kaddr+offset);
limit = kaddr + ufs_last_byte(inode, n) - UFS_DIR_REC_LEN(1);
diff --git a/fs/ufs/inode.c b/fs/ufs/inode.c
index 1bc0bd6a9848..4cb39d555df6 100644
--- a/fs/ufs/inode.c
+++ b/fs/ufs/inode.c
@@ -680,7 +680,7 @@ struct inode *ufs_iget(struct super_block *sb, unsigned 
long ino)
 
if (err)
goto bad_inode;
-   inode->i_version++;
+   inode_inc_iversion_locked(inode);
ufsi->i_lastfrag =
(inode->i_size + uspi->s_fsize - 1) >> uspi->s_fshift;
ufsi->i_dir_start_lookup = 0;
diff --git a/fs/ufs/super.c b/fs/ufs/super.c
index f04ab232d08d..ed7038aaa258 100644
--- a/fs/ufs/super.c
+++ b/fs/ufs/super.c
@@ -1398,7 +1398,7 @@ static struct inode *ufs_alloc_inode(struct super_block 
*sb)
if (!ei)
return NULL;
 
-   ei->vfs_inode.i_version = 1;
+   inode_set_iversion(>vfs_inode, 1);
seqlock_init(>meta_lock);
mutex_init(>truncate_mutex);
return >vfs_inode;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 11/30] fs: new API for handling i_version

2016-12-21 Thread Jeff Layton
We already have inode_inc_iversion. Add inode_set_iversion,
inode_get_iversion, inode_cmp_iversion and inode_iversion_need_inc.

These should encompass how i_version is currently accessed and
manipulated so that we can later change the implementation.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h | 109 ++---
 1 file changed, 104 insertions(+), 5 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2ba074328894..075c915fe2b1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1962,18 +1962,117 @@ static inline void inode_dec_link_count(struct inode 
*inode)
 }
 
 /**
+ * inode_set_iversion - set i_version to a particular value
+ * @inode: inode to set
+ * @new: new i_version value to set
+ *
+ * Set the inode's i_version. Should generally be called when initializing
+ * a new inode.
+ */
+static inline void
+inode_set_iversion(struct inode *inode, const u64 new)
+{
+   inode->i_version = new;
+}
+
+/**
+ * inode_inc_iversion_locked - increment i_version while protected
+ * @inode: inode to be updated
+ *
+ * Increment the i_version field in the inode. This version is usable
+ * when there is some other sort of lock in play that would prevent
+ * concurrent accessors.
+ */
+static inline void
+inode_inc_iversion_locked(struct inode *inode)
+{
+   inode->i_version++;
+}
+
+/**
+ * inode_set_iversion_read - set i_version to a particular value and flag
+ *  set flag to indicate that it has been viewed
+ * @inode: inode to set
+ * @new: new i_version value to set
+ *
+ * Set the inode's i_version, and flag the inode as if it has been viewed.
+ * Should generally be called when initializinga new inode in memory from
+ * from disk.
+ */
+static inline void
+inode_set_iversion_read(struct inode *inode, const u64 new)
+{
+   inode_set_iversion(inode, new);
+}
+
+/**
  * inode_inc_iversion - increments i_version
  * @inode: inode that need to be updated
  *
  * Every time the inode is modified, the i_version field will be incremented.
- * The filesystem has to be mounted with i_version flag
+ * The filesystem has to be mounted with MS_I_VERSION flag.
+ */
+static inline bool
+inode_inc_iversion(struct inode *inode)
+{
+   spin_lock(>i_lock);
+   inode_inc_iversion_locked(inode);
+   spin_unlock(>i_lock);
+   return true;
+}
+
+/**
+ * inode_get_iversion_raw - read i_version without "perturbing" it
+ * @inode: inode from which i_version should be read
+ *
+ * Read the inode i_version counter for an inode without registering it as a
+ * read. Should typically be used by local filesystems that need to store an
+ * i_version on disk.
+ */
+static inline u64
+inode_get_iversion_raw(const struct inode *inode)
+{
+   return inode->i_version;
+}
+
+/**
+ * inode_get_iversion - read i_version for later use
+ * @inode: inode from which i_version should be read
+ *
+ * Read the inode i_version counter. This should be used by callers that wish
+ * to store the returned i_version for later comparison.
+ */
+static inline u64
+inode_get_iversion(const struct inode *inode)
+{
+   return inode_get_iversion_raw(inode);
+}
+
+/**
+ * inode_cmp_iversion - check whether the i_version counter has changed
+ * @inode: inode to check
+ * @old: old value to check against its i_version
+ *
+ * Compare an i_version counter with a previous one. Returns 0 if they are
+ * the same or non-zero if they are different.
  */
+static inline s64
+inode_cmp_iversion(const struct inode *inode, const u64 old)
+{
+   return (s64)inode->i_version - (s64)old;
+}
 
-static inline void inode_inc_iversion(struct inode *inode)
+/**
+ * inode_iversion_need_inc - is the i_version in need of being incremented?
+ * @inode: inode to check
+ *
+ * Returns whether the inode->i_version counter needs incrementing on the next
+ * change.
+ */
+static inline bool
+inode_iversion_need_inc(struct inode *inode)
 {
-   spin_lock(>i_lock);
-   inode->i_version++;
-   spin_unlock(>i_lock);
+   return true;
 }
 
 enum file_time_flags {
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 24/30] IMA: switch IMA over to new i_version API

2016-12-21 Thread Jeff Layton
This change is straightforward, but I have some serious doubts about
this code.

Signed-off-by: Jeff Layton 
---
 security/integrity/ima/ima_api.c  | 2 +-
 security/integrity/ima/ima_main.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c
index 9df26a2b75ba..1f676313c30b 100644
--- a/security/integrity/ima/ima_api.c
+++ b/security/integrity/ima/ima_api.c
@@ -204,7 +204,7 @@ int ima_collect_measurement(struct integrity_iint_cache 
*iint,
} hash;
 
if (!(iint->flags & IMA_COLLECTED)) {
-   u64 i_version = file_inode(file)->i_version;
+   u64 i_version = inode_get_iversion(file_inode(file));
 
if (file->f_flags & O_DIRECT) {
audit_cause = "failed(directio)";
diff --git a/security/integrity/ima/ima_main.c 
b/security/integrity/ima/ima_main.c
index 423d111b3b94..8d95cf42d20c 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -122,7 +122,7 @@ static void ima_check_last_writer(struct 
integrity_iint_cache *iint,
 
inode_lock(inode);
if (atomic_read(>i_writecount) == 1) {
-   if ((iint->version != inode->i_version) ||
+   if (inode_cmp_iversion(inode, iint->version) ||
(iint->flags & IMA_NEW_FILE)) {
iint->flags &= ~(IMA_DONE_MASK | IMA_NEW_FILE);
iint->measured_pcrs = 0;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 28/30] btrfs: only dirty the inode in btrfs_update_time if something was changed

2016-12-21 Thread Jeff Layton
At this point, we know that "now" and the file times may differ, and we
suspect that the i_version has been flagged to be bumped. Attempt to
bump the i_version, and only mark the inode dirty if that actually
occurred or if one of the times was updated.

Signed-off-by: Jeff Layton 
---
 fs/btrfs/inode.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a03e5a1d5e05..65a7065c0fbf 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6025,19 +6025,20 @@ static int btrfs_update_time(struct inode *inode, 
struct timespec *now,
 int flags)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
+   bool dirty = flags & ~S_VERSION;
 
if (btrfs_root_readonly(root))
return -EROFS;
 
if (flags & S_VERSION)
-   inode_inc_iversion(inode, true);
+   dirty |= inode_inc_iversion(inode, dirty);
if (flags & S_CTIME)
inode->i_ctime = *now;
if (flags & S_MTIME)
inode->i_mtime = *now;
if (flags & S_ATIME)
inode->i_atime = *now;
-   return btrfs_dirty_inode(inode);
+   return dirty ? btrfs_dirty_inode(inode) : 0;
 }
 
 /*
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 26/30] fs: only set S_VERSION when updating times if it has been queried

2016-12-21 Thread Jeff Layton
We only really need to update i_version if someone has queried for it
since we last incremented it. By doing that, we can avoid having to
update the inode if the times haven't changed.

If the times have changed, then we go ahead and forcibly increment the
counter, under the assumption that we'll be going to the storage
anyway, and the increment itself is relatively cheap.

Signed-off-by: Jeff Layton 
---
 fs/inode.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 2484d91ae8ef..584516a32905 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1621,17 +1621,18 @@ static int relatime_need_update(const struct path 
*path, struct inode *inode,
 int generic_update_time(struct inode *inode, struct timespec *time, int flags)
 {
int iflags = I_DIRTY_TIME;
+   bool dirty = flags & ~S_VERSION;
 
if (flags & S_ATIME)
inode->i_atime = *time;
-   if (flags & S_VERSION)
-   inode_inc_iversion(inode, true);
if (flags & S_CTIME)
inode->i_ctime = *time;
if (flags & S_MTIME)
inode->i_mtime = *time;
+   if (flags & S_VERSION)
+   dirty |= inode_inc_iversion(inode, dirty);
 
-   if (!(inode->i_sb->s_flags & MS_LAZYTIME) || (flags & S_VERSION))
+   if (dirty || !(inode->i_sb->s_flags & MS_LAZYTIME))
iflags |= I_DIRTY_SYNC;
__mark_inode_dirty(inode, iflags);
return 0;
@@ -1850,7 +1851,7 @@ int file_update_time(struct file *file)
if (!timespec_equal(>i_ctime, ))
sync_it |= S_CTIME;
 
-   if (IS_I_VERSION(inode))
+   if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode))
sync_it |= S_VERSION;
 
if (!sync_it)
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 30/30] fs: convert i_version counter over to an atomic64_t

2016-12-21 Thread Jeff Layton
The spinlock is only used to serialize callers that want to increment
the counter. We can achieve the same thing with an atomic64_t and
get the i_lock out of this codepath.

Drop the I_VERS_BUMP flag, and instead, borrow the most significant bit
in the counter to use as the flag. With this change, we can stop taking
the i_lock in this codepath, and can use atomics instead to manage the
thing.

On the query side, if the flag is already set, then we just return the
counter value. Otherwise, we set the flag in our in-memory copy and use
cmpxchg to swap it into place if it hasn't changed. If it has, then we
use the value from the cmpxchg as the new "old" value and try again.

When we go to bump the thing, we fetch the value and check the flag bit.
If it's clear then we don't need to do anything if the update isn't
being forced.

If we do need to update, then we clear the flag in our in-memory copy
and bump the counter (handling any overflow into the flag bit by
resetting the counter to zero). We then do a cmpxchg to swap the updated
value into place if it hasn't changed. If it has changed, then we use
the value we got back from cmpxchg to try again.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h | 82 --
 1 file changed, 48 insertions(+), 34 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 917557faa8e8..401e38d76171 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -621,7 +621,7 @@ struct inode {
struct hlist_head   i_dentry;
struct rcu_head i_rcu;
};
-   u64 i_version;
+   atomic64_t  i_version;
atomic_ti_count;
atomic_ti_dio_count;
atomic_ti_writecount;
@@ -1909,9 +1909,6 @@ static inline bool HAS_UNMAPPED_ID(struct inode *inode)
  * wb stat updates to grab mapping->tree_lock.  See
  * inode_switch_wb_work_fn() for details.
  *
- * I_VERS_BUMP inode->i_version counter must be bumped on the next
- * change. See the inode_*_iversion functions.
- *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
 #define I_DIRTY_SYNC   (1 << 0)
@@ -1932,7 +1929,6 @@ static inline bool HAS_UNMAPPED_ID(struct inode *inode)
 #define __I_DIRTY_TIME_EXPIRED 12
 #define I_DIRTY_TIME_EXPIRED   (1 << __I_DIRTY_TIME_EXPIRED)
 #define I_WB_SWITCH(1 << 13)
-#define I_VERS_BUMP(1 << 14)
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 #define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME)
@@ -1965,6 +1961,14 @@ static inline void inode_dec_link_count(struct inode 
*inode)
mark_inode_dirty(inode);
 }
 
+/*
+ * We borrow the top bit in the i_version to use as a flag to tell us whether
+ * it has been queried since we last bumped it. If it has, then we must bump
+ * it and set the flag. Note that this means that we have to handle wrapping
+ * manually.
+ */
+#define INODE_I_VERSION_QUERIED(1ULL<<63)
+
 /**
  * inode_set_iversion - set i_version to a particular value
  * @inode: inode to set
@@ -1976,7 +1980,7 @@ static inline void inode_dec_link_count(struct inode 
*inode)
 static inline void
 inode_set_iversion(struct inode *inode, const u64 new)
 {
-   inode->i_version = new;
+   atomic64_set(>i_version, new);
 }
 
 /**
@@ -1992,10 +1996,7 @@ inode_set_iversion(struct inode *inode, const u64 new)
 static inline void
 inode_set_iversion_read(struct inode *inode, const u64 new)
 {
-   spin_lock(>i_lock);
-   inode_set_iversion(inode, new);
-   inode->i_state |= I_VERS_BUMP;
-   spin_unlock(>i_lock);
+   inode_set_iversion(inode, new | INODE_I_VERSION_QUERIED);
 }
 
 /**
@@ -2010,16 +2011,26 @@ inode_set_iversion_read(struct inode *inode, const u64 
new)
 static inline bool
 inode_inc_iversion(struct inode *inode, bool force)
 {
-   bool ret = false;
+   u64 cur, old, new;
+
+   cur = (u64)atomic64_read(>i_version);
+   for (;;) {
+   /* If flag is clear then we needn't do anything */
+   if (!force && !(cur & INODE_I_VERSION_QUERIED))
+   return false;
+
+   new = (cur & ~INODE_I_VERSION_QUERIED) + 1;
+
+   /* Did we overflow into flag bit? Reset to 0 if so. */
+   if (unlikely(new == INODE_I_VERSION_QUERIED))
+   new = 0;
 
-   spin_lock(>i_lock);
-   if (force || (inode->i_state & I_VERS_BUMP)) {
-   inode->i_version++;
-   inode->i_state &= ~I_VERS_BUMP;
-   ret = true;
+   old = atomic64_cmpxchg(>i_version, cur, new);
+   if (likely(old == cur))
+   break;
+   cur = old;
}
-   spin_unlock(>i_lock);
-   return ret;
+   return true;
 }
 
 /**
@@ -2027,8 

[RFC PATCH v1 18/30] ext4: convert to new i_version API

2016-12-21 Thread Jeff Layton
Signed-off-by: Jeff Layton 
---
 fs/ext4/dir.c|  8 
 fs/ext4/inline.c |  6 +++---
 fs/ext4/inode.c  | 12 
 fs/ext4/ioctl.c  |  2 +-
 fs/ext4/namei.c  |  8 
 fs/ext4/super.c  |  2 +-
 6 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index e8b365000d73..cd84f15d3d41 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -207,7 +207,7 @@ static int ext4_readdir(struct file *file, struct 
dir_context *ctx)
 * readdir(2), then we might be pointing to an invalid
 * dirent right now.  Scan from the start of the block
 * to make sure. */
-   if (file->f_version != inode->i_version) {
+   if (inode_cmp_iversion(inode, file->f_version)) {
for (i = 0; i < sb->s_blocksize && i < offset; ) {
de = (struct ext4_dir_entry_2 *)
(bh->b_data + i);
@@ -226,7 +226,7 @@ static int ext4_readdir(struct file *file, struct 
dir_context *ctx)
offset = i;
ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1))
| offset;
-   file->f_version = inode->i_version;
+   file->f_version = inode_get_iversion(inode);
}
 
while (ctx->pos < inode->i_size
@@ -567,10 +567,10 @@ static int ext4_dx_readdir(struct file *file, struct 
dir_context *ctx)
 * cached entries.
 */
if ((!info->curr_node) ||
-   (file->f_version != inode->i_version)) {
+   inode_cmp_iversion(inode, file->f_version)) {
info->curr_node = NULL;
free_rb_tree_fname(>root);
-   file->f_version = inode->i_version;
+   file->f_version = inode_get_iversion(inode);
ret = ext4_htree_fill_tree(file, info->curr_hash,
   info->curr_minor_hash,
   >next_hash);
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 437df6a1a841..6a3d173cb9ec 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1042,7 +1042,7 @@ static int ext4_add_dirent_to_inline(handle_t *handle,
 */
dir->i_mtime = dir->i_ctime = current_time(dir);
ext4_update_dx_flag(dir);
-   dir->i_version++;
+   inode_inc_iversion_locked(dir);
ext4_mark_inode_dirty(handle, dir);
return 1;
 }
@@ -1496,7 +1496,7 @@ int ext4_read_inline_dir(struct file *file,
 * dirent right now.  Scan from the start of the inline
 * dir to make sure.
 */
-   if (file->f_version != inode->i_version) {
+   if (inode_cmp_iversion(inode, file->f_version)) {
for (i = 0; i < extra_size && i < offset;) {
/*
 * "." is with offset 0 and
@@ -1528,7 +1528,7 @@ int ext4_read_inline_dir(struct file *file,
}
offset = i;
ctx->pos = offset;
-   file->f_version = inode->i_version;
+   file->f_version = inode_get_iversion(inode);
}
 
while (ctx->pos < extra_size) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 88d57af1b516..5603e1782c65 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4709,12 +4709,14 @@ struct inode *ext4_iget(struct super_block *sb, 
unsigned long ino)
EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode);
 
if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
-   inode->i_version = le32_to_cpu(raw_inode->i_disk_version);
+   u64 ivers = le32_to_cpu(raw_inode->i_disk_version);
+
if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
-   inode->i_version |=
+   ivers |=
(__u64)(le32_to_cpu(raw_inode->i_version_hi)) << 32;
}
+   inode_set_iversion_read(inode, ivers);
}
 
ret = 0;
@@ -5000,11 +5002,13 @@ static int ext4_do_update_inode(handle_t *handle,
}
 
if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
-   raw_inode->i_disk_version = cpu_to_le32(inode->i_version);
+   u64 ivers = inode_get_iversion_raw(inode);
+
+   raw_inode->i_disk_version = cpu_to_le32(ivers);
if (ei->i_extra_isize) {
if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
raw_inode->i_version_hi =
-   cpu_to_le32(inode->i_version >> 32);
+   cpu_to_le32(ivers >> 32);
raw_inode->i_extra_isize =
  

[RFC PATCH v1 27/30] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing

2016-12-21 Thread Jeff Layton
We do go ahead and increment it though if XFS_ILOG_CORE is already
set when we get to this point.

Signed-off-by: Jeff Layton 
---
 fs/xfs/xfs_trans_inode.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index 844c08886170..fac422da0bbe 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -110,15 +110,15 @@ xfs_trans_log_inode(
 
/*
 * First time we log the inode in a transaction, bump the inode change
-* counter if it is configured for this to occur. We don't use
-* inode_inc_version() because there is no need for extra locking around
-* i_version as we already hold the inode locked exclusively for
-* metadata modification.
+* counter if it is configured for this to occur. While we hold the
+* inode locked exclusively for metadata modification, we still use
+* inode_inc_iversion as it allows us to avoid setting XFS_ILOG_CORE
+* if the version hasn't been queried since the last bump.
 */
if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
IS_I_VERSION(VFS_I(ip))) {
-   inode_inc_iversion_locked(VFS_I(ip));
-   flags |= XFS_ILOG_CORE;
+   if (inode_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
+   flags |= XFS_ILOG_CORE;
}
 
tp->t_flags |= XFS_TRANS_DIRTY;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 29/30] fs: track whether the i_version has been queried with an i_state flag

2016-12-21 Thread Jeff Layton
NFSv4 has some pretty relaxed rules for the i_version counter that
we can exploit for our own (performance) gain. The rules basically
boil down to:

1) it must steadily increase so that a client can discard change
   attributes that are older than ones it has already seen.

2) the value must be different from the last time we checked it if
   there was a data or metadata change.

This last bit is important, as we don't necessarily need to bump the
counter when no one is querying for it. On a write-intensive workload
this can add up to the metadata being written a lot less.

Add a new I_VERS_BUMP i_state flag that we can use to track when the
i_version has been queried. When it's queried we take the i_lock,
get the value and set the flag and then drop the lock and return it.

When we would go to bump it, we check the flag and only bump the
the counter if it's set and we weren't requested to forcibly bump it.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h | 66 +-
 1 file changed, 46 insertions(+), 20 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 75323e7b6954..917557faa8e8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1909,6 +1909,9 @@ static inline bool HAS_UNMAPPED_ID(struct inode *inode)
  * wb stat updates to grab mapping->tree_lock.  See
  * inode_switch_wb_work_fn() for details.
  *
+ * I_VERS_BUMP inode->i_version counter must be bumped on the next
+ * change. See the inode_*_iversion functions.
+ *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
 #define I_DIRTY_SYNC   (1 << 0)
@@ -1929,6 +1932,7 @@ static inline bool HAS_UNMAPPED_ID(struct inode *inode)
 #define __I_DIRTY_TIME_EXPIRED 12
 #define I_DIRTY_TIME_EXPIRED   (1 << __I_DIRTY_TIME_EXPIRED)
 #define I_WB_SWITCH(1 << 13)
+#define I_VERS_BUMP(1 << 14)
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 #define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME)
@@ -1976,20 +1980,6 @@ inode_set_iversion(struct inode *inode, const u64 new)
 }
 
 /**
- * inode_inc_iversion_locked - increment i_version while protected
- * @inode: inode to be updated
- *
- * Increment the i_version field in the inode. This version is usable
- * when there is some other sort of lock in play that would prevent
- * concurrent accessors.
- */
-static inline void
-inode_inc_iversion_locked(struct inode *inode)
-{
-   inode->i_version++;
-}
-
-/**
  * inode_set_iversion_read - set i_version to a particular value and flag
  *  set flag to indicate that it has been viewed
  * @inode: inode to set
@@ -2002,7 +1992,10 @@ inode_inc_iversion_locked(struct inode *inode)
 static inline void
 inode_set_iversion_read(struct inode *inode, const u64 new)
 {
+   spin_lock(>i_lock);
inode_set_iversion(inode, new);
+   inode->i_state |= I_VERS_BUMP;
+   spin_unlock(>i_lock);
 }
 
 /**
@@ -2011,14 +2004,36 @@ inode_set_iversion_read(struct inode *inode, const u64 
new)
  *
  * Every time the inode is modified, the i_version field will be incremented.
  * The filesystem has to be mounted with MS_I_VERSION flag.
+ *
+ * Returns true if counter was bumped, and false if it wasn't necessary.
  */
 static inline bool
 inode_inc_iversion(struct inode *inode, bool force)
 {
+   bool ret = false;
+
spin_lock(>i_lock);
-   inode_inc_iversion_locked(inode);
+   if (force || (inode->i_state & I_VERS_BUMP)) {
+   inode->i_version++;
+   inode->i_state &= ~I_VERS_BUMP;
+   ret = true;
+   }
spin_unlock(>i_lock);
-   return true;
+   return ret;
+}
+
+/**
+ * inode_inc_iversion_locked - increment i_version while protected
+ * @inode: inode to be updated
+ *
+ * Increment the i_version field in the inode. This version is usable
+ * when there is some other sort of lock in play that would prevent
+ * concurrent increments (typically inode->i_rwsem for write).
+ */
+static inline void
+inode_inc_iversion_locked(struct inode *inode)
+{
+   inode_inc_iversion(inode, true);
 }
 
 /**
@@ -2043,9 +2058,15 @@ inode_get_iversion_raw(const struct inode *inode)
  * to store the returned i_version for later comparison.
  */
 static inline u64
-inode_get_iversion(const struct inode *inode)
+inode_get_iversion(struct inode *inode)
 {
-   return inode_get_iversion_raw(inode);
+   u64 ret;
+
+   spin_lock(>i_lock);
+   inode->i_state |= I_VERS_BUMP;
+   ret = inode->i_version;
+   spin_unlock(>i_lock);
+   return ret;
 }
 
 /**
@@ -2054,7 +2075,7 @@ inode_get_iversion(const struct inode *inode)
  * @old: old value to check against its i_version
  *
  * Compare an i_version counter with a previous one. Returns 0 if they are
- * the same or non-zero if they are different.
+ * the same, greater than zero if the inode's is 

Btrfs progs pre-release 4.9-rc2

2016-12-21 Thread David Sterba
Hi,

a pre-release has been tagged, fixes some problems with receive found by
fstests.

ETA for 4.9 is in +2 days (2016-12-22).

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_log2phys: cannot lookup extent mapping

2016-12-21 Thread David Hanke

Hi Duncan,

Thank you for your reply. If I've emailed the wrong list, please let me 
know. What I hear you saying, in short, is that btrfs is not yet fully 
stable but current 4.x versions may work better. I'm willing to upgrade, 
but I'm told that the upgrade process may result in total failure, and 
I'm not sure I can trust the contents of the volume either way. Given 
that, it seems I must backup the backup, erase and start over. What 
would you do?


Thank you,

David


On 12/20/16 17:24, Duncan wrote:

David Hanke posted on Tue, 20 Dec 2016 09:52:25 -0600 as excerpted:


I've been using a btrfs-based volume for backups, but lately the
system's been filling the syslog with errors like "btrfs_log2phys:
cannot lookup extent mapping for 7129125486592" at the rate of hundreds
per second. (Please see output below for more details.) Despite the
errors, the files I've looked at appear to be written and read
successfully.

I'm wondering if the contents of the volume are trustworthy and whether
this problem is resolvable without backing up, erasing and starting
over?

Thank you!

David


# uname -a
Linux backup2 3.0.101.RNx86_64.3 #1 SMP Wed Apr 1 16:02:14 PDT 2015
x86_64 GNU/Linux

# btrfs --version
Btrfs v3.17.3

FWIW...

[TL;DR: see the four bottom line choices, at the bottom.]

This is the upstream btrfs development and discussion list for a
filesystem that's still stabilizing (that is, not fully stable and
mature) and that remains under heavy development and bug fixing.  As
such, list focus is heavily forward looking, with an extremely strong
recommendation to use current kernels (and to a lessor extent btrfs
userspace) if you're going to be running btrfs, as these have all the
latest bugfixes.

Put a different way, the general view and strong recommendation of the
list is that because btrfs is still under heavy development, with bug
fixes, some more major than others, every kernel cycle, while we
recognize that choosing to run old and stale^H^Hble kernels and userspace
is a legitimate choice on its own, that choice of stability over support
for the latest and greatest, is viewed as incompatible with choosing to
run a still under heavy development filesystem.  Choosing one OR the
other is strongly recommended.

For list purposes, we recommend and best support the last two kernel
release series in two tracks, LTS/long-term-stable, or current release
track.  On the LTS track, that's the LTS 4.4 and 4.1 series.  On the
current track, 4.9 is the latest release, so 4.9 and 4.8 are best
supported.

Meanwhile, it's worth keeping in mind that the experimental label and
accompanying extremely strong "eat your babies" level warnings weren't
peeled off until IIRC 3.12 or so, meaning anything before that is not
only ancient history in list terms, but also still labeled as "eat your
babies" level experimental.  Why anyone choosing to run an ancient eat-
your-babies level experimental version of a filesystem that's now rather
more stable and mature, tho not yet fully stabilized, is beyond me.  If
they're interested in newer filesystems, running newer and less buggy
versions is reasonable; if they're interested in years-stale level of
stability, then running such filesystems, especially when still labeled
eat-your-babies level experimental back then, seems an extremely odd
choice indeed.

Of course, on-list we do recognize that various distros did and do offer
support at some level for older than list-recommended version btrfs, in
part because they backport fixes from newer versions.  However, because
we're forward development focused we don't track what patches these
distros may or may not have backported and thus aren't in a good position
to provide good support for them.  Instead, users choosing to use such
kernels are generally asked to choose between upgrading to something
reasonably supportable on-list if they wish to go that route, or referred
back to their distros for the support they're in a far better position to
offer, since they know what they've backported and what they haven't,
while we don't.

As for btrfs userspace, the way btrfs works, during normal runtime,
userspace primarily calls the kernel to do the real work, so userspace
version isn't as big a deal unless you're trying to use a feature only
supported by newer versions, except that if it's /too/ old, the impedance
mismatch between the commands as they were then and the commands in
current versions makes support rather more difficult.  However, once
there's a problem, then the age of userspace code becomes more vital, as
then it's actually the userspace code doing the work, and only newer
versions of btrfs check and btrfs restore, for instance, can detect and
fix problems where code has only recently been added to do so.

In general, then, with btrfs-progs releases and versioning synced to that
of the kernel, a reasonable rule of thumb is to run userspace of a
similar version to your kernel, tho unless you're experiencing problems,
getting a 

Re: [PATCH v3 0/6] Convert rollback rework for v4.9

2016-12-21 Thread David Sterba
On Mon, Dec 19, 2016 at 02:56:36PM +0800, Qu Wenruo wrote:
> Can be fetched from github:
> https://github.com/adam900710/btrfs-progs.git convert_rework_for_4.9
> 
> This is mainly to fix problems exposed by Chandan's fix for 64K nodesize.

Thanks for the fix. That's a lot of new code that I don't feel
comfortable to put to the upcomming release, so I'll merge 1 and 6 now
and queue the rest for 4.9.1.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: btrfs: Remove btrfs/047 since upstream don't accept stream-version

2016-12-21 Thread David Sterba
On Wed, Dec 21, 2016 at 10:33:55AM +0800, Eryu Guan wrote:
> On Tue, Dec 20, 2016 at 09:24:56AM +0800, Qu Wenruo wrote:
> > Btrfs upstream doesn't accept stream-version, so the test is never ran
> > on upstream kernel nor btrfs-progs.
> > 
> > Just remove it.
> > 
> > Signed-off-by: Qu Wenruo 
> 
> Looks fine to me, but I'd like to see an ack or review from btrfs
> developers.

We don't have ETA when the stream protocol version bump will happen. The
test can be added back later.

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: Get the highest inode for lost+found

2016-12-21 Thread David Sterba
On Tue, Dec 20, 2016 at 06:08:54AM -0600, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> root->highest_inode is not accurate at the time of creating a lost+found
> and it fails because the highest_inode+1 is already present. This could be
> because of fixes after highest_inode is set. Instead, search
> for the highest inode in the tree and use it for lost+found.
> 
> This makes root->highest_inode unnecessary and hence deleted.

There's a slight change in how the highest number is found, but I
haven't found a major problem with that. Currently the highest is
obtained from a "cache", that's being built, so we can be relatively
sure that we found all the correct objects in the filesystem or repaired
them and use the cached versions.

The new approach is to actively search in the existing tree. So this
expects that this will work without encountering any unexpected problems
(corruption etc). This should be still fine IMO, as creating the
lost+found directory searches the tree anyway and modifies it.

> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  cmds-check.c | 46 +++---
>  ctree.h  |  1 -
>  disk-io.c|  1 -
>  3 files changed, 27 insertions(+), 21 deletions(-)
> 
> diff --git a/cmds-check.c b/cmds-check.c
> index 1dba298..a55d00d 100644
> --- a/cmds-check.c
> +++ b/cmds-check.c
> @@ -2853,6 +2853,31 @@ out:
>   return ret;
>  }
>  
> +static int get_highest_inode(struct btrfs_trans_handle *trans,
> + struct btrfs_root *root,
> + struct btrfs_path *path,
> + u64 *highest_ino)
> +{
> + struct btrfs_key key, found_key;
> + int ret;
> +
> + btrfs_init_path(path);
> + key.objectid = BTRFS_LAST_FREE_OBJECTID;
> + key.offset = -1;
> + key.type = BTRFS_INODE_ITEM_KEY;
> + ret = btrfs_search_slot(trans, root, , path, -1, 1);
> + if (ret == 1) {
> + btrfs_item_key_to_cpu(path->nodes[0], _key,
> + path->slots[0] - 1);
> + *highest_ino = found_key.objectid;
> + ret = 0;
> + }

All the ret values should be handled, no? At least the error case < 0,
and == 0 as a sanity check.

Otherwise looks good, I'll wait if Qu has more to comment.

> + if (*highest_ino >= BTRFS_LAST_FREE_OBJECTID)
> + ret = -EOVERFLOW;
> + btrfs_release_path(path);
> + return ret;
> +}
> +
>  static int repair_inode_nlinks(struct btrfs_trans_handle *trans,
>  struct btrfs_root *root,
>  struct btrfs_path *path,
> @@ -2898,11 +2923,9 @@ static int repair_inode_nlinks(struct 
> btrfs_trans_handle *trans,
>   }
>  
>   if (rec->found_link == 0) {
> - lost_found_ino = root->highest_inode;
> - if (lost_found_ino >= BTRFS_LAST_FREE_OBJECTID) {
> - ret = -EOVERFLOW;
> + ret = get_highest_inode(trans, root, path, _found_ino);
> + if (ret < 0)
>   goto out;
> - }
>   lost_found_ino++;
>   ret = btrfs_mkdir(trans, root, dir_name, strlen(dir_name),
> BTRFS_FIRST_FREE_OBJECTID, _found_ino,
> @@ -3266,21 +3289,6 @@ static int check_inode_recs(struct btrfs_root *root,
>   }
>  
>   /*
> -  * We need to record the highest inode number for later 'lost+found'
> -  * dir creation.
> -  * We must select an ino not used/referred by any existing inode, or
> -  * 'lost+found' ino may be a missing ino in a corrupted leaf,
> -  * this may cause 'lost+found' dir has wrong nlinks.
> -  */
> - cache = last_cache_extent(inode_cache);
> - if (cache) {
> - node = container_of(cache, struct ptr_node, cache);
> - rec = node->data;
> - if (rec->ino > root->highest_inode)
> - root->highest_inode = rec->ino;
> - }
> -
> - /*
>* We need to repair backrefs first because we could change some of the
>* errors in the inode recs.
>*
> diff --git a/ctree.h b/ctree.h
> index dd02ef8..0c34ae2 100644
> --- a/ctree.h
> +++ b/ctree.h
> @@ -1177,7 +1177,6 @@ struct btrfs_root {
>  
>  
>   u32 type;
> - u64 highest_inode;
>   u64 last_inode_alloc;
>  
>   /*
> diff --git a/disk-io.c b/disk-io.c
> index 9140a81..2a94d4f 100644
> --- a/disk-io.c
> +++ b/disk-io.c
> @@ -494,7 +494,6 @@ void btrfs_setup_root(u32 nodesize, u32 leafsize, u32 
> sectorsize,
>   root->fs_info = fs_info;
>   root->objectid = objectid;
>   root->last_trans = 0;
> - root->highest_inode = 0;
>   root->last_inode_alloc = 0;
>  
>   INIT_LIST_HEAD(>dirty_list);
> -- 
> 2.10.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  

Re: [PATCH 1/2] btrfs: drop trace_btrfs_all_work_done() from normal_work_helper()

2016-12-21 Thread David Sterba
On Wed, Dec 21, 2016 at 04:45:06PM +0800, Qu Wenruo wrote:
> 
> 
> At 12/21/2016 04:28 PM, Sebastian Andrzej Siewior wrote:
> > On 2016-12-21 08:33:03 [+0800], Qu Wenruo wrote:
> >> The trace point only uses the pointer, and this helps us to pair with
> >> btrfs_work_queued/sched.
> >
> > | /* For situiations that the work is freed */
> > | DECLARE_EVENT_CLASS(btrfs__work__done,
> > |
> > | TP_PROTO(struct btrfs_work *work),
> > |
> > | TP_ARGS(work),
> > |
> > | TP_STRUCT__entry_btrfs(
> > | __field(void *, work)
> > | ),
> > |
> > | TP_fast_assign_btrfs(btrfs_work_owner(work),
> > | __entry->work   = work;
> > | ),
> > |
> > | TP_printk_btrfs("work->%p", __entry->work)
> > | );
> >
> > and btrfs_work_owner exapnds to:
> >
> > | struct btrfs_fs_info *
> > | btrfs_work_owner(struct btrfs_work *work)
> > | {
> > | return work->wq->fs_info;
> > | }
> >
> > voilà
> 
> Oh I got it, thanks very much.
> 
> The btrfs_work_owner() is newly introduced, no wonder I didn't know that.
> 
> 
> I think we can fix it by extracting fs_info pointer before running the 
> work, and using the extracted one in the trace point.

That sounds like an easy fix to preserve the tracepoint.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OOM: Better, but still there on

2016-12-21 Thread Chris Mason

On Wed, Dec 21, 2016 at 12:16:53PM +0100, Michal Hocko wrote:

On Wed 21-12-16 20:00:38, Tetsuo Handa wrote:

One thing to note here, when we are talking about 32b kernel, things
have changed in 4.8 when we moved from the zone based to node based
reclaim (see b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a
per-node basis") and associated patches). It is possible that the
reporter is hitting some pathological path which needs fixing but it
might be also related to something else. So I am rather not trying to
blame 32b yet...


It might be interesting to put tracing on releasepage and see if btrfs 
is pinning pages around.  I can't see how 32bit kernels would be 
different, but maybe we're hitting a weird corner.


-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [CORRUPTION FILESYSTEM] Corrupted and unrecoverable file system during the snapshot receive

2016-12-21 Thread bepi
Hi.

I will insert ' btrfs check ' after each ' receive ' in my script.

I will test again my hardware.
But is not very likely that 2 computers, 3 HDD, 3 partitions, all have issue.

I think that the problem is a concomitance of operations, a race condition, a
random conditions.
I'll try to create a test case.


P.S. For find the problem may need to insert tools as ' coredumper ' and  '
sanitize ' in ' btrfs ', detect in realtime the ' extent ' corruption, and log
detection.


Thank you.

Gdb



Xin Zhou :

> Hi,
> 
> The system seems running some customized scripts continuously backup data
> from a NVME drive to HDDs.
> If the 3 HDDs backup storage are same in btrfs config, and the there is a bug
> in btrfs code,
> they all suppose to fail after the same operation sequence.
> 
> Otherwise, probably one of the HDDs might have issue, or there is a bug in
> layer below btrfs.
> 
> For the customize script, it might be helpful to check the file system
> consistency after each transfer.
> That might be useful to figure out which step generates a corruption, and if
> there is error propagations.
> 
> Regards,
> Xin
>  
>  
> 
> Sent: Monday, December 19, 2016 at 10:55 AM
> From: "Giuseppe Della Bianca" 
> To: "Xin Zhou" 
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: [CORRUPTION FILESYSTEM] Corrupted and unrecoverable file system
> during the snapshot receive
> a concrete example
> 
> 
> SNAPSHOT
> 
> /dev/nvme0n1p2 on /tmp/tmp.X3vU6dLLVI type btrfs
> (rw,relatime,ssd,space_cache,subvolid=5,subvol=/)
> 
> btrfsManage SNAPSHOT /
> 
> (2016-12-19 19:44:00) Start btrfsManage
> . . . Start managing SNAPSHOT ' / ' filesystem ' root ' snapshot
> 
> In ' btrfssnapshot ' latest source snapshot ' root-2016-12-18_15:10:01.40 '
> . . . date ' 2016-12-18_15:10:01 ' number ' 40 '
> 
> Creation ' root-2016-12-19_19:44:00.part ' snapshot from ' root ' subvolume
> . . . Create a readonly snapshot of '/tmp/tmp.X3vU6dLLVI/root' in
> '/tmp/tmp.X3vU6dLLVI/btrfssnapshot/root/root-2016-12-19_19:44:00.part'
> 
> Renaming ' root-2016-12-19_19:44:00.part ' into ' root-2016-12-19_19:44:00.41
> ' snapshot
> 
> Source snapshot list of ' root ' subvolume
> . . . btrfssnapshot/root/root-2016-08-28-12-35-01.1
> ]zac[
> . . . btrfssnapshot/root/root-2016-12-19_19:44:00.41
> 
> (2016-12-19 19:44:05) End btrfsManage
> . . . End managing SNAPSHOT ' / ' filesystem ' root ' snapshot
> CORRECTLY
> 
> 
> 
> SEND e RECEIVE
> 
> /dev/nvme0n1p2 on /tmp/tmp.o78czE0Bo6 type btrfs
> (rw,relatime,ssd,space_cache,subvolid=5,subvol=/)
> /dev/sda2 on /tmp/tmp.XcwqQCKq09 type btrfs
> (rw,relatime,space_cache,subvolid=5,subvol=/)
> 
> btrfsManage SEND / /dev/sda2
> 
> (2016-12-19 19:47:24) Start btrfsManage
> . . . Start managing SEND ' / ' filesystem ' root ' snapshot in ' /dev/sda2
> '
> 
> Sending ' root-2016-12-19_19:44:00.41 ' source snapshot to ' btrfsreceive '
> subvolume
> . . . btrfs send -p
> /tmp/tmp.o78czE0Bo6/btrfssnapshot/root/root-2016-12-18_15:10:01.40
> /tmp/tmp.o78czE0Bo6/btrfssnapshot/root/root-2016-12-19_19:44:00.41 | btrfs
> receive /tmp/tmp.XcwqQCKq09/btrfsreceive/root/.part/
> . . . At subvol
> /tmp/tmp.o78czE0Bo6/btrfssnapshot/root/root-2016-12-19_19:44:00.41
> . . . At snapshot root-2016-12-19_19:44:00.41
> 
> Creation ' root-2016-12-19_19:44:00.41 ' snapshot from '
> .part/root-2016-12-19_19:44:00.41 ' subvolume
> . . . Create a readonly snapshot of
> '/tmp/tmp.XcwqQCKq09/btrfsreceive/root/.part/root-2016-12-19_19:44:00.41' in
> '/tmp/tmp.XcwqQCKq09/btrfsreceive/root/root-2016-12-19_19:44:00.41'
> . . . Delete subvolume (commit):
> '/tmp/tmp.XcwqQCKq09/btrfsreceive/root/.part/root-2016-12-19_19:44:00.41'
> 
> Snapshot list in ' /dev/sda2 ' device
> . . . btrfsreceive/data_backup/data_backup-2016-12-17_12:07:00.1
> . . . btrfsreceive/data_storage/data_storage-2016-12-10_17:05:51.1
> . . . btrfsreceive/root/root-2016-08-28-12-35-01.1
> ]zac[
> . . . btrfsreceive/root/root-2016-12-19_19:44:00.41
> 
> (2016-12-19 19:48:37) End btrfsManage
> . . . End managing SEND ' / ' filesystem ' root ' snapshot in ' /dev/sda2 '
> CORRECTLY
> 
> 
> 
> > Hi Giuseppe,
> >
> > Would you like to tell some details about:
> > 1. the XYZ snapshot was taken from which subvolume
> > 2. where the base (initial) snapshot is stored
> > 3. The 3 partitions receives the same snapshot, are they in the same btrfs
> > configuration and subvol structure?
> >
> > Also, would you send the link reports "two files unreadable error" post
> > mentioned in step 2? Hope can see the message and figure out if the issue
> > first comes from sender or receiver side.
> >
> > Thanks,
> > Xin
> >
> >
>  
> 





This mail has been sent using Alpikom webmail system
http://www.alpikom.it

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: OOM: Better, but still there on

2016-12-21 Thread Michal Hocko
On Wed 21-12-16 20:00:38, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > TL;DR
> > there is another version of the debugging patch. Just revert the
> > previous one and apply this one instead. It's still not clear what
> > is going on but I suspect either some misaccounting or unexpeted
> > pages on the LRU lists. I have added one more tracepoint, so please
> > enable also mm_vmscan_inactive_list_is_low.
> > 
> > Hopefully the additional data will tell us more.
> > 
> > On Tue 20-12-16 03:08:29, Nils Holland wrote:
[...]
> > > http://ftp.tisys.org/pub/misc/teela_2016-12-20.log.xz
> > 
> > This is the stall report:
> > [ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, 
> > order:0, mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
> > [ 1661.485859] CPU: 1 PID: 1950 Comm: btrfs-transacti Not tainted 
> > 4.9.0-gentoo #4
> > 
> > pid 1950 is trying to allocate for a _long_ time. Considering that this
> > is the only stall report, this means that reclaim took really long so we
> > didn't get to the page allocator for that long. It sounds really crazy!
> 
> warn_alloc() reports only if !__GFP_NOWARN.

yes and the above allocation clear is !__GFP_NOWARN allocation which is
reported after 611s! If there are no prior/lost warn_alloc() then it
implies we have spent _that_ much time in the reclaim. Considering the
tracing data we cannot really rule that out. All the reclaimers would
fight over the lru_lock and considering we are scanning the whole LRU
this will take some time.

[...]

> By the way, Michal, I'm feeling strange because it seems to me that your
> analysis does not refer to the implications of "x86_32 kernel". Maybe
> you already referred x86_32 by "they are from the highmem zone" though.

yes Highmem as well all those scanning anomalies is the 32b kernel
specific thing. I believe I have already mentioned that the 32b kernel
suffers from some inherent issues but I would like to understand what is
going on here before blaming the 32b.

One thing to note here, when we are talking about 32b kernel, things
have changed in 4.8 when we moved from the zone based to node based
reclaim (see b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a
per-node basis") and associated patches). It is possible that the
reporter is hitting some pathological path which needs fixing but it
might be also related to something else. So I am rather not trying to
blame 32b yet...

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OOM: Better, but still there on

2016-12-21 Thread Tetsuo Handa
Michal Hocko wrote:
> TL;DR
> there is another version of the debugging patch. Just revert the
> previous one and apply this one instead. It's still not clear what
> is going on but I suspect either some misaccounting or unexpeted
> pages on the LRU lists. I have added one more tracepoint, so please
> enable also mm_vmscan_inactive_list_is_low.
> 
> Hopefully the additional data will tell us more.
> 
> On Tue 20-12-16 03:08:29, Nils Holland wrote:
> > On Mon, Dec 19, 2016 at 02:45:34PM +0100, Michal Hocko wrote:
> > 
> > > Unfortunatelly shrink_active_list doesn't have any tracepoint so we do
> > > not know whether we managed to rotate those pages. If they are referenced
> > > quickly enough we might just keep refaulting them... Could you try to 
> > > apply
> > > the followin diff on top what you have currently. It should add some more
> > > tracepoint data which might tell us more. We can reduce the amount of
> > > tracing data by enabling only mm_vmscan_lru_isolate,
> > > mm_vmscan_lru_shrink_inactive and mm_vmscan_lru_shrink_active.
> > 
> > So, the results are in! I applied your patch and rebuild the kernel,
> > then I rebooted the machine, set up tracing so that only the three
> > events you mentioned were being traced, and captured the output over
> > the network.
> > 
> > Things went a bit different this time: The trace events started to
> > appear after a while and a whole lot of them were generated, but
> > suddenly they stopped. A short while later, we get

"cat /debug/trace/trace_pipe > /dev/udp/$ip/$port" stops reporting if
/bin/cat is disturbed by page fault and/or memory allocation needed for
sending UDP packets. Since netconsole can send UDP packets without involving
memory allocation, printk() is preferable than tracing under OOM.

> 
> It is possible that you are hitting multiple issues so it would be
> great to focus at one at the time. The underlying problem might be
> same/similar in the end but this is hard to tell now. Could you try to
> reproduce and provide data for the OOM killer situation as well?
>  
> > [ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, 
> > order:0, mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
> > 
> > along with a backtrace and memory information, and then there was
> > silence.
> 
> > When I walked up to the machine, it had completely died; it
> > wouldn't turn on its screen on key press any more, blindly trying to
> > reboot via SysRequest had no effect, but the caps lock LED also wasn't
> > blinking, like it normally does when a kernel panic occurs. Good
> > question what state it was in. The OOM reaper didn't really seem to
> > kick in and kill processes this time, it seems.
> > 
> > The complete capture is up at:
> > 
> > http://ftp.tisys.org/pub/misc/teela_2016-12-20.log.xz
> 
> This is the stall report:
> [ 1661.485568] btrfs-transacti: page alloction stalls for 611058ms, order:0, 
> mode:0x2420048(GFP_NOFS|__GFP_HARDWALL|__GFP_MOVABLE)
> [ 1661.485859] CPU: 1 PID: 1950 Comm: btrfs-transacti Not tainted 
> 4.9.0-gentoo #4
> 
> pid 1950 is trying to allocate for a _long_ time. Considering that this
> is the only stall report, this means that reclaim took really long so we
> didn't get to the page allocator for that long. It sounds really crazy!

warn_alloc() reports only if !__GFP_NOWARN.

We can report where they were looping using kmallocwd at
http://lkml.kernel.org/r/1478416501-10104-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
(and extend it to call printk() for reporting values using SystemTap which your
trace hooks would report, only during memory allocations are stalling, without
delay caused by page fault and/or memory allocation needed for sending UDP 
packets).

But if trying to reboot via SysRq-b did not work, I think that the system
was in hard lockup state. That would be a different problem.

By the way, Michal, I'm feeling strange because it seems to me that your
analysis does not refer to the implications of "x86_32 kernel". Maybe
you already referred x86_32 by "they are from the highmem zone" though.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: drop trace_btrfs_all_work_done() from normal_work_helper()

2016-12-21 Thread Sebastian Andrzej Siewior
On 2016-12-21 16:45:06 [+0800], Qu Wenruo wrote:
> > | DECLARE_EVENT_CLASS(btrfs__work__done,
> > |
> > | TP_PROTO(struct btrfs_work *work),
> > |
> > | TP_ARGS(work),
> > |
> > | TP_STRUCT__entry_btrfs(
> > | __field(void *, work)
> > | ),
> > |
> > | TP_fast_assign_btrfs(btrfs_work_owner(work),
> > | __entry->work   = work;
> > | ),
> > |
> > | TP_printk_btrfs("work->%p", __entry->work)
> > | );
> > 
> > and btrfs_work_owner exapnds to:
> > 
> > | struct btrfs_fs_info *
> > | btrfs_work_owner(struct btrfs_work *work)
> > | {
> > | return work->wq->fs_info;
> > | }
> > 
> > voilà
> 
> Oh I got it, thanks very much.
> 
> The btrfs_work_owner() is newly introduced, no wonder I didn't know that.

It was introduced in cb001095ca70 ("btrfs: plumb fs_info into
btrfs_work") which is v4.8-rc1. The usage in trace points started in
bc074524e123 ("btrfs: prefix fsid to all trace events") which is also
v4.8-rc1. So whatever is done to get it fixed, it should be pushed
stable for v4.8+.

> Thanks,
> Qu

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: drop trace_btrfs_all_work_done() from normal_work_helper()

2016-12-21 Thread Qu Wenruo



At 12/21/2016 04:28 PM, Sebastian Andrzej Siewior wrote:

On 2016-12-21 08:33:03 [+0800], Qu Wenruo wrote:

The trace point only uses the pointer, and this helps us to pair with
btrfs_work_queued/sched.


| /* For situiations that the work is freed */
| DECLARE_EVENT_CLASS(btrfs__work__done,
|
| TP_PROTO(struct btrfs_work *work),
|
| TP_ARGS(work),
|
| TP_STRUCT__entry_btrfs(
| __field(void *, work)
| ),
|
| TP_fast_assign_btrfs(btrfs_work_owner(work),
| __entry->work   = work;
| ),
|
| TP_printk_btrfs("work->%p", __entry->work)
| );

and btrfs_work_owner exapnds to:

| struct btrfs_fs_info *
| btrfs_work_owner(struct btrfs_work *work)
| {
| return work->wq->fs_info;
| }

voilà


Oh I got it, thanks very much.

The btrfs_work_owner() is newly introduced, no wonder I didn't know that.


I think we can fix it by extracting fs_info pointer before running the 
work, and using the extracted one in the trace point.


Thanks,
Qu





But I still don't understand why backtrace is triggered.
Since we're just recording a pointer, not touching it.

Would you please explain the problem with more details on how it trigger the
problem?


enabled all events played with the fs which was just an upgrade and git
tree sync + checkout so nothing special.



So I think we should either remove the tracepoint completely or change
the arguments to take something else than a potentially freed 'work'.


I'm mostly OK to remove the tracepoint, but such all_workd_done() trace
should still help to determine if it's a workqueue stalled.

Thanks,
Qu


Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: drop trace_btrfs_all_work_done() from normal_work_helper()

2016-12-21 Thread Sebastian Andrzej Siewior
On 2016-12-21 08:33:03 [+0800], Qu Wenruo wrote:
> The trace point only uses the pointer, and this helps us to pair with
> btrfs_work_queued/sched.

| /* For situiations that the work is freed */
| DECLARE_EVENT_CLASS(btrfs__work__done,
| 
| TP_PROTO(struct btrfs_work *work),
| 
| TP_ARGS(work),
| 
| TP_STRUCT__entry_btrfs(
| __field(void *, work)
| ),
| 
| TP_fast_assign_btrfs(btrfs_work_owner(work),
| __entry->work   = work;
| ),
| 
| TP_printk_btrfs("work->%p", __entry->work)
| );

and btrfs_work_owner exapnds to:

| struct btrfs_fs_info *
| btrfs_work_owner(struct btrfs_work *work)
| {
| return work->wq->fs_info;
| }
 
voilà

> But I still don't understand why backtrace is triggered.
> Since we're just recording a pointer, not touching it.
> 
> Would you please explain the problem with more details on how it trigger the
> problem?

enabled all events played with the fs which was just an upgrade and git
tree sync + checkout so nothing special.

> > 
> > So I think we should either remove the tracepoint completely or change
> > the arguments to take something else than a potentially freed 'work'.
> 
> I'm mostly OK to remove the tracepoint, but such all_workd_done() trace
> should still help to determine if it's a workqueue stalled.
> 
> Thanks,
> Qu

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html