date:20171222

Re: [PATCH 09/10] Btrfs: add tracepoint for em's EEXIST case

2017-12-22 Thread Nikolay Borisov



On 22.12.2017 21:07, Liu Bo wrote:
> On Fri, Dec 22, 2017 at 10:56:31AM +0200, Nikolay Borisov wrote:
>>
>>
>> On 22.12.2017 00:42, Liu Bo wrote:
>>> This is adding a tracepoint 'btrfs_handle_em_exist' to help debug the
>>> subtle bugs around merge_extent_mapping.
>>
>> In the next patch you are already making the function which takes all
>> these values noinline, meaning you can attach a kprobe so you can
>> interrogate the args via systemtap,perf probe or even bpf. So I'd rather
>> not add this tracepoint since the general sentiment seems to be that
>> tracepoints are ABI and so have to be maintained.
>>
> 
> The following patch of noinline function is inside the else {} branch,
> so kprobe would give us something back only when it runs to else{}.
> 
> Since subtle bugs may lie in this area, a simple tracepoint like this
> can help us understand them more efficiently.

In that case if you make search_extent_mapping and put a kprobe/ret
kprobe there you should be able to gain the same information, we should
really careful when adding trace events...
> 
> thanks,
> -liubo
> 
>>>
>>> Signed-off-by: Liu Bo 
>>> ---
>>>  fs/btrfs/extent_map.c|  1 +
>>>  include/trace/events/btrfs.h | 35 +++
>>>  2 files changed, 36 insertions(+)
>>>
>>> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
>>> index a8b7e24..40e4d30 100644
>>> --- a/fs/btrfs/extent_map.c
>>> +++ b/fs/btrfs/extent_map.c
>>> @@ -539,6 +539,7 @@ int btrfs_add_extent_mapping(struct extent_map_tree 
>>> *em_tree,
>>> ret = 0;
>>>  
>>> existing = search_extent_mapping(em_tree, start, len);
>>> +   trace_btrfs_handle_em_exist(existing, em, start, len);
>>>  
>>> /*
>>>  * existing will always be non-NULL, since there must be
>>> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
>>> index 4342a32..b7ffcf7 100644
>>> --- a/include/trace/events/btrfs.h
>>> +++ b/include/trace/events/btrfs.h
>>> @@ -249,6 +249,41 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
>>>   __entry->refs, __entry->compress_type)
>>>  );
>>>  
>>> +TRACE_EVENT(btrfs_handle_em_exist,
>>> +
>>> +   TP_PROTO(const struct extent_map *existing, const struct extent_map 
>>> *map, u64 start, u64 len),
>>> +
>>> +   TP_ARGS(existing, map, start, len),
>>> +
>>> +   TP_STRUCT__entry(
>>> +   __field(u64,  e_start   )
>>> +   __field(u64,  e_len )
>>> +   __field(u64,  map_start )
>>> +   __field(u64,  map_len   )
>>> +   __field(u64,  start )
>>> +   __field(u64,  len   )
>>> +   ),
>>> +
>>> +   TP_fast_assign(
>>> +   __entry->e_start= existing->start;
>>> +   __entry->e_len  = existing->len;
>>> +   __entry->map_start  = map->start;
>>> +   __entry->map_len= map->len;
>>> +   __entry->start  = start;
>>> +   __entry->len= len;
>>> +   ),
>>> +
>>> +   TP_printk("start=%llu len=%llu "
>>> + "existing(start=%llu len=%llu) "
>>> + "em(start=%llu len=%llu)",
>>> + (unsigned long long)__entry->start,
>>> + (unsigned long long)__entry->len,
>>> + (unsigned long long)__entry->e_start,
>>> + (unsigned long long)__entry->e_len,
>>> + (unsigned long long)__entry->map_start,
>>> + (unsigned long long)__entry->map_len)
>>> +);
>>> +
>>>  /* file extent item */
>>>  DECLARE_EVENT_CLASS(btrfs__file_extent_item_regular,
>>>  
>>>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unexpected raid1 behaviour

2017-12-22 Thread Duncan

Tomasz Pala posted on Sat, 23 Dec 2017 03:52:47 +0100 as excerpted:

> On Fri, Dec 22, 2017 at 14:04:43 -0700, Chris Murphy wrote:
> 
>> I'm pretty sure degraded boot timeout policy is handled by dracut. The
> 
> Well, last time I've checked dracut on systemd-system couldn't even
> generate systemd-less image.

??

Unless it changed recently (I /chose/ a systemd-based dracut setup here, 
so I'd not be aware if it did), dracut can indeed do systemd-less initr* 
images.  Dracut is modular, and systemd is one of the modules, enabled by 
default on a systemd system, but not required, as I know, because I had 
dracut setup without the systemd module for some time after I switched to 
systemd for my main sysinit, and I verified it didn't install systemd in 
the initr* until I activated the systemd module.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unexpected raid1 behaviour

2017-12-22 Thread Duncan

Tomasz Pala posted on Sat, 23 Dec 2017 05:08:16 +0100 as excerpted:

> On Tue, Dec 19, 2017 at 17:08:28 -0700, Chris Murphy wrote:
> 
> Now, if the current kernels won't toggle degraded RAID1 as ro, can I
> safely add "degraded" to the mount options? My primary concern is
> the
>>> [...]
>> 
>> Well it only does rw once, then the next degraded is ro - there are
>> patches dealing with this better but I don't know the state. And
>> there's no resync code that I'm aware of, absolutely it's not good
>> enough to just kick off a full scrub - that has huge performance
>> implications and I'd consider it a regression compared to functionality
>> in LVM and mdadm RAID by default with the write intent bitmap.  Without
>> some equivalent short cut, automatic degraded means a
> 
> I read about the 'scrub' all over the time here, so let me ask this
> directly, as this is also not documented clearly:
> 
> 1. is the full scrub required after ANY desync? (like: degraded mount
> followed by readding old device)?

It is very strongly recommended.

> 2. if the scrub is omitted - is it possible that btrfs return invalid
> data (from the desynced and readded drive)?

Were invalid data returned it would be a bug.  However, a reasonably 
common refrain here is that btrfs is "still stabilizing, not yet fully 
stable and mature", so occasional bugs can be expected, tho both the 
ideal and experience suggests that they're gradually reducing in 
frequency and severity as time goes on and we get closer to "fully stable 
and mature".

Which of course is why both having usable and tested backups, and keeping 
current with the kernel, are strongly recommended as well, the first in 
case one of those bugs does hit and it's severe enough to take out your 
working btrfs, the second because later kernels have fewer known bugs in 
the first place.

Functioning as designed as as intent-coded, in the case of a desync, 
btrfs will use the copy with the latest generation/transid serial, and 
thus should never return older data from the desynced device.  Further, 
btrfs is designed to be self-healing and will transparently rewrite the 
out-of-sync copy, syncing it in the process, as it comes across each 
stale block.

But the only way to be sure everything's consistent again is that scrub, 
and of course if something should happen to the only current copy while 
the desync still has the other copy stale, /then/ you lose data.

And as I said, that's functioning as designed and intent-coded, assuming 
no bugs, an explicitly unsafe assumption given btrfs' "still stabilizing" 
state.

So... "strongly recommended" indeed, tho in theory it shouldn't be 
absolutely required as long as unlucky fate doesn't strike before the 
data is transparently synced in normal usage.  YMMV, but I definitely do 
those scrubs here.

> 3. is the scrub required to be scheduled on regular basis? By 'required'
> I mean by design/implementation issues/quirks, _not_ related to possible
> hardware malfunctions.

Perhaps I'm tempting fate, but I don't do scheduled/regular scrubs here.  
Only if I have an ungraceful shutdown or see complaints in the log (which 
I tail to a system status dashboard so I'd be likely to notice a problem 
one way or the other pretty quickly).

But I do keep those backups, and while it has been quite some time (over 
a year, I'd say about 18 months to two years, and I was actually able to 
use btrfs restore and avoid having to use the backups themselves the last 
time it happened even 18 months or whatever ago) now since I had to use 
them, I /did/ actually spend some significant money upgrading my backups 
to all-SSD in ordered to make updating those backups easier and encourage 
me to keep them much more current than I had been (btrfs restore saved me 
more trouble than I'm comfortable admitting, given that I /did/ have 
backups, but they weren't the freshest at the time).

If as some people I had my backups offsite and would have to download 
them if I actually needed them, I'd potentially be rather stricter and 
schedule regular scrubs.

So by design and intention-coding, no, regularly scheduled scrubs aren't 
"required".  But I'd treat them the same as I would on non-btrfs raid, or 
a bit stricter given the above discussed btrfs stability status.  If 
you'd be uncomfortable not scheduling regular scrubs on your non-btrfs 
raid, you better be uncomfortable not scheduling them on btrfs as well!

And as always, btrfs or no btrfs, scrub or no scrub, have your backups or 
you are literally defining your data as not worth the time/trouble/
resources necessary to do them, and some day, maybe 10 minutes from now, 
maybe 10 years from now, fate's going to call you on that definition!

(Yes, I know /you/ know that or we'd not have this thread, which 
demonstrates that you /do/ care about your data.  But it's as much about 
the lurkers and googlers coming across the thread later as it is the 
direct participants...)

-- 
Duncan - List replies

Re: Unexpected raid1 behaviour

2017-12-22 Thread Tomasz Pala

On Tue, Dec 19, 2017 at 17:08:28 -0700, Chris Murphy wrote:

 Now, if the current kernels won't toggle degraded RAID1 as ro, can I
 safely add "degraded" to the mount options? My primary concern is the
>> [...]
> 
> Well it only does rw once, then the next degraded is ro - there are
> patches dealing with this better but I don't know the state. And
> there's no resync code that I'm aware of, absolutely it's not good
> enough to just kick off a full scrub - that has huge performance
> implications and I'd consider it a regression compared to
> functionality in LVM and mdadm RAID by default with the write intent
> bitmap.  Without some equivalent short cut, automatic degraded means a

I read about the 'scrub' all over the time here, so let me ask this
directly, as this is also not documented clearly:

1. is the full scrub required after ANY desync? (like: degraded mount
followed by readding old device)?

2. if the scrub is omitted - is it possible that btrfs return invalid data 
(from the
desynced and readded drive)?

3. is the scrub required to be scheduled on regular basis? By 'required'
I mean by design/implementation issues/quirks, _not_ related to possible
hardware malfunctions.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unexpected raid1 behaviour

2017-12-22 Thread Tomasz Pala

On Fri, Dec 22, 2017 at 14:04:43 -0700, Chris Murphy wrote:

> I'm pretty sure degraded boot timeout policy is handled by dracut. The

Well, last time I've checked dracut on systemd-system couldn't even
generate systemd-less image.

> kernel doesn't just automatically assemble an md array as soon as it's
> possible (degraded) and then switch to normal operation as other

MD devices are explicitly listed in mdadm.conf (for mdadm --assemble
--scan) or kernel command line or metadata of autodetected partitions (fd).

> devices appear. I have no idea how LVM manages the delay policy for
> multiple devices.

I *guess* it's not about waiting, but simply being executed after the
devices are ready.

And there is a VERY long history of various init systems having problems
to boot systems using multi-layer setups (LVM/MD under or above LUKS,
not to mention remote ones that need networking to be set up).

All of this works reasonably well under systemd - except for the btrfs
that uses single device node to match entire group of devices. Which is
convenient for living person (no need to switch between /dev/mdX and
/dev/sdX), but impossible to guess automatically by userspace tools.
There is only probe IOCTL which doesn't handle degraded mode.

> I don't think the delay policy belongs in the kernel.

That is exactly why the systemd waits for appropriate udev state.

> It's pie in the sky, and unicorns, but it sure would be nice to have
> standardization rather than everyone rolling their own solution. The

There was a de facto standard I think - expose component devices or
require them to be specified. Apparently no such thing in btrfs, so
it must be handled in btrfs-way.

Also note that MD can be assembled by kernel itself, while btrfs cannot
(so initrd is required for rootfs).

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: remove unused wait in btrfs_stripe_hash

2017-12-22 Thread Liu Bo

In fact nobody is waiting on @wait's waitqueue, it can be safely
removed.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.h  |  1 -
 fs/btrfs/raid56.c | 10 --
 2 files changed, 11 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 13c260b..b2e09fe 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -679,7 +679,6 @@ enum btrfs_orphan_cleanup_state {
 /* used by the raid56 code to lock stripes for read/modify/write */
 struct btrfs_stripe_hash {
struct list_head hash_list;
-   wait_queue_head_t wait;
spinlock_t lock;
 };
 
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 3940906..9fa45e0 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -231,7 +231,6 @@ int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info 
*info)
cur = h + i;
INIT_LIST_HEAD(&cur->hash_list);
spin_lock_init(&cur->lock);
-   init_waitqueue_head(&cur->wait);
}
 
x = cmpxchg(&info->stripe_hash_table, NULL, table);
@@ -815,15 +814,6 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
}
 
goto done_nolock;
-   /*
-* The barrier for this waitqueue_active is not needed,
-* we're protected by h->lock and can't miss a wakeup.
-*/
-   } else if (waitqueue_active(&h->wait)) {
-   spin_unlock(&rbio->bio_list_lock);
-   spin_unlock_irqrestore(&h->lock, flags);
-   wake_up(&h->wait);
-   goto done_nolock;
}
}
 done:
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 14/19] xfs: convert to new i_version API

2017-12-22 Thread Darrick J. Wong

On Fri, Dec 22, 2017 at 07:05:51AM -0500, Jeff Layton wrote:
> From: Jeff Layton 
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/xfs/libxfs/xfs_inode_buf.c | 7 +--
>  fs/xfs/xfs_icache.c   | 5 +++--
>  fs/xfs/xfs_inode.c| 3 ++-
>  fs/xfs/xfs_inode_item.c   | 3 ++-
>  fs/xfs/xfs_trans_inode.c  | 4 +++-
>  5 files changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 6b7989038d75..b9c0bf80669c 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -32,6 +32,8 @@
>  #include "xfs_ialloc.h"
>  #include "xfs_dir2.h"
>  
> +#include 

/me wonders if these ought to be in fs/xfs/xfs_linux.h since this is
libxfs, but seeing as I already let that horse escape I might as well
clean it up separately.

Looks ok,
Acked-by: Darrick J. Wong 

--D

> +
>  /*
>   * Check that none of the inode's in the buffer have a next
>   * unlinked field of 0.
> @@ -264,7 +266,8 @@ xfs_inode_from_disk(
>   to->di_flags= be16_to_cpu(from->di_flags);
>  
>   if (to->di_version == 3) {
> - inode->i_version = be64_to_cpu(from->di_changecount);
> + inode_set_iversion_queried(inode,
> +be64_to_cpu(from->di_changecount));
>   to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
>   to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
>   to->di_flags2 = be64_to_cpu(from->di_flags2);
> @@ -314,7 +317,7 @@ xfs_inode_to_disk(
>   to->di_flags = cpu_to_be16(from->di_flags);
>  
>   if (from->di_version == 3) {
> - to->di_changecount = cpu_to_be64(inode->i_version);
> + to->di_changecount = cpu_to_be64(inode_peek_iversion(inode));
>   to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
>   to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
>   to->di_flags2 = cpu_to_be64(from->di_flags2);
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 43005fbe8b1e..4c315adb05e6 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -37,6 +37,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * Allocate and initialise an xfs_inode.
> @@ -293,14 +294,14 @@ xfs_reinit_inode(
>   int error;
>   uint32_tnlink = inode->i_nlink;
>   uint32_tgeneration = inode->i_generation;
> - uint64_tversion = inode->i_version;
> + uint64_tversion = inode_peek_iversion(inode);
>   umode_t mode = inode->i_mode;
>  
>   error = inode_init_always(mp->m_super, inode);
>  
>   set_nlink(inode, nlink);
>   inode->i_generation = generation;
> - inode->i_version = version;
> + inode_set_iversion_queried(inode, version);
>   inode->i_mode = mode;
>   return error;
>  }
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 801274126648..dfc5e60d8af3 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -16,6 +16,7 @@
>   * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
>   */
>  #include 
> +#include 
>  
>  #include "xfs.h"
>  #include "xfs_fs.h"
> @@ -833,7 +834,7 @@ xfs_ialloc(
>   ip->i_d.di_flags = 0;
>  
>   if (ip->i_d.di_version == 3) {
> - inode->i_version = 1;
> + inode_set_iversion(inode, 1);
>   ip->i_d.di_flags2 = 0;
>   ip->i_d.di_cowextsize = 0;
>   ip->i_d.di_crtime.t_sec = (int32_t)tv.tv_sec;
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 6ee5c3bf19ad..7571abf5dfb3 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -30,6 +30,7 @@
>  #include "xfs_buf_item.h"
>  #include "xfs_log.h"
>  
> +#include 
>  
>  kmem_zone_t  *xfs_ili_zone;  /* inode log item zone */
>  
> @@ -354,7 +355,7 @@ xfs_inode_to_log_dinode(
>   to->di_next_unlinked = NULLAGINO;
>  
>   if (from->di_version == 3) {
> - to->di_changecount = inode->i_version;
> + to->di_changecount = inode_peek_iversion(inode);
>   to->di_crtime.t_sec = from->di_crtime.t_sec;
>   to->di_crtime.t_nsec = from->di_crtime.t_nsec;
>   to->di_flags2 = from->di_flags2;
> diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
> index daa7615497f9..225544327c4f 100644
> --- a/fs/xfs/xfs_trans_inode.c
> +++ b/fs/xfs/xfs_trans_inode.c
> @@ -28,6 +28,8 @@
>  #include "xfs_inode_item.h"
>  #include "xfs_trace.h"
>  
> +#include 
> +
>  /*
>   * Add a locked inode to the transaction.
>   *
> @@ -117,7 +119,7 @@ xfs_trans_log_inode(
>*/
>   if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
>   IS_I_VERSION(VFS_I(ip))) {
> - VFS_I(ip)->i_version++;
> + inode_inc_iversion(VFS_I(ip));
>   flags |= XFS_ILOG_CORE;
>   }
>  
> -- 
> 2.14.3
> 
> --
> To un

Re: [PATCH v4 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing

2017-12-22 Thread Darrick J. Wong

On Fri, Dec 22, 2017 at 07:05:54AM -0500, Jeff Layton wrote:
> From: Jeff Layton 
> 
> If XFS_ILOG_CORE is already set then go ahead and increment it.
> 
> Signed-off-by: Jeff Layton 

Looks ok,
Acked-by: Darrick J. Wong 

> ---
>  fs/xfs/xfs_trans_inode.c | 14 --
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
> index 225544327c4f..4a89da4b6fe7 100644
> --- a/fs/xfs/xfs_trans_inode.c
> +++ b/fs/xfs/xfs_trans_inode.c
> @@ -112,15 +112,17 @@ xfs_trans_log_inode(
>  
>   /*
>* First time we log the inode in a transaction, bump the inode change
> -  * counter if it is configured for this to occur. We don't use
> -  * inode_inc_version() because there is no need for extra locking around
> -  * i_version as we already hold the inode locked exclusively for
> -  * metadata modification.
> +  * counter if it is configured for this to occur. While we have the
> +  * inode locked exclusively for metadata modification, we can usually
> +  * avoid setting XFS_ILOG_CORE if no one has queried the value since
> +  * the last time it was incremented. If we have XFS_ILOG_CORE already
> +  * set however, then go ahead and bump the i_version counter
> +  * unconditionally.
>*/
>   if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
>   IS_I_VERSION(VFS_I(ip))) {
> - inode_inc_iversion(VFS_I(ip));
> - flags |= XFS_ILOG_CORE;
> + if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
> + flags |= XFS_ILOG_CORE;
>   }
>  
>   tp->t_flags |= XFS_TRANS_DIRTY;
> -- 
> 2.14.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs allow compression on NoDataCow files? (AFAIK Not, but it does)

2017-12-22 Thread Kai Krakow

Am Thu, 21 Dec 2017 13:51:40 -0500 schrieb Chris Mason:

> On 12/20/2017 03:59 PM, Timofey Titovets wrote:
>> How reproduce:
>> touch test_file
>> chattr +C test_file
>> dd if=/dev/zero of=test_file bs=1M count=1
>> btrfs fi def -vrczlib test_file
>> filefrag -v test_file
>> 
>> test_file
>> Filesystem type is: 9123683e
>> File size of test_file is 1048576 (256 blocks of 4096 bytes)
>> ext: logical_offset:physical_offset: length:   expected: flags:
>>0:0..  31:   72917050..  72917081: 32: encoded
>>1:   32..  63:   72917118..  72917149: 32:   72917082: encoded
>>2:   64..  95:   72919494..  72919525: 32:   72917150: encoded
>>3:   96.. 127:   72927576..  72927607: 32:   72919526: encoded
>>4:  128.. 159:   72943261..  72943292: 32:   72927608: encoded
>>5:  160.. 191:   72944929..  72944960: 32:   72943293: encoded
>>6:  192.. 223:   72944952..  72944983: 32:   72944961: encoded
>>7:  224.. 255:   72967084..  72967115: 32:   72944984:
>> last,encoded,eof
>> test_file: 8 extents found
>> 
>> I can't found at now, where that error happen in code,
>> but it's reproducible on Linux 4.14.8
> 
> We'll silently cow in a few cases, this is one.

I think the question was about compression, not cow.

I can reproduce this behavior:

$ touch nocow.dat
$ touch cow.dat
$ chattr +c cow.dat
$ chattr +C nocow.dat
$ dd if=/dev/zero of=cow.dat count=1 bs=1M
$ dd if=/dev/zero of=nocow.dat count=1 bs=1M

$ filefrag -v cow.dat
Filesystem type is: 9123683e
File size of cow.dat is 1048576 (256 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31: 1044845154..1044845185: 32: 
encoded,shared
   1:   32..  63: 1044845166..1044845197: 32: 1044845186: 
encoded,shared
   2:   64..  95: 1044845167..1044845198: 32: 1044845198: 
encoded,shared
   3:   96.. 127: 1044851064..1044851095: 32: 1044845199: 
encoded,shared
   4:  128.. 159: 1044851065..1044851096: 32: 1044851096: 
encoded,shared
   5:  160.. 191: 1044852160..1044852191: 32: 1044851097: 
encoded,shared
   6:  192.. 223: 1044943106..1044943137: 32: 1044852192: 
encoded,shared
   7:  224.. 255: 1045054792..1045054823: 32: 1044943138: 
last,encoded,shared,eof
cow.dat: 8 extents found

$ filefrag -v nocow.dat 
Filesystem type is: 9123683e
File size of nocow.dat is 1048576 (256 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0.. 255: 1196077983..1196078238:256: 
last,shared,eof
nocow.dat: 1 extent found

Now it seems to be compressed (8x 128k extents):

$ filefrag -v nocow.dat  
Filesystem type is: 9123683e
File size of nocow.dat is 1048576 (256 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31: 1121866367..1121866398: 32: 
encoded,shared
   1:   32..  63: 1121866369..1121866400: 32: 1121866399: 
encoded,shared
   2:   64..  95: 1121866370..1121866401: 32: 1121866401: 
encoded,shared
   3:   96.. 127: 1121866371..1121866402: 32: 1121866402: 
encoded,shared
   4:  128.. 159: 1121866372..1121866403: 32: 1121866403: 
encoded,shared
   5:  160.. 191: 1121866373..1121866404: 32: 1121866404: 
encoded,shared
   6:  192.. 223: 1121866374..1121866405: 32: 1121866405: 
encoded,shared
   7:  224.. 255: 1121866375..1121866406: 32: 1121866406: 
last,encoded,shared,eof
nocow.dat: 8 extents found


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 01/19] fs: new API for handling inode->i_version

2017-12-22 Thread Jeff Layton

On Sat, 2017-12-23 at 10:14 +1100, NeilBrown wrote:
> On Fri, Dec 22 2017, Jeff Layton wrote:
> 
> > From: Jeff Layton 
> > 
> > Add a documentation blob that explains what the i_version field is, how
> > it is expected to work, and how it is currently implemented by various
> > filesystems.
> > 
> > We already have inode_inc_iversion. Add several other functions for
> > manipulating and accessing the i_version counter. For now, the
> > implementation is trivial and basically works the way that all of the
> > open-coded i_version accesses work today.
> > 
> > Future patches will convert existing users of i_version to use the new
> > API, and then convert the backend implementation to do things more
> > efficiently.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  fs/btrfs/file.c  |   1 +
> >  fs/btrfs/inode.c |   1 +
> >  fs/btrfs/ioctl.c |   1 +
> >  fs/btrfs/xattr.c |   1 +
> >  fs/ext4/inode.c  |   1 +
> >  fs/ext4/namei.c  |   1 +
> >  fs/inode.c   |   1 +
> >  include/linux/fs.h   |  15 
> >  include/linux/iversion.h | 205 
> > +++
> >  9 files changed, 212 insertions(+), 15 deletions(-)
> >  create mode 100644 include/linux/iversion.h
> > 
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index eb1bac7c8553..c95d7b2efefb 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -31,6 +31,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include "ctree.h"
> >  #include "disk-io.h"
> >  #include "transaction.h"
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index e1a7f3cb5be9..27f008b33fc1 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -43,6 +43,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include "ctree.h"
> >  #include "disk-io.h"
> >  #include "transaction.h"
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index 2ef8acaac688..aa452c9e2eff 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -43,6 +43,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include "ctree.h"
> >  #include "disk-io.h"
> >  #include "transaction.h"
> > diff --git a/fs/btrfs/xattr.c b/fs/btrfs/xattr.c
> > index 2c7e53f9ff1b..5258c1714830 100644
> > --- a/fs/btrfs/xattr.c
> > +++ b/fs/btrfs/xattr.c
> > @@ -23,6 +23,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include "ctree.h"
> >  #include "btrfs_inode.h"
> >  #include "transaction.h"
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 7df2c5644e59..fa5d8bc52d2d 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -39,6 +39,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include "ext4_jbd2.h"
> >  #include "xattr.h"
> > diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> > index 798b3ac680db..bcf0dff517be 100644
> > --- a/fs/ext4/namei.c
> > +++ b/fs/ext4/namei.c
> > @@ -34,6 +34,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include "ext4.h"
> >  #include "ext4_jbd2.h"
> >  
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 03102d6ef044..19e72f500f71 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -18,6 +18,7 @@
> >  #include  /* for inode_has_buffers */
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include "internal.h"
> >  
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 511fbaabf624..76382c24e9d0 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2036,21 +2036,6 @@ static inline void inode_dec_link_count(struct inode 
> > *inode)
> > mark_inode_dirty(inode);
> >  }
> >  
> > -/**
> > - * inode_inc_iversion - increments i_version
> > - * @inode: inode that need to be updated
> > - *
> > - * Every time the inode is modified, the i_version field will be 
> > incremented.
> > - * The filesystem has to be mounted with i_version flag
> > - */
> > -
> > -static inline void inode_inc_iversion(struct inode *inode)
> > -{
> > -   spin_lock(&inode->i_lock);
> > -   inode->i_version++;
> > -   spin_unlock(&inode->i_lock);
> > -}
> > -
> >  enum file_time_flags {
> > S_ATIME = 1,
> > S_MTIME = 2,
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > new file mode 100644
> > index ..bb50d27c71f9
> > --- /dev/null
> > +++ b/include/linux/iversion.h
> > @@ -0,0 +1,205 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_IVERSION_H
> > +#define _LINUX_IVERSION_H
> > +
> > +#include 
> > +
> > +/*
> > + * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > + * knfsd, but is also used for other purposes (e.g. IMA). The i_version 
> > must
> > + * appear different to observers if there was a change to the inode's data 
> > or
> > + * metadata since it was last queried.
> > + *
> > + * It should be considered an opaque value by observers. If it remains the 
> > same
> 
>

Re: [PATCH v4 01/19] fs: new API for handling inode->i_version

2017-12-22 Thread NeilBrown

On Fri, Dec 22 2017, Jeff Layton wrote:

> From: Jeff Layton 
>
> Add a documentation blob that explains what the i_version field is, how
> it is expected to work, and how it is currently implemented by various
> filesystems.
>
> We already have inode_inc_iversion. Add several other functions for
> manipulating and accessing the i_version counter. For now, the
> implementation is trivial and basically works the way that all of the
> open-coded i_version accesses work today.
>
> Future patches will convert existing users of i_version to use the new
> API, and then convert the backend implementation to do things more
> efficiently.
>
> Signed-off-by: Jeff Layton 
> ---
>  fs/btrfs/file.c  |   1 +
>  fs/btrfs/inode.c |   1 +
>  fs/btrfs/ioctl.c |   1 +
>  fs/btrfs/xattr.c |   1 +
>  fs/ext4/inode.c  |   1 +
>  fs/ext4/namei.c  |   1 +
>  fs/inode.c   |   1 +
>  include/linux/fs.h   |  15 
>  include/linux/iversion.h | 205 
> +++
>  9 files changed, 212 insertions(+), 15 deletions(-)
>  create mode 100644 include/linux/iversion.h
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index eb1bac7c8553..c95d7b2efefb 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -31,6 +31,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "transaction.h"
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e1a7f3cb5be9..27f008b33fc1 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -43,6 +43,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "transaction.h"
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 2ef8acaac688..aa452c9e2eff 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -43,6 +43,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "transaction.h"
> diff --git a/fs/btrfs/xattr.c b/fs/btrfs/xattr.c
> index 2c7e53f9ff1b..5258c1714830 100644
> --- a/fs/btrfs/xattr.c
> +++ b/fs/btrfs/xattr.c
> @@ -23,6 +23,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ctree.h"
>  #include "btrfs_inode.h"
>  #include "transaction.h"
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 7df2c5644e59..fa5d8bc52d2d 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -39,6 +39,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "ext4_jbd2.h"
>  #include "xattr.h"
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 798b3ac680db..bcf0dff517be 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -34,6 +34,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ext4.h"
>  #include "ext4_jbd2.h"
>  
> diff --git a/fs/inode.c b/fs/inode.c
> index 03102d6ef044..19e72f500f71 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -18,6 +18,7 @@
>  #include  /* for inode_has_buffers */
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "internal.h"
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 511fbaabf624..76382c24e9d0 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2036,21 +2036,6 @@ static inline void inode_dec_link_count(struct inode 
> *inode)
>   mark_inode_dirty(inode);
>  }
>  
> -/**
> - * inode_inc_iversion - increments i_version
> - * @inode: inode that need to be updated
> - *
> - * Every time the inode is modified, the i_version field will be incremented.
> - * The filesystem has to be mounted with i_version flag
> - */
> -
> -static inline void inode_inc_iversion(struct inode *inode)
> -{
> -   spin_lock(&inode->i_lock);
> -   inode->i_version++;
> -   spin_unlock(&inode->i_lock);
> -}
> -
>  enum file_time_flags {
>   S_ATIME = 1,
>   S_MTIME = 2,
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> new file mode 100644
> index ..bb50d27c71f9
> --- /dev/null
> +++ b/include/linux/iversion.h
> @@ -0,0 +1,205 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_IVERSION_H
> +#define _LINUX_IVERSION_H
> +
> +#include 
> +
> +/*
> + * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> + * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> + * appear different to observers if there was a change to the inode's data or
> + * metadata since it was last queried.
> + *
> + * It should be considered an opaque value by observers. If it remains the 
> same
 

You keep using that word ... I don't think it means what you think it
means.
Change that sentence to:

Observers see i_version as a 64 number which never decreases.

and the rest still makes perfect sense.

Thanks,
NeilBrown


> + * since it was last checked, then nothing has changed in the inode. If it's
> + * different then something has changed. Observers cannot

Re: Unexpected raid1 behaviour

2017-12-22 Thread Chris Murphy

On Fri, Dec 22, 2017 at 9:05 AM, Tomasz Pala  wrote:
> On Thu, Dec 21, 2017 at 07:27:23 -0500, Austin S. Hemmelgarn wrote:
>
>> Also, it's not 'up to the filesystem', it's 'up to the underlying
>> device'.  LUKS, LVM, MD, and everything else that's an actual device
>> layer is what systemd waits on.  XFS, ext4, and any other filesystem
>> except BTRFS (and possibly ZFS, but I'm not 100% sure about that)
>> provides absolutely _NOTHING_ to wait on.  Systemd just chose to handle
>
> You wait for all the devices to settle. One might have dozen of drives
> including some attached via network and it might take a time to become
> available. Since systemd knows nothing about underlying components, it
> simply waits for the btrfs itself to announce it's ready.

I'm pretty sure degraded boot timeout policy is handled by dracut. The
kernel doesn't just automatically assemble an md array as soon as it's
possible (degraded) and then switch to normal operation as other
devices appear. I have no idea how LVM manages the delay policy for
multiple devices.

I don't think the delay policy belongs in the kernel.

It's pie in the sky, and unicorns, but it sure would be nice to have
standardization rather than everyone rolling their own solution. The
Red Hat Stratis folks will need something to do this for their
solution so yet another one is about to be developed...

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes

2017-12-22 Thread Liu Bo

On Fri, Dec 15, 2017 at 11:58:27AM -0800, Chris Mason wrote:
> refcounts have a generic implementation and an asm optimized one.  The
> generic version has extra debugging to make sure that once a refcount
> goes to zero, refcount_inc won't increase it.
> 
> The btrfs delayed inode code wasn't expecting this, and we're tripping
> over the warnings when the generic refcounts are used.  We ended up with
> this race:
> 
> Process A Process B
>   btrfs_get_delayed_node()
> spin_lock(root->inode_lock)
> radix_tree_lookup()
> __btrfs_release_delayed_node()
> refcount_dec_and_test(&delayed_node->refs)
> our refcount is now zero
> refcount_add(2) <---
> warning here, refcount
>   unchanged
> 
> spin_lock(root->inode_lock)
> radix_tree_delete()
> 
> With the generic refcounts, we actually warn again when process B above
> tries to release his refcount because refcount_add() turned into a
> no-op.
> 
> We saw this in production on older kernels without the asm optimized
> refcounts.
> 
> The fix used here is to use refcount_inc_not_zero() to detect when the
> object is in the middle of being freed and return NULL.  This is almost
> always the right answer anyway, since we usually end up pitching the
> delayed_node if it didn't have fresh data in it.
> 
> This also changes __btrfs_release_delayed_node() to remove the extra
> check for zero refcounts before radix tree deletion.
> btrfs_get_delayed_node() was the only path that was allowing refcounts
> to go from zero to one.
> 

Reviewed-by: Liu Bo 

-liubo
> Signed-off-by: Chris Mason 
> Fixes: 6de5f18e7b0da
> cc:  #4.12+
> ---
>  fs/btrfs/delayed-inode.c | 45 ++---
>  1 file changed, 34 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index 5d73f79..84c54af 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -87,6 +87,7 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node(
>  
>   spin_lock(&root->inode_lock);
>   node = radix_tree_lookup(&root->delayed_nodes_tree, ino);
> +
>   if (node) {
>   if (btrfs_inode->delayed_node) {
>   refcount_inc(&node->refs);  /* can be accessed */
> @@ -94,9 +95,30 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node(
>   spin_unlock(&root->inode_lock);
>   return node;
>   }
> - btrfs_inode->delayed_node = node;
> - /* can be accessed and cached in the inode */
> - refcount_add(2, &node->refs);
> +
> + /* it's possible that we're racing into the middle of
> +  * removing this node from the radix tree.  In this case,
> +  * the refcount was zero and it should never go back
> +  * to one.  Just return NULL like it was never in the radix
> +  * at all; our release function is in the process of removing
> +  * it.
> +  *
> +  * Some implementations of refcount_inc refuse to
> +  * bump the refcount once it has hit zero.  If we don't do
> +  * this dance here, refcount_inc() may decide to
> +  * just WARN_ONCE() instead of actually bumping the refcount.
> +  *
> +  * If this node is properly in the radix, we want to
> +  * bump the refcount twice, once for the inode
> +  * and once for this get operation.
> +  */
> + if (refcount_inc_not_zero(&node->refs)) {
> + refcount_inc(&node->refs);
> + btrfs_inode->delayed_node = node;
> + } else {
> + node = NULL;
> + }
> +
>   spin_unlock(&root->inode_lock);
>   return node;
>   }
> @@ -254,17 +276,18 @@ static void __btrfs_release_delayed_node(
>   mutex_unlock(&delayed_node->mutex);
>  
>   if (refcount_dec_and_test(&delayed_node->refs)) {
> - bool free = false;
>   struct btrfs_root *root = delayed_node->root;
> +
>   spin_lock(&root->inode_lock);
> - if (refcount_read(&delayed_node->refs) == 0) {
> - radix_tree_delete(&root->delayed_nodes_tree,
> -   delayed_node->inode_id);
> - free = true;
> - }
> + /*
> +  * once our refcount goes to zero, nobody is allowed to
> +  * bump it back up.  We can delete it now
> +  */
> + ASSERT(refcount_read(&delayed_node->refs) == 0);
> + radix_tree_

Re: [PATCH 07/10] Btrfs: fix unexpected EEXIST from btrfs_get_extent

2017-12-22 Thread Liu Bo

On Fri, Dec 22, 2017 at 02:10:45PM +0200, Nikolay Borisov wrote:
> 
> 
> On 22.12.2017 00:42, Liu Bo wrote:
> > This fixes a corner case that is caused by a race of dio write vs dio
> > read/write.
> > 
> > dio write:
> > [0, 32k) -> [0, 8k) + [8k, 32k)
> > 
> > dio read/write:
> > 
> > While get_extent() with [0, 4k), [0, 8k) is found as existing em, even
> > though start == existing->start, em is [0, 32k),
> > extent_map_end(em) > extent_map_end(existing),
> > then it goes thru merge_extent_mapping() which tries to add a [8k, 8k)
> > (with a length 0), and get_extent ends up returning -EEXIST, and dio
> > read/write will get -EEXIST which is confusing applications.
> 
> I think this paragraph should be expanded a bit since you've crammed a
> lot of information in few words.
> 

OK, it's not easy to explain the problem, either.

Probably I should also attach the following as a extra explanation,

+ * Suppose that no extent map has been loaded into memory yet.
+ * There is a file extent [0, 32K), two jobs are running concurrently
+ * against it, t1 is doing dio write to [8K, 32K) and t2 is doing dio
+ * read from [0, 4K) or [4K, 8K).
+ *
+ * t1 goes ahead of t2 and splits em [0, 32K) to em [0K, 8K) and [8K 32K).
+ *
+ * t1t2
+ *  btrfs_get_blocks_direct() btrfs_get_blocks_direct()
+ *   -> btrfs_get_extent()  -> btrfs_get_extent()
+ *   -> lookup_extent_mapping()
+ *   -> add_extent_mapping()-> lookup_extent_mapping()
+ *  # load [0, 32K)
+ *   -> btrfs_new_extent_direct()
+ *   -> btrfs_drop_extent_cache()
+ *  # split [0, 32K)
+ *   -> add_extent_mapping()
+ *  # add [8K, 32K)
+ *  -> add_extent_mapping()
+ * # handle -EEXIST when adding
+ * # [0, 32K)


> In the below graphs em should be extent we'd like to insert, no?
> So given that it always encompasses the existing (0, 8k given the
> above description) then em should be 0, 32k ?

Yes and yes.

thanks,
-liubo

> > 
> > Here I concluded all the possible situations,
> > 1) start < existing->start
> > 
> > +---+em+---+
> > +--prev---+ | +-+  |
> > | | | | |  |
> > +-+ + +---+existing++  ++
> > +
> > |
> > +
> >  start
> > 
> > 2) start == existing->start
> > 
> >   +em+
> >   | +-+  |
> >   | | |  |
> >   + +existing-+  +
> > |
> > |
> > +
> >  start
> > 
> > 3) start > existing->start && start < (existing->start + existing->len)
> > 
> >   +em+
> >   | +-+  |
> >   | | |  |
> >   + +existing-+  +
> >|
> >|
> >+
> >  start
> > 
> > 4) start >= (existing->start + existing->len)
> > 
> > +---+em+---+
> > | +-+  | +--next---+
> > | | |  | | |
> > + +---+existing++  + +-+
> >   +
> >   |
> >   +
> >start
> > 
> > After going thru the above case by case, it turns out that if start is
> > within existing em (front inclusive), then the existing em should be
> > returned, otherwise, we try our best to merge candidate em with
> > sibling ems to form a larger em.
> > 
> > Reported-by: David Vallender 
> > Signed-off-by: Liu Bo 
> > ---
> >  fs/btrfs/extent_map.c | 25 ++---
> >  1 file changed, 10 insertions(+), 15 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> > index 6653b08..d386cfb 100644
> > --- a/fs/btrfs/extent_map.c
> > +++ b/fs/btrfs/extent_map.c
> > @@ -483,7 +483,7 @@ static struct extent_map *prev_extent_map(struct 
> > extent_map *em)
> >  static int merge_extent_mapping(struct extent_map_tree *em_tree,
> > struct extent_map *existing,
> > struct extent_map *em,
> > -   u64 map_start)
> > +   u64 map_start, u64 map_len)
> >  {
> > struct extent_map *prev;
> > struct extent_map *next;
> > @@ -496,9 +496,13 @@ static int merge_extent_mapping(struct extent_map_tree 
> > *em_tree,
> > if (existing->start > map_start) {
> > next = existing;
> > prev = prev_extent_map(next);
> > +   if (prev)
> > +   ASSERT(extent_map_end(prev) <= map_start);
> > } else {
> > prev = existing;
> > next = next_extent_map(prev);
> > +   if (next)
> > +   ASSERT(map_start + map_len <= ne

Re: [PATCH 09/10] Btrfs: add tracepoint for em's EEXIST case

2017-12-22 Thread Liu Bo

On Fri, Dec 22, 2017 at 10:56:31AM +0200, Nikolay Borisov wrote:
> 
> 
> On 22.12.2017 00:42, Liu Bo wrote:
> > This is adding a tracepoint 'btrfs_handle_em_exist' to help debug the
> > subtle bugs around merge_extent_mapping.
> 
> In the next patch you are already making the function which takes all
> these values noinline, meaning you can attach a kprobe so you can
> interrogate the args via systemtap,perf probe or even bpf. So I'd rather
> not add this tracepoint since the general sentiment seems to be that
> tracepoints are ABI and so have to be maintained.
>

The following patch of noinline function is inside the else {} branch,
so kprobe would give us something back only when it runs to else{}.

Since subtle bugs may lie in this area, a simple tracepoint like this
can help us understand them more efficiently.

thanks,
-liubo

> > 
> > Signed-off-by: Liu Bo 
> > ---
> >  fs/btrfs/extent_map.c|  1 +
> >  include/trace/events/btrfs.h | 35 +++
> >  2 files changed, 36 insertions(+)
> > 
> > diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> > index a8b7e24..40e4d30 100644
> > --- a/fs/btrfs/extent_map.c
> > +++ b/fs/btrfs/extent_map.c
> > @@ -539,6 +539,7 @@ int btrfs_add_extent_mapping(struct extent_map_tree 
> > *em_tree,
> > ret = 0;
> >  
> > existing = search_extent_mapping(em_tree, start, len);
> > +   trace_btrfs_handle_em_exist(existing, em, start, len);
> >  
> > /*
> >  * existing will always be non-NULL, since there must be
> > diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> > index 4342a32..b7ffcf7 100644
> > --- a/include/trace/events/btrfs.h
> > +++ b/include/trace/events/btrfs.h
> > @@ -249,6 +249,41 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
> >   __entry->refs, __entry->compress_type)
> >  );
> >  
> > +TRACE_EVENT(btrfs_handle_em_exist,
> > +
> > +   TP_PROTO(const struct extent_map *existing, const struct extent_map 
> > *map, u64 start, u64 len),
> > +
> > +   TP_ARGS(existing, map, start, len),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(u64,  e_start   )
> > +   __field(u64,  e_len )
> > +   __field(u64,  map_start )
> > +   __field(u64,  map_len   )
> > +   __field(u64,  start )
> > +   __field(u64,  len   )
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   __entry->e_start= existing->start;
> > +   __entry->e_len  = existing->len;
> > +   __entry->map_start  = map->start;
> > +   __entry->map_len= map->len;
> > +   __entry->start  = start;
> > +   __entry->len= len;
> > +   ),
> > +
> > +   TP_printk("start=%llu len=%llu "
> > + "existing(start=%llu len=%llu) "
> > + "em(start=%llu len=%llu)",
> > + (unsigned long long)__entry->start,
> > + (unsigned long long)__entry->len,
> > + (unsigned long long)__entry->e_start,
> > + (unsigned long long)__entry->e_len,
> > + (unsigned long long)__entry->map_start,
> > + (unsigned long long)__entry->map_len)
> > +);
> > +
> >  /* file extent item */
> >  DECLARE_EVENT_CLASS(btrfs__file_extent_item_regular,
> >  
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/10] Btrfs: extent map selftest: buffered write vs dio read

2017-12-22 Thread Liu Bo

On Fri, Dec 22, 2017 at 09:51:24AM +0200, Nikolay Borisov wrote:
> 
> 
> On 22.12.2017 00:42, Liu Bo wrote:
> > This test case simulates the racy situation of buffered write vs dio
> > read, and see if btrfs_get_extent() would return -EEXIST.
> 
> Isn't mixing dio/buffered IO on the same file (range?) considered
> dangerous in any case?

They are, but it is sometimes the way how applications work.

thanks,
-liubo

> 
> > 
> > Signed-off-by: Liu Bo 
> > ---
> >  fs/btrfs/tests/extent-map-tests.c | 73 
> > +++
> >  1 file changed, 73 insertions(+)
> > 
> > diff --git a/fs/btrfs/tests/extent-map-tests.c 
> > b/fs/btrfs/tests/extent-map-tests.c
> > index 0407396..2adf55f 100644
> > --- a/fs/btrfs/tests/extent-map-tests.c
> > +++ b/fs/btrfs/tests/extent-map-tests.c
> > @@ -181,6 +181,78 @@ static void test_case_2(struct extent_map_tree 
> > *em_tree)
> > free_extent_map_tree(em_tree);
> >  }
> >  
> > +static void __test_case_3(struct extent_map_tree *em_tree, u64 start)
> > +{
> > +   struct extent_map *em;
> > +   u64 len = SZ_4K;
> > +   int ret;
> > +
> > +   em = alloc_extent_map();
> > +   if (!em)
> > +   /* Skip this test on error. */
> > +   return;
> > +
> > +   /* Add [4K, 8K) */
> > +   em->start = SZ_4K;
> > +   em->len = SZ_4K;
> > +   em->block_start = SZ_4K;
> > +   em->block_len = SZ_4K;
> > +   ret = add_extent_mapping(em_tree, em, 0);
> > +   ASSERT(ret == 0);
> > +   free_extent_map(em);
> > +
> > +   em = alloc_extent_map();
> > +   if (!em)
> > +   goto out;
> > +
> > +   /* Add [0, 16K) */
> > +   em->start = 0;
> > +   em->len = SZ_16K;
> > +   em->block_start = 0;
> > +   em->block_len = SZ_16K;
> > +   ret = btrfs_add_extent_mapping(em_tree, &em, start, len);
> > +   if (ret)
> > +   test_msg("case3 [0x%llx 0x%llx): ret %d\n",
> > +start, start + len, ret);
> > +   /*
> > +* Since bytes within em are contiguous, em->block_start is identical to
> > +* em->start.
> > +*/
> > +   if (em &&
> > +   (start < em->start || start + len > extent_map_end(em) ||
> > +em->start != em->block_start || em->len != em->block_len))
> > +   test_msg("case3 [0x%llx 0x%llx): ret %d em (start 0x%llx len 
> > 0x%llx block_start 0x%llx block_len 0x%llx)\n",
> > +start, start + len, ret, em->start, em->len,
> > +em->block_start, em->block_len);
> > +   free_extent_map(em);
> > +out:
> > +   /* free memory */
> > +   free_extent_map_tree(em_tree);
> > +}
> > +
> > +/*
> > + * Test scenario:
> > + *
> > + * Suppose that no extent map has been loaded into memory yet.
> > + * There is a file extent [0, 16K), two jobs are running concurrently
> > + * against it, t1 is buffered writing to [4K, 8K) and t2 is doing dio
> > + * read from [0, 4K) or [8K, 12K) or [12K, 16K).
> > + *
> > + * t1 goes ahead of t2 and adds em [4K, 8K) into tree.
> > + *
> > + * t1   t2
> > + *  cow_file_range()btrfs_get_extent()
> > + *-> lookup_extent_mapping()
> > + *   -> add_extent_mapping()
> > + *-> add_extent_mapping()
> > + */
> > +static void test_case_3(struct extent_map_tree *em_tree)
> > +{
> > +   __test_case_3(em_tree, 0);
> > +   __test_case_3(em_tree, SZ_8K);
> > +   __test_case_3(em_tree, (12 * 1024ULL));
> > +}
> > +
> >  int btrfs_test_extent_map()
> >  {
> > struct extent_map_tree *em_tree;
> > @@ -196,6 +268,7 @@ int btrfs_test_extent_map()
> >  
> > test_case_1(em_tree);
> > test_case_2(em_tree);
> > +   test_case_3(em_tree);
> >  
> > kfree(em_tree);
> > return 0;
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/10] Btrfs: add extent map selftests

2017-12-22 Thread Liu Bo

On Fri, Dec 22, 2017 at 09:42:17AM +0200, Nikolay Borisov wrote:
> 
> 
> On 22.12.2017 00:42, Liu Bo wrote:
> > We've observed that btrfs_get_extent() and merge_extent_mapping() could
> > return -EEXIST in several cases, and they are caused by some racy
> > condition, e.g dio read vs dio write, which makes the problem very tricky
> > to reproduce.
> > 
> > This adds extent map selftests in order to simulate those racy situations.
> > 
> > Signed-off-by: Liu Bo 
> > ---
> >  fs/btrfs/Makefile |   2 +-
> >  fs/btrfs/tests/btrfs-tests.c  |   3 +
> >  fs/btrfs/tests/btrfs-tests.h  |   1 +
> >  fs/btrfs/tests/extent-map-tests.c | 202 
> > ++
> >  4 files changed, 207 insertions(+), 1 deletion(-)
> >  create mode 100644 fs/btrfs/tests/extent-map-tests.c
> > 
> > diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> > index 6fe881d..0c43736 100644
> > --- a/fs/btrfs/Makefile
> > +++ b/fs/btrfs/Makefile
> > @@ -19,4 +19,4 @@ btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
> >  btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
> > tests/extent-buffer-tests.o tests/btrfs-tests.o \
> > tests/extent-io-tests.o tests/inode-tests.o tests/qgroup-tests.o \
> > -   tests/free-space-tree-tests.o
> > +   tests/free-space-tree-tests.o tests/extent-map-tests.o
> > diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
> > index d3f2537..9786d8c 100644
> > --- a/fs/btrfs/tests/btrfs-tests.c
> > +++ b/fs/btrfs/tests/btrfs-tests.c
> > @@ -277,6 +277,9 @@ int btrfs_run_sanity_tests(void)
> > goto out;
> > }
> > }
> > +   ret = btrfs_test_extent_map();
> > +   if (ret)
> > +   goto out;
> >  out:
> > btrfs_destroy_test_fs();
> > return ret;
> > diff --git a/fs/btrfs/tests/btrfs-tests.h b/fs/btrfs/tests/btrfs-tests.h
> > index 266f1e3..bc0615b 100644
> > --- a/fs/btrfs/tests/btrfs-tests.h
> > +++ b/fs/btrfs/tests/btrfs-tests.h
> > @@ -33,6 +33,7 @@ int btrfs_test_extent_io(u32 sectorsize, u32 nodesize);
> >  int btrfs_test_inodes(u32 sectorsize, u32 nodesize);
> >  int btrfs_test_qgroups(u32 sectorsize, u32 nodesize);
> >  int btrfs_test_free_space_tree(u32 sectorsize, u32 nodesize);
> > +int btrfs_test_extent_map(void);
> >  struct inode *btrfs_new_test_inode(void);
> >  struct btrfs_fs_info *btrfs_alloc_dummy_fs_info(u32 nodesize, u32 
> > sectorsize);
> >  void btrfs_free_dummy_fs_info(struct btrfs_fs_info *fs_info);
> > diff --git a/fs/btrfs/tests/extent-map-tests.c 
> > b/fs/btrfs/tests/extent-map-tests.c
> > new file mode 100644
> > index 000..0407396
> > --- /dev/null
> > +++ b/fs/btrfs/tests/extent-map-tests.c
> > @@ -0,0 +1,202 @@
> > +/*
> > + * Copyright (C) 2017 Oracle.  All rights reserved.
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public
> > + * License v2 as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public
> > + * License along with this program; if not, write to the
> > + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> > + * Boston, MA 021110-1307, USA.
> > + */
> > +
> > +#include 
> > +#include "btrfs-tests.h"
> > +#include "../ctree.h"
> > +
> > +static void free_extent_map_tree(struct extent_map_tree *em_tree)
> > +{
> > +   struct extent_map *em;
> > +   struct rb_node *node;
> > +
> > +   while (!RB_EMPTY_ROOT(&em_tree->map)) {
> > +   node = rb_first(&em_tree->map);
> > +   em = rb_entry(node, struct extent_map, rb_node);
> > +   remove_extent_mapping(em_tree, em);
> > +
> > +#ifdef CONFIG_BTRFS_DEBUG
> > +   if (refcount_read(&em->refs) != 1) {
> > +   test_msg(
> > +"Oops we have a em leak: em (start 0x%llx len 
> > 0x%llx block_start 0x%llx block_len 0x%llx) refs %d\n",
> > +em->start, em->len, em->block_start,
> > +em->block_len, refcount_read(&em->refs));
> > +
> > +   refcount_set(&em->refs, 1);
> > +   }
> > +#endif
> > +   free_extent_map(em);
> > +   }
> > +}
> > +
> > +/*
> > + * Test scenario:
> > + *
> > + * Suppose that no extent map has been loaded into memory yet, there is a 
> > file
> > + * extent [0, 16K), followed by another file extent [16K, 20K), two dio 
> > reads
> > + * are entering btrfs_get_extent() concurrently, t1 is reading [8K, 16K), 
> > t2 is
> > + * reading [0, 8K)
> > + *
> > + * t1t2
> > + *  btrfs_get_extent()  btrfs_get_extent()
> > + *-> lookup_extent_ma

Re: [PATCH 01/10] Btrfs: add helper for em merge logic

2017-12-22 Thread Liu Bo

On Fri, Dec 22, 2017 at 09:23:40AM +0200, Nikolay Borisov wrote:
> 
> 
> On 22.12.2017 00:42, Liu Bo wrote:
> > This is a prepare work for the following extent map selftest, which
> > runs tests against em merge logic.
> > 
> > Signed-off-by: Liu Bo 
> > ---
> >  fs/btrfs/ctree.h |   2 ++
> >  fs/btrfs/inode.c | 101 
> > ++-
> >  2 files changed, 58 insertions(+), 45 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index b2e09fe..328f40f 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -3148,6 +3148,8 @@ struct btrfs_delalloc_work 
> > *btrfs_alloc_delalloc_work(struct inode *inode,
> > int delay_iput);
> >  void btrfs_wait_and_free_delalloc_work(struct btrfs_delalloc_work *work);
> >  
> > +int btrfs_add_extent_mapping(struct extent_map_tree *em_tree,
> > +struct extent_map **em_in, u64 start, u64 len);
> >  struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
> > struct page *page, size_t pg_offset, u64 start,
> > u64 len, int create);
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index e1a7f3c..527df6f 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -6911,6 +6911,61 @@ static noinline int uncompress_inline(struct 
> > btrfs_path *path,
> > return ret;
> >  }
> >  
> > +int btrfs_add_extent_mapping(struct extent_map_tree *em_tree,
> > +struct extent_map **em_in, u64 start, u64 len)
> 
> How about adding the following comment above the function: 
> 
> /**   
>   
>  * btrfs_add_extent_mapping - try to add given extent mapping 
>   
>  * @em_tree - the extent tree into which we want to add the mapping   
>   
>  * @em_in - extent we are inserting   
>   
>  * @start - the start of the logical range of the extent we are adding
>   
>  * @len - logical length of the extent
>   
>  *
>   
>  * Insert @em_in into @em_tree. In case there is an overlapping range, handle 
>   
>  * the -EEXIST by either: 
>   
>  * a) Returning the existing extent in @em_in if there is a full overlap  
>   
>  * b) Merge the extents if they are near each other.  
>   
>  *
>   
>  * Returns 0 on success or a negative error code  
>  
>  *
>   
>  */ 
> 

Appreciate your comments.

Sure, comments are always welcome.

> Also one thing which I'm not very clear is why do we need the start/len, 
> aren't 
> those already set in em_in ?
>

[start, start+len) is within *em_in, ie.

|*em_in|
   |--|
  start   start+len

What we really care about is [start, start+len), which is passed to
btrfs_get_extent().

For example, if a file extent item on disk is [0, 32k), given a tuple
(start=0, len=4k), reading [0, 4k) would end up with *em_in being
[0, 32k).

So if add_extent_mapping(*em_in) returns EEXIST, then we know the
'existing' em is overlapped with *em_in, then we need to check whether
it's overlapped with [start, start+len).

thanks,
-liubo

> 
> 
> > +{
> > +   int ret;
> > +   struct extent_map *em = *em_in;
> > +
> > +   ret = add_extent_mapping(em_tree, em, 0);
> > +   /* it is possible that someone inserted the extent into the tree
> > +* while we had the lock dropped.  It is also possible that
> > +* an overlapping map exists in the tree
> > +*/
> > +   if (ret == -EEXIST) {
> > +   struct extent_map *existing;
> > +
> > +   ret = 0;
> > +
> > +   existing = search_extent_mapping(em_tree, start, len);
> > +
> > +   /*
> > +* existing will always be non-NULL, since there must be
> > +* extent causing the -EEXIST.
> > +*/
> > +   if (existing->start == em->start &&
> > +   extent_map_end(existing) >= extent_map_end(em) &&
> > +   em->block_start == existing->block_start) {
> > +   /*
> > +* The existing extent map already encompasses the
> > +* entire extent map we tried to add.
> > +*/
> > +   free_extent_map(em);
> > +   *em_in = existing;
> > +   ret = 0;
> > +   } else if (start >= extent_map_end(existing) ||
> > +   start <= existing->start) {
> > +   /*
> > +* The existing extent map is the one nearest to
> > +* the [start, start +

Re: Unexpected raid1 behaviour

2017-12-22 Thread Tomasz Pala

On Thu, Dec 21, 2017 at 07:27:23 -0500, Austin S. Hemmelgarn wrote:

> No, it isn't.  You can just make the damn mount call with the supplied 
> options.  If it succeeds, the volume was ready, if it fails, it wasn't, 
> it's that simple, and there's absolutely no reason that systemd can't 
> just do that in a loop until it succeeds or a timeout is reached.  That 

There is no such loop, so if mount would happen before all the required
devices show up, it would either definitely fail, or if there were 'degraded'
in fstab, just start degraded.

> any of these issues with the volume being completely unusable in a 
> degraded state.
> 
> Also, it's not 'up to the filesystem', it's 'up to the underlying 
> device'.  LUKS, LVM, MD, and everything else that's an actual device 
> layer is what systemd waits on.  XFS, ext4, and any other filesystem 
> except BTRFS (and possibly ZFS, but I'm not 100% sure about that) 
> provides absolutely _NOTHING_ to wait on.  Systemd just chose to handle 

You wait for all the devices to settle. One might have dozen of drives
including some attached via network and it might take a time to become
available. Since systemd knows nothing about underlying components, it
simply waits for the btrfs itself to announce it's ready.

> BTRFS like a device layer, and not a filesystem, so we have this crap to 

As btrfs handles many devices in "lower part", this effectively is
device layer. Mounting /dev/sda happens to mount various other /dev/sd*
that are _not_ explicitly exposed, so there is really not an
alternative. Except for the 'mount loop' which is a no-go.

> deal with, as well as the fact that it makes it impossible to manually 
> mount a BTRFS volume with missing or failed devices in degraded mode 
> under systemd (because it unmounts it damn near instantly because it 
> somehow thinks it knows better than the user what the user wants to do).

This seems to be some distro-specific misconfiguration, didn't happen to
me on plain systemd/udev. What is the reproducing scenario?

>> This integration issue was so far silently ignored both by btrfs and
>> systemd developers. 
> It's been ignored by BTRFS devs because there is _nothing_ wrong on this 
> side other than the naming choice for the ioctl.  Systemd is _THE ONLY_ 
> init system which has this issue, every other one works just fine.

Not true - mounting btrfs without "btrfs device scan" doesn't work at
all without udev rules (that mimic behaviour of the command). Let me
repeat example from Dec 19th:

1. create 2-volume btrfs, e.g. /dev/sda and /dev/sdb,
2. reboot the system into clean state (init=/bin/sh), (or remove btrfs-scan 
tool),
3. try
mount /dev/sda /test - fails
mount /dev/sdb /test - works
4. reboot again and try in reversed order
mount /dev/sdb /test - fails
mount /dev/sda /test - works

> As far as the systemd side, I have no idea why they are ignoring it, 
> though I suspect it's the usual spoiled brat mentality that seems to be 
> present about everything that people complain about regarding systemd.

Explanation above. This is the point when _you_ need to stop ignoring
the fact, that you simply cannot just try mounting devices in a loop as
this would render any NAS/FC/iSCSI-backed or more complicated systems
unusable or hide problems in case of temporary problems with connection.

systemd waits for the _underlying_ device - unless btrfs exposes them as
a list of _actual_ devices to wait for, there is nothing except for
waiting for btrfs itself that systemd can do.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH] Btrfs: replace custom heuristic ws allocation logic with mempool API

2017-12-22 Thread Timofey Titovets

Currently btrfs compression code use custom wrapper
for store allocated compression/heuristic workspaces.

That logic try store at least ncpu+1 each type of workspaces.

As far, as i can see that logic fully reimplement
mempool API.
So i think, that use of mempool api can simplify code
and allow for cleanup it.

That a proof of concept patch, i have tested it (at least that works),
future version will looks mostly same.

If that acceptable,
next step will be:
1. Create mempool_alloc_w()
that will resize mempool to apropriate size ncpu+1
And will create apropriate mempool, if creating failed in __init.

2. Convert per compression ws to mempool.

Thanks.

Signed-off-by: Timofey Titovets 
Cc: David Sterba 
---
 fs/btrfs/compression.c | 123 -
 1 file changed, 39 insertions(+), 84 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 208334aa6c6e..cf47089b9ec0 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -768,14 +769,11 @@ struct heuristic_ws {
struct bucket_item *bucket;
/* Sorting buffer */
struct bucket_item *bucket_b;
-   struct list_head list;
 };
 
-static void free_heuristic_ws(struct list_head *ws)
+static void heuristic_ws_free(void *element, void *pool_data)
 {
-   struct heuristic_ws *workspace;
-
-   workspace = list_entry(ws, struct heuristic_ws, list);
+   struct heuristic_ws *workspace = (struct heuristic_ws *) element;
 
kvfree(workspace->sample);
kfree(workspace->bucket);
@@ -783,13 +781,12 @@ static void free_heuristic_ws(struct list_head *ws)
kfree(workspace);
 }
 
-static struct list_head *alloc_heuristic_ws(void)
+static void *heuristic_ws_alloc(gfp_t gfp_mask, void *pool_data)
 {
-   struct heuristic_ws *ws;
+   struct heuristic_ws *ws = kmalloc(sizeof(*ws), GFP_KERNEL);
 
-   ws = kzalloc(sizeof(*ws), GFP_KERNEL);
if (!ws)
-   return ERR_PTR(-ENOMEM);
+   return ws;
 
ws->sample = kvmalloc(MAX_SAMPLE_SIZE, GFP_KERNEL);
if (!ws->sample)
@@ -803,11 +800,14 @@ static struct list_head *alloc_heuristic_ws(void)
if (!ws->bucket_b)
goto fail;
 
-   INIT_LIST_HEAD(&ws->list);
-   return &ws->list;
+   return ws;
+
 fail:
-   free_heuristic_ws(&ws->list);
-   return ERR_PTR(-ENOMEM);
+   kvfree(ws->sample);
+   kfree(ws->bucket);
+   kfree(ws->bucket_b);
+   kfree(ws);
+   return NULL;
 }
 
 struct workspaces_list {
@@ -821,10 +821,9 @@ struct workspaces_list {
wait_queue_head_t ws_wait;
 };
 
+static mempool_t *btrfs_heuristic_ws_pool;
 static struct workspaces_list btrfs_comp_ws[BTRFS_COMPRESS_TYPES];
 
-static struct workspaces_list btrfs_heuristic_ws;
-
 static const struct btrfs_compress_op * const btrfs_compress_op[] = {
&btrfs_zlib_compress,
&btrfs_lzo_compress,
@@ -836,20 +835,15 @@ void __init btrfs_init_compress(void)
struct list_head *workspace;
int i;
 
-   INIT_LIST_HEAD(&btrfs_heuristic_ws.idle_ws);
-   spin_lock_init(&btrfs_heuristic_ws.ws_lock);
-   atomic_set(&btrfs_heuristic_ws.total_ws, 0);
-   init_waitqueue_head(&btrfs_heuristic_ws.ws_wait);
+   /*
+* Try preallocate pool with minimum size for successful
+* initialization of btrfs module
+*/
+   btrfs_heuristic_ws_pool = mempool_create(1, heuristic_ws_alloc,
+   heuristic_ws_free, NULL);
 
-   workspace = alloc_heuristic_ws();
-   if (IS_ERR(workspace)) {
-   pr_warn(
-   "BTRFS: cannot preallocate heuristic workspace, will try later\n");
-   } else {
-   atomic_set(&btrfs_heuristic_ws.total_ws, 1);
-   btrfs_heuristic_ws.free_ws = 1;
-   list_add(workspace, &btrfs_heuristic_ws.idle_ws);
-   }
+   if (IS_ERR(btrfs_heuristic_ws_pool))
+   pr_warn("BTRFS: cannot preallocate heuristic workspace, will 
try later\n");
 
for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) {
INIT_LIST_HEAD(&btrfs_comp_ws[i].idle_ws);
@@ -878,7 +872,7 @@ void __init btrfs_init_compress(void)
  * Preallocation makes a forward progress guarantees and we do not return
  * errors.
  */
-static struct list_head *__find_workspace(int type, bool heuristic)
+static struct list_head *find_workspace(int type)
 {
struct list_head *workspace;
int cpus = num_online_cpus();
@@ -890,19 +884,11 @@ static struct list_head *__find_workspace(int type, bool 
heuristic)
wait_queue_head_t *ws_wait;
int *free_ws;
 
-   if (heuristic) {
-   idle_ws  = &btrfs_heuristic_ws.idle_ws;
-   ws_lock  = &btrfs_heuristic_ws.ws_lock;
-   total_ws = &btrfs_heuristic_ws.total_ws;
-

[PATCH v4 00/19] fs: rework and optimize i_version handling in filesystems

2017-12-22 Thread Jeff Layton

From: Jeff Layton 

v4:
- fix SB_LAZYTIME handling in generic_update_time
- add memory barriers to patch to convert i_version field to atomic64_t

v3:
- move i_version handling functions to new header file
- document that the kernel-managed i_version implementation will appear to
  increase over time
- fix inode_cmp_iversion to handle wraparound correctly

v2:
- xfs should use inode_peek_iversion instead of inode_peek_iversion_raw
- rework file_update_time patch
- don't dirty inode when only S_ATIME is set and SB_LAZYTIME is enabled
- better comments and documentation

I think this is now approaching merge readiness.

Special thanks to Jan Kara and Dave Chinner who helped me tighten up the
memory barriers in the final patch.

tl;dr: I think we can greatly reduce the cost of the inode->i_version
counter, by exploiting the fact that we don't need to increment it if no
one is looking at it. We can also clean up the code to prepare to
eventually expose this value via statx().

Note that this set relies on a few patches that are in other trees. The
full stack that I've been testing with is here:


https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/log/?h=iversion

The inode->i_version field is supposed to be a value that changes
whenever there is any data or metadata change to the inode. Some
filesystems use it internally to detect directory changes during
readdir. knfsd will use it if the filesystem has MS_I_VERSION set. IMA
will also use it to optimize away some remeasurement if it's available.
NFS and AFS just use it to store an opaque change attribute from the
server.

Only btrfs, ext4, and xfs increment it for data changes. Because of
this, these filesystems must log the inode to disk whenever the
i_version counter changes. That has a non-zero performance impact,
especially on write-heavy workloads, because we end up dirtying the
inode metadata on every write, not just when the times change.

It turns out though that none of these users of i_version require that
it change on every change to the file. The only real requirement is that
it be different if something changed since the last time we queried for
it.

If we keep track of when something queries the value, we can avoid
bumping the counter and an on-disk update when nothing else has changed
if no one has queried it since it was last incremented.

This patchset changes the code to only bump the i_version counter when
it's strictly necessary, or when we're updating the inode metadata
anyway (e.g. when times change).

It takes the approach of converting the existing accessors of i_version
to use a new API, while leaving the underlying implementation mostly the
same.  The last patch then converts the existing implementation to keep
track of whether the value has been queried since it was last
incremented. It then uses that to avoid incrementing the counter when
it can.

With this, we reduce inode metadata updates across all 3 filesystems
down to roughly the frequency of the timestamp granularity, particularly
when it's not being queried (the vastly common case).

I can see measurable performance gains on xfs and ext4 with iversion
enabled, when streaming small (4k) I/Os.

btrfs shows some slight gain in testing, but not quite the magnitude
that xfs and ext4 show. I'm not sure why yet and would appreciate some
input from btrfs folks.

My goal is to get this into linux-next fairly soon. If it shows no
problems then we can look at merging it for 4.16, or 4.17 if all of the
prequisite patches are not yet merged.

Jeff Layton (19):
  fs: new API for handling inode->i_version
  fs: don't take the i_lock in inode_inc_iversion
  fat: convert to new i_version API
  affs: convert to new i_version API
  afs: convert to new i_version API
  btrfs: convert to new i_version API
  exofs: switch to new i_version API
  ext2: convert to new i_version API
  ext4: convert to new i_version API
  nfs: convert to new i_version API
  nfsd: convert to new i_version API
  ocfs2: convert to new i_version API
  ufs: use new i_version API
  xfs: convert to new i_version API
  IMA: switch IMA over to new i_version API
  fs: only set S_VERSION when updating times if necessary
  xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need
incrementing
  btrfs: only dirty the inode in btrfs_update_time if something was
changed
  fs: handle inode->i_version more efficiently

 fs/affs/amigaffs.c|   5 +-
 fs/affs/dir.c |   5 +-
 fs/affs/super.c   |   3 +-
 fs/afs/fsclient.c |   3 +-
 fs/afs/inode.c|   5 +-
 fs/btrfs/delayed-inode.c  |   7 +-
 fs/btrfs/file.c   |   1 +
 fs/btrfs/inode.c  |  12 +-
 fs/btrfs/ioctl.c  |   1 +
 fs/btrfs/tree-log.c   |   4 +-
 fs/btrfs/xattr.c  |   1 +
 fs/exofs/dir.c|   9 +-
 fs/exofs/super.c  |   3 +-
 fs/ext2/dir.c

[PATCH v4 01/19] fs: new API for handling inode->i_version

2017-12-22 Thread Jeff Layton

From: Jeff Layton 

Add a documentation blob that explains what the i_version field is, how
it is expected to work, and how it is currently implemented by various
filesystems.

We already have inode_inc_iversion. Add several other functions for
manipulating and accessing the i_version counter. For now, the
implementation is trivial and basically works the way that all of the
open-coded i_version accesses work today.

Future patches will convert existing users of i_version to use the new
API, and then convert the backend implementation to do things more
efficiently.

Signed-off-by: Jeff Layton 
---
 fs/btrfs/file.c  |   1 +
 fs/btrfs/inode.c |   1 +
 fs/btrfs/ioctl.c |   1 +
 fs/btrfs/xattr.c |   1 +
 fs/ext4/inode.c  |   1 +
 fs/ext4/namei.c  |   1 +
 fs/inode.c   |   1 +
 include/linux/fs.h   |  15 
 include/linux/iversion.h | 205 +++
 9 files changed, 212 insertions(+), 15 deletions(-)
 create mode 100644 include/linux/iversion.h

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index eb1bac7c8553..c95d7b2efefb 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e1a7f3cb5be9..27f008b33fc1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2ef8acaac688..aa452c9e2eff 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
diff --git a/fs/btrfs/xattr.c b/fs/btrfs/xattr.c
index 2c7e53f9ff1b..5258c1714830 100644
--- a/fs/btrfs/xattr.c
+++ b/fs/btrfs/xattr.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "btrfs_inode.h"
 #include "transaction.h"
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7df2c5644e59..fa5d8bc52d2d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ext4_jbd2.h"
 #include "xattr.h"
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 798b3ac680db..bcf0dff517be 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ext4.h"
 #include "ext4_jbd2.h"
 
diff --git a/fs/inode.c b/fs/inode.c
index 03102d6ef044..19e72f500f71 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -18,6 +18,7 @@
 #include  /* for inode_has_buffers */
 #include 
 #include 
+#include 
 #include 
 #include "internal.h"
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 511fbaabf624..76382c24e9d0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2036,21 +2036,6 @@ static inline void inode_dec_link_count(struct inode 
*inode)
mark_inode_dirty(inode);
 }
 
-/**
- * inode_inc_iversion - increments i_version
- * @inode: inode that need to be updated
- *
- * Every time the inode is modified, the i_version field will be incremented.
- * The filesystem has to be mounted with i_version flag
- */
-
-static inline void inode_inc_iversion(struct inode *inode)
-{
-   spin_lock(&inode->i_lock);
-   inode->i_version++;
-   spin_unlock(&inode->i_lock);
-}
-
 enum file_time_flags {
S_ATIME = 1,
S_MTIME = 2,
diff --git a/include/linux/iversion.h b/include/linux/iversion.h
new file mode 100644
index ..bb50d27c71f9
--- /dev/null
+++ b/include/linux/iversion.h
@@ -0,0 +1,205 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_IVERSION_H
+#define _LINUX_IVERSION_H
+
+#include 
+
+/*
+ * The change attribute (i_version) is mandated by NFSv4 and is mostly for
+ * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
+ * appear different to observers if there was a change to the inode's data or
+ * metadata since it was last queried.
+ *
+ * It should be considered an opaque value by observers. If it remains the same
+ * since it was last checked, then nothing has changed in the inode. If it's
+ * different then something has changed. Observers cannot infer anything about
+ * the nature or magnitude of the changes from the value, only that the inode
+ * has changed in some fashion.
+ *
+ * Not all filesystems properly implement the i_version counter. Subsystems 
that
+ * want to use i_version field on an inode should first check whether the
+ * filesystem sets the SB_I_VERSION flag (usually via the IS_I_VERSION macro).
+ *
+ * Those that set SB_I_VERSION will automatically have their i_version counter
+ * incremented on writes to normal files. If the SB_I_VERSION is not set, then
+ * the VFS will not touch it on writes, and the filesystem can use it how it
+ * wishes. Note th

[PATCH v4 03/19] fat: convert to new i_version API

2017-12-22 Thread Jeff Layton

From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/fat/dir.c |  3 ++-
 fs/fat/inode.c   |  9 +
 fs/fat/namei_msdos.c |  7 ---
 fs/fat/namei_vfat.c  | 22 +++---
 4 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/fs/fat/dir.c b/fs/fat/dir.c
index b833ffeee1e1..8e100c3bf72c 100644
--- a/fs/fat/dir.c
+++ b/fs/fat/dir.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "fat.h"
 
 /*
@@ -1055,7 +1056,7 @@ int fat_remove_entries(struct inode *dir, struct 
fat_slot_info *sinfo)
brelse(bh);
if (err)
return err;
-   dir->i_version++;
+   inode_inc_iversion(dir);
 
if (nr_slots) {
/*
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 20a0a89eaca5..ffbbf0520d9e 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "fat.h"
 
 #ifndef CONFIG_FAT_DEFAULT_IOCHARSET
@@ -507,7 +508,7 @@ int fat_fill_inode(struct inode *inode, struct 
msdos_dir_entry *de)
MSDOS_I(inode)->i_pos = 0;
inode->i_uid = sbi->options.fs_uid;
inode->i_gid = sbi->options.fs_gid;
-   inode->i_version++;
+   inode_inc_iversion(inode);
inode->i_generation = get_seconds();
 
if ((de->attr & ATTR_DIR) && !IS_FREE(de->name)) {
@@ -590,7 +591,7 @@ struct inode *fat_build_inode(struct super_block *sb,
goto out;
}
inode->i_ino = iunique(sb, MSDOS_ROOT_INO);
-   inode->i_version = 1;
+   inode_set_iversion(inode, 1);
err = fat_fill_inode(inode, de);
if (err) {
iput(inode);
@@ -1377,7 +1378,7 @@ static int fat_read_root(struct inode *inode)
MSDOS_I(inode)->i_pos = MSDOS_ROOT_INO;
inode->i_uid = sbi->options.fs_uid;
inode->i_gid = sbi->options.fs_gid;
-   inode->i_version++;
+   inode_inc_iversion(inode);
inode->i_generation = 0;
inode->i_mode = fat_make_mode(sbi, ATTR_DIR, S_IRWXUGO);
inode->i_op = sbi->dir_ops;
@@ -1828,7 +1829,7 @@ int fat_fill_super(struct super_block *sb, void *data, 
int silent, int isvfat,
if (!root_inode)
goto out_fail;
root_inode->i_ino = MSDOS_ROOT_INO;
-   root_inode->i_version = 1;
+   inode_set_iversion(root_inode, 1);
error = fat_read_root(root_inode);
if (error < 0) {
iput(root_inode);
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index d24d2758a363..582ca731a6c9 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -7,6 +7,7 @@
  */
 
 #include 
+#include 
 #include "fat.h"
 
 /* Characters that are undesirable in an MS-DOS file name */
@@ -480,7 +481,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned 
char *old_name,
} else
mark_inode_dirty(old_inode);
 
-   old_dir->i_version++;
+   inode_inc_iversion(old_dir);
old_dir->i_ctime = old_dir->i_mtime = 
current_time(old_dir);
if (IS_DIRSYNC(old_dir))
(void)fat_sync_inode(old_dir);
@@ -508,7 +509,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned 
char *old_name,
goto out;
new_i_pos = sinfo.i_pos;
}
-   new_dir->i_version++;
+   inode_inc_iversion(new_dir);
 
fat_detach(old_inode);
fat_attach(old_inode, new_i_pos);
@@ -540,7 +541,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned 
char *old_name,
old_sinfo.bh = NULL;
if (err)
goto error_dotdot;
-   old_dir->i_version++;
+   inode_inc_iversion(old_dir);
old_dir->i_ctime = old_dir->i_mtime = ts;
if (IS_DIRSYNC(old_dir))
(void)fat_sync_inode(old_dir);
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index 02c03a3a..cefea792cde8 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -20,7 +20,7 @@
 #include 
 #include 
 #include 
-
+#include 
 #include "fat.h"
 
 static inline unsigned long vfat_d_version(struct dentry *dentry)
@@ -46,7 +46,7 @@ static int vfat_revalidate_shortname(struct dentry *dentry)
 {
int ret = 1;
spin_lock(&dentry->d_lock);
-   if (vfat_d_version(dentry) != d_inode(dentry->d_parent)->i_version)
+   if (inode_cmp_iversion(d_inode(dentry->d_parent), 
vfat_d_version(dentry)))
ret = 0;
spin_unlock(&dentry->d_lock);
return ret;
@@ -759,7 +759,7 @@ static struct dentry *vfat_lookup(struct inode *dir, struct 
dentry *dentry,
 out:
mutex_unlock(&MSDOS_SB(sb)->s_lock);
if (!inode)
-   vfat_d_version_set(dentry, dir->i_version);
+   vfat_d_version_set(dentry, inode_query_iversion(dir));
return d_splice_alias(inode, dentry);
 error:
mutex_unlock(&MSDOS_SB(sb)->s_lock);
@@ -

[PATCH v4 05/19] afs: convert to new i_version API

2017-12-22 Thread Jeff Layton

From: Jeff Layton 

For AFS, it's generally treated as an opaque value, so we use the
*_raw variants of the API here.

Note that AFS has quite a different definition for this counter. AFS
only increments it on changes to the data, not for the metadata. We'll
need to reconcile that somehow if we ever want to present this to
userspace via statx.

Signed-off-by: Jeff Layton 
---
 fs/afs/fsclient.c | 3 ++-
 fs/afs/inode.c| 5 +++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
index b90ef39ae914..88ec38c2d83c 100644
--- a/fs/afs/fsclient.c
+++ b/fs/afs/fsclient.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 #include "afs_fs.h"
 
@@ -124,7 +125,7 @@ static void xdr_decode_AFSFetchStatus(const __be32 **_bp,
vnode->vfs_inode.i_ctime.tv_sec = status->mtime_client;
vnode->vfs_inode.i_mtime= vnode->vfs_inode.i_ctime;
vnode->vfs_inode.i_atime= vnode->vfs_inode.i_ctime;
-   vnode->vfs_inode.i_version  = data_version;
+   inode_set_iversion_raw(&vnode->vfs_inode, data_version);
}
 
expected_version = status->data_version;
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 3415eb7484f6..dcd2e08d6cdb 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 static const struct inode_operations afs_symlink_inode_operations = {
@@ -89,7 +90,7 @@ static int afs_inode_map_status(struct afs_vnode *vnode, 
struct key *key)
inode->i_atime  = inode->i_mtime = inode->i_ctime;
inode->i_blocks = 0;
inode->i_generation = vnode->fid.unique;
-   inode->i_version= vnode->status.data_version;
+   inode_set_iversion_raw(inode, vnode->status.data_version);
inode->i_mapping->a_ops = &afs_fs_aops;
 
read_sequnlock_excl(&vnode->cb_lock);
@@ -218,7 +219,7 @@ struct inode *afs_iget_autocell(struct inode *dir, const 
char *dev_name,
inode->i_ctime.tv_nsec  = 0;
inode->i_atime  = inode->i_mtime = inode->i_ctime;
inode->i_blocks = 0;
-   inode->i_version= 0;
+   inode_set_iversion_raw(inode, 0);
inode->i_generation = 0;
 
set_bit(AFS_VNODE_PSEUDODIR, &vnode->flags);
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 06/19] btrfs: convert to new i_version API

2017-12-22 Thread Jeff Layton

From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/btrfs/delayed-inode.c | 7 +--
 fs/btrfs/inode.c | 6 --
 fs/btrfs/tree-log.c  | 4 +++-
 3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 5d73f79ded8b..6a246ae2bcb2 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -18,6 +18,7 @@
  */
 
 #include 
+#include 
 #include "delayed-inode.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -1700,7 +1701,8 @@ static void fill_stack_inode_item(struct 
btrfs_trans_handle *trans,
btrfs_set_stack_inode_nbytes(inode_item, inode_get_bytes(inode));
btrfs_set_stack_inode_generation(inode_item,
 BTRFS_I(inode)->generation);
-   btrfs_set_stack_inode_sequence(inode_item, inode->i_version);
+   btrfs_set_stack_inode_sequence(inode_item,
+  inode_peek_iversion(inode));
btrfs_set_stack_inode_transid(inode_item, trans->transid);
btrfs_set_stack_inode_rdev(inode_item, inode->i_rdev);
btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
@@ -1754,7 +1756,8 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
BTRFS_I(inode)->generation = btrfs_stack_inode_generation(inode_item);
 BTRFS_I(inode)->last_trans = btrfs_stack_inode_transid(inode_item);
 
-   inode->i_version = btrfs_stack_inode_sequence(inode_item);
+   inode_set_iversion_queried(inode,
+  btrfs_stack_inode_sequence(inode_item));
inode->i_rdev = 0;
*rdev = btrfs_stack_inode_rdev(inode_item);
BTRFS_I(inode)->flags = btrfs_stack_inode_flags(inode_item);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 27f008b33fc1..ac8692849a81 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3778,7 +3778,8 @@ static int btrfs_read_locked_inode(struct inode *inode)
BTRFS_I(inode)->generation = btrfs_inode_generation(leaf, inode_item);
BTRFS_I(inode)->last_trans = btrfs_inode_transid(leaf, inode_item);
 
-   inode->i_version = btrfs_inode_sequence(leaf, inode_item);
+   inode_set_iversion_queried(inode,
+  btrfs_inode_sequence(leaf, inode_item));
inode->i_generation = BTRFS_I(inode)->generation;
inode->i_rdev = 0;
rdev = btrfs_inode_rdev(leaf, inode_item);
@@ -3946,7 +3947,8 @@ static void fill_inode_item(struct btrfs_trans_handle 
*trans,
 &token);
btrfs_set_token_inode_generation(leaf, item, BTRFS_I(inode)->generation,
 &token);
-   btrfs_set_token_inode_sequence(leaf, item, inode->i_version, &token);
+   btrfs_set_token_inode_sequence(leaf, item, inode_peek_iversion(inode),
+  &token);
btrfs_set_token_inode_transid(leaf, item, trans->transid, &token);
btrfs_set_token_inode_rdev(leaf, item, inode->i_rdev, &token);
btrfs_set_token_inode_flags(leaf, item, BTRFS_I(inode)->flags, &token);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 7bf9b31561db..1b7d92075c1f 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "tree-log.h"
 #include "disk-io.h"
 #include "locking.h"
@@ -3609,7 +3610,8 @@ static void fill_inode_item(struct btrfs_trans_handle 
*trans,
btrfs_set_token_inode_nbytes(leaf, item, inode_get_bytes(inode),
 &token);
 
-   btrfs_set_token_inode_sequence(leaf, item, inode->i_version, &token);
+   btrfs_set_token_inode_sequence(leaf, item,
+  inode_peek_iversion(inode), &token);
btrfs_set_token_inode_transid(leaf, item, trans->transid, &token);
btrfs_set_token_inode_rdev(leaf, item, inode->i_rdev, &token);
btrfs_set_token_inode_flags(leaf, item, BTRFS_I(inode)->flags, &token);
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 07/19] exofs: switch to new i_version API

2017-12-22 Thread Jeff Layton

From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/exofs/dir.c   | 9 +
 fs/exofs/super.c | 3 ++-
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
index 98233a97b7b8..c5a53fcc43ea 100644
--- a/fs/exofs/dir.c
+++ b/fs/exofs/dir.c
@@ -31,6 +31,7 @@
  * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
  */
 
+#include 
 #include "exofs.h"
 
 static inline unsigned exofs_chunk_size(struct inode *inode)
@@ -60,7 +61,7 @@ static int exofs_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
struct inode *dir = mapping->host;
int err = 0;
 
-   dir->i_version++;
+   inode_inc_iversion(dir);
 
if (!PageUptodate(page))
SetPageUptodate(page);
@@ -241,7 +242,7 @@ exofs_readdir(struct file *file, struct dir_context *ctx)
unsigned long n = pos >> PAGE_SHIFT;
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(exofs_chunk_size(inode)-1);
-   int need_revalidate = (file->f_version != inode->i_version);
+   bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
 
if (pos > inode->i_size - EXOFS_DIR_REC_LEN(1))
return 0;
@@ -264,8 +265,8 @@ exofs_readdir(struct file *file, struct dir_context *ctx)
chunk_mask);
ctx->pos = (ni_version;
-   need_revalidate = 0;
+   file->f_version = inode_query_iversion(inode);
+   need_revalidate = false;
}
de = (struct exofs_dir_entry *)(kaddr + offset);
limit = kaddr + exofs_last_byte(inode, n) -
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
index 819624cfc8da..7e244093c0e5 100644
--- a/fs/exofs/super.c
+++ b/fs/exofs/super.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "exofs.h"
 
@@ -159,7 +160,7 @@ static struct inode *exofs_alloc_inode(struct super_block 
*sb)
if (!oi)
return NULL;
 
-   oi->vfs_inode.i_version = 1;
+   inode_set_iversion(&oi->vfs_inode, 1);
return &oi->vfs_inode;
 }
 
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 08/19] ext2: convert to new i_version API

2017-12-22 Thread Jeff Layton

From: Jeff Layton 

Signed-off-by: Jeff Layton 
Reviewed-by: Jan Kara 
---
 fs/ext2/dir.c   | 9 +
 fs/ext2/super.c | 5 +++--
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 987647986f47..4111085a129f 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 typedef struct ext2_dir_entry_2 ext2_dirent;
 
@@ -92,7 +93,7 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
struct inode *dir = mapping->host;
int err = 0;
 
-   dir->i_version++;
+   inode_inc_iversion(dir);
block_write_end(NULL, mapping, pos, len, len, page, NULL);
 
if (pos+len > dir->i_size) {
@@ -293,7 +294,7 @@ ext2_readdir(struct file *file, struct dir_context *ctx)
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(ext2_chunk_size(inode)-1);
unsigned char *types = NULL;
-   int need_revalidate = file->f_version != inode->i_version;
+   bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
 
if (pos > inode->i_size - EXT2_DIR_REC_LEN(1))
return 0;
@@ -319,8 +320,8 @@ ext2_readdir(struct file *file, struct dir_context *ctx)
offset = ext2_validate_entry(kaddr, offset, 
chunk_mask);
ctx->pos = (ni_version;
-   need_revalidate = 0;
+   file->f_version = inode_query_iversion(inode);
+   need_revalidate = false;
}
de = (ext2_dirent *)(kaddr+offset);
limit = kaddr + ext2_last_byte(inode, n) - EXT2_DIR_REC_LEN(1);
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 7646818ab266..554c98b8a93a 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
@@ -184,7 +185,7 @@ static struct inode *ext2_alloc_inode(struct super_block 
*sb)
if (!ei)
return NULL;
ei->i_block_alloc_info = NULL;
-   ei->vfs_inode.i_version = 1;
+   inode_set_iversion(&ei->vfs_inode, 1);
 #ifdef CONFIG_QUOTA
memset(&ei->i_dquot, 0, sizeof(ei->i_dquot));
 #endif
@@ -1569,7 +1570,7 @@ static ssize_t ext2_quota_write(struct super_block *sb, 
int type,
return err;
if (inode->i_size < off+len-towrite)
i_size_write(inode, off+len-towrite);
-   inode->i_version++;
+   inode_inc_iversion(inode);
inode->i_mtime = inode->i_ctime = current_time(inode);
mark_inode_dirty(inode);
return len - towrite;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 09/19] ext4: convert to new i_version API

2017-12-22 Thread Jeff Layton

From: Jeff Layton 

Signed-off-by: Jeff Layton 
Acked-by: Theodore Ts'o 
---
 fs/ext4/dir.c|  9 +
 fs/ext4/inline.c |  7 ---
 fs/ext4/inode.c  | 12 
 fs/ext4/ioctl.c  |  3 ++-
 fs/ext4/namei.c  |  4 ++--
 fs/ext4/super.c  |  3 ++-
 fs/ext4/xattr.c  |  5 +++--
 7 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index d5babc9f222b..afda0a0499ce 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ext4.h"
 #include "xattr.h"
 
@@ -208,7 +209,7 @@ static int ext4_readdir(struct file *file, struct 
dir_context *ctx)
 * readdir(2), then we might be pointing to an invalid
 * dirent right now.  Scan from the start of the block
 * to make sure. */
-   if (file->f_version != inode->i_version) {
+   if (inode_cmp_iversion(inode, file->f_version)) {
for (i = 0; i < sb->s_blocksize && i < offset; ) {
de = (struct ext4_dir_entry_2 *)
(bh->b_data + i);
@@ -227,7 +228,7 @@ static int ext4_readdir(struct file *file, struct 
dir_context *ctx)
offset = i;
ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1))
| offset;
-   file->f_version = inode->i_version;
+   file->f_version = inode_query_iversion(inode);
}
 
while (ctx->pos < inode->i_size
@@ -568,10 +569,10 @@ static int ext4_dx_readdir(struct file *file, struct 
dir_context *ctx)
 * cached entries.
 */
if ((!info->curr_node) ||
-   (file->f_version != inode->i_version)) {
+   inode_cmp_iversion(inode, file->f_version)) {
info->curr_node = NULL;
free_rb_tree_fname(&info->root);
-   file->f_version = inode->i_version;
+   file->f_version = inode_query_iversion(inode);
ret = ext4_htree_fill_tree(file, info->curr_hash,
   info->curr_minor_hash,
   &info->next_hash);
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 1367553c43bb..a8b987b71173 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -14,6 +14,7 @@
 
 #include 
 #include 
+#include 
 
 #include "ext4_jbd2.h"
 #include "ext4.h"
@@ -1042,7 +1043,7 @@ static int ext4_add_dirent_to_inline(handle_t *handle,
 */
dir->i_mtime = dir->i_ctime = current_time(dir);
ext4_update_dx_flag(dir);
-   dir->i_version++;
+   inode_inc_iversion(dir);
return 1;
 }
 
@@ -1494,7 +1495,7 @@ int ext4_read_inline_dir(struct file *file,
 * dirent right now.  Scan from the start of the inline
 * dir to make sure.
 */
-   if (file->f_version != inode->i_version) {
+   if (inode_cmp_iversion(inode, file->f_version)) {
for (i = 0; i < extra_size && i < offset;) {
/*
 * "." is with offset 0 and
@@ -1526,7 +1527,7 @@ int ext4_read_inline_dir(struct file *file,
}
offset = i;
ctx->pos = offset;
-   file->f_version = inode->i_version;
+   file->f_version = inode_query_iversion(inode);
}
 
while (ctx->pos < extra_size) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index fa5d8bc52d2d..1b0d54b372f2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4874,12 +4874,14 @@ struct inode *ext4_iget(struct super_block *sb, 
unsigned long ino)
EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode);
 
if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
-   inode->i_version = le32_to_cpu(raw_inode->i_disk_version);
+   u64 ivers = le32_to_cpu(raw_inode->i_disk_version);
+
if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
-   inode->i_version |=
+   ivers |=
(__u64)(le32_to_cpu(raw_inode->i_version_hi)) << 32;
}
+   inode_set_iversion_queried(inode, ivers);
}
 
ret = 0;
@@ -5165,11 +5167,13 @@ static int ext4_do_update_inode(handle_t *handle,
}
 
if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
-   raw_inode->i_disk_version = cpu_to_le32(inode->i_version);
+   u64 ivers = inode_peek_iversion(inode);
+
+   raw_inode->i_disk_version = cpu_to_le32(ivers);
if (ei->i_extra_isize) {
if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))

[PATCH v4 04/19] affs: convert to new i_version API

2017-12-22 Thread Jeff Layton

From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/affs/amigaffs.c | 5 +++--
 fs/affs/dir.c  | 5 +++--
 fs/affs/super.c| 3 ++-
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/affs/amigaffs.c b/fs/affs/amigaffs.c
index 0f0e6925e97d..14a6c1b90c9f 100644
--- a/fs/affs/amigaffs.c
+++ b/fs/affs/amigaffs.c
@@ -10,6 +10,7 @@
  */
 
 #include 
+#include 
 #include "affs.h"
 
 /*
@@ -60,7 +61,7 @@ affs_insert_hash(struct inode *dir, struct buffer_head *bh)
affs_brelse(dir_bh);
 
dir->i_mtime = dir->i_ctime = current_time(dir);
-   dir->i_version++;
+   inode_inc_iversion(dir);
mark_inode_dirty(dir);
 
return 0;
@@ -114,7 +115,7 @@ affs_remove_hash(struct inode *dir, struct buffer_head 
*rem_bh)
affs_brelse(bh);
 
dir->i_mtime = dir->i_ctime = current_time(dir);
-   dir->i_version++;
+   inode_inc_iversion(dir);
mark_inode_dirty(dir);
 
return retval;
diff --git a/fs/affs/dir.c b/fs/affs/dir.c
index a105e77df2c1..d180b46453cf 100644
--- a/fs/affs/dir.c
+++ b/fs/affs/dir.c
@@ -14,6 +14,7 @@
  *
  */
 
+#include 
 #include "affs.h"
 
 static int affs_readdir(struct file *, struct dir_context *);
@@ -80,7 +81,7 @@ affs_readdir(struct file *file, struct dir_context *ctx)
 * we can jump directly to where we left off.
 */
ino = (u32)(long)file->private_data;
-   if (ino && file->f_version == inode->i_version) {
+   if (ino && inode_cmp_iversion(inode, file->f_version) == 0) {
pr_debug("readdir() left off=%d\n", ino);
goto inside;
}
@@ -130,7 +131,7 @@ affs_readdir(struct file *file, struct dir_context *ctx)
} while (ino);
}
 done:
-   file->f_version = inode->i_version;
+   file->f_version = inode_query_iversion(inode);
file->private_data = (void *)(long)ino;
affs_brelse(fh_bh);
 
diff --git a/fs/affs/super.c b/fs/affs/super.c
index 1117e36134cc..e602619aed9d 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "affs.h"
 
 static int affs_statfs(struct dentry *dentry, struct kstatfs *buf);
@@ -102,7 +103,7 @@ static struct inode *affs_alloc_inode(struct super_block 
*sb)
if (!i)
return NULL;
 
-   i->vfs_inode.i_version = 1;
+   inode_set_iversion(&i->vfs_inode, 1);
i->i_lc = NULL;
i->i_ext_bh = NULL;
i->i_pa_cnt = 0;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 10/19] nfs: convert to new i_version API

2017-12-22 Thread Jeff Layton

From: Jeff Layton 

For NFS, we just use the "raw" API since the i_version is mostly
managed by the server. The exception there is when the client
holds a write delegation, but we only need to bump it once
there anyway to handle CB_GETATTR.

Signed-off-by: Jeff Layton 
---
 fs/nfs/delegation.c|  3 ++-
 fs/nfs/fscache-index.c |  5 +++--
 fs/nfs/inode.c | 18 +-
 fs/nfs/nfs4proc.c  | 10 ++
 fs/nfs/nfstrace.h  |  5 +++--
 fs/nfs/write.c |  8 +++-
 6 files changed, 26 insertions(+), 23 deletions(-)

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index ade44ca0c66c..d8b47624fee2 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -347,7 +348,7 @@ int nfs_inode_set_delegation(struct inode *inode, struct 
rpc_cred *cred, struct
nfs4_stateid_copy(&delegation->stateid, &res->delegation);
delegation->type = res->delegation_type;
delegation->pagemod_limit = res->pagemod_limit;
-   delegation->change_attr = inode->i_version;
+   delegation->change_attr = inode_peek_iversion_raw(inode);
delegation->cred = get_rpccred(cred);
delegation->inode = inode;
delegation->flags = 1<
 #include 
 #include 
+#include 
 
 #include "internal.h"
 #include "fscache.h"
@@ -211,7 +212,7 @@ static uint16_t nfs_fscache_inode_get_aux(const void 
*cookie_netfs_data,
auxdata.ctime = nfsi->vfs_inode.i_ctime;
 
if (NFS_SERVER(&nfsi->vfs_inode)->nfs_client->rpc_ops->version == 4)
-   auxdata.change_attr = nfsi->vfs_inode.i_version;
+   auxdata.change_attr = inode_peek_iversion_raw(&nfsi->vfs_inode);
 
if (bufmax > sizeof(auxdata))
bufmax = sizeof(auxdata);
@@ -243,7 +244,7 @@ enum fscache_checkaux nfs_fscache_inode_check_aux(void 
*cookie_netfs_data,
auxdata.ctime = nfsi->vfs_inode.i_ctime;
 
if (NFS_SERVER(&nfsi->vfs_inode)->nfs_client->rpc_ops->version == 4)
-   auxdata.change_attr = nfsi->vfs_inode.i_version;
+   auxdata.change_attr = inode_peek_iversion_raw(&nfsi->vfs_inode);
 
if (memcmp(data, &auxdata, datalen) != 0)
return FSCACHE_CHECKAUX_OBSOLETE;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index b992d2382ffa..0b85cca1184b 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -38,8 +38,8 @@
 #include 
 #include 
 #include 
-
 #include 
+#include 
 
 #include "nfs4_fs.h"
 #include "callback.h"
@@ -483,7 +483,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct 
nfs_fattr *fattr, st
memset(&inode->i_atime, 0, sizeof(inode->i_atime));
memset(&inode->i_mtime, 0, sizeof(inode->i_mtime));
memset(&inode->i_ctime, 0, sizeof(inode->i_ctime));
-   inode->i_version = 0;
+   inode_set_iversion_raw(inode, 0);
inode->i_size = 0;
clear_nlink(inode);
inode->i_uid = make_kuid(&init_user_ns, -2);
@@ -508,7 +508,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct 
nfs_fattr *fattr, st
else if (nfs_server_capable(inode, NFS_CAP_CTIME))
nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR);
if (fattr->valid & NFS_ATTR_FATTR_CHANGE)
-   inode->i_version = fattr->change_attr;
+   inode_set_iversion_raw(inode, fattr->change_attr);
else
nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR
| NFS_INO_REVAL_PAGECACHE);
@@ -1289,8 +1289,8 @@ static unsigned long nfs_wcc_update_inode(struct inode 
*inode, struct nfs_fattr
 
if ((fattr->valid & NFS_ATTR_FATTR_PRECHANGE)
&& (fattr->valid & NFS_ATTR_FATTR_CHANGE)
-   && inode->i_version == fattr->pre_change_attr) {
-   inode->i_version = fattr->change_attr;
+   && !inode_cmp_iversion(inode, fattr->pre_change_attr)) {
+   inode_set_iversion_raw(inode, fattr->change_attr);
if (S_ISDIR(inode->i_mode))
nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA);
ret |= NFS_INO_INVALID_ATTR;
@@ -1348,7 +1348,7 @@ static int nfs_check_inode_attributes(struct inode 
*inode, struct nfs_fattr *fat
 
if (!nfs_file_has_buffered_writers(nfsi)) {
/* Verify a few of the more important attributes */
-   if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && 
inode->i_version != fattr->change_attr)
+   if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && 
inode_cmp_iversion(inode, fattr->change_attr))
invalid |= NFS_INO_INVALID_ATTR | 
NFS_INO_REVAL_PAGECACHE;
 
if ((fattr->valid & NFS_ATTR_FATTR_MTIME) && 
!timespec_equal(&inode->i_mtime, &fattr->mtime))
@@ -1642,7 +1642,7 @@

[PATCH v4 13/19] ufs: use new i_version API

2017-12-22 Thread Jeff Layton

From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/ufs/dir.c   | 9 +
 fs/ufs/inode.c | 3 ++-
 fs/ufs/super.c | 3 ++-
 3 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c
index 2edc1755b7c5..50dfce000864 100644
--- a/fs/ufs/dir.c
+++ b/fs/ufs/dir.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ufs_fs.h"
 #include "ufs.h"
@@ -47,7 +48,7 @@ static int ufs_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
struct inode *dir = mapping->host;
int err = 0;
 
-   dir->i_version++;
+   inode_inc_iversion(dir);
block_write_end(NULL, mapping, pos, len, len, page, NULL);
if (pos+len > dir->i_size) {
i_size_write(dir, pos+len);
@@ -428,7 +429,7 @@ ufs_readdir(struct file *file, struct dir_context *ctx)
unsigned long n = pos >> PAGE_SHIFT;
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(UFS_SB(sb)->s_uspi->s_dirblksize - 1);
-   int need_revalidate = file->f_version != inode->i_version;
+   bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
unsigned flags = UFS_SB(sb)->s_flags;
 
UFSD("BEGIN\n");
@@ -455,8 +456,8 @@ ufs_readdir(struct file *file, struct dir_context *ctx)
offset = ufs_validate_entry(sb, kaddr, offset, 
chunk_mask);
ctx->pos = (ni_version;
-   need_revalidate = 0;
+   file->f_version = inode_query_iversion(inode);
+   need_revalidate = false;
}
de = (struct ufs_dir_entry *)(kaddr+offset);
limit = kaddr + ufs_last_byte(inode, n) - UFS_DIR_REC_LEN(1);
diff --git a/fs/ufs/inode.c b/fs/ufs/inode.c
index afb601c0dda0..c843ec858cf7 100644
--- a/fs/ufs/inode.c
+++ b/fs/ufs/inode.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ufs_fs.h"
 #include "ufs.h"
@@ -693,7 +694,7 @@ struct inode *ufs_iget(struct super_block *sb, unsigned 
long ino)
if (err)
goto bad_inode;
 
-   inode->i_version++;
+   inode_inc_iversion(inode);
ufsi->i_lastfrag =
(inode->i_size + uspi->s_fsize - 1) >> uspi->s_fshift;
ufsi->i_dir_start_lookup = 0;
diff --git a/fs/ufs/super.c b/fs/ufs/super.c
index 4d497e9c6883..b6ba80e05bff 100644
--- a/fs/ufs/super.c
+++ b/fs/ufs/super.c
@@ -88,6 +88,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ufs_fs.h"
 #include "ufs.h"
@@ -1440,7 +1441,7 @@ static struct inode *ufs_alloc_inode(struct super_block 
*sb)
if (!ei)
return NULL;
 
-   ei->vfs_inode.i_version = 1;
+   inode_set_iversion(&ei->vfs_inode, 1);
seqlock_init(&ei->meta_lock);
mutex_init(&ei->truncate_mutex);
return &ei->vfs_inode;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 07/10] Btrfs: fix unexpected EEXIST from btrfs_get_extent

2017-12-22 Thread Nikolay Borisov



On 22.12.2017 00:42, Liu Bo wrote:
> This fixes a corner case that is caused by a race of dio write vs dio
> read/write.
> 
> dio write:
> [0, 32k) -> [0, 8k) + [8k, 32k)
> 
> dio read/write:
> 
> While get_extent() with [0, 4k), [0, 8k) is found as existing em, even
> though start == existing->start, em is [0, 32k),
> extent_map_end(em) > extent_map_end(existing),
> then it goes thru merge_extent_mapping() which tries to add a [8k, 8k)
> (with a length 0), and get_extent ends up returning -EEXIST, and dio
> read/write will get -EEXIST which is confusing applications.

I think this paragraph should be expanded a bit since you've crammed a
lot of information in few words.

In the below graphs em should be extent we'd like to insert, no? So
given that it always encompasses the existing (0, 8k given the above
description) then em should be 0, 32k ?

> 
> Here I concluded all the possible situations,
> 1) start < existing->start
> 
> +---+em+---+
> +--prev---+ | +-+  |
> | | | | |  |
> +-+ + +---+existing++  ++
> +
> |
> +
>  start
> 
> 2) start == existing->start
> 
>   +em+
>   | +-+  |
>   | | |  |
>   + +existing-+  +
> |
> |
> +
>  start
> 
> 3) start > existing->start && start < (existing->start + existing->len)
> 
>   +em+
>   | +-+  |
>   | | |  |
>   + +existing-+  +
>|
>|
>+
>  start
> 
> 4) start >= (existing->start + existing->len)
> 
> +---+em+---+
> | +-+  | +--next---+
> | | |  | | |
> + +---+existing++  + +-+
>   +
>   |
>   +
>start
> 
> After going thru the above case by case, it turns out that if start is
> within existing em (front inclusive), then the existing em should be
> returned, otherwise, we try our best to merge candidate em with
> sibling ems to form a larger em.
> 
> Reported-by: David Vallender 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/extent_map.c | 25 ++---
>  1 file changed, 10 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> index 6653b08..d386cfb 100644
> --- a/fs/btrfs/extent_map.c
> +++ b/fs/btrfs/extent_map.c
> @@ -483,7 +483,7 @@ static struct extent_map *prev_extent_map(struct 
> extent_map *em)
>  static int merge_extent_mapping(struct extent_map_tree *em_tree,
>   struct extent_map *existing,
>   struct extent_map *em,
> - u64 map_start)
> + u64 map_start, u64 map_len)
>  {
>   struct extent_map *prev;
>   struct extent_map *next;
> @@ -496,9 +496,13 @@ static int merge_extent_mapping(struct extent_map_tree 
> *em_tree,
>   if (existing->start > map_start) {
>   next = existing;
>   prev = prev_extent_map(next);
> + if (prev)
> + ASSERT(extent_map_end(prev) <= map_start);
>   } else {
>   prev = existing;
>   next = next_extent_map(prev);
> + if (next)
> + ASSERT(map_start + map_len <= next->start);
>   }
>  
>   start = prev ? extent_map_end(prev) : em->start;
> @@ -540,35 +544,26 @@ int btrfs_add_extent_mapping(struct extent_map_tree 
> *em_tree,
>* existing will always be non-NULL, since there must be
>* extent causing the -EEXIST.
>*/
> - if (existing->start == em->start &&
> - extent_map_end(existing) >= extent_map_end(em) &&
> - em->block_start == existing->block_start) {
> - /*
> -  * The existing extent map already encompasses the
> -  * entire extent map we tried to add.
> -  */
> + if (start >= existing->start &&
> + start < extent_map_end(existing)) {
>   free_extent_map(em);
>   *em_in = existing;
>   ret = 0;
> - } else if (start >= extent_map_end(existing) ||
> - start <= existing->start) {
> + } else {
>   /*
>* The existing extent map is the one nearest to
>* the [start, start + len) range which overlaps
>*/
>   ret = merge_extent_mapping(em_tree, existing,
> -em, start);
> +

[PATCH v4 14/19] xfs: convert to new i_version API

2017-12-22 Thread Jeff Layton

From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 fs/xfs/libxfs/xfs_inode_buf.c | 7 +--
 fs/xfs/xfs_icache.c   | 5 +++--
 fs/xfs/xfs_inode.c| 3 ++-
 fs/xfs/xfs_inode_item.c   | 3 ++-
 fs/xfs/xfs_trans_inode.c  | 4 +++-
 5 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 6b7989038d75..b9c0bf80669c 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -32,6 +32,8 @@
 #include "xfs_ialloc.h"
 #include "xfs_dir2.h"
 
+#include 
+
 /*
  * Check that none of the inode's in the buffer have a next
  * unlinked field of 0.
@@ -264,7 +266,8 @@ xfs_inode_from_disk(
to->di_flags= be16_to_cpu(from->di_flags);
 
if (to->di_version == 3) {
-   inode->i_version = be64_to_cpu(from->di_changecount);
+   inode_set_iversion_queried(inode,
+  be64_to_cpu(from->di_changecount));
to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
to->di_flags2 = be64_to_cpu(from->di_flags2);
@@ -314,7 +317,7 @@ xfs_inode_to_disk(
to->di_flags = cpu_to_be16(from->di_flags);
 
if (from->di_version == 3) {
-   to->di_changecount = cpu_to_be64(inode->i_version);
+   to->di_changecount = cpu_to_be64(inode_peek_iversion(inode));
to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
to->di_flags2 = cpu_to_be64(from->di_flags2);
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 43005fbe8b1e..4c315adb05e6 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -37,6 +37,7 @@
 
 #include 
 #include 
+#include 
 
 /*
  * Allocate and initialise an xfs_inode.
@@ -293,14 +294,14 @@ xfs_reinit_inode(
int error;
uint32_tnlink = inode->i_nlink;
uint32_tgeneration = inode->i_generation;
-   uint64_tversion = inode->i_version;
+   uint64_tversion = inode_peek_iversion(inode);
umode_t mode = inode->i_mode;
 
error = inode_init_always(mp->m_super, inode);
 
set_nlink(inode, nlink);
inode->i_generation = generation;
-   inode->i_version = version;
+   inode_set_iversion_queried(inode, version);
inode->i_mode = mode;
return error;
 }
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 801274126648..dfc5e60d8af3 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -16,6 +16,7 @@
  * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
  */
 #include 
+#include 
 
 #include "xfs.h"
 #include "xfs_fs.h"
@@ -833,7 +834,7 @@ xfs_ialloc(
ip->i_d.di_flags = 0;
 
if (ip->i_d.di_version == 3) {
-   inode->i_version = 1;
+   inode_set_iversion(inode, 1);
ip->i_d.di_flags2 = 0;
ip->i_d.di_cowextsize = 0;
ip->i_d.di_crtime.t_sec = (int32_t)tv.tv_sec;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 6ee5c3bf19ad..7571abf5dfb3 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -30,6 +30,7 @@
 #include "xfs_buf_item.h"
 #include "xfs_log.h"
 
+#include 
 
 kmem_zone_t*xfs_ili_zone;  /* inode log item zone */
 
@@ -354,7 +355,7 @@ xfs_inode_to_log_dinode(
to->di_next_unlinked = NULLAGINO;
 
if (from->di_version == 3) {
-   to->di_changecount = inode->i_version;
+   to->di_changecount = inode_peek_iversion(inode);
to->di_crtime.t_sec = from->di_crtime.t_sec;
to->di_crtime.t_nsec = from->di_crtime.t_nsec;
to->di_flags2 = from->di_flags2;
diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index daa7615497f9..225544327c4f 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -28,6 +28,8 @@
 #include "xfs_inode_item.h"
 #include "xfs_trace.h"
 
+#include 
+
 /*
  * Add a locked inode to the transaction.
  *
@@ -117,7 +119,7 @@ xfs_trans_log_inode(
 */
if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
IS_I_VERSION(VFS_I(ip))) {
-   VFS_I(ip)->i_version++;
+   inode_inc_iversion(VFS_I(ip));
flags |= XFS_ILOG_CORE;
}
 
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 11/19] nfsd: convert to new i_version API

2017-12-22 Thread Jeff Layton

From: Jeff Layton 

Mostly just making sure we use the "get" wrappers so we know when
it is being fetched for later use.

Signed-off-by: Jeff Layton 
---
 fs/nfsd/nfsfh.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index 43f31cf49bae..b8444189223b 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static inline __u32 ino_t_to_u32(ino_t ino)
 {
@@ -259,7 +260,7 @@ static inline u64 nfsd4_change_attribute(struct inode 
*inode)
chattr =  inode->i_ctime.tv_sec;
chattr <<= 30;
chattr += inode->i_ctime.tv_nsec;
-   chattr += inode->i_version;
+   chattr += inode_query_iversion(inode);
return chattr;
 }
 
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 15/19] IMA: switch IMA over to new i_version API

2017-12-22 Thread Jeff Layton

From: Jeff Layton 

Signed-off-by: Jeff Layton 
---
 security/integrity/ima/ima_api.c  | 3 ++-
 security/integrity/ima/ima_main.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c
index c7e8db0ea4c0..c6ae42266270 100644
--- a/security/integrity/ima/ima_api.c
+++ b/security/integrity/ima/ima_api.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ima.h"
 
@@ -215,7 +216,7 @@ int ima_collect_measurement(struct integrity_iint_cache 
*iint,
 * which do not support i_version, support is limited to an initial
 * measurement/appraisal/audit.
 */
-   i_version = file_inode(file)->i_version;
+   i_version = inode_query_iversion(inode);
hash.hdr.algo = algo;
 
/* Initialize hash digest to 0's in case of failure */
diff --git a/security/integrity/ima/ima_main.c 
b/security/integrity/ima/ima_main.c
index 50b8254d..06a70c5a2329 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ima.h"
 
@@ -128,7 +129,7 @@ static void ima_check_last_writer(struct 
integrity_iint_cache *iint,
inode_lock(inode);
if (atomic_read(&inode->i_writecount) == 1) {
if (!IS_I_VERSION(inode) ||
-   (iint->version != inode->i_version) ||
+   inode_cmp_iversion(inode, iint->version) ||
(iint->flags & IMA_NEW_FILE)) {
iint->flags &= ~(IMA_DONE_MASK | IMA_NEW_FILE);
iint->measured_pcrs = 0;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 16/19] fs: only set S_VERSION when updating times if necessary

2017-12-22 Thread Jeff Layton

From: Jeff Layton 

We only really need to update i_version if someone has queried for it
since we last incremented it. By doing that, we can avoid having to
update the inode if the times haven't changed.

If the times have changed, then we go ahead and forcibly increment the
counter, under the assumption that we'll be going to the storage
anyway, and the increment itself is relatively cheap.

Signed-off-by: Jeff Layton 
---
 fs/inode.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 19e72f500f71..2fa920188759 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1635,17 +1635,21 @@ static int relatime_need_update(const struct path 
*path, struct inode *inode,
 int generic_update_time(struct inode *inode, struct timespec *time, int flags)
 {
int iflags = I_DIRTY_TIME;
+   bool dirty = false;
 
if (flags & S_ATIME)
inode->i_atime = *time;
if (flags & S_VERSION)
-   inode_inc_iversion(inode);
+   dirty |= inode_maybe_inc_iversion(inode, dirty);
if (flags & S_CTIME)
inode->i_ctime = *time;
if (flags & S_MTIME)
inode->i_mtime = *time;
+   if ((flags & (S_ATIME | S_CTIME | S_MTIME)) &&
+   !(inode->i_sb->s_flags & SB_LAZYTIME))
+   dirty = true;
 
-   if (!(inode->i_sb->s_flags & SB_LAZYTIME) || (flags & S_VERSION))
+   if (dirty)
iflags |= I_DIRTY_SYNC;
__mark_inode_dirty(inode, iflags);
return 0;
@@ -1864,7 +1868,7 @@ int file_update_time(struct file *file)
if (!timespec_equal(&inode->i_ctime, &now))
sync_it |= S_CTIME;
 
-   if (IS_I_VERSION(inode))
+   if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode))
sync_it |= S_VERSION;
 
if (!sync_it)
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 12/19] ocfs2: convert to new i_version API

2017-12-22 Thread Jeff Layton

From: Jeff Layton 

Signed-off-by: Jeff Layton 
Reviewed-by: Jan Kara 
---
 fs/ocfs2/dir.c  | 15 ---
 fs/ocfs2/inode.c|  3 ++-
 fs/ocfs2/namei.c|  3 ++-
 fs/ocfs2/quota_global.c |  3 ++-
 4 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c
index febe6312ceff..32f9c72dff17 100644
--- a/fs/ocfs2/dir.c
+++ b/fs/ocfs2/dir.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -1174,7 +1175,7 @@ static int __ocfs2_delete_entry(handle_t *handle, struct 
inode *dir,
le16_add_cpu(&pde->rec_len,
le16_to_cpu(de->rec_len));
de->inode = 0;
-   dir->i_version++;
+   inode_inc_iversion(dir);
ocfs2_journal_dirty(handle, bh);
goto bail;
}
@@ -1729,7 +1730,7 @@ int __ocfs2_add_entry(handle_t *handle,
if (ocfs2_dir_indexed(dir))
ocfs2_recalc_free_list(dir, handle, lookup);
 
-   dir->i_version++;
+   inode_inc_iversion(dir);
ocfs2_journal_dirty(handle, insert_bh);
retval = 0;
goto bail;
@@ -1775,7 +1776,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode,
 * readdir(2), then we might be pointing to an invalid
 * dirent right now.  Scan from the start of the block
 * to make sure. */
-   if (*f_version != inode->i_version) {
+   if (inode_cmp_iversion(inode, *f_version)) {
for (i = 0; i < i_size_read(inode) && i < offset; ) {
de = (struct ocfs2_dir_entry *)
(data->id_data + i);
@@ -1791,7 +1792,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode,
i += le16_to_cpu(de->rec_len);
}
ctx->pos = offset = i;
-   *f_version = inode->i_version;
+   *f_version = inode_query_iversion(inode);
}
 
de = (struct ocfs2_dir_entry *) (data->id_data + ctx->pos);
@@ -1869,7 +1870,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
 * readdir(2), then we might be pointing to an invalid
 * dirent right now.  Scan from the start of the block
 * to make sure. */
-   if (*f_version != inode->i_version) {
+   if (inode_cmp_iversion(inode, *f_version)) {
for (i = 0; i < sb->s_blocksize && i < offset; ) {
de = (struct ocfs2_dir_entry *) (bh->b_data + 
i);
/* It's too expensive to do a full
@@ -1886,7 +1887,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
offset = i;
ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1))
| offset;
-   *f_version = inode->i_version;
+   *f_version = inode_query_iversion(inode);
}
 
while (ctx->pos < i_size_read(inode)
@@ -1940,7 +1941,7 @@ static int ocfs2_dir_foreach_blk(struct inode *inode, u64 
*f_version,
  */
 int ocfs2_dir_foreach(struct inode *inode, struct dir_context *ctx)
 {
-   u64 version = inode->i_version;
+   u64 version = inode_query_iversion(inode);
ocfs2_dir_foreach_blk(inode, &version, ctx, true);
return 0;
 }
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index 1a1e0078ab38..d51b80edd972 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -302,7 +303,7 @@ void ocfs2_populate_inode(struct inode *inode, struct 
ocfs2_dinode *fe,
OCFS2_I(inode)->ip_attr = le32_to_cpu(fe->i_attr);
OCFS2_I(inode)->ip_dyn_features = le16_to_cpu(fe->i_dyn_features);
 
-   inode->i_version = 1;
+   inode_set_iversion(inode, 1);
inode->i_generation = le32_to_cpu(fe->i_generation);
inode->i_rdev = huge_decode_dev(le64_to_cpu(fe->id1.dev1.i_rdev));
inode->i_mode = le16_to_cpu(fe->i_mode);
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 3b0a10d9b36f..c801eddc4bf3 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -1520,7 +1521,7 @@ static int ocfs2_rename(struct inode *old_dir,
mlog_errno(status);
goto bail;
}
-   new_dir->i_version++;
+   inode_inc_iversion(new_dir);
 
if (S_ISDIR(new_inode->i_mode))
ocfs2_set_links_count(newfe, 0);
diff

[PATCH v4 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing

2017-12-22 Thread Jeff Layton

From: Jeff Layton 

If XFS_ILOG_CORE is already set then go ahead and increment it.

Signed-off-by: Jeff Layton 
---
 fs/xfs/xfs_trans_inode.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index 225544327c4f..4a89da4b6fe7 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -112,15 +112,17 @@ xfs_trans_log_inode(
 
/*
 * First time we log the inode in a transaction, bump the inode change
-* counter if it is configured for this to occur. We don't use
-* inode_inc_version() because there is no need for extra locking around
-* i_version as we already hold the inode locked exclusively for
-* metadata modification.
+* counter if it is configured for this to occur. While we have the
+* inode locked exclusively for metadata modification, we can usually
+* avoid setting XFS_ILOG_CORE if no one has queried the value since
+* the last time it was incremented. If we have XFS_ILOG_CORE already
+* set however, then go ahead and bump the i_version counter
+* unconditionally.
 */
if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
IS_I_VERSION(VFS_I(ip))) {
-   inode_inc_iversion(VFS_I(ip));
-   flags |= XFS_ILOG_CORE;
+   if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
+   flags |= XFS_ILOG_CORE;
}
 
tp->t_flags |= XFS_TRANS_DIRTY;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 18/19] btrfs: only dirty the inode in btrfs_update_time if something was changed

2017-12-22 Thread Jeff Layton

From: Jeff Layton 

At this point, we know that "now" and the file times may differ, and we
suspect that the i_version has been flagged to be bumped. Attempt to
bump the i_version, and only mark the inode dirty if that actually
occurred or if one of the times was updated.

Signed-off-by: Jeff Layton 
---
 fs/btrfs/inode.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ac8692849a81..76245323a7c8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6107,19 +6107,20 @@ static int btrfs_update_time(struct inode *inode, 
struct timespec *now,
 int flags)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
+   bool dirty = flags & ~S_VERSION;
 
if (btrfs_root_readonly(root))
return -EROFS;
 
if (flags & S_VERSION)
-   inode_inc_iversion(inode);
+   dirty |= inode_maybe_inc_iversion(inode, dirty);
if (flags & S_CTIME)
inode->i_ctime = *now;
if (flags & S_MTIME)
inode->i_mtime = *now;
if (flags & S_ATIME)
inode->i_atime = *now;
-   return btrfs_dirty_inode(inode);
+   return dirty ? btrfs_dirty_inode(inode) : 0;
 }
 
 /*
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 19/19] fs: handle inode->i_version more efficiently

2017-12-22 Thread Jeff Layton

From: Jeff Layton 

Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.

Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.

When we go to maybe increment it, we fetch the value and check the flag
bit.  If it's clear then we don't need to do anything if the update
isn't being forced.

If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.

On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.

This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h   |   2 +-
 include/linux/iversion.h | 208 ++-
 2 files changed, 154 insertions(+), 56 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 76382c24e9d0..6804d075933e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -639,7 +639,7 @@ struct inode {
struct hlist_head   i_dentry;
struct rcu_head i_rcu;
};
-   u64 i_version;
+   atomic64_t  i_version;
atomic_ti_count;
atomic_ti_dio_count;
atomic_ti_writecount;
diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index e08c634779df..cef242e54489 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -5,6 +5,8 @@
 #include 
 
 /*
+ * The inode->i_version field:
+ * ---
  * The change attribute (i_version) is mandated by NFSv4 and is mostly for
  * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
  * appear different to observers if there was a change to the inode's data or
@@ -27,86 +29,171 @@
  * i_version on namespace changes in directories (mkdir, rmdir, unlink, etc.).
  * We consider these sorts of filesystems to have a kernel-managed i_version.
  *
+ * This implementation uses the low bit in the i_version field as a flag to
+ * track when the value has been queried. If it has not been queried since it
+ * was last incremented, we can skip the increment in most cases.
+ *
+ * In the event that we're updating the ctime, we will usually go ahead and
+ * bump the i_version anyway. Since that has to go to stable storage in some
+ * fashion, we might as well increment it as well.
+ *
+ * With this implementation, the value should always appear to observers to
+ * increase over time if the file has changed. It's recommended to use
+ * inode_cmp_iversion() helper to compare values.
+ *
  * Note that some filesystems (e.g. NFS and AFS) just use the field to store
- * a server-provided value (for the most part). For that reason, those
+ * a server-provided value for the most part. For that reason, those
  * filesystems do not set SB_I_VERSION. These filesystems are considered to
  * have a self-managed i_version.
+ *
+ * Persistently storing the i_version
+ * --
+ * Queries of the i_version field are not gated on them hitting the backing
+ * store. It's always possible that the host could crash after allowing
+ * a query of the value but before it has made it to disk.
+ *
+ * To mitigate this problem, filesystems should always use
+ * inode_set_iversion_queried when loading an existing inode from disk. This
+ * ensures that the next attempted inode increment will result in the value
+ * changing.
+ *
+ * Storing the value to disk therefore does not count as a query, so those
+ * filesystems should use inode_peek_iversion to grab the value to be stored.
+ * There is no need to flag the value as having been queried in that case.
  */
 
+/*
+ * We borrow the lowest bit in the i_version to use as a flag to tell whether
+ * it has been queried since we last incremented it. If it has, then we must
+ * increment it on the next change. After that, we can clear the flag and
+ * avoid incrementing it again until it has again been queried.
+ */
+#define I_VERSION_QUERIED_SHIFT(1)
+#define I_VERSION_QUERIED

[PATCH v4 02/19] fs: don't take the i_lock in inode_inc_iversion

2017-12-22 Thread Jeff Layton

From: Jeff Layton 

The rationale for taking the i_lock when incrementing this value is
lost in antiquity. The readers of the field don't take it (at least
not universally), so my assumption is that it was only done here to
serialize incrementors.

If that is indeed the case, then we can drop the i_lock from this
codepath and treat it as a atomic64_t for the purposes of
incrementing it. This allows us to use inode_inc_iversion without
any danger of lock inversion.

Note that the read side is not fetched atomically with this change.
The assumption here is that that is not a critical issue since the
i_version is not fully synchronized with anything else anyway.

Signed-off-by: Jeff Layton 
---
 include/linux/iversion.h | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index bb50d27c71f9..e08c634779df 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -104,12 +104,13 @@ inode_set_iversion_queried(struct inode *inode, const u64 
new)
 static inline bool
 inode_maybe_inc_iversion(struct inode *inode, bool force)
 {
-   spin_lock(&inode->i_lock);
-   inode->i_version++;
-   spin_unlock(&inode->i_lock);
+   atomic64_t *ivp = (atomic64_t *)&inode->i_version;
+
+   atomic64_inc(ivp);
return true;
 }
 
+
 /**
  * inode_inc_iversion - forcibly increment i_version
  * @inode: inode that needs to be updated
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH bpf-next 3/3] error-injection: Separate error-injection from kprobe

2017-12-22 Thread Masami Hiramatsu

Since error-injection framework is not limited to be used
by kprobes, nor bpf. Other kernel subsystems can use it
freely for checking safeness of error-injection, e.g.
livepatch, ftrace etc.
So this separate error-injection framework from kprobes.

Some differences has been made:

- "kprobe" word is removed from any APIs/structures.
- BPF_ALLOW_ERROR_INJECTION() is renamed to
  ALLOW_ERROR_INJECTION() since it is not limited for BPF too.
- CONFIG_FUNCTION_ERROR_INJECTION is the config item of this
  feature. It is automatically enabled if the arch supports
  error injection feature for kprobe or ftrace etc.

Signed-off-by: Masami Hiramatsu 
---
 arch/Kconfig   |2 
 arch/x86/Kconfig   |2 
 arch/x86/include/asm/error-injection.h |   12 ++
 arch/x86/kernel/kprobes/ftrace.c   |   14 --
 arch/x86/lib/Makefile  |2 
 arch/x86/lib/error-inject.c|   19 +++
 fs/btrfs/disk-io.c |2 
 fs/btrfs/free-space-cache.c|2 
 include/asm-generic/error-injection.h  |   20 +++
 include/asm-generic/vmlinux.lds.h  |   14 +-
 include/linux/bpf.h|   12 --
 include/linux/error-injection.h|   21 +++
 include/linux/kprobes.h|1 
 include/linux/module.h |6 -
 kernel/kprobes.c   |  163 --
 kernel/module.c|8 +
 kernel/trace/Kconfig   |2 
 kernel/trace/bpf_trace.c   |2 
 kernel/trace/trace_kprobe.c|3 
 lib/Kconfig.debug  |4 +
 lib/Makefile   |1 
 lib/error-inject.c |  200 
 22 files changed, 302 insertions(+), 210 deletions(-)
 create mode 100644 arch/x86/include/asm/error-injection.h
 create mode 100644 arch/x86/lib/error-inject.c
 create mode 100644 include/asm-generic/error-injection.h
 create mode 100644 include/linux/error-injection.h
 create mode 100644 lib/error-inject.c

diff --git a/arch/Kconfig b/arch/Kconfig
index d3f4aaf9cb7a..97376accfb14 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -196,7 +196,7 @@ config HAVE_OPTPROBES
 config HAVE_KPROBES_ON_FTRACE
bool
 
-config HAVE_KPROBE_OVERRIDE
+config HAVE_FUNCTION_ERROR_INJECTION
bool
 
 config HAVE_NMI
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 04d66e6fa447..fc519e3ae754 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -154,7 +154,7 @@ config X86
select HAVE_KERNEL_XZ
select HAVE_KPROBES
select HAVE_KPROBES_ON_FTRACE
-   select HAVE_KPROBE_OVERRIDE
+   select HAVE_FUNCTION_ERROR_INJECTION
select HAVE_KRETPROBES
select HAVE_KVM
select HAVE_LIVEPATCH   if X86_64
diff --git a/arch/x86/include/asm/error-injection.h 
b/arch/x86/include/asm/error-injection.h
new file mode 100644
index ..d89759a0354c
--- /dev/null
+++ b/arch/x86/include/asm/error-injection.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ERROR_INJECTION_H
+#define _ASM_ERROR_INJECTION_H
+
+#include 
+#include 
+#include 
+
+asmlinkage void just_return_func(void);
+void override_function_to_return(struct pt_regs *regs);
+
+#endif /* _ASM_ERROR_INJECTION_H */
diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c
index 1ea748d682fd..8dc0161cec8f 100644
--- a/arch/x86/kernel/kprobes/ftrace.c
+++ b/arch/x86/kernel/kprobes/ftrace.c
@@ -97,17 +97,3 @@ int arch_prepare_kprobe_ftrace(struct kprobe *p)
p->ainsn.boostable = false;
return 0;
 }
-
-asmlinkage void override_func(void);
-asm(
-   ".type override_func, @function\n"
-   "override_func:\n"
-   "   ret\n"
-   ".size override_func, .-override_func\n"
-);
-
-void arch_ftrace_kprobe_override_function(struct pt_regs *regs)
-{
-   regs->ip = (unsigned long)&override_func;
-}
-NOKPROBE_SYMBOL(arch_ftrace_kprobe_override_function);
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index 7b181b61170e..081f09435d28 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -26,6 +26,8 @@ lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o insn-eval.o
 lib-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
+lib-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
+
 
 obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o
 
diff --git a/arch/x86/lib/error-inject.c b/arch/x86/lib/error-inject.c
new file mode 100644
index ..1998d4ae161e
--- /dev/null
+++ b/arch/x86/lib/error-inject.c
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+
+asmlinkage void just_return_func(void);
+
+asm(
+   ".type just_return_func, @function\n"
+   "just_return_func:\n"
+   "   ret\n"
+   ".size just_return_func, .-just_return_func\n"
+);
+
+void ov

[RFC PATCH bpf-next 2/3] tracing/kprobe: bpf: Compare instruction pointer with original one

2017-12-22 Thread Masami Hiramatsu

Compare instruction pointer with original one on the
stack instead using per-cpu bpf_kprobe_override flag.

This patch also consolidates reset_current_kprobe() and
preempt_enable_no_resched() blocks. Those can be done
in one place.

Signed-off-by: Masami Hiramatsu 
---
 kernel/trace/bpf_trace.c|1 -
 kernel/trace/trace_kprobe.c |   21 +++--
 2 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d663660f8392..cefa9b0e396c 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -83,7 +83,6 @@ EXPORT_SYMBOL_GPL(trace_call_bpf);
 #ifdef CONFIG_BPF_KPROBE_OVERRIDE
 BPF_CALL_2(bpf_override_return, struct pt_regs *, regs, unsigned long, rc)
 {
-   __this_cpu_write(bpf_kprobe_override, 1);
regs_set_return_value(regs, rc);
arch_ftrace_kprobe_override_function(regs);
return 0;
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 265e3e27e8dc..a7c7035963f2 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -42,8 +42,6 @@ struct trace_kprobe {
(offsetof(struct trace_kprobe, tp.args) +   \
(sizeof(struct probe_arg) * (n)))
 
-DEFINE_PER_CPU(int, bpf_kprobe_override);
-
 static nokprobe_inline bool trace_kprobe_is_return(struct trace_kprobe *tk)
 {
return tk->rp.handler != NULL;
@@ -1204,6 +1202,7 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs 
*regs)
int rctx;
 
if (bpf_prog_array_valid(call)) {
+   unsigned long orig_ip = instruction_pointer(regs);
int ret;
 
ret = trace_call_bpf(call, regs);
@@ -1211,12 +1210,13 @@ kprobe_perf_func(struct trace_kprobe *tk, struct 
pt_regs *regs)
/*
 * We need to check and see if we modified the pc of the
 * pt_regs, and if so clear the kprobe and return 1 so that we
-* don't do the instruction skipping.  Also reset our state so
-* we are clean the next pass through.
+* don't do the single stepping.
+* The ftrace kprobe handler leaves it up to us to re-enable
+* preemption here before returning if we've modified the ip.
 */
-   if (__this_cpu_read(bpf_kprobe_override)) {
-   __this_cpu_write(bpf_kprobe_override, 0);
+   if (orig_ip != instruction_pointer(regs)) {
reset_current_kprobe();
+   preempt_enable_no_resched();
return 1;
}
if (!ret)
@@ -1324,15 +1324,8 @@ static int kprobe_dispatcher(struct kprobe *kp, struct 
pt_regs *regs)
if (tk->tp.flags & TP_FLAG_TRACE)
kprobe_trace_func(tk, regs);
 #ifdef CONFIG_PERF_EVENTS
-   if (tk->tp.flags & TP_FLAG_PROFILE) {
+   if (tk->tp.flags & TP_FLAG_PROFILE)
ret = kprobe_perf_func(tk, regs);
-   /*
-* The ftrace kprobe handler leaves it up to us to re-enable
-* preemption here before returning if we've modified the ip.
-*/
-   if (ret)
-   preempt_enable_no_resched();
-   }
 #endif
return ret;
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH bpf-next 1/3] tracing/kprobe: bpf: Check error injectable event is on function entry

2017-12-22 Thread Masami Hiramatsu

Check whether error injectable event is on function entry or not.
Currently it checks the event is ftrace-based kprobes or not,
but that is wrong. It should check if the event is on the entry
of target function. Since error injection will override a function
to just return with modified return value, that operation must
be done before the target function starts making stackframe.

As a side effect, bpf error injection is no need to depend on
function-tracer. It can work with sw-breakpoint based kprobe
events too.

Signed-off-by: Masami Hiramatsu 
---
 kernel/trace/Kconfig|2 --
 kernel/trace/bpf_trace.c|6 +++---
 kernel/trace/trace_kprobe.c |8 +---
 kernel/trace/trace_probe.h  |   12 ++--
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index ae3a2d519e50..6400e1bf97c5 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -533,9 +533,7 @@ config FUNCTION_PROFILER
 config BPF_KPROBE_OVERRIDE
bool "Enable BPF programs to override a kprobed function"
depends on BPF_EVENTS
-   depends on KPROBES_ON_FTRACE
depends on HAVE_KPROBE_OVERRIDE
-   depends on DYNAMIC_FTRACE_WITH_REGS
default n
help
 Allows BPF to override the execution of a probed function and
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index f6d2327ecb59..d663660f8392 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -800,11 +800,11 @@ int perf_event_attach_bpf_prog(struct perf_event *event,
int ret = -EEXIST;
 
/*
-* Kprobe override only works for ftrace based kprobes, and only if they
-* are on the opt-in list.
+* Kprobe override only works if they are on the function entry,
+* and only if they are on the opt-in list.
 */
if (prog->kprobe_override &&
-   (!trace_kprobe_ftrace(event->tp_event) ||
+   (!trace_kprobe_on_func_entry(event->tp_event) ||
 !trace_kprobe_error_injectable(event->tp_event)))
return -EINVAL;
 
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 91f4b57dab82..265e3e27e8dc 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -88,13 +88,15 @@ static nokprobe_inline unsigned long 
trace_kprobe_nhit(struct trace_kprobe *tk)
return nhit;
 }
 
-int trace_kprobe_ftrace(struct trace_event_call *call)
+bool trace_kprobe_on_func_entry(struct trace_event_call *call)
 {
struct trace_kprobe *tk = (struct trace_kprobe *)call->data;
-   return kprobe_ftrace(&tk->rp.kp);
+
+   return kprobe_on_func_entry(tk->rp.kp.addr, tk->rp.kp.symbol_name,
+   tk->rp.kp.offset);
 }
 
-int trace_kprobe_error_injectable(struct trace_event_call *call)
+bool trace_kprobe_error_injectable(struct trace_event_call *call)
 {
struct trace_kprobe *tk = (struct trace_kprobe *)call->data;
unsigned long addr;
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 5e54d748c84c..e101c5bb9eda 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -252,8 +252,8 @@ struct symbol_cache;
 unsigned long update_symbol_cache(struct symbol_cache *sc);
 void free_symbol_cache(struct symbol_cache *sc);
 struct symbol_cache *alloc_symbol_cache(const char *sym, long offset);
-int trace_kprobe_ftrace(struct trace_event_call *call);
-int trace_kprobe_error_injectable(struct trace_event_call *call);
+bool trace_kprobe_on_func_entry(struct trace_event_call *call);
+bool trace_kprobe_error_injectable(struct trace_event_call *call);
 #else
 /* uprobes do not support symbol fetch methods */
 #define fetch_symbol_u8NULL
@@ -280,14 +280,14 @@ alloc_symbol_cache(const char *sym, long offset)
return NULL;
 }
 
-static inline int trace_kprobe_ftrace(struct trace_event_call *call)
+static inline bool trace_kprobe_on_func_entry(struct trace_event_call *call)
 {
-   return 0;
+   return false;
 }
 
-static inline int trace_kprobe_error_injectable(struct trace_event_call *call)
+static inline bool trace_kprobe_error_injectable(struct trace_event_call *call)
 {
-   return 0;
+   return false;
 }
 #endif /* CONFIG_KPROBE_EVENTS */
 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH bpf-next 0/3] Separate error injection framework from kprobes

2017-12-22 Thread Masami Hiramatsu

Hi Josef and Alexei,

Here are the patches which describe what I think more "natural"
introduction of error injection APIs. Basically what I did on
this series is to separate error injection from kprobes and put
it on new error-injection small subsystem which is currently
provide whitelists and just-return function stub.

There are 2 main reasons why I separate it from kprobes.

 - kprobes users can modify execution path not only at 
   error-injection whitelist functions but also other
   functions. I don't like to suggest user that such
   limitation is from kprobes itself.

 - This error injection information is also useful for
   ftrace (function-hook) and livepatch. It should not
   be limited by CONFIG_KPROBES.

So I introduced CONFIG_FUNCTION_ERROR_INJECTION for this feature.

This series also have some improvement suggestions.

 - [1/3] "kprobe override function" feature is not limited by
   ftrace-based kprobe, but also you can use it on sw-breakpoint
   based kprobe too. Also, you must check the kprobe is on the
   entry of function right before setting up the stackframe.

 - [2/3] If we store original instruction pointer and compare
   it with regs->ip, we don't need per-cpu bpf_kprobe_override.
   Also, reset_current_kprobe() and preempt_enable_no_resched()
   are no need to separate.

Any thoughts?

If it is good, I also add MAINTAINERS entry for this feature
and add some testcases using kprobes and ftrace to inject
error. (And maybe we also need a document how to use)

BTW, it seems there are many error injection frameworks in
lib/. We may also consider these distinctions.

Thank you,

---

Masami Hiramatsu (3):
  tracing/kprobe: bpf: Check error injectable event is on function entry
  tracing/kprobe: bpf: Compare instruction pointer with original one
  error-injection: Separate error-injection from kprobe


 arch/Kconfig   |2 
 arch/x86/Kconfig   |2 
 arch/x86/include/asm/error-injection.h |   12 ++
 arch/x86/kernel/kprobes/ftrace.c   |   14 --
 arch/x86/lib/Makefile  |2 
 arch/x86/lib/error-inject.c|   19 +++
 fs/btrfs/disk-io.c |2 
 fs/btrfs/free-space-cache.c|2 
 include/asm-generic/error-injection.h  |   20 +++
 include/asm-generic/vmlinux.lds.h  |   14 +-
 include/linux/bpf.h|   12 --
 include/linux/error-injection.h|   21 +++
 include/linux/kprobes.h|1 
 include/linux/module.h |6 -
 kernel/kprobes.c   |  163 --
 kernel/module.c|8 +
 kernel/trace/Kconfig   |4 -
 kernel/trace/bpf_trace.c   |9 +
 kernel/trace/trace_kprobe.c|   32 ++---
 kernel/trace/trace_probe.h |   12 +-
 lib/Kconfig.debug  |4 +
 lib/Makefile   |1 
 lib/error-inject.c |  200 
 23 files changed, 323 insertions(+), 239 deletions(-)
 create mode 100644 arch/x86/include/asm/error-injection.h
 create mode 100644 arch/x86/lib/error-inject.c
 create mode 100644 include/asm-generic/error-injection.h
 create mode 100644 include/linux/error-injection.h
 create mode 100644 lib/error-inject.c

--
Signature
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 08/10] Btrfs: add WARN_ONCE to detect unexpected error from merge_extent_mapping

2017-12-22 Thread Nikolay Borisov



On 22.12.2017 00:42, Liu Bo wrote:
> This is a subtle case, so in order to understand the problem, it'd be good to
> know the content of existing and em when any error occurs.
> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/extent_map.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> index d386cfb..a8b7e24 100644
> --- a/fs/btrfs/extent_map.c
> +++ b/fs/btrfs/extent_map.c
> @@ -550,17 +550,23 @@ int btrfs_add_extent_mapping(struct extent_map_tree 
> *em_tree,
>   *em_in = existing;
>   ret = 0;
>   } else {
> + u64 orig_start = em->start;
> + u64 orig_len = em->len;
> +
>   /*
>* The existing extent map is the one nearest to
>* the [start, start + len) range which overlaps
>*/
>   ret = merge_extent_mapping(em_tree, existing,
>  em, start, len);
> - free_extent_map(existing);
>   if (ret) {
>   free_extent_map(em);
>   *em_in = NULL;
> + WARN_ONCE(ret, KERN_INFO "Unexpected error %d: 
> merge existing(start %llu len %llu) with em(start %llu len %llu)\n",

I think this KERN_INFO is spurious and should be removed?

> +   ret, existing->start, existing->len,
> +   orig_start, orig_len);
>   }
> + free_extent_map(existing);
>   }
>   }
>  
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/10] Btrfs: add tracepoint for em's EEXIST case

2017-12-22 Thread Nikolay Borisov



On 22.12.2017 00:42, Liu Bo wrote:
> This is adding a tracepoint 'btrfs_handle_em_exist' to help debug the
> subtle bugs around merge_extent_mapping.

In the next patch you are already making the function which takes all
these values noinline, meaning you can attach a kprobe so you can
interrogate the args via systemtap,perf probe or even bpf. So I'd rather
not add this tracepoint since the general sentiment seems to be that
tracepoints are ABI and so have to be maintained.

> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/extent_map.c|  1 +
>  include/trace/events/btrfs.h | 35 +++
>  2 files changed, 36 insertions(+)
> 
> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> index a8b7e24..40e4d30 100644
> --- a/fs/btrfs/extent_map.c
> +++ b/fs/btrfs/extent_map.c
> @@ -539,6 +539,7 @@ int btrfs_add_extent_mapping(struct extent_map_tree 
> *em_tree,
>   ret = 0;
>  
>   existing = search_extent_mapping(em_tree, start, len);
> + trace_btrfs_handle_em_exist(existing, em, start, len);
>  
>   /*
>* existing will always be non-NULL, since there must be
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index 4342a32..b7ffcf7 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -249,6 +249,41 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
> __entry->refs, __entry->compress_type)
>  );
>  
> +TRACE_EVENT(btrfs_handle_em_exist,
> +
> + TP_PROTO(const struct extent_map *existing, const struct extent_map 
> *map, u64 start, u64 len),
> +
> + TP_ARGS(existing, map, start, len),
> +
> + TP_STRUCT__entry(
> + __field(u64,  e_start   )
> + __field(u64,  e_len )
> + __field(u64,  map_start )
> + __field(u64,  map_len   )
> + __field(u64,  start )
> + __field(u64,  len   )
> + ),
> +
> + TP_fast_assign(
> + __entry->e_start= existing->start;
> + __entry->e_len  = existing->len;
> + __entry->map_start  = map->start;
> + __entry->map_len= map->len;
> + __entry->start  = start;
> + __entry->len= len;
> + ),
> +
> + TP_printk("start=%llu len=%llu "
> +   "existing(start=%llu len=%llu) "
> +   "em(start=%llu len=%llu)",
> +   (unsigned long long)__entry->start,
> +   (unsigned long long)__entry->len,
> +   (unsigned long long)__entry->e_start,
> +   (unsigned long long)__entry->e_len,
> +   (unsigned long long)__entry->map_start,
> +   (unsigned long long)__entry->map_len)
> +);
> +
>  /* file extent item */
>  DECLARE_EVENT_CLASS(btrfs__file_extent_item_regular,
>  
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: xfstests/btrfs/100 and 151 failure

2017-12-22 Thread Lakshmipathi.G

> Thanks for the config file, will check with this and update the status.

script/100 passed on today's run [1] with my kernel config too. May be
a false alarm :-)

[1]: 
https://github.com/Lakshmipathi/btrfsqa/blob/master/results/results_2017-12-22_07:12/

thanks!

Cheers,
Lakshmipathi.G
http://www.giis.co.in http://www.webminal.org http://www.btrfs.in
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2.1 09/10] btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT

2017-12-22 Thread Qu Wenruo

Unlike previous method to try commit transaction inside
qgroup_reserve(), this time we will try to commit transaction using
fs_info->transaction_kthread to avoid nested transaction and no need to
worry about lock context.

Since it's an asynchronous function call and we won't wait transaction
commit, unlike previous method, we must call it before we hit qgroup
limit.

So this patch will use the ratio and size of qgroup meta_pertrans
reservation as indicator to check if we should trigger a transaction
commitment.
(meta_prealloc won't be cleaned in transaction commitment, it's useless
anyway)

Signed-off-by: Qu Wenruo 
---
v2.1
  Allow transaction_kthread to commit without waiting for commit
  interval.
  Previous version it doesn't really work due to the commit interval
  limit.

  Now btrfs can handle small file creation much better. Although
  btrfs/139 still hit EDQUOT earlier than expected.
---
 fs/btrfs/ctree.h   |  6 ++
 fs/btrfs/disk-io.c |  1 +
 fs/btrfs/qgroup.c  | 43 +--
 fs/btrfs/transaction.c |  1 +
 fs/btrfs/transaction.h | 14 ++
 5 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0c58f92c2d44..141a36571bd7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -724,6 +724,12 @@ struct btrfs_delayed_root;
  */
 #define BTRFS_FS_EXCL_OP   16
 
+/*
+ * To info transaction_kthread we need an immediate commit so it doesn't
+ * need to wait for commit_interval
+ */
+#define BTRFS_FS_NEED_ASYNC_COMMIT 17
+
 struct btrfs_fs_info {
u8 fsid[BTRFS_FSID_SIZE];
u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d8eaadac4330..a33e1da195ec 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1787,6 +1787,7 @@ static int transaction_kthread(void *arg)
 
now = get_seconds();
if (cur->state < TRANS_STATE_BLOCKED &&
+   !test_bit(BTRFS_FS_NEED_ASYNC_COMMIT, &fs_info->flags) &&
(now < cur->start_time ||
 now - cur->start_time < fs_info->commit_interval)) {
spin_unlock(&fs_info->trans_lock);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 6d5b476feb08..57d8fd605655 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ctree.h"
 #include "transaction.h"
@@ -2395,8 +2396,21 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
*trans,
return ret;
 }
 
-static bool qgroup_check_limits(const struct btrfs_qgroup *qg, u64 num_bytes)
+/*
+ * Two limits to commit transaction in advance.
+ *
+ * For RATIO, it will be 1/RATIO of the remaining limit
+ * (excluding data and prealloc meta) as threshold.
+ * For SIZE, it will be in byte unit as threshold.
+ */
+#define QGROUP_PERTRANS_RATIO  32
+#define QGROUP_PERTRANS_SIZE   SZ_32M
+static bool qgroup_check_limits(struct btrfs_fs_info *fs_info,
+   const struct btrfs_qgroup *qg, u64 num_bytes)
 {
+   u64 limit;
+   u64 threshold;
+
if ((qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_RFER) &&
qgroup_rsv_total(qg) + (s64)qg->rfer + num_bytes > qg->max_rfer)
return false;
@@ -2405,6 +2419,31 @@ static bool qgroup_check_limits(const struct 
btrfs_qgroup *qg, u64 num_bytes)
qgroup_rsv_total(qg) + (s64)qg->excl + num_bytes > qg->max_excl)
return false;
 
+   /*
+* Even if we passed the check, it's better to check if reservation
+* for meta_pertrans is pushing us near limit.
+* If there is too much pertrans reservation or it's near the limit,
+* let's try commit transaction to free some, using transaction_kthread
+*/
+   if ((qg->lim_flags & (BTRFS_QGROUP_LIMIT_MAX_RFER |
+ BTRFS_QGROUP_LIMIT_MAX_EXCL))) {
+   if (qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_EXCL)
+   limit = qg->max_excl;
+   else
+   limit = qg->max_rfer;
+   threshold = (limit - qg->rsv.values[BTRFS_QGROUP_RSV_DATA] -
+   qg->rsv.values[BTRFS_QGROUP_RSV_META_PREALLOC]) /
+   QGROUP_PERTRANS_RATIO;
+   threshold = min_t(u64, threshold, QGROUP_PERTRANS_SIZE);
+
+   /*
+* Use transaction_kthread to commit transaction, so we no
+* longer need to bother nested transaction nor lock context.
+*/
+   if (qg->rsv.values[BTRFS_QGROUP_RSV_META_PERTRANS] > threshold)
+   btrfs_commit_transaction_locksafe(fs_info);
+   }
+
return true;
 }
 
@@ -2454,7 +2493,7 @@ static int qgroup_reserve(struct btrfs_root *root, u64 
num_bytes, bool enforce,
 
qg = unode_aux_to_qgroup(unode);

49 matches

Mail list logo