date:20170303

Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-03 Thread Chris Murphy

On Thu, Mar 2, 2017 at 6:48 PM, Chris Murphy  wrote:

>
> Again, my data is fine. The problem I'm having is this:
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/Documentation/filesystems/btrfs.txt?id=refs/tags/v4.10.1
>
> Which says in the first line, in part, "focusing on fault tolerance,
> repair and easy administration" and quite frankly this sort of
> enduring bug in this file system that's nearly 10 years old now, is
> rendered misleading, and possibly dishonest. How do we describe this
> file system as focusing on fault tolerance when, in the identical
> scenario using mdadm or LVM raid, the user's data is not mishandled
> like it is on Btrfs with multiple devices?


I think until these problems are fixed, the Btrfs status page should
describe RAID 1 and 10 as mostly OK, with this problem as the reason
for it not being OK.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-03 Thread J. Bruce Fields

On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> tl;dr: I think we can greatly reduce the cost of the inode->i_version
> counter, by exploiting the fact that we don't need to increment it
> if no one is looking at it. We can also clean up the code to prepare
> to eventually expose this value via statx().
> 
> The inode->i_version field is supposed to be a value that changes
> whenever there is any data or metadata change to the inode. Some
> filesystems use it internally to detect directory changes during
> readdir. knfsd will use it if the filesystem has MS_I_VERSION
> set. IMA will also use it (though it's not clear to me that that
> works 100% -- no check for MS_I_VERSION there).

I'm a little confused about the internal uses for directories.  Is it
necessarily kept on disk?  In cases it's not, then there's not the same
performance issue, right?  Is there any risk these patches make
performance slightly worse in that case?

> Only btrfs, ext4, and xfs implement it for data changes. Because of
> this, these filesystems must log the inode to disk whenever the
> i_version counter changes.

On those filesystems that's done for both files and directories, right?

--b.

> That has a non-zero performance impact,
> especially on write-heavy workloads, because we end up dirtying the
> inode metadata on every write, not just when the times change. [1]
> 
> It turns out though that none of these users of i_version require that
> i_version change on every change to the file. The only real requirement
> is that it be different if _something_ changed since the last time we
> queried for it. [2]
> 
> So, if we simply keep track of when something queries the value, we
> can avoid bumping the counter and that metadata update.
> 
> This patchset implements this:
> 
> It starts with some small cleanup patches to just remove any mention of
> the i_version counter in filesystems that don't actually use it.
> 
> Then, we add a new set of static inlines for managing the counter. The
> initial version should work identically to what we have now. Then, all
> of the remaining filesystems that use i_version are converted to the new
> inlines.
> 
> Once that's in place, we switch to a new implementation that allows us
> to track readers of i_version counter, and only bump it when it's
> necessary or convenient (i.e. we're going to disk anyway).
> 
> The final patch switches from a scheme that uses the i_lock to serialize
> the counter updates during write to an atomic64_t. That's a wash
> performance-wise in my testing, but I like not having to take the i_lock
> down where it could end up nested inside other locks.
> 
> With this, we reduce inode metadata updates across all 3 filesystems
> down to roughly the frequency of the timestamp granularity, particularly
> when it's not being queried (the vastly common case).
> 
> The pessimal workload here is 1 byte writes, and it helps that
> significantly. Of course, that's not a real-world workload.
> 
> A tiobench-example.fio workload also shows some modest performance
> gains, and I've gotten mails from the kernel test robot that show some
> significant performance gains on some microbenchmarks (case-msync-mt in
> the vm-scalability testsuite to be specific).
> 
> I'm happy to run other workloads if anyone can suggest them.
> 
> At this point, the patchset works and does what it's expected to do in
> my own testing. It seems like it's at least a modest performance win
> across all 3 major disk-based filesystems. It may also encourage others
> to implement i_version as well since it reduces that cost.
> 
> Is this an avenue that's worthwhile to pursue?
> 
> Note that I think we may have other changes coming in the future that
> will make this sort of cleanup necessary anyway. I'd like to plug in the
> Ceph change attribute here eventually, and that will require something
> like this anyway.
> 
> Thoughts, comments and suggestions are welcome...
> 
> ---
> 
> [1]: On ext4 it must be turned on with the i_version mount option,
>  mostly due to fears of incurring this impact, AFAICT.
> 
> [2]: NFS also recommends that it appear to increase in value over time, so
>  that clients can discard metadata updates that are older than ones
>  they've already seen.
> 
> Jeff Layton (30):
>   lustre: don't set f_version in ll_readdir
>   ecryptfs: remove unnecessary i_version bump
>   ceph: remove the bump of i_version
>   f2fs: don't bother setting i_version
>   hpfs: don't bother with the i_version counter
>   jfs: remove initialization of i_version counter
>   nilfs2: remove inode->i_version initialization
>   orangefs: remove initialization of i_version
>   reiserfs: remove unneeded i_version bump
>   ntfs: remove i_version handling
>   fs: new API for handling i_version
>   fat: convert to new i_version API
>   affs: convert to new i_version API
>   afs: convert to new i_version API
>   btrfs: convert to new i_version API
>   exofs: switch to new i_version API
>

[PATCH] Btrfs: add file item tracepoint

2017-03-03 Thread Liu Bo

While debugging truncate problems, I found that these tracepoints could
help us quickly know what went wrong.

Two sets of tracepoints are created to track regular/prealloc file item
and inline file item respectively, I put inline as a separate one since
what inline file items cares about are way less than the regular one.

This adds four tracepoints:
- btrfs_get_extent_show_fi_regular
- btrfs_get_extent_show_fi_inline
- btrfs_truncate_show_fi_regular
- btrfs_truncate_show_fi_inline

Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c |  15 +
 include/trace/events/btrfs.h | 133 +++
 2 files changed, 148 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5652f5f..8a26712 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4401,9 +4401,17 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
item_end +=
btrfs_file_extent_num_bytes(leaf, fi);
+
+   trace_btrfs_truncate_show_fi_regular(
+   BTRFS_I(inode), leaf, fi,
+   found_key.offset);
} else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
item_end += btrfs_file_extent_inline_len(leaf,
 path->slots[0], fi);
+
+   trace_btrfs_truncate_show_fi_inline(
+   BTRFS_I(inode), leaf, fi, 
path->slots[0],
+   found_key.offset);
}
item_end--;
}
@@ -6814,11 +6822,18 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode 
*inode,
found_type == BTRFS_FILE_EXTENT_PREALLOC) {
extent_end = extent_start +
   btrfs_file_extent_num_bytes(leaf, item);
+
+   trace_btrfs_get_extent_show_fi_regular(inode, leaf, item,
+  extent_start);
} else if (found_type == BTRFS_FILE_EXTENT_INLINE) {
size_t size;
size = btrfs_file_extent_inline_len(leaf, path->slots[0], item);
extent_end = ALIGN(extent_start + size,
   fs_info->sectorsize);
+
+   trace_btrfs_get_extent_show_fi_inline(inode, leaf, item,
+ path->slots[0],
+ extent_start);
}
 next:
if (start >= extent_end) {
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index a3c3cab..6353d0f 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -12,6 +12,7 @@ struct btrfs_root;
 struct btrfs_fs_info;
 struct btrfs_inode;
 struct extent_map;
+struct btrfs_file_extent_item;
 struct btrfs_ordered_extent;
 struct btrfs_delayed_ref_node;
 struct btrfs_delayed_tree_ref;
@@ -54,6 +55,12 @@ struct btrfs_qgroup_extent_record;
  (obj >= BTRFS_ROOT_TREE_OBJECTID &&   \
   obj <= BTRFS_QUOTA_TREE_OBJECTID)) ? __show_root_type(obj) : "-"
 
+#define show_fi_type(type) \
+   __print_symbolic(type,  \
+{ BTRFS_FILE_EXTENT_INLINE,"INLINE" }, \
+{ BTRFS_FILE_EXTENT_REG,   "REG"}, \
+{ BTRFS_FILE_EXTENT_PREALLOC,  "PREALLOC"})
+
 #define BTRFS_GROUP_FLAGS  \
{ BTRFS_BLOCK_GROUP_DATA,   "DATA"},\
{ BTRFS_BLOCK_GROUP_SYSTEM, "SYSTEM"},  \
@@ -232,6 +239,132 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
  __entry->refs, __entry->compress_type)
 );
 
+/* file extent item */
+DECLARE_EVENT_CLASS(btrfs__file_extent_item_regular,
+
+   TP_PROTO(struct btrfs_inode *bi, struct extent_buffer *l, struct 
btrfs_file_extent_item *fi, u64 start),
+
+   TP_ARGS(bi, l, fi, start),
+   
+   TP_STRUCT__entry_btrfs(
+   __field(u64,root_obj)
+   __field(u64,ino )
+   __field(loff_t, isize   )
+   __field(u64,disk_isize  )
+   __field(u64,num_bytes   )
+   __field(u64,ram_bytes   )
+   __field(u64,disk_bytenr )
+   __field(u64,disk_num_bytes  )
+   __field(u64,extent_offset   )
+   __field(u8, extent_type )
+   __field(u8, compression )
+   __field(u64,extent_start)
+   __field(u64,

Re: [RFC PATCH v1 11/30] fs: new API for handling i_version

2017-03-03 Thread Jeff Layton

On Sat, 2017-03-04 at 10:55 +1100, NeilBrown wrote:
> On Wed, Dec 21 2016, Jeff Layton wrote:
> 
> > We already have inode_inc_iversion. Add inode_set_iversion,
> > inode_get_iversion, inode_cmp_iversion and inode_iversion_need_inc.
> 
> This list of added interfaces is incomplete.
> And some of these interfaces could really use some justification up
> front.
> 
> You later add a "force" parameter to inode_inc_version.
> Why not do that up front here?
> 

First, thanks to you and Bruce for having a look.

Yes, it's clear from this and the earlier emails that I didn't do
enough documentation and explanation. I'll plan to respin this when I
get a chance, and lay out the justification and discussion a bit more.

I'll also make sure the not change the API midstream like this when I
respin.

> > +
> > +/**
> > + * inode_get_iversion - read i_version for later use
> > + * @inode: inode from which i_version should be read
> > + *
> > + * Read the inode i_version counter. This should be used by callers that 
> > wish
> > + * to store the returned i_version for later comparison.
> > + */
> > +static inline u64
> > +inode_get_iversion(const struct inode *inode)
> > +{
> > +   return inode_get_iversion_raw(inode);
> > +}
> 
> I don't understand why this can use the _raw version rather than the
> _read version.
> Surely you need to know about any changes after this read.
> 

The approach here was to convert everything to a new API and have it
work much like the code works today and then to morph it into something
that only conditionally bumps the counter.

In the later implementation, yes you do need to know about changes
after the read, but in the initial implementation it doesn't matter
since the counter is bumped on every change anyway.

I'll try to do a better job laying out this rationale in follow-on
postings.
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-03 Thread Jeff Layton

On Fri, 2017-03-03 at 18:00 -0500, J. Bruce Fields wrote:
> On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> > tl;dr: I think we can greatly reduce the cost of the inode->i_version
> > counter, by exploiting the fact that we don't need to increment it
> > if no one is looking at it. We can also clean up the code to prepare
> > to eventually expose this value via statx().
> > 
> > The inode->i_version field is supposed to be a value that changes
> > whenever there is any data or metadata change to the inode. Some
> > filesystems use it internally to detect directory changes during
> > readdir. knfsd will use it if the filesystem has MS_I_VERSION
> > set. IMA will also use it (though it's not clear to me that that
> > works 100% -- no check for MS_I_VERSION there).
> 
> I'm a little confused about the internal uses for directories.  Is it
> necessarily kept on disk?

No, they aren't necessarily stored on disk, and in fact they aren't on
most (maybe all?) of those filesystems. It's just a convenient place to
store a dir change value that is subsequently checked in readdir
operations.

That's part of the "fun" of untangling this mess. ;)

> In cases it's not, then there's not the same
> performance issue, right? 

Right, I don't think it's really a big deal for most of those and I'm
not terribly concerned about the i_version counter on directory change
operations. An extra atomic op on a directory change seems unlikely to
hurt anything.

The main purpose of this is to improve the situation with small writes.
This should also help pave the way for fs' like NFS and Ceph that
implement a version counter but may not necessarily bump it on every
change.

I think once we have things more consistent, we'll be able to consider
exposing the i_version counter to userland via statx.

> Is there any risk these patches make
> performance slightly worse in that case?
> 

Maybe, but I think that risk is pretty low. Note that I haven't measured
that specifically here, so I could be completely wrong.

If it is a problem, we could consider unioning this thing with a non-
atomic field for the directory change cases, but that would add some
complexity and I'm not sure it's worth it. I'd want to measure it first.

> > Only btrfs, ext4, and xfs implement it for data changes. Because of
> > this, these filesystems must log the inode to disk whenever the
> > i_version counter changes.
> 
> On those filesystems that's done for both files and directories, right?
> 

Yes.

> > That has a non-zero performance impact,
> > especially on write-heavy workloads, because we end up dirtying the
> > inode metadata on every write, not just when the times change. [1]
> > 
> > It turns out though that none of these users of i_version require that
> > i_version change on every change to the file. The only real requirement
> > is that it be different if _something_ changed since the last time we
> > queried for it. [2]
> > 
> > So, if we simply keep track of when something queries the value, we
> > can avoid bumping the counter and that metadata update.
> > 
> > This patchset implements this:
> > 
> > It starts with some small cleanup patches to just remove any mention of
> > the i_version counter in filesystems that don't actually use it.
> > 
> > Then, we add a new set of static inlines for managing the counter. The
> > initial version should work identically to what we have now. Then, all
> > of the remaining filesystems that use i_version are converted to the new
> > inlines.
> > 
> > Once that's in place, we switch to a new implementation that allows us
> > to track readers of i_version counter, and only bump it when it's
> > necessary or convenient (i.e. we're going to disk anyway).
> > 
> > The final patch switches from a scheme that uses the i_lock to serialize
> > the counter updates during write to an atomic64_t. That's a wash
> > performance-wise in my testing, but I like not having to take the i_lock
> > down where it could end up nested inside other locks.
> > 
> > With this, we reduce inode metadata updates across all 3 filesystems
> > down to roughly the frequency of the timestamp granularity, particularly
> > when it's not being queried (the vastly common case).
> > 
> > The pessimal workload here is 1 byte writes, and it helps that
> > significantly. Of course, that's not a real-world workload.
> > 
> > A tiobench-example.fio workload also shows some modest performance
> > gains, and I've gotten mails from the kernel test robot that show some
> > significant performance gains on some microbenchmarks (case-msync-mt in
> > the vm-scalability testsuite to be specific).
> > 
> > I'm happy to run other workloads if anyone can suggest them.
> > 
> > At this point, the patchset works and does what it's expected to do in
> > my own testing. It seems like it's at least a modest performance win
> > across all 3 major disk-based filesystems. It may also encourage others
> > to implement i_version as well since it reduces that

Re: [RFC PATCH v1 29/30] fs: track whether the i_version has been queried with an i_state flag

2017-03-03 Thread Jeff Layton

On Sat, 2017-03-04 at 11:03 +1100, NeilBrown wrote:
> On Wed, Dec 21 2016, Jeff Layton wrote:
> 
> > @@ -2072,7 +2093,12 @@ inode_cmp_iversion(const struct inode *inode, const 
> > u64 old)
> >  static inline bool
> >  inode_iversion_need_inc(struct inode *inode)
> >  {
> > -   return true;
> > +   bool ret;
> > +
> > +   spin_lock(>i_lock);
> > +   ret = inode->i_state & I_VERS_BUMP;
> > +   spin_unlock(>i_lock);
> > +   return ret;
> >  }
> >  
> 
> I know this code gets removed, so this isn't really important.
> By why do you take the spinlock here?  What are you racing again?
> 
> Thanks,
> NeilBrown

I think I was worried about I_VERS_BUMP being set or cleared during an
increment or query. It is quite possible that that spinlock is not
necessary.
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v1 29/30] fs: track whether the i_version has been queried with an i_state flag

2017-03-03 Thread NeilBrown

On Wed, Dec 21 2016, Jeff Layton wrote:

> @@ -2072,7 +2093,12 @@ inode_cmp_iversion(const struct inode *inode, const 
> u64 old)
>  static inline bool
>  inode_iversion_need_inc(struct inode *inode)
>  {
> - return true;
> + bool ret;
> +
> + spin_lock(>i_lock);
> + ret = inode->i_state & I_VERS_BUMP;
> + spin_unlock(>i_lock);
> + return ret;
>  }
>  

I know this code gets removed, so this isn't really important.
By why do you take the spinlock here?  What are you racing again?

Thanks,
NeilBrown

signature.asc
Description: PGP signature

Re: [RFC PATCH v1 11/30] fs: new API for handling i_version

2017-03-03 Thread Jeff Layton

On Fri, 2017-03-03 at 17:36 -0500, J. Bruce Fields wrote:
> The patch ordering is a little annoying as I'd like to be able to be
> able to verify the implementation at the same time these new interfaces
> are added, but, I don't know, I don't have a better idea.
> 

Fair point. My thinking was "add new API, convert everything over to it,
and then convert the implementation to do something different". Maybe
that's the wrong approach?

> Anyway, various nits:
> 
> On Wed, Dec 21, 2016 at 12:03:28PM -0500, Jeff Layton wrote:
> > We already have inode_inc_iversion. Add inode_set_iversion,
> > inode_get_iversion, inode_cmp_iversion and inode_iversion_need_inc.
> > 
> > These should encompass how i_version is currently accessed and
> > manipulated so that we can later change the implementation.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  include/linux/fs.h | 109 
> > ++---
> >  1 file changed, 104 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 2ba074328894..075c915fe2b1 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1962,18 +1962,117 @@ static inline void inode_dec_link_count(struct 
> > inode *inode)
> >  }
> >  
> >  /**
> > + * inode_set_iversion - set i_version to a particular value
> > + * @inode: inode to set
> > + * @new: new i_version value to set
> > + *
> > + * Set the inode's i_version. Should generally be called when initializing
> > + * a new inode.
> > + */
> > +static inline void
> > +inode_set_iversion(struct inode *inode, const u64 new)
> > +{
> > +   inode->i_version = new;
> > +}
> > +
> > +/**
> > + * inode_inc_iversion_locked - increment i_version while protected
> > + * @inode: inode to be updated
> > + *
> > + * Increment the i_version field in the inode. This version is usable
> > + * when there is some other sort of lock in play that would prevent
> > + * concurrent accessors.
> > + */
> > +static inline void
> > +inode_inc_iversion_locked(struct inode *inode)
> > +{
> > +   inode->i_version++;
> > +}
> > +
> > +/**
> > + * inode_set_iversion_read - set i_version to a particular value and flag
> > + *  set flag to indicate that it has been 
> > viewed
> 
> s/flag set flag/set flag/.
> 
> > + * @inode: inode to set
> > + * @new: new i_version value to set
> > + *
> > + * Set the inode's i_version, and flag the inode as if it has been viewed.
> > + * Should generally be called when initializinga new inode in memory from
> > + * from disk.
> 
> s/from from/from/.
> 
> > + */
> > +static inline void
> > +inode_set_iversion_read(struct inode *inode, const u64 new)
> > +{
> > +   inode_set_iversion(inode, new);
> > +}
> > +
> > +/**
> >   * inode_inc_iversion - increments i_version
> >   * @inode: inode that need to be updated
> >   *
> >   * Every time the inode is modified, the i_version field will be 
> > incremented.
> > - * The filesystem has to be mounted with i_version flag
> > + * The filesystem has to be mounted with MS_I_VERSION flag.
> 
> I'm not sure why that comment's there.  Filesystems can choose to
> increment i_version without requiring the mount option if they want,
> can't they?
> 

Yeah, it should probably be removed. It honestly doesn't make much sense
since we can end up bumping the thing regardless of whether it's set.

> > + */
> > +static inline bool
> 
> Document what the return value means?
> 

Will do.

> > +inode_inc_iversion(struct inode *inode)
> > +{
> > +   spin_lock(>i_lock);
> > +   inode_inc_iversion_locked(inode);
> > +   spin_unlock(>i_lock);
> > +   return true;
> > +}
> > +
> > +/**
> > + * inode_get_iversion_raw - read i_version without "perturbing" it
> > + * @inode: inode from which i_version should be read
> > + *
> > + * Read the inode i_version counter for an inode without registering it as 
> > a
> > + * read. Should typically be used by local filesystems that need to store 
> > an
> > + * i_version on disk.
> > + */
> > +static inline u64
> > +inode_get_iversion_raw(const struct inode *inode)
> > +{
> > +   return inode->i_version;
> > +}
> > +
> > +/**
> > + * inode_get_iversion - read i_version for later use
> > + * @inode: inode from which i_version should be read
> > + *
> > + * Read the inode i_version counter. This should be used by callers that 
> > wish
> > + * to store the returned i_version for later comparison.
> 
> I'm not sure what "for later comparison" means.  This is the "read"
> operation that actually flags the i_version as read, which you'd use
> (for example) to implement NFS getattr?  I wonder if there's a better
> way to say that.
> 

Yeah, I'm definitely open to better suggestions on naming.

> > + */
> > +static inline u64
> > +inode_get_iversion(const struct inode *inode)
> > +{
> > +   return inode_get_iversion_raw(inode);
> > +}
> > +
> > +/**
> > + * inode_cmp_iversion - check whether the i_version counter has changed
> > + * @inode:

Re: [RFC PATCH v1 30/30] fs: convert i_version counter over to an atomic64_t

2017-03-03 Thread NeilBrown

On Wed, Dec 21 2016, Jeff Layton wrote:

>  
> +/*
> + * We borrow the top bit in the i_version to use as a flag to tell us whether
> + * it has been queried since we last bumped it. If it has, then we must bump
> + * it and set the flag. Note that this means that we have to handle wrapping
> + * manually.
> + */
> +#define INODE_I_VERSION_QUERIED  (1ULL<<63)
> +

I would prefer that the least significant bit were used, rather than the
most significant.

Partly, this is because the "queried" state is less significant than
that "number has changed" state.
But most, this would mean we wouldn't need inode_cmp_iversion() at all.
We could just use "<" or ">=" or whatever.
The number returned by inode_get_iversion() would always be even (or
maybe odd) and wrapping (after the end of time) would "just work".

Thanks,
NeilBrown

signature.asc
Description: PGP signature

Re: [RFC PATCH v1 11/30] fs: new API for handling i_version

2017-03-03 Thread NeilBrown

On Wed, Dec 21 2016, Jeff Layton wrote:

> We already have inode_inc_iversion. Add inode_set_iversion,
> inode_get_iversion, inode_cmp_iversion and inode_iversion_need_inc.

This list of added interfaces is incomplete.
And some of these interfaces could really use some justification up
front.

You later add a "force" parameter to inode_inc_version.
Why not do that up front here?

> +
> +/**
> + * inode_get_iversion - read i_version for later use
> + * @inode: inode from which i_version should be read
> + *
> + * Read the inode i_version counter. This should be used by callers that wish
> + * to store the returned i_version for later comparison.
> + */
> +static inline u64
> +inode_get_iversion(const struct inode *inode)
> +{
> + return inode_get_iversion_raw(inode);
> +}

I don't understand why this can use the _raw version rather than the
_read version.
Surely you need to know about any changes after this read.

Thanks,
NeilBrown

signature.asc
Description: PGP signature

Re: [RFC PATCH v1 11/30] fs: new API for handling i_version

2017-03-03 Thread J. Bruce Fields

The patch ordering is a little annoying as I'd like to be able to be
able to verify the implementation at the same time these new interfaces
are added, but, I don't know, I don't have a better idea.

Anyway, various nits:

On Wed, Dec 21, 2016 at 12:03:28PM -0500, Jeff Layton wrote:
> We already have inode_inc_iversion. Add inode_set_iversion,
> inode_get_iversion, inode_cmp_iversion and inode_iversion_need_inc.
> 
> These should encompass how i_version is currently accessed and
> manipulated so that we can later change the implementation.
> 
> Signed-off-by: Jeff Layton 
> ---
>  include/linux/fs.h | 109 
> ++---
>  1 file changed, 104 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 2ba074328894..075c915fe2b1 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1962,18 +1962,117 @@ static inline void inode_dec_link_count(struct inode 
> *inode)
>  }
>  
>  /**
> + * inode_set_iversion - set i_version to a particular value
> + * @inode: inode to set
> + * @new: new i_version value to set
> + *
> + * Set the inode's i_version. Should generally be called when initializing
> + * a new inode.
> + */
> +static inline void
> +inode_set_iversion(struct inode *inode, const u64 new)
> +{
> + inode->i_version = new;
> +}
> +
> +/**
> + * inode_inc_iversion_locked - increment i_version while protected
> + * @inode: inode to be updated
> + *
> + * Increment the i_version field in the inode. This version is usable
> + * when there is some other sort of lock in play that would prevent
> + * concurrent accessors.
> + */
> +static inline void
> +inode_inc_iversion_locked(struct inode *inode)
> +{
> + inode->i_version++;
> +}
> +
> +/**
> + * inode_set_iversion_read - set i_version to a particular value and flag
> + *set flag to indicate that it has been viewed

s/flag set flag/set flag/.

> + * @inode: inode to set
> + * @new: new i_version value to set
> + *
> + * Set the inode's i_version, and flag the inode as if it has been viewed.
> + * Should generally be called when initializinga new inode in memory from
> + * from disk.

s/from from/from/.

> + */
> +static inline void
> +inode_set_iversion_read(struct inode *inode, const u64 new)
> +{
> + inode_set_iversion(inode, new);
> +}
> +
> +/**
>   * inode_inc_iversion - increments i_version
>   * @inode: inode that need to be updated
>   *
>   * Every time the inode is modified, the i_version field will be incremented.
> - * The filesystem has to be mounted with i_version flag
> + * The filesystem has to be mounted with MS_I_VERSION flag.

I'm not sure why that comment's there.  Filesystems can choose to
increment i_version without requiring the mount option if they want,
can't they?

> + */
> +static inline bool

Document what the return value means?

> +inode_inc_iversion(struct inode *inode)
> +{
> + spin_lock(>i_lock);
> + inode_inc_iversion_locked(inode);
> + spin_unlock(>i_lock);
> + return true;
> +}
> +
> +/**
> + * inode_get_iversion_raw - read i_version without "perturbing" it
> + * @inode: inode from which i_version should be read
> + *
> + * Read the inode i_version counter for an inode without registering it as a
> + * read. Should typically be used by local filesystems that need to store an
> + * i_version on disk.
> + */
> +static inline u64
> +inode_get_iversion_raw(const struct inode *inode)
> +{
> + return inode->i_version;
> +}
> +
> +/**
> + * inode_get_iversion - read i_version for later use
> + * @inode: inode from which i_version should be read
> + *
> + * Read the inode i_version counter. This should be used by callers that wish
> + * to store the returned i_version for later comparison.

I'm not sure what "for later comparison" means.  This is the "read"
operation that actually flags the i_version as read, which you'd use
(for example) to implement NFS getattr?  I wonder if there's a better
way to say that.

> + */
> +static inline u64
> +inode_get_iversion(const struct inode *inode)
> +{
> + return inode_get_iversion_raw(inode);
> +}
> +
> +/**
> + * inode_cmp_iversion - check whether the i_version counter has changed
> + * @inode: inode to check
> + * @old: old value to check against its i_version
> + *
> + * Compare an i_version counter with a previous one. Returns 0 if they are
> + * the same or non-zero if they are different.

Does this flag the i_version as read?  What's it for?  (OK, I guess I'll
find out after a few more patches...).

--b.

>   */
> +static inline s64
> +inode_cmp_iversion(const struct inode *inode, const u64 old)
> +{
> + return (s64)inode->i_version - (s64)old;
> +}
>  
> -static inline void inode_inc_iversion(struct inode *inode)
> +/**
> + * inode_iversion_need_inc - is the i_version in need of being incremented?
> + * @inode: inode to check
> + *
> + * Returns whether the inode->i_version counter needs incrementing on the 
>

Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-03 Thread Kai Krakow

Am Fri, 3 Mar 2017 07:19:06 -0500
schrieb "Austin S. Hemmelgarn" :

> On 2017-03-03 00:56, Kai Krakow wrote:
> > Am Thu, 2 Mar 2017 11:37:53 +0100
> > schrieb Adam Borowski :
> >  
> >> On Wed, Mar 01, 2017 at 05:30:37PM -0700, Chris Murphy wrote:  
>  [...]  
> >>
> >> Well, there's Qu's patch at:
> >> https://www.spinics.net/lists/linux-btrfs/msg47283.html
> >> but it doesn't apply cleanly nor is easy to rebase to current
> >> kernels. 
>  [...]  
> >>
> >> Well, yeah.  The current check is naive and wrong.  It does have a
> >> purpose, just fails in this, very common, case.  
> >
> > I guess the reasoning behind this is: Creating any more chunks on
> > this drive will make raid1 chunks with only one copy. Adding
> > another drive later will not replay the copies without user
> > interaction. Is that true?
> >
> > If yes, this may leave you with a mixed case of having a raid1 drive
> > with some chunks not mirrored and some mirrored. When the other
> > drives goes missing later, you are loosing data or even the whole
> > filesystem although you were left with the (wrong) imagination of
> > having a mirrored drive setup...
> >
> > Is this how it works?
> >
> > If yes, a real patch would also need to replay the missing copies
> > after adding a new drive.
> >  
> The problem is that that would use some serious disk bandwidth
> without user intervention.  The way from userspace to fix this is to
> scrub the FS.  It would essentially be the same from kernel space,
> which means that if you had a multi-TB FS and this happened, you'd be
> running at below capacity in terms of bandwidth for quite some time.
> If this were to be implemented, it would have to be keyed off of the
> per-chunk degraded check (so that _only_ the chunks that need it get
> touched), and there would need to be a switch to disable it.

Well, I'd expect that a replaced drive would involve reduced bandwidth
for a while. Every traditional RAID does this. The key solution there
is that you can limit bandwidth and/or define priorities (BG rebuild
rate).

Btrfs OTOH could be a lot more smarter, only rebuilding chunks that are
affected. The kernel can already do IO priorities and some sort of
bandwidth limiting should also be possible. I think IO throttling is
already implemented in the kernel somewhere (at least with 4.10) and
also in btrfs. So the basics are there.

In a RAID setup, performance should never have priority over redundancy
by default.

If performance is an important factor, I suggest working with SSD
writeback caches. This is already possible with different kernel
techniques like mdcache or bcache. Proper hardware controllers also
support this in hardware. It's cheap to have a mirrored SSD
writeback cache of 1TB or so if your setup already contains a multiple
terabytes array. Such a setup has huge performance benefits in setups
we deploy (tho, not btrfs related).

Also, adding/replacing a drive is usually not a totally unplanned
event. Except for hot spares, a missing drive will be replaced at the
time you arrive on-site. If performance is a factor, this can be done
the same time as manually starting the process. So why not should it be
done automatically?

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix file corruption after cloning inline extents

2017-03-03 Thread Liu Bo

On Fri, Mar 03, 2017 at 03:36:36PM +, Filipe Manana wrote:
> On Fri, Mar 3, 2017 at 12:36 AM, Liu Bo  wrote:
> > On Thu, Mar 02, 2017 at 02:18:21PM -0800, Liu Bo wrote:
> >> On Tue, Jul 14, 2015 at 04:34:48PM +0100, fdman...@kernel.org wrote:
> >> > From: Filipe Manana 
> >> >
> >> > Using the clone ioctl (or extent_same ioctl, which calls the same extent
> >> > cloning function as well) we end up allowing copy an inline extent from
> >> > the source file into a non-zero offset of the destination file. This is
> >> > something not expected and that the btrfs code is not prepared to deal
> >> > with - all inline extents must be at a file offset equals to 0.
> >> >
> >>
> >> Somehow I failed to reproduce the BUG_ON with this case.
> >>
> >> > For example, the following excerpt of a test case for fstests triggers
> >> > a crash/BUG_ON() on a write operation after an inline extent is cloned
> >> > into a non-zero offset:
> >> >
> >> >   _scratch_mkfs >>$seqres.full 2>&1
> >> >   _scratch_mount
> >> >
> >> >   # Create our test files. File foo has the same 2K of data at offset 4K
> >> >   # as file bar has at its offset 0.
> >> >   $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 4K" \
> >> >   -c "pwrite -S 0xbb 4k 2K" \
> >> >   -c "pwrite -S 0xcc 8K 4K" \
> >> >   $SCRATCH_MNT/foo | _filter_xfs_io
> >> >
> >> >   # File bar consists of a single inline extent (2K size).
> >> >   $XFS_IO_PROG -f -s -c "pwrite -S 0xbb 0 2K" \
> >> >  $SCRATCH_MNT/bar | _filter_xfs_io
> >> >
> >> >   # Now call the clone ioctl to clone the extent of file bar into file
> >> >   # foo at its offset 4K. This made file foo have an inline extent at
> >> >   # offset 4K, something which the btrfs code can not deal with in future
> >> >   # IO operations because all inline extents are supposed to start at an
> >> >   # offset of 0, resulting in all sorts of chaos.
> >> >   # So here we validate that clone ioctl returns an EOPNOTSUPP, which is
> >> >   # what it returns for other cases dealing with inlined extents.
> >> >   $CLONER_PROG -s 0 -d $((4 * 1024)) -l $((2 * 1024)) \
> >> >   $SCRATCH_MNT/bar $SCRATCH_MNT/foo
> >> >
> >> >   # Because of the inline extent at offset 4K, the following write made
> >> >   # the kernel crash with a BUG_ON().
> >> >   $XFS_IO_PROG -c "pwrite -S 0xdd 6K 2K" $SCRATCH_MNT/foo | 
> >> > _filter_xfs_io
> >> >
> >>
> >> On 4.10, after allowing to clone an inline extent to dst file's offset 
> >> greater
> >> than zero, I followed the test case manually and got these
> >>
> >> [root@localhost trinity]# /home/btrfs-progs/btrfs-debugfs -f /mnt/btrfs/foo
> >> (257 0): ram 4096 disk 12648448 disk_size 4096
> >> (257 4096): ram 2048 disk 0 disk_size 2048 -- inline
> >> (257 8192): ram 4096 disk 12656640 disk_size 4096
> >> file: /mnt/btrfs/foo extents 3 disk size 10240 logical size 12288 ratio 
> >> 1.20
> >>
> >> [root@localhost trinity]# xfs_io -f -c "pwrite 6k 2k" /mnt/btrfs/foo
> >> wrote 2048/2048 bytes at offset 6144
> >> 2 KiB, 1 ops; 0. sec (12.520 MiB/sec and 6410.2564 ops/sec)
> >>
> >> [root@localhost trinity]# sync
> >> [root@localhost trinity]# /home/btrfs-progs/btrfs-debugfs -f /mnt/btrfs/foo
> >> (257 0): ram 4096 disk 12648448 disk_size 4096
> >> (257 4096): ram 4096 disk 12582912 disk_size 4096
> >> (257 8192): ram 4096 disk 12656640 disk_size 4096
> >> file: /mnt/btrfs/foo extents 3 disk size 12288 logical size 12288 ratio 
> >> 1.00
> >>
> >>
> >> Looks like we now are able to cope with these inline extents?
> >
> > I went back to test against v4.1 and v4.5, it turns out that we got the 
> > below
> > BUG_ON() in MM and -EIO when writing to the inline extent, because of the 
> > fact
> > that, when writing to the page that covers the inline extent, firstly it 
> > reads
> > page to get an uptodate page for writing, in readpage(), for inline extent,
> > btrfs_get_extent() always goes to search fs tree to read inline data out to 
> > page
> > and then tries to insert a em, -EEXIST would be returned if there is an 
> > existing
> > one.
> >
> > However, after commit 8dff9c853410 ("Btrfs: deal with duplciates during
> > extent_map insertion in btrfs_get_extent"), we have that fixed, so now we 
> > can
> > read/write inline extent even they've been mixed with other regular extents.
> >
> > But...I'm not 100% sure whether such files (with mixing inline with regular)
> > would have any other problems rather than read/write.  Let me know if you 
> > could
> > think of a corruption due to that.
> 
> Without thinking too much and after doing some quick tests that passed
> successfully, I'm not seeing where things can go wrong.
> However it's odd to have a mix of inline and non-inline extents, since
> the only cases where we create inline extents is for zero offsets and
> their size is smaller than page_size. I am not entirely sure if, even
> after the side effects of that commit, it would be safe to allow clone
> operation to leave inline

Re: assertion failed: last_size == new_size, file: fs/btrfs/inode.c

2017-03-03 Thread Dave Jones

On Thu, Mar 02, 2017 at 06:04:33PM -0800, Liu Bo wrote:
 > On Thu, Mar 02, 2017 at 07:58:01AM -0800, Liu Bo wrote:
 > > On Wed, Mar 01, 2017 at 03:03:19PM -0500, Dave Jones wrote:
 > > > On Tue, Feb 28, 2017 at 05:12:01PM -0800, Liu Bo wrote:
 > > >  > On Mon, Feb 27, 2017 at 11:23:42AM -0500, Dave Jones wrote:
 > > >  > > On Mon, Feb 27, 2017 at 07:53:48AM -0800, Liu Bo wrote:
 > > >  > >  > On Sun, Feb 26, 2017 at 07:18:42PM -0500, Dave Jones wrote:
 > > >  > >  > > Hitting this fairly frequently.. I'm not sure if this is the 
 > > > same bug I've
 > > >  > >  > > been hitting occasionally since 4.9. The assertion looks new 
 > > > to me at least.
 > > >  > >  > >
 > > >  > >  > 
 > > >  > >  > It was recently introduced by my commit and used to catch data 
 > > > loss at truncate.
 > > >  > >  > 
 > > >  > >  > Were you running the test with a mkfs.btrfs -O NO_HOLES?
 > > >  > >  > (We just queued a fix for the NO_HOLES case in btrfs-next.)
 > > >  > > 
 > > >  > > No, a fs created with default mkfs.btrfs options.
 > > >  > 
 > > >  > I have this patch[1] to fix a bug which results in file hole extent, 
 > > > and this
 > > >  > bug could lead us to hit the assertion.
 > > >  > 
 > > >  > Would you try to run the test w/ it, please?
 > > >  > 
 > > >  > [1]: https://patchwork.kernel.org/patch/9597281/
 > > > 
 > > > Made no difference. Still see the same trace & assertion.
 > > 
 > > Some updates here, I've got it reproduced, somehow a corner case ends up
 > > with a inline file extent following by some pre-alloc extents, along the
 > > way, isize also got updated unexpectedly.  Will try to narrow it down.
 > >
 > 
 > I realized that btrfs now could tolerate files that mix inline extents with
 > regular extents, so we don't need this ASSERT() anymore, will send a patch to
 > remove it.

note that as well as the assertion trigger, it's always immediately
followed by kernel BUG at fs/btrfs/ctree.h:3422!

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: assertion failed: last_size == new_size, file: fs/btrfs/inode.c

2017-03-03 Thread Liu Bo

On Fri, Mar 03, 2017 at 01:04:39PM -0500, Dave Jones wrote:
> On Thu, Mar 02, 2017 at 06:04:33PM -0800, Liu Bo wrote:
>  > On Thu, Mar 02, 2017 at 07:58:01AM -0800, Liu Bo wrote:
>  > > On Wed, Mar 01, 2017 at 03:03:19PM -0500, Dave Jones wrote:
>  > > > On Tue, Feb 28, 2017 at 05:12:01PM -0800, Liu Bo wrote:
>  > > >  > On Mon, Feb 27, 2017 at 11:23:42AM -0500, Dave Jones wrote:
>  > > >  > > On Mon, Feb 27, 2017 at 07:53:48AM -0800, Liu Bo wrote:
>  > > >  > >  > On Sun, Feb 26, 2017 at 07:18:42PM -0500, Dave Jones wrote:
>  > > >  > >  > > Hitting this fairly frequently.. I'm not sure if this is the 
> same bug I've
>  > > >  > >  > > been hitting occasionally since 4.9. The assertion looks new 
> to me at least.
>  > > >  > >  > >
>  > > >  > >  > 
>  > > >  > >  > It was recently introduced by my commit and used to catch data 
> loss at truncate.
>  > > >  > >  > 
>  > > >  > >  > Were you running the test with a mkfs.btrfs -O NO_HOLES?
>  > > >  > >  > (We just queued a fix for the NO_HOLES case in btrfs-next.)
>  > > >  > > 
>  > > >  > > No, a fs created with default mkfs.btrfs options.
>  > > >  > 
>  > > >  > I have this patch[1] to fix a bug which results in file hole 
> extent, and this
>  > > >  > bug could lead us to hit the assertion.
>  > > >  > 
>  > > >  > Would you try to run the test w/ it, please?
>  > > >  > 
>  > > >  > [1]: https://patchwork.kernel.org/patch/9597281/
>  > > > 
>  > > > Made no difference. Still see the same trace & assertion.
>  > > 
>  > > Some updates here, I've got it reproduced, somehow a corner case ends up
>  > > with a inline file extent following by some pre-alloc extents, along the
>  > > way, isize also got updated unexpectedly.  Will try to narrow it down.
>  > >
>  > 
>  > I realized that btrfs now could tolerate files that mix inline extents with
>  > regular extents, so we don't need this ASSERT() anymore, will send a patch 
> to
>  > remove it.
> 
> note that as well as the assertion trigger, it's always immediately
> followed by kernel BUG at fs/btrfs/ctree.h:3422!

I think it's because we do ASSERT() like,

#ifdef CONFIG_BTRFS_ASSERT

__cold
static inline void assfail(char *expr, char *file, int line)
{
pr_err("assertion failed: %s, file: %s, line: %d\n",
   expr, file, line);
BUG();
}

#define ASSERT(expr)\
(likely(expr) ? (void)0 : assfail(#expr, __FILE__, __LINE__))
#else
#define ASSERT(expr)((void)0)
#endif


Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] btrfs-progs: report I/O errors when closing the filesystem

2017-03-03 Thread Omar Sandoval

From: Omar Sandoval 

If the final fsync() on the Btrfs device fails, we just swallow the
error and don't alert the user in any way. This was uncovered by xfstest
generic/405, which checks that mkfs fails when it encounters EIO.

Signed-off-by: Omar Sandoval 
---
Changes from v1: return -errno instead of -1

 disk-io.c | 4 ++--
 volumes.c | 9 +++--
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/disk-io.c b/disk-io.c
index 2a94d4fc..48fb3c6b 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -1817,10 +1817,10 @@ int close_ctree_fs_info(struct btrfs_fs_info *fs_info)
free_fs_roots_tree(_info->fs_root_tree);
 
btrfs_release_all_roots(fs_info);
-   btrfs_close_devices(fs_info->fs_devices);
+   ret = btrfs_close_devices(fs_info->fs_devices);
btrfs_cleanup_all_caches(fs_info);
btrfs_free_fs_info(fs_info);
-   return 0;
+   return ret;
 }
 
 int clean_tree_block(struct btrfs_trans_handle *trans, struct btrfs_root *root,
diff --git a/volumes.c b/volumes.c
index a0a85edd..f7d82d51 100644
--- a/volumes.c
+++ b/volumes.c
@@ -160,6 +160,7 @@ int btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
 {
struct btrfs_fs_devices *seed_devices;
struct btrfs_device *device;
+   int ret = 0;
 
 again:
if (!fs_devices)
@@ -168,7 +169,11 @@ again:
device = list_entry(fs_devices->devices.next,
struct btrfs_device, dev_list);
if (device->fd != -1) {
-   fsync(device->fd);
+   if (fsync(device->fd) == -1) {
+   warning("fsync on device %llu failed: %s",
+   device->devid, strerror(errno));
+   ret = -errno;
+   }
if (posix_fadvise(device->fd, 0, 0, 
POSIX_FADV_DONTNEED))
fprintf(stderr, "Warning, could not drop 
caches\n");
close(device->fd);
@@ -197,7 +202,7 @@ again:
free(fs_devices);
}
 
-   return 0;
+   return ret;
 }
 
 void btrfs_close_all_devices(void)
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs-progs: report I/O errors when closing the filesystem

2017-03-03 Thread Omar Sandoval

From: Omar Sandoval 

If the final fsync() on the Btrfs device fails, we just swallow the
error and don't alert the user in any way. This was uncovered by xfstest
generic/405, which checks that mkfs fails when it encounters EIO.

Signed-off-by: Omar Sandoval 
---
 disk-io.c | 4 ++--
 volumes.c | 9 +++--
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/disk-io.c b/disk-io.c
index 2a94d4fc..48fb3c6b 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -1817,10 +1817,10 @@ int close_ctree_fs_info(struct btrfs_fs_info *fs_info)
free_fs_roots_tree(_info->fs_root_tree);
 
btrfs_release_all_roots(fs_info);
-   btrfs_close_devices(fs_info->fs_devices);
+   ret = btrfs_close_devices(fs_info->fs_devices);
btrfs_cleanup_all_caches(fs_info);
btrfs_free_fs_info(fs_info);
-   return 0;
+   return ret;
 }
 
 int clean_tree_block(struct btrfs_trans_handle *trans, struct btrfs_root *root,
diff --git a/volumes.c b/volumes.c
index a0a85edd..89335d35 100644
--- a/volumes.c
+++ b/volumes.c
@@ -160,6 +160,7 @@ int btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
 {
struct btrfs_fs_devices *seed_devices;
struct btrfs_device *device;
+   int ret = 0;
 
 again:
if (!fs_devices)
@@ -168,7 +169,11 @@ again:
device = list_entry(fs_devices->devices.next,
struct btrfs_device, dev_list);
if (device->fd != -1) {
-   fsync(device->fd);
+   if (fsync(device->fd) == -1) {
+   warning("fsync on device %llu failed: %s",
+   device->devid, strerror(errno));
+   ret = -1;
+   }
if (posix_fadvise(device->fd, 0, 0, 
POSIX_FADV_DONTNEED))
fprintf(stderr, "Warning, could not drop 
caches\n");
close(device->fd);
@@ -197,7 +202,7 @@ again:
free(fs_devices);
}
 
-   return 0;
+   return ret;
 }
 
 void btrfs_close_all_devices(void)
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix file corruption after cloning inline extents

2017-03-03 Thread Filipe Manana

On Fri, Mar 3, 2017 at 12:36 AM, Liu Bo  wrote:
> On Thu, Mar 02, 2017 at 02:18:21PM -0800, Liu Bo wrote:
>> On Tue, Jul 14, 2015 at 04:34:48PM +0100, fdman...@kernel.org wrote:
>> > From: Filipe Manana 
>> >
>> > Using the clone ioctl (or extent_same ioctl, which calls the same extent
>> > cloning function as well) we end up allowing copy an inline extent from
>> > the source file into a non-zero offset of the destination file. This is
>> > something not expected and that the btrfs code is not prepared to deal
>> > with - all inline extents must be at a file offset equals to 0.
>> >
>>
>> Somehow I failed to reproduce the BUG_ON with this case.
>>
>> > For example, the following excerpt of a test case for fstests triggers
>> > a crash/BUG_ON() on a write operation after an inline extent is cloned
>> > into a non-zero offset:
>> >
>> >   _scratch_mkfs >>$seqres.full 2>&1
>> >   _scratch_mount
>> >
>> >   # Create our test files. File foo has the same 2K of data at offset 4K
>> >   # as file bar has at its offset 0.
>> >   $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 4K" \
>> >   -c "pwrite -S 0xbb 4k 2K" \
>> >   -c "pwrite -S 0xcc 8K 4K" \
>> >   $SCRATCH_MNT/foo | _filter_xfs_io
>> >
>> >   # File bar consists of a single inline extent (2K size).
>> >   $XFS_IO_PROG -f -s -c "pwrite -S 0xbb 0 2K" \
>> >  $SCRATCH_MNT/bar | _filter_xfs_io
>> >
>> >   # Now call the clone ioctl to clone the extent of file bar into file
>> >   # foo at its offset 4K. This made file foo have an inline extent at
>> >   # offset 4K, something which the btrfs code can not deal with in future
>> >   # IO operations because all inline extents are supposed to start at an
>> >   # offset of 0, resulting in all sorts of chaos.
>> >   # So here we validate that clone ioctl returns an EOPNOTSUPP, which is
>> >   # what it returns for other cases dealing with inlined extents.
>> >   $CLONER_PROG -s 0 -d $((4 * 1024)) -l $((2 * 1024)) \
>> >   $SCRATCH_MNT/bar $SCRATCH_MNT/foo
>> >
>> >   # Because of the inline extent at offset 4K, the following write made
>> >   # the kernel crash with a BUG_ON().
>> >   $XFS_IO_PROG -c "pwrite -S 0xdd 6K 2K" $SCRATCH_MNT/foo | _filter_xfs_io
>> >
>>
>> On 4.10, after allowing to clone an inline extent to dst file's offset 
>> greater
>> than zero, I followed the test case manually and got these
>>
>> [root@localhost trinity]# /home/btrfs-progs/btrfs-debugfs -f /mnt/btrfs/foo
>> (257 0): ram 4096 disk 12648448 disk_size 4096
>> (257 4096): ram 2048 disk 0 disk_size 2048 -- inline
>> (257 8192): ram 4096 disk 12656640 disk_size 4096
>> file: /mnt/btrfs/foo extents 3 disk size 10240 logical size 12288 ratio 1.20
>>
>> [root@localhost trinity]# xfs_io -f -c "pwrite 6k 2k" /mnt/btrfs/foo
>> wrote 2048/2048 bytes at offset 6144
>> 2 KiB, 1 ops; 0. sec (12.520 MiB/sec and 6410.2564 ops/sec)
>>
>> [root@localhost trinity]# sync
>> [root@localhost trinity]# /home/btrfs-progs/btrfs-debugfs -f /mnt/btrfs/foo
>> (257 0): ram 4096 disk 12648448 disk_size 4096
>> (257 4096): ram 4096 disk 12582912 disk_size 4096
>> (257 8192): ram 4096 disk 12656640 disk_size 4096
>> file: /mnt/btrfs/foo extents 3 disk size 12288 logical size 12288 ratio 1.00
>>
>>
>> Looks like we now are able to cope with these inline extents?
>
> I went back to test against v4.1 and v4.5, it turns out that we got the below
> BUG_ON() in MM and -EIO when writing to the inline extent, because of the fact
> that, when writing to the page that covers the inline extent, firstly it reads
> page to get an uptodate page for writing, in readpage(), for inline extent,
> btrfs_get_extent() always goes to search fs tree to read inline data out to 
> page
> and then tries to insert a em, -EEXIST would be returned if there is an 
> existing
> one.
>
> However, after commit 8dff9c853410 ("Btrfs: deal with duplciates during
> extent_map insertion in btrfs_get_extent"), we have that fixed, so now we can
> read/write inline extent even they've been mixed with other regular extents.
>
> But...I'm not 100% sure whether such files (with mixing inline with regular)
> would have any other problems rather than read/write.  Let me know if you 
> could
> think of a corruption due to that.

Without thinking too much and after doing some quick tests that passed
successfully, I'm not seeing where things can go wrong.
However it's odd to have a mix of inline and non-inline extents, since
the only cases where we create inline extents is for zero offsets and
their size is smaller than page_size. I am not entirely sure if, even
after the side effects of that commit, it would be safe to allow clone
operation to leave inline extents at the destination like this. A lot
more testing and verification should be done.

thanks

>
> Thanks,
>
> -liubo
>>
>>
>> Thanks,
>>
>> -liubo
>>
>>
>> >   status=0
>> >   exit
>> >
>> > The stack trace of the BUG_ON() triggered by the last write is:
>> >
>> >   [152154.035903]

Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-03 Thread Austin S. Hemmelgarn


On 2017-03-03 00:56, Kai Krakow wrote:

Am Thu, 2 Mar 2017 11:37:53 +0100
schrieb Adam Borowski :


On Wed, Mar 01, 2017 at 05:30:37PM -0700, Chris Murphy wrote:

[1717713.408675] BTRFS warning (device dm-8): missing devices (1)
exceeds the limit (0), writeable mount is not allowed
[1717713.446453] BTRFS error (device dm-8): open_ctree failed

[chris@f25s ~]$ uname -r
4.9.8-200.fc25.x86_64

I thought this was fixed. I'm still getting a one time degraded rw
mount, after that it's no longer allowed, which really doesn't make
any sense because those single chunks are on the drive I'm trying to
mount.


Well, there's Qu's patch at:
https://www.spinics.net/lists/linux-btrfs/msg47283.html
but it doesn't apply cleanly nor is easy to rebase to current kernels.


I don't understand what problem this proscription is trying to
avoid. If it's OK to mount rw,degraded once, then it's OK to allow
it twice. If it's not OK twice, it's not OK once.


Well, yeah.  The current check is naive and wrong.  It does have a
purpose, just fails in this, very common, case.


I guess the reasoning behind this is: Creating any more chunks on this
drive will make raid1 chunks with only one copy. Adding another drive
later will not replay the copies without user interaction. Is that true?

If yes, this may leave you with a mixed case of having a raid1 drive
with some chunks not mirrored and some mirrored. When the other drives
goes missing later, you are loosing data or even the whole filesystem
although you were left with the (wrong) imagination of having a
mirrored drive setup...

Is this how it works?

If yes, a real patch would also need to replay the missing copies after
adding a new drive.

The problem is that that would use some serious disk bandwidth without 
user intervention.  The way from userspace to fix this is to scrub the 
FS.  It would essentially be the same from kernel space, which means 
that if you had a multi-TB FS and this happened, you'd be running at 
below capacity in terms of bandwidth for quite some time.  If this were 
to be implemented, it would have to be keyed off of the per-chunk 
degraded check (so that _only_ the chunks that need it get touched), and 
there would need to be a switch to disable it.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-03 Thread Austin S. Hemmelgarn


On 2017-03-02 19:47, Peter Grandi wrote:

[ ... ] Meanwhile, the problem as I understand it is that at
the first raid1 degraded writable mount, no single-mode chunks
exist, but without the second device, they are created.  [
... ]


That does not make any sense, unless there is a fundamental
mistake in the design of the 'raid1' profile, which this and
other situations make me think is a possibility: that the
category of "mirrored" 'raid1' chunk does not exist in the Btrfs
chunk manager. That is, a chunk is either 'raid1' if it has a
mirror, or if has no mirror it must be 'single'.

If a member device of a 'raid1' profile multidevice volume
disappears there will be "unmirrored" 'raid1' profile chunks and
some code path must recognize them as such, but the logic of the
code does not allow their creation. Question: how does the code
know that a specific 'raid1' chunk is mirrored or not? The chunk
must have a link (member, offset) to its mirror, do they?

What makes me think that "unmirrored" 'raid1' profile chunks are
"not a thing" is that it is impossible to remove explicitly a
member device from a 'raid1' profile volume: first one has to
'convert' to 'single', and then  the 'remove' copies back to the
remaining devices the 'single' chunks that are on the explicitly
'remove'd device. Which to me seems absurd.
It is, there should be a way to do this as a single operation.  The 
reason this is currently the case though is a simple one, 'btrfs device 
delete' is just a special instance of balance that prevents new chunks 
being allocated on the device being removed and balances all the chunks 
on that device so they end up on other devices.  It currently does no 
profile conversion, but having that as an option would actually be 
_very_ useful from a data safety perspective.


Going further in my speculation, I suspect that at the core of
the Btrfs multidevice design there is a persistent "confusion"
(to use en euphemism) between volumes having a profile, and
merely chunks have a profile.
There generally is.  The profile is entirely a property of the chunks 
(each chunk literally has a bit of metadata that says what profile it 
is), not the volume.  There's some metadata in the volume somewhere that 
says what profile to use for new chunks of each type (I think), but that 
doesn't dictate what chunk profiles there are on the volume.  This whole 
arrangement is actually pretty important for fault tolerance in general, 
since during a conversion you have _both_ profiles for that chunk type 
at the same time on the same filesystem (new chunks will get allocated 
with the new type though), and the kernel has to be able to handle a 
partially converted FS.


My additional guess that the original design concept had
multidevice volumes to be merely containers for chunks of
whichever mixed profiles, so a subvolume could have 'raid1'
profile metadata and 'raid0' profile data, and another could
have 'raid10' profile metadata and data, but since handling this
turned out to be too hard, this was compromised into volumes
having all metadata chunks to have the same profile and all data
of the same profile, which requires special-case handling of
corner cases, like volumes being converted or missing member
devices.
Actually, the only bits missing that would be needed to do this are 
stuff to segregate the data of given subvolumes completely form each 
other (ie, make sure they can't be in the same chunks at all).  Doing 
that is hard, so we don't have per-subvolume profiles yet.  It's fully 
possible to have a mix of profiles on a given volume though.  Some old 
versions of mkfs actually did this (you'd end up with a small single 
profile chunk of each type on a FS that used different profiles), and 
the filesystem is in exactly that state when converting between profiles 
for a given chunk type.  New chunks will only be generated with one 
profile, but you can have whatever other mix you want essentially (in 
fact, one of the handful of regression tests I run when I'm checking 
patches explicitly creates a filesystem with one data and one system 
chunk of every profile and makes sure the kernel can still access it 
correctly).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5] btrfs: Handle delalloc error correctly to avoid ordered extent hang

2017-03-03 Thread Filipe Manana

On Fri, Mar 3, 2017 at 12:45 AM, Qu Wenruo  wrote:
>
>
> At 03/03/2017 01:28 AM, Filipe Manana wrote:
>>
>> On Tue, Feb 28, 2017 at 2:28 AM, Qu Wenruo 
>> wrote:
>>>
>>> [BUG]
>>> Reports about btrfs hang running btrfs/124 with default mount option and
>>> btrfs/125 with nospace_cache or space_cache=v2 mount options, with
>>> following backtrace.
>>>
>>> Call Trace:
>>>  __schedule+0x2d4/0xae0
>>>  schedule+0x3d/0x90
>>>  btrfs_start_ordered_extent+0x160/0x200 [btrfs]
>>>  ? wake_atomic_t_function+0x60/0x60
>>>  btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
>>>  btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
>>>  btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
>>>  process_one_work+0x2af/0x720
>>>  ? process_one_work+0x22b/0x720
>>>  worker_thread+0x4b/0x4f0
>>>  kthread+0x10f/0x150
>>>  ? process_one_work+0x720/0x720
>>>  ? kthread_create_on_node+0x40/0x40
>>>  ret_from_fork+0x2e/0x40
>>>
>>> [CAUSE]
>>> The problem is caused by error handler in run_delalloc_nocow() doesn't
>>> handle error from btrfs_reloc_clone_csums() well.
>>
>>
>> The cause is bad error handling in general, not specific to
>> btrfs_reloc_clone_csums().
>> Keep in mind that you're giving a cause for specific failure scenario
>> while providing a solution to a more broader problem.
>
>
> Right, I'll update the commit message in next update.
>
>>
>>>
>>> Error handlers in run_dealloc_nocow() and cow_file_range() will clear
>>> dirty flags and finish writeback for remaining pages like the following:
>>
>>
>> They don't finish writeback because writeback isn't even started.
>> Writeback is started when a bio is about to be submitted, at
>> __extent_writepage_io().
>>
>>>
>>> |<-- delalloc range --->|
>>> | Ordered extent 1 | Ordered extent 2  |
>>> |Submitted OK  | recloc_clone_csum() error |
>>> |<>|   |<--- cleanup range >|
>>>  ||
>>>  \_=> First page handled by end_extent_writepage() in
>>> __extent_writepage()
>>>
>>> This behavior has two problems:
>>> 1) Ordered extent 2 will never finish
>>
>>
>> Neither will ordered extent 1.
>
>
> Not always.
> If ordered extent 1 is only 1 page large, then it can finish.

Yes, but that's not the case in your example... that's why I pointed it out...

>
> So here I introduced ordered extent 2 for this corner case.
>
>
>>
>>>Ordered extent 2 is already submitted, which relies endio hooks to
>>>wait for all its pages to finish.
>>
>>
>> submitted -> created
>>
>> endio hooks don't wait for pages to finish. What you want to say is
>> that the ordered extent is marked as complete by the endio hooks.
>>
>>
>>>
>>>However since we finish writeback in error handler, ordered extent 2
>>>will never finish.
>>
>>
>> finish -> complete
>>
>> Again, we don't even reach the point of starting writeback. And
>> neither ordered extent 2 nor ordered extent 1 complete.
>>
>>>
>>> 2) Metadata underflow
>>>btrfs_finish_ordered_io() for ordered extent will free its reserved
>>>metadata space, while error handlers will also free metadata space of
>>>the remaining range, which covers ordered extent 2.
>>>
>>>So even if problem 1) is solved, we can still under flow metadata
>>>reservation, which will leads to deadly btrfs assertion.
>>>
>>> [FIX]
>>> This patch will resolve the problem in two steps:
>>> 1) Introduce btrfs_cleanup_ordered_extents() to cleanup ordered extents
>>>Slightly modify one existing function,
>>>btrfs_endio_direct_write_update_ordered() to handle free space inode
>>>just like btrfs_writepage_endio_hook() and skip first page to
>>>co-operate with end_extent_writepage().
>>>
>>>So btrfs_cleanup_ordered_extents() will search all submitted ordered
>>>extent in specified range, and clean them up except the first page.
>>>
>>> 2) Make error handlers skip any range covered by ordered extent
>>>For run_delalloc_nocow() and cow_file_range(), only allow error
>>>handlers to clean up pages/extents not covered by submitted ordered
>>>extent.
>>>
>>>For compression, it's calling writepage_end_io_hook() itself to handle
>>>its error, and any submitted ordered extent will have its bio
>>>submitted, so no need to worry about compression part.
>>>
>>> After the fix, the clean up will happen like:
>>>
>>> |<--- delalloc range
>>> --->|
>>> | Ordered extent 1 | Ordered extent 2  |
>>> |Submitted OK  | recloc_clone_csum() error |
>>> |<>|<- Cleaned up by cleanup_ordered_extents ->|<-- old error
>>> handler--->|
>>>  ||
>>>  \_=> First page handled by end_extent_writepage() in
>>> __extent_writepage()
>>>
>>> Suggested-by: Filipe Manana 
>>> Signed-off-by: Qu Wenruo 
>>> ---
>>> v2:
>>>   Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing
>>>   outstanding extents, which is already done

Re: [4.7.2] btrfs_run_delayed_refs:2963: errno=-17 Object already exists

2017-03-03 Thread Marc Joliet

On Friday 03 March 2017 09:00:10 Qu Wenruo wrote:
> > FWIW, as per my later messages, after mounting with clear_cache and
> > letting
> > btrfs-cleaner finish, btrfs-check did *not* print out those errors after
> > running again.  It's now about two weeks later that the file system is
> > showing problems again.
> 
> If btrfs-check didn't print out *any* error, then it should be mostly 
> fine. (Unless there is some case we don't expose yet)
> 
> The problem should be caused by kernel AFAIK.

So you think it could be a regression in 4.9?  Should I try 4.10?  Or is it 
more likely just an undiscovered bug?

> > Oh, and just in case it's relevant, the file system was created with
> > btrfs-
> > convert (a long time, maybe 1.5 years ago, though; it was originally
> > ext4).
> 
> Not sure if it's related.
> But at least for that old convert, it's chunk layout is somewhat rare 
> and sometimes even bug-prone.
> 
> Did you balance the btrfs after convert? If so, it should be more like a 
> traditional btrfs then.

Yes, I'm fairly certain I did that, as that is what the btrfs wiki recommends.

> Personally speaking I don't think it is relative for your bug, but much 
> like a normal extent tree corruption seen in mail list.

OK, so is there anything else I can do?

Greetings
-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup


signature.asc
Description: This is a digitally signed message part.

Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-03 Thread Adam Borowski

On Fri, Mar 03, 2017 at 06:56:22AM +0100, Kai Krakow wrote:
> > > I don't understand what problem this proscription is trying to
> > > avoid. If it's OK to mount rw,degraded once, then it's OK to allow
> > > it twice. If it's not OK twice, it's not OK once.  
> > 
> > Well, yeah.  The current check is naive and wrong.  It does have a
> > purpose, just fails in this, very common, case.
> 
> I guess the reasoning behind this is: Creating any more chunks on this
> drive will make raid1 chunks with only one copy. Adding another drive
> later will not replay the copies without user interaction. Is that true?
> 
> If yes, this may leave you with a mixed case of having a raid1 drive
> with some chunks not mirrored and some mirrored. When the other drives
> goes missing later, you are loosing data or even the whole filesystem
> although you were left with the (wrong) imagination of having a
> mirrored drive setup...

Ie, you want a degraded mount to create degraded raid1 chunks rather than
single ones?  Good idea, it would solve the most common case with least
surprise to the user.

But there are other scenarios where Qu's patch[-set] is needed.  For
example, if you try to convert a single-disk filesystem to raid1, yet the
new shiny disk you just added decides to remind you of words "infant
mortality" halfway during conversion.

Or, if you have degraded raid1 chunks and something bad happens during
recovery.  Having the required number of devices, despite passing the
current bogus check, doesn't mean you can continue.  Qu's patch checks
whether at least one copy of every chunk is present.

-- 
⢀⣴⠾⠻⢶⣦⠀ Meow!
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second
⠈⠳⣄ preimage for double rot13!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.7.2] btrfs_run_delayed_refs:2963: errno=-17 Object already exists

2017-03-03 Thread Marc Joliet

On Friday 03 March 2017 09:09:57 Qu Wenruo wrote:
> At 03/02/2017 05:44 PM, Marc Joliet wrote:
> > On Wednesday 01 March 2017 19:14:07 Marc Joliet wrote:
> >> In any
> >> case, I started btrfs-check on the device itself.
> > 
> > OK, it's still running, but the output so far is:
> > 
> > # btrfs check --mode=lowmem --progress /dev/sdb2
> > Checking filesystem on /dev/sdb2
> > UUID: f97b3cda-15e8-418b-bb9b-235391ef2a38
> > ERROR: shared extent[3826242740224 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3826442825728 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3826744471552 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3827106349056 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3827141001216 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3827150958592 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3827251724288 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3827433795584 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3827536166912 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3827536183296 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3827621646336 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3828179406848 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3828267970560 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3828284530688 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3828714246144 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3828794187776 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3829161340928 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3829373693952 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3830252130304 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3830421159936 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3830439141376 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3830441398272 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3830785138688 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3831099297792 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3831128768512 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3831371513856 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3831535570944 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3831591952384 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3831799398400 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3831829250048 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3831829512192 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3832011440128 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3832011767808 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3832023920640 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3832024678400 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3832027316224 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3832028762112 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3832030236672 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3832030330880 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3832161079296 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3832164904960 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3832164945920 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3832613765120 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3833727565824 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3833914073088 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared extent[3833929310208 4096] lost its parent (parent:
> > 3827251183616, level: 0)
> > ERROR: shared

[PATCH 11/17] fs, btrfs: convert compressed_bio.pending_bios from atomic_t to refcount_t

2017-03-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 fs/btrfs/compression.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index c7721a6..10e6b28 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -44,7 +44,7 @@
 
 struct compressed_bio {
/* number of bios pending for this compressed extent */
-   atomic_t pending_bios;
+   refcount_t pending_bios;
 
/* the pages with the compressed data on them */
struct page **compressed_pages;
@@ -161,7 +161,7 @@ static void end_compressed_bio_read(struct bio *bio)
/* if there are more bios still pending for this compressed
 * extent, just exit
 */
-   if (!atomic_dec_and_test(>pending_bios))
+   if (!refcount_dec_and_test(>pending_bios))
goto out;
 
inode = cb->inode;
@@ -274,7 +274,7 @@ static void end_compressed_bio_write(struct bio *bio)
/* if there are more bios still pending for this compressed
 * extent, just exit
 */
-   if (!atomic_dec_and_test(>pending_bios))
+   if (!refcount_dec_and_test(>pending_bios))
goto out;
 
/* ok, we're the last bio for this extent, step one is to
@@ -342,7 +342,7 @@ int btrfs_submit_compressed_write(struct inode *inode, u64 
start,
cb = kmalloc(compressed_bio_size(fs_info, compressed_len), GFP_NOFS);
if (!cb)
return -ENOMEM;
-   atomic_set(>pending_bios, 0);
+   refcount_set(>pending_bios, 0);
cb->errors = 0;
cb->inode = inode;
cb->start = start;
@@ -363,7 +363,7 @@ int btrfs_submit_compressed_write(struct inode *inode, u64 
start,
bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
bio->bi_private = cb;
bio->bi_end_io = end_compressed_bio_write;
-   atomic_inc(>pending_bios);
+   refcount_set(>pending_bios, 1);
 
/* create and submit bios for the compressed pages */
bytes_left = compressed_len;
@@ -388,7 +388,7 @@ int btrfs_submit_compressed_write(struct inode *inode, u64 
start,
 * we inc the count.  Otherwise, the cb might get
 * freed before we're done setting it up
 */
-   atomic_inc(>pending_bios);
+   refcount_inc(>pending_bios);
ret = btrfs_bio_wq_end_io(fs_info, bio,
  BTRFS_WQ_ENDIO_DATA);
BUG_ON(ret); /* -ENOMEM */
@@ -607,7 +607,7 @@ int btrfs_submit_compressed_read(struct inode *inode, 
struct bio *bio,
if (!cb)
goto out;
 
-   atomic_set(>pending_bios, 0);
+   refcount_set(>pending_bios, 0);
cb->errors = 0;
cb->inode = inode;
cb->mirror_num = mirror_num;
@@ -656,7 +656,7 @@ int btrfs_submit_compressed_read(struct inode *inode, 
struct bio *bio,
bio_set_op_attrs (comp_bio, REQ_OP_READ, 0);
comp_bio->bi_private = cb;
comp_bio->bi_end_io = end_compressed_bio_read;
-   atomic_inc(>pending_bios);
+   refcount_set(>pending_bios, 1);
 
for (pg_index = 0; pg_index < nr_pages; pg_index++) {
page = cb->compressed_pages[pg_index];
@@ -685,7 +685,7 @@ int btrfs_submit_compressed_read(struct inode *inode, 
struct bio *bio,
 * we inc the count.  Otherwise, the cb might get
 * freed before we're done setting it up
 */
-   atomic_inc(>pending_bios);
+   refcount_inc(>pending_bios);
 
if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
ret = btrfs_lookup_bio_sums(inode, comp_bio,
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-03 Thread Anand Jain




AFAIK, no, it hasn't been fixed, at least not in mainline, because the
patches to fix it got stuck in some long-running project patch queue
(IIRC, the one for on-degraded auto-device-replace), with no timeline
known to me on mainline merge.

Meanwhile, the problem as I understand it is that at the first raid1
degraded writable mount, no single-mode chunks exist, but without the
second device, they are created.



It might be an accidental feature introduced in the patch [1].
RFC [2] (limited tested) tried to correct it. But, if the accidental
feature works better than the traditional RAID1 approach then
workaround fix [3] will help, however for the accidental feature
I am not sure if it is would to support all the failures-recovery/
FS-is-full cases.

[1]
  commit 95669976bd7d30ae265db938ecb46a6b7f8cb893
  Btrfs: don't consider the missing device when allocating new chunks

[2]
  [PATCH 0/2] [RFC] btrfs: create degraded-RAID1 chunks

[3]
  Patches 01/13 to 05/13 of the below patch set (which were needed
  to test rest of the patches in the set).
  [PATCH v6 00/13] Introduce device state 'failed', spare device and 
auto replace.



Hope this sheds some light on the long standing issue.

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 06/17] fs, btrfs: convert btrfs_delayed_ref_node.refs from atomic_t to refcount_t

2017-03-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 fs/btrfs/backref.c | 2 +-
 fs/btrfs/delayed-ref.c | 8 
 fs/btrfs/delayed-ref.h | 8 +---
 fs/btrfs/disk-io.c | 2 +-
 fs/btrfs/extent-tree.c | 6 +++---
 5 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 7699e16..1163383 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1286,7 +1286,7 @@ static int find_parent_nodes(struct btrfs_trans_handle 
*trans,
head = btrfs_find_delayed_ref_head(delayed_refs, bytenr);
if (head) {
if (!mutex_trylock(>mutex)) {
-   atomic_inc(>node.refs);
+   refcount_inc(>node.refs);
spin_unlock(_refs->lock);
 
btrfs_release_path(path);
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 6eb8095..be70d90 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -164,7 +164,7 @@ int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
if (mutex_trylock(>mutex))
return 0;
 
-   atomic_inc(>node.refs);
+   refcount_inc(>node.refs);
spin_unlock(_refs->lock);
 
mutex_lock(>mutex);
@@ -590,7 +590,7 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
delayed_refs = >transaction->delayed_refs;
 
/* first set the basic ref node struct up */
-   atomic_set(>refs, 1);
+   refcount_set(>refs, 1);
ref->bytenr = bytenr;
ref->num_bytes = num_bytes;
ref->ref_mod = count_mod;
@@ -682,7 +682,7 @@ add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
delayed_refs = >transaction->delayed_refs;
 
/* first set the basic ref node struct up */
-   atomic_set(>refs, 1);
+   refcount_set(>refs, 1);
ref->bytenr = bytenr;
ref->num_bytes = num_bytes;
ref->ref_mod = 1;
@@ -739,7 +739,7 @@ add_delayed_data_ref(struct btrfs_fs_info *fs_info,
seq = atomic64_read(_info->tree_mod_seq);
 
/* first set the basic ref node struct up */
-   atomic_set(>refs, 1);
+   refcount_set(>refs, 1);
ref->bytenr = bytenr;
ref->num_bytes = num_bytes;
ref->ref_mod = 1;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 0e537f9..c0264ff 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -18,6 +18,8 @@
 #ifndef __DELAYED_REF__
 #define __DELAYED_REF__
 
+#include 
+
 /* these are the possible values of struct btrfs_delayed_ref_node->action */
 #define BTRFS_ADD_DELAYED_REF1 /* add one backref to the tree */
 #define BTRFS_DROP_DELAYED_REF   2 /* delete one backref from the tree */
@@ -53,7 +55,7 @@ struct btrfs_delayed_ref_node {
u64 seq;
 
/* ref count on this data structure */
-   atomic_t refs;
+   refcount_t refs;
 
/*
 * how many refs is this entry adding or deleting.  For
@@ -220,8 +222,8 @@ btrfs_free_delayed_extent_op(struct btrfs_delayed_extent_op 
*op)
 
 static inline void btrfs_put_delayed_ref(struct btrfs_delayed_ref_node *ref)
 {
-   WARN_ON(atomic_read(>refs) == 0);
-   if (atomic_dec_and_test(>refs)) {
+   WARN_ON(refcount_read(>refs) == 0);
+   if (refcount_dec_and_test(>refs)) {
WARN_ON(ref->in_tree);
switch (ref->type) {
case BTRFS_TREE_BLOCK_REF_KEY:
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index dca6432..913df60 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4343,7 +4343,7 @@ static int btrfs_destroy_delayed_refs(struct 
btrfs_transaction *trans,
head = rb_entry(node, struct btrfs_delayed_ref_head,
href_node);
if (!mutex_trylock(>mutex)) {
-   atomic_inc(>node.refs);
+   refcount_inc(>node.refs);
spin_unlock(_refs->lock);
 
mutex_lock(>mutex);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 796001b..11a1b07 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -891,7 +891,7 @@ int btrfs_lookup_extent_info(struct btrfs_trans_handle 
*trans,
head = btrfs_find_delayed_ref_head(delayed_refs, bytenr);
if (head) {
if (!mutex_trylock(>mutex)) {
-   atomic_inc(>node.refs);
+   refcount_inc(>node.refs);
spin_unlock(_refs->lock);

[PATCH 07/17] fs, btrfs: convert btrfs_delayed_node.refs from atomic_t to refcount_t

2017-03-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 fs/btrfs/delayed-inode.c | 28 ++--
 fs/btrfs/delayed-inode.h |  3 ++-
 2 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 1aff676..7396c36 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -52,7 +52,7 @@ static inline void btrfs_init_delayed_node(
 {
delayed_node->root = root;
delayed_node->inode_id = inode_id;
-   atomic_set(_node->refs, 0);
+   refcount_set(_node->refs, 0);
delayed_node->ins_root = RB_ROOT;
delayed_node->del_root = RB_ROOT;
mutex_init(_node->mutex);
@@ -81,7 +81,7 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node(
 
node = READ_ONCE(btrfs_inode->delayed_node);
if (node) {
-   atomic_inc(>refs);
+   refcount_inc(>refs);
return node;
}
 
@@ -89,14 +89,14 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node(
node = radix_tree_lookup(>delayed_nodes_tree, ino);
if (node) {
if (btrfs_inode->delayed_node) {
-   atomic_inc(>refs);/* can be accessed */
+   refcount_inc(>refs);  /* can be accessed */
BUG_ON(btrfs_inode->delayed_node != node);
spin_unlock(>inode_lock);
return node;
}
btrfs_inode->delayed_node = node;
/* can be accessed and cached in the inode */
-   atomic_add(2, >refs);
+   refcount_add(2, >refs);
spin_unlock(>inode_lock);
return node;
}
@@ -125,7 +125,7 @@ static struct btrfs_delayed_node 
*btrfs_get_or_create_delayed_node(
btrfs_init_delayed_node(node, root, ino);
 
/* cached in the btrfs inode and can be accessed */
-   atomic_add(2, >refs);
+   refcount_set(>refs, 2);
 
ret = radix_tree_preload(GFP_NOFS);
if (ret) {
@@ -166,7 +166,7 @@ static void btrfs_queue_delayed_node(struct 
btrfs_delayed_root *root,
} else {
list_add_tail(>n_list, >node_list);
list_add_tail(>p_list, >prepare_list);
-   atomic_inc(>refs);/* inserted into list */
+   refcount_inc(>refs);  /* inserted into list */
root->nodes++;
set_bit(BTRFS_DELAYED_NODE_IN_LIST, >flags);
}
@@ -180,7 +180,7 @@ static void btrfs_dequeue_delayed_node(struct 
btrfs_delayed_root *root,
spin_lock(>lock);
if (test_bit(BTRFS_DELAYED_NODE_IN_LIST, >flags)) {
root->nodes--;
-   atomic_dec(>refs);/* not in the list */
+   refcount_dec(>refs);  /* not in the list */
list_del_init(>n_list);
if (!list_empty(>p_list))
list_del_init(>p_list);
@@ -201,7 +201,7 @@ static struct btrfs_delayed_node *btrfs_first_delayed_node(
 
p = delayed_root->node_list.next;
node = list_entry(p, struct btrfs_delayed_node, n_list);
-   atomic_inc(>refs);
+   refcount_inc(>refs);
 out:
spin_unlock(_root->lock);
 
@@ -228,7 +228,7 @@ static struct btrfs_delayed_node *btrfs_next_delayed_node(
p = node->n_list.next;
 
next = list_entry(p, struct btrfs_delayed_node, n_list);
-   atomic_inc(>refs);
+   refcount_inc(>refs);
 out:
spin_unlock(_root->lock);
 
@@ -253,11 +253,11 @@ static void __btrfs_release_delayed_node(
btrfs_dequeue_delayed_node(delayed_root, delayed_node);
mutex_unlock(_node->mutex);
 
-   if (atomic_dec_and_test(_node->refs)) {
+   if (refcount_dec_and_test(_node->refs)) {
bool free = false;
struct btrfs_root *root = delayed_node->root;
spin_lock(>inode_lock);
-   if (atomic_read(_node->refs) == 0) {
+   if (refcount_read(_node->refs) == 0) {
radix_tree_delete(>delayed_nodes_tree,
  delayed_node->inode_id);
free = true;
@@ -286,7 +286,7 @@ static struct btrfs_delayed_node 
*btrfs_first_prepared_delayed_node(
p = delayed_root->prepare_list.next;
list_del_init(p);
node = list_entry(p, struct btrfs_delayed_node, p_list);
-   atomic_inc(>refs);
+   refcount_inc(>refs);
 out:
spin_unlock(_root->lock);
 
@@ -1621,7 +1621,7 @@ bool

[PATCH 02/17] fs, btrfs: convert btrfs_transaction.use_count from atomic_t to refcount_t

2017-03-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 fs/btrfs/disk-io.c  |  2 +-
 fs/btrfs/extent-tree.c  |  2 +-
 fs/btrfs/ordered-data.c |  2 +-
 fs/btrfs/transaction.c  | 20 ++--
 fs/btrfs/transaction.h  |  3 ++-
 5 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5f1b08c..dca6432 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4615,7 +4615,7 @@ static int btrfs_cleanup_transaction(struct btrfs_fs_info 
*fs_info)
t = list_first_entry(_info->trans_list,
 struct btrfs_transaction, list);
if (t->state >= TRANS_STATE_COMMIT_START) {
-   atomic_inc(>use_count);
+   refcount_inc(>use_count);
spin_unlock(_info->trans_lock);
btrfs_wait_for_commit(fs_info, t->transid);
btrfs_put_transaction(t);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6079465..63ba295 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10849,7 +10849,7 @@ static int btrfs_trim_free_extents(struct btrfs_device 
*device,
spin_lock(_info->trans_lock);
trans = fs_info->running_transaction;
if (trans)
-   atomic_inc(>use_count);
+   refcount_inc(>use_count);
spin_unlock(_info->trans_lock);
 
ret = find_free_dev_extent_start(trans, device, minlen, start,
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 9a46878..da5399f 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -623,7 +623,7 @@ void btrfs_remove_ordered_extent(struct inode *inode,
spin_lock(_info->trans_lock);
trans = fs_info->running_transaction;
if (trans)
-   atomic_inc(>use_count);
+   refcount_inc(>use_count);
spin_unlock(_info->trans_lock);
 
ASSERT(trans);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 61b807d..a7d7a7d 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -60,8 +60,8 @@ static const unsigned int 
btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
 
 void btrfs_put_transaction(struct btrfs_transaction *transaction)
 {
-   WARN_ON(atomic_read(>use_count) == 0);
-   if (atomic_dec_and_test(>use_count)) {
+   WARN_ON(refcount_read(>use_count) == 0);
+   if (refcount_dec_and_test(>use_count)) {
BUG_ON(!list_empty(>list));
WARN_ON(!RB_EMPTY_ROOT(>delayed_refs.href_root));
if (transaction->delayed_refs.pending_csums)
@@ -207,7 +207,7 @@ static noinline int join_transaction(struct btrfs_fs_info 
*fs_info,
spin_unlock(_info->trans_lock);
return -EBUSY;
}
-   atomic_inc(_trans->use_count);
+   refcount_inc(_trans->use_count);
atomic_inc(_trans->num_writers);
extwriter_counter_inc(cur_trans, type);
spin_unlock(_info->trans_lock);
@@ -257,7 +257,7 @@ static noinline int join_transaction(struct btrfs_fs_info 
*fs_info,
 * One for this trans handle, one so it will live on until we
 * commit the transaction.
 */
-   atomic_set(_trans->use_count, 2);
+   refcount_set(_trans->use_count, 2);
atomic_set(_trans->pending_ordered, 0);
cur_trans->flags = 0;
cur_trans->start_time = get_seconds();
@@ -432,7 +432,7 @@ static void wait_current_trans(struct btrfs_fs_info 
*fs_info)
spin_lock(_info->trans_lock);
cur_trans = fs_info->running_transaction;
if (cur_trans && is_transaction_blocked(cur_trans)) {
-   atomic_inc(_trans->use_count);
+   refcount_inc(_trans->use_count);
spin_unlock(_info->trans_lock);
 
wait_event(fs_info->transaction_wait,
@@ -744,7 +744,7 @@ int btrfs_wait_for_commit(struct btrfs_fs_info *fs_info, 
u64 transid)
list_for_each_entry(t, _info->trans_list, list) {
if (t->transid == transid) {
cur_trans = t;
-   atomic_inc(_trans->use_count);
+   refcount_inc(_trans->use_count);
ret = 0;
break;
}
@@ -773,7 +773,7 @@ int

[PATCH 15/17] fs, btrfs: convert scrub_parity.refs from atomic_t to refcount_t

2017-03-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 fs/btrfs/scrub.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 08895d8..4d15112 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -142,7 +142,7 @@ struct scrub_parity {
 
int stripe_len;
 
-   atomic_trefs;
+   refcount_t  refs;
 
struct list_headspages;
 
@@ -2822,12 +2822,12 @@ static inline int scrub_calc_parity_bitmap_len(int 
nsectors)
 
 static void scrub_parity_get(struct scrub_parity *sparity)
 {
-   atomic_inc(>refs);
+   refcount_inc(>refs);
 }
 
 static void scrub_parity_put(struct scrub_parity *sparity)
 {
-   if (!atomic_dec_and_test(>refs))
+   if (!refcount_dec_and_test(>refs))
return;
 
scrub_parity_check_and_repair(sparity);
@@ -2879,7 +2879,7 @@ static noinline_for_stack int scrub_raid56_parity(struct 
scrub_ctx *sctx,
sparity->scrub_dev = sdev;
sparity->logic_start = logic_start;
sparity->logic_end = logic_end;
-   atomic_set(>refs, 1);
+   refcount_set(>refs, 1);
INIT_LIST_HEAD(>spages);
sparity->dbitmap = sparity->bitmap;
sparity->ebitmap = (void *)sparity->bitmap + bitmap_len;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 08/17] fs, btrfs: convert btrfs_delayed_item.refs from atomic_t to refcount_t

2017-03-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 fs/btrfs/delayed-inode.c | 18 +-
 fs/btrfs/delayed-inode.h |  2 +-
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 7396c36..8ae409b 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -308,7 +308,7 @@ static struct btrfs_delayed_item 
*btrfs_alloc_delayed_item(u32 data_len)
item->ins_or_del = 0;
item->bytes_reserved = 0;
item->delayed_node = NULL;
-   atomic_set(>refs, 1);
+   refcount_set(>refs, 1);
}
return item;
 }
@@ -483,7 +483,7 @@ static void btrfs_release_delayed_item(struct 
btrfs_delayed_item *item)
 {
if (item) {
__btrfs_remove_delayed_item(item);
-   if (atomic_dec_and_test(>refs))
+   if (refcount_dec_and_test(>refs))
kfree(item);
}
 }
@@ -1600,14 +1600,14 @@ bool btrfs_readdir_get_delayed_items(struct inode 
*inode,
mutex_lock(_node->mutex);
item = __btrfs_first_delayed_insertion_item(delayed_node);
while (item) {
-   atomic_inc(>refs);
+   refcount_inc(>refs);
list_add_tail(>readdir_list, ins_list);
item = __btrfs_next_delayed_item(item);
}
 
item = __btrfs_first_delayed_deletion_item(delayed_node);
while (item) {
-   atomic_inc(>refs);
+   refcount_inc(>refs);
list_add_tail(>readdir_list, del_list);
item = __btrfs_next_delayed_item(item);
}
@@ -1634,13 +1634,13 @@ void btrfs_readdir_put_delayed_items(struct inode 
*inode,
 
list_for_each_entry_safe(curr, next, ins_list, readdir_list) {
list_del(>readdir_list);
-   if (atomic_dec_and_test(>refs))
+   if (refcount_dec_and_test(>refs))
kfree(curr);
}
 
list_for_each_entry_safe(curr, next, del_list, readdir_list) {
list_del(>readdir_list);
-   if (atomic_dec_and_test(>refs))
+   if (refcount_dec_and_test(>refs))
kfree(curr);
}
 
@@ -1667,7 +1667,7 @@ int btrfs_should_delete_dir_index(struct list_head 
*del_list,
list_del(>readdir_list);
ret = (curr->key.offset == index);
 
-   if (atomic_dec_and_test(>refs))
+   if (refcount_dec_and_test(>refs))
kfree(curr);
 
if (ret)
@@ -1705,7 +1705,7 @@ int btrfs_readdir_delayed_dir_index(struct dir_context 
*ctx,
list_del(>readdir_list);
 
if (curr->key.offset < ctx->pos) {
-   if (atomic_dec_and_test(>refs))
+   if (refcount_dec_and_test(>refs))
kfree(curr);
continue;
}
@@ -1722,7 +1722,7 @@ int btrfs_readdir_delayed_dir_index(struct dir_context 
*ctx,
over = !dir_emit(ctx, name, name_len,
   location.objectid, d_type);
 
-   if (atomic_dec_and_test(>refs))
+   if (refcount_dec_and_test(>refs))
kfree(curr);
 
if (over)
diff --git a/fs/btrfs/delayed-inode.h b/fs/btrfs/delayed-inode.h
index d234974..6d4f5a0 100644
--- a/fs/btrfs/delayed-inode.h
+++ b/fs/btrfs/delayed-inode.h
@@ -81,7 +81,7 @@ struct btrfs_delayed_item {
struct list_head readdir_list;  /* used for readdir items */
u64 bytes_reserved;
struct btrfs_delayed_node *delayed_node;
-   atomic_t refs;
+   refcount_t refs;
int ins_or_del;
u32 data_len;
char data[0];
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/17] fs, btrfs: convert extent_state.refs from atomic_t to refcount_t

2017-03-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 fs/btrfs/extent_io.c | 14 +++---
 fs/btrfs/extent_io.h |  3 ++-
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3cf0b02..0702be6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -68,7 +68,7 @@ void btrfs_leak_debug_check(void)
pr_err("BTRFS: state leak: start %llu end %llu state %u in tree 
%d refs %d\n",
   state->start, state->end, state->state,
   extent_state_in_tree(state),
-  atomic_read(>refs));
+  refcount_read(>refs));
list_del(>leak_list);
kmem_cache_free(extent_state_cache, state);
}
@@ -238,7 +238,7 @@ static struct extent_state *alloc_extent_state(gfp_t mask)
state->failrec = NULL;
RB_CLEAR_NODE(>rb_node);
btrfs_leak_debug_add(>leak_list, );
-   atomic_set(>refs, 1);
+   refcount_set(>refs, 1);
init_waitqueue_head(>wq);
trace_alloc_extent_state(state, mask, _RET_IP_);
return state;
@@ -248,7 +248,7 @@ void free_extent_state(struct extent_state *state)
 {
if (!state)
return;
-   if (atomic_dec_and_test(>refs)) {
+   if (refcount_dec_and_test(>refs)) {
WARN_ON(extent_state_in_tree(state));
btrfs_leak_debug_del(>leak_list);
trace_free_extent_state(state, _RET_IP_);
@@ -641,7 +641,7 @@ static int __clear_extent_bit(struct extent_io_tree *tree, 
u64 start, u64 end,
if (cached && extent_state_in_tree(cached) &&
cached->start <= start && cached->end > start) {
if (clear)
-   atomic_dec(>refs);
+   refcount_dec(>refs);
state = cached;
goto hit_next;
}
@@ -793,7 +793,7 @@ static void wait_extent_bit(struct extent_io_tree *tree, 
u64 start, u64 end,
 
if (state->state & bits) {
start = state->start;
-   atomic_inc(>refs);
+   refcount_inc(>refs);
wait_on_state(tree, state);
free_extent_state(state);
goto again;
@@ -834,7 +834,7 @@ static void cache_state_if_flags(struct extent_state *state,
if (cached_ptr && !(*cached_ptr)) {
if (!flags || (state->state & flags)) {
*cached_ptr = state;
-   atomic_inc(>refs);
+   refcount_inc(>refs);
}
}
 }
@@ -1538,7 +1538,7 @@ static noinline u64 find_delalloc_range(struct 
extent_io_tree *tree,
if (!found) {
*start = state->start;
*cached_state = state;
-   atomic_inc(>refs);
+   refcount_inc(>refs);
}
found++;
*end = state->end;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index c16260c..9ea9338 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -2,6 +2,7 @@
 #define __EXTENTIO__
 
 #include 
+#include 
 #include "ulist.h"
 
 /* bits for the extent state */
@@ -134,7 +135,7 @@ struct extent_state {
 
/* ADD NEW ELEMENTS AFTER THIS */
wait_queue_head_t wq;
-   atomic_t refs;
+   refcount_t refs;
unsigned state;
 
struct io_failure_record *failrec;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 09/17] fs, btrfs: convert btrfs_root.refs from atomic_t to refcount_t

2017-03-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 fs/btrfs/ctree.h   | 2 +-
 fs/btrfs/disk-io.c | 2 +-
 fs/btrfs/disk-io.h | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 557af39..c01bfca 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1221,7 +1221,7 @@ struct btrfs_root {
dev_t anon_dev;
 
spinlock_t root_item_lock;
-   atomic_t refs;
+   refcount_t refs;
 
struct mutex delalloc_mutex;
spinlock_t delalloc_lock;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 913df60..ca89ae3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1340,7 +1340,7 @@ static void __setup_root(struct btrfs_root *root, struct 
btrfs_fs_info *fs_info,
atomic_set(>log_writers, 0);
atomic_set(>log_batch, 0);
atomic_set(>orphan_inodes, 0);
-   atomic_set(>refs, 1);
+   refcount_set(>refs, 1);
atomic_set(>will_be_snapshoted, 0);
atomic_set(>qgroup_meta_rsv, 0);
root->log_transid = 0;
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 2e0ec29..21f1ceb 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -101,14 +101,14 @@ struct btrfs_root *btrfs_alloc_dummy_root(struct 
btrfs_fs_info *fs_info);
  */
 static inline struct btrfs_root *btrfs_grab_fs_root(struct btrfs_root *root)
 {
-   if (atomic_inc_not_zero(>refs))
+   if (refcount_inc_not_zero(>refs))
return root;
return NULL;
 }
 
 static inline void btrfs_put_fs_root(struct btrfs_root *root)
 {
-   if (atomic_dec_and_test(>refs))
+   if (refcount_dec_and_test(>refs))
kfree(root);
 }
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 16/17] fs, btrfs: convert scrub_ctx.refs from atomic_t to refcount_t

2017-03-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 fs/btrfs/scrub.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 4d15112..ace634e 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -202,7 +202,7 @@ struct scrub_ctx {
 * doesn't free the scrub context before or while the workers are
 * doing the wakeup() call.
 */
-   atomic_trefs;
+   refcount_trefs;
 };
 
 struct scrub_fixup_nodatasum {
@@ -305,7 +305,7 @@ static void scrub_put_ctx(struct scrub_ctx *sctx);
 
 static void scrub_pending_bio_inc(struct scrub_ctx *sctx)
 {
-   atomic_inc(>refs);
+   refcount_inc(>refs);
atomic_inc(>bios_in_flight);
 }
 
@@ -356,7 +356,7 @@ static void scrub_pending_trans_workers_inc(struct 
scrub_ctx *sctx)
 {
struct btrfs_fs_info *fs_info = sctx->fs_info;
 
-   atomic_inc(>refs);
+   refcount_inc(>refs);
/*
 * increment scrubs_running to prevent cancel requests from
 * completing as long as a worker is running. we must also
@@ -447,7 +447,7 @@ static noinline_for_stack void scrub_free_ctx(struct 
scrub_ctx *sctx)
 
 static void scrub_put_ctx(struct scrub_ctx *sctx)
 {
-   if (atomic_dec_and_test(>refs))
+   if (refcount_dec_and_test(>refs))
scrub_free_ctx(sctx);
 }
 
@@ -462,7 +462,7 @@ struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, 
int is_dev_replace)
sctx = kzalloc(sizeof(*sctx), GFP_KERNEL);
if (!sctx)
goto nomem;
-   atomic_set(>refs, 1);
+   refcount_set(>refs, 1);
sctx->is_dev_replace = is_dev_replace;
sctx->pages_per_rd_bio = SCRUB_PAGES_PER_RD_BIO;
sctx->curr = -1;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 17/17] fs, btrfs: convert btrfs_raid_bio.refs from atomic_t to refcount_t

2017-03-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 fs/btrfs/raid56.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 1571bf2..a8954f5 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -149,7 +149,7 @@ struct btrfs_raid_bio {
 
int generic_bio_cnt;
 
-   atomic_t refs;
+   refcount_t refs;
 
atomic_t stripes_pending;
 
@@ -389,7 +389,7 @@ static void __remove_rbio_from_cache(struct btrfs_raid_bio 
*rbio)
if (bio_list_empty(>bio_list)) {
if (!list_empty(>hash_list)) {
list_del_init(>hash_list);
-   atomic_dec(>refs);
+   refcount_dec(>refs);
BUG_ON(!list_empty(>plug_list));
}
}
@@ -480,7 +480,7 @@ static void cache_rbio(struct btrfs_raid_bio *rbio)
 
/* bump our ref if we were not in the list before */
if (!test_and_set_bit(RBIO_CACHE_BIT, >flags))
-   atomic_inc(>refs);
+   refcount_inc(>refs);
 
if (!list_empty(>stripe_cache)){
list_move(>stripe_cache, >stripe_cache);
@@ -689,7 +689,7 @@ static noinline int lock_stripe_add(struct btrfs_raid_bio 
*rbio)
test_bit(RBIO_CACHE_BIT, >flags) &&
!test_bit(RBIO_RMW_LOCKED_BIT, >flags)) {
list_del_init(>hash_list);
-   atomic_dec(>refs);
+   refcount_dec(>refs);
 
steal_rbio(cur, rbio);
cache_drop = cur;
@@ -738,7 +738,7 @@ static noinline int lock_stripe_add(struct btrfs_raid_bio 
*rbio)
}
}
 lockit:
-   atomic_inc(>refs);
+   refcount_inc(>refs);
list_add(>hash_list, >hash_list);
 out:
spin_unlock_irqrestore(>lock, flags);
@@ -784,7 +784,7 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
}
 
list_del_init(>hash_list);
-   atomic_dec(>refs);
+   refcount_dec(>refs);
 
/*
 * we use the plug list to hold all the rbios
@@ -801,7 +801,7 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
list_del_init(>plug_list);
 
list_add(>hash_list, >hash_list);
-   atomic_inc(>refs);
+   refcount_inc(>refs);
spin_unlock(>bio_list_lock);
spin_unlock_irqrestore(>lock, flags);
 
@@ -843,8 +843,7 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio)
 {
int i;
 
-   WARN_ON(atomic_read(>refs) < 0);
-   if (!atomic_dec_and_test(>refs))
+   if (!refcount_dec_and_test(>refs))
return;
 
WARN_ON(!list_empty(>stripe_cache));
@@ -997,7 +996,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct 
btrfs_fs_info *fs_info,
rbio->stripe_npages = stripe_npages;
rbio->faila = -1;
rbio->failb = -1;
-   atomic_set(>refs, 1);
+   refcount_set(>refs, 1);
atomic_set(>error, 0);
atomic_set(>stripes_pending, 0);
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 00/17] fs, btrfs refcount conversions

2017-03-03 Thread Elena Reshetova

Now when new refcount_t type and API are finally merged
(see include/linux/refcount.h), the following
patches convert various refcounters in the btrfs filesystem from atomic_t
to refcount_t. By doing this we prevent intentional or accidental
underflows or overflows that can led to use-after-free vulnerabilities.

The below patches are fully independent and can be cherry-picked separately.
Since we convert all kernel subsystems in the same fashion, resulting
in about 300 patches, we have to group them for sending at least in some
fashion to be manageable. Please excuse the long cc list.

These patches have been tested with xfstests by running btrfs-related tests.
btrfs debug was enabled, warns on refcount errors, too. No output related to
refcount errors produced. However, the following errors were during the run:
 * tests btrfs/078, btrfs/114, btrfs/115, no errors anywhere in dmesg, but
 process hangs. They all seem to be around qgroup, sometimes error visible
 such as qgroup scan failed -4 before it blocks, but not always.
 * test btrfs/104 dmesg has additional error output:
 BTRFS warning (device vdc): qgroup 258 reserved space underflow, have: 0,
 to free: 4096
 I tried looking at the code on what causes the failure, but could not figure
 it out. It doesn't seem to be related to any refcount changes at least IMO.

The above test failures are hard for me to understand and interpreted, but
they don't seem to relate to refcount conversions.

Elena Reshetova (17):
  fs, btrfs: convert btrfs_bio.refs from atomic_t to refcount_t
  fs, btrfs: convert btrfs_transaction.use_count from atomic_t to
refcount_t
  fs, btrfs: convert extent_map.refs from atomic_t to refcount_t
  fs, btrfs: convert btrfs_ordered_extent.refs from atomic_t to
refcount_t
  fs, btrfs: convert btrfs_caching_control.count from atomic_t to
refcount_t
  fs, btrfs: convert btrfs_delayed_ref_node.refs from atomic_t to
refcount_t
  fs, btrfs: convert btrfs_delayed_node.refs from atomic_t to refcount_t
  fs, btrfs: convert btrfs_delayed_item.refs from atomic_t to refcount_t
  fs, btrfs: convert btrfs_root.refs from atomic_t to refcount_t
  fs, btrfs: convert extent_state.refs from atomic_t to refcount_t
  fs, btrfs: convert compressed_bio.pending_bios from atomic_t to
refcount_t
  fs, btrfs: convert scrub_recover.refs from atomic_t to refcount_t
  fs, btrfs: convert scrub_page.refs from atomic_t to refcount_t
  fs, btrfs: convert scrub_block.refs from atomic_t to refcount_t
  fs, btrfs: convert scrub_parity.refs from atomic_t to refcount_t
  fs, btrfs: convert scrub_ctx.refs from atomic_t to refcount_t
  fs, btrfs: convert btrfs_raid_bio.refs from atomic_t to refcount_t

 fs/btrfs/backref.c   |  2 +-
 fs/btrfs/compression.c   | 18 -
 fs/btrfs/ctree.h |  5 +++--
 fs/btrfs/delayed-inode.c | 46 ++--
 fs/btrfs/delayed-inode.h |  5 +++--
 fs/btrfs/delayed-ref.c   |  8 
 fs/btrfs/delayed-ref.h   |  8 +---
 fs/btrfs/disk-io.c   |  6 +++---
 fs/btrfs/disk-io.h   |  4 ++--
 fs/btrfs/extent-tree.c   | 20 +--
 fs/btrfs/extent_io.c | 18 -
 fs/btrfs/extent_io.h |  3 ++-
 fs/btrfs/extent_map.c| 10 +-
 fs/btrfs/extent_map.h|  3 ++-
 fs/btrfs/ordered-data.c  | 20 +--
 fs/btrfs/ordered-data.h  |  2 +-
 fs/btrfs/raid56.c| 19 +-
 fs/btrfs/scrub.c | 42 
 fs/btrfs/transaction.c   | 20 +--
 fs/btrfs/transaction.h   |  3 ++-
 fs/btrfs/tree-log.c  |  2 +-
 fs/btrfs/volumes.c   | 10 +-
 fs/btrfs/volumes.h   |  2 +-
 include/trace/events/btrfs.h |  4 ++--
 24 files changed, 143 insertions(+), 137 deletions(-)

-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 13/17] fs, btrfs: convert scrub_page.refs from atomic_t to refcount_t

2017-03-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 fs/btrfs/scrub.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index c9406bf..8299f64 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -79,7 +79,7 @@ struct scrub_page {
u64 logical;
u64 physical;
u64 physical_for_dev_replace;
-   atomic_trefs;
+   refcount_t  refs;
struct {
unsigned intmirror_num:8;
unsigned inthave_csum:1;
@@ -2017,12 +2017,12 @@ static void scrub_block_put(struct scrub_block *sblock)
 
 static void scrub_page_get(struct scrub_page *spage)
 {
-   atomic_inc(>refs);
+   refcount_inc(>refs);
 }
 
 static void scrub_page_put(struct scrub_page *spage)
 {
-   if (atomic_dec_and_test(>refs)) {
+   if (refcount_dec_and_test(>refs)) {
if (spage->page)
__free_page(spage->page);
kfree(spage);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 12/17] fs, btrfs: convert scrub_recover.refs from atomic_t to refcount_t

2017-03-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 fs/btrfs/scrub.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b0251eb..c9406bf 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -64,7 +64,7 @@ struct scrub_ctx;
 #define SCRUB_MAX_PAGES_PER_BLOCK  16  /* 64k per node/leaf/sector */
 
 struct scrub_recover {
-   atomic_trefs;
+   refcount_t  refs;
struct btrfs_bio*bbio;
u64 map_length;
 };
@@ -857,12 +857,12 @@ static void scrub_fixup_nodatasum(struct btrfs_work *work)
 
 static inline void scrub_get_recover(struct scrub_recover *recover)
 {
-   atomic_inc(>refs);
+   refcount_inc(>refs);
 }
 
 static inline void scrub_put_recover(struct scrub_recover *recover)
 {
-   if (atomic_dec_and_test(>refs)) {
+   if (refcount_dec_and_test(>refs)) {
btrfs_put_bbio(recover->bbio);
kfree(recover);
}
@@ -1343,7 +1343,7 @@ static int scrub_setup_recheck_block(struct scrub_block 
*original_sblock,
return -ENOMEM;
}
 
-   atomic_set(>refs, 1);
+   refcount_set(>refs, 1);
recover->bbio = bbio;
recover->map_length = mapped_length;
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 14/17] fs, btrfs: convert scrub_block.refs from atomic_t to refcount_t

2017-03-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 fs/btrfs/scrub.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 8299f64..08895d8 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -112,7 +112,7 @@ struct scrub_block {
struct scrub_page   *pagev[SCRUB_MAX_PAGES_PER_BLOCK];
int page_count;
atomic_toutstanding_pages;
-   atomic_trefs; /* free mem on transition to zero */
+   refcount_t  refs; /* free mem on transition to zero */
struct scrub_ctx*sctx;
struct scrub_parity *sparity;
struct {
@@ -1998,12 +1998,12 @@ static int scrub_checksum_super(struct scrub_block 
*sblock)
 
 static void scrub_block_get(struct scrub_block *sblock)
 {
-   atomic_inc(>refs);
+   refcount_inc(>refs);
 }
 
 static void scrub_block_put(struct scrub_block *sblock)
 {
-   if (atomic_dec_and_test(>refs)) {
+   if (refcount_dec_and_test(>refs)) {
int i;
 
if (sblock->sparity)
@@ -2255,7 +2255,7 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
logical, u64 len,
 
/* one ref inside this function, plus one for each page added to
 * a bio later on */
-   atomic_set(>refs, 1);
+   refcount_set(>refs, 1);
sblock->sctx = sctx;
sblock->no_io_error_seen = 1;
 
@@ -2555,7 +2555,7 @@ static int scrub_pages_for_parity(struct scrub_parity 
*sparity,
 
/* one ref inside this function, plus one for each page added to
 * a bio later on */
-   atomic_set(>refs, 1);
+   refcount_set(>refs, 1);
sblock->sctx = sctx;
sblock->no_io_error_seen = 1;
sblock->sparity = sparity;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/17] fs, btrfs: convert btrfs_bio.refs from atomic_t to refcount_t

2017-03-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 fs/btrfs/volumes.c | 8 
 fs/btrfs/volumes.h | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b124462..a8fd2e9 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5295,22 +5295,22 @@ static struct btrfs_bio *alloc_btrfs_bio(int 
total_stripes, int real_stripes)
GFP_NOFS|__GFP_NOFAIL);
 
atomic_set(>error, 0);
-   atomic_set(>refs, 1);
+   refcount_set(>refs, 1);
 
return bbio;
 }
 
 void btrfs_get_bbio(struct btrfs_bio *bbio)
 {
-   WARN_ON(!atomic_read(>refs));
-   atomic_inc(>refs);
+   WARN_ON(!refcount_read(>refs));
+   refcount_inc(>refs);
 }
 
 void btrfs_put_bbio(struct btrfs_bio *bbio)
 {
if (!bbio)
return;
-   if (atomic_dec_and_test(>refs))
+   if (refcount_dec_and_test(>refs))
kfree(bbio);
 }
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 59be812..ac0bf7d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -298,7 +298,7 @@ struct btrfs_bio;
 typedef void (btrfs_bio_end_io_t) (struct btrfs_bio *bio, int err);
 
 struct btrfs_bio {
-   atomic_t refs;
+   refcount_t refs;
atomic_t stripes_pending;
struct btrfs_fs_info *fs_info;
u64 map_type; /* get from map_lookup->type */
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/17] fs, btrfs: convert btrfs_ordered_extent.refs from atomic_t to refcount_t

2017-03-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 fs/btrfs/ordered-data.c  | 18 +-
 fs/btrfs/ordered-data.h  |  2 +-
 include/trace/events/btrfs.h |  2 +-
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index da5399f..7b40e2e 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -212,7 +212,7 @@ static int __btrfs_add_ordered_extent(struct inode *inode, 
u64 file_offset,
set_bit(BTRFS_ORDERED_DIRECT, >flags);
 
/* one ref for the tree */
-   atomic_set(>refs, 1);
+   refcount_set(>refs, 1);
init_waitqueue_head(>wait);
INIT_LIST_HEAD(>list);
INIT_LIST_HEAD(>root_extent_list);
@@ -358,7 +358,7 @@ int btrfs_dec_test_first_ordered_pending(struct inode 
*inode,
 out:
if (!ret && cached && entry) {
*cached = entry;
-   atomic_inc(>refs);
+   refcount_inc(>refs);
}
spin_unlock_irqrestore(>lock, flags);
return ret == 0;
@@ -425,7 +425,7 @@ int btrfs_dec_test_ordered_pending(struct inode *inode,
 out:
if (!ret && cached && entry) {
*cached = entry;
-   atomic_inc(>refs);
+   refcount_inc(>refs);
}
spin_unlock_irqrestore(>lock, flags);
return ret == 0;
@@ -456,7 +456,7 @@ void btrfs_get_logged_extents(struct btrfs_inode *inode,
if (test_and_set_bit(BTRFS_ORDERED_LOGGED, >flags))
continue;
list_add(>log_list, logged_list);
-   atomic_inc(>refs);
+   refcount_inc(>refs);
}
spin_unlock_irq(>lock);
 }
@@ -565,7 +565,7 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent 
*entry)
 
trace_btrfs_ordered_extent_put(entry->inode, entry);
 
-   if (atomic_dec_and_test(>refs)) {
+   if (refcount_dec_and_test(>refs)) {
ASSERT(list_empty(>log_list));
ASSERT(list_empty(>trans_list));
ASSERT(list_empty(>root_extent_list));
@@ -690,7 +690,7 @@ int btrfs_wait_ordered_extents(struct btrfs_root *root, int 
nr,
 
list_move_tail(>root_extent_list,
   >ordered_extents);
-   atomic_inc(>refs);
+   refcount_inc(>refs);
spin_unlock(>ordered_extent_lock);
 
btrfs_init_work(>flush_work,
@@ -870,7 +870,7 @@ struct btrfs_ordered_extent 
*btrfs_lookup_ordered_extent(struct inode *inode,
if (!offset_in_entry(entry, file_offset))
entry = NULL;
if (entry)
-   atomic_inc(>refs);
+   refcount_inc(>refs);
 out:
spin_unlock_irq(>lock);
return entry;
@@ -911,7 +911,7 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_range(
}
 out:
if (entry)
-   atomic_inc(>refs);
+   refcount_inc(>refs);
spin_unlock_irq(>lock);
return entry;
 }
@@ -948,7 +948,7 @@ btrfs_lookup_first_ordered_extent(struct inode *inode, u64 
file_offset)
goto out;
 
entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
-   atomic_inc(>refs);
+   refcount_inc(>refs);
 out:
spin_unlock_irq(>lock);
return entry;
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 195c93b..e0c1d5b 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -113,7 +113,7 @@ struct btrfs_ordered_extent {
int compress_type;
 
/* reference count */
-   atomic_t refs;
+   refcount_t refs;
 
/* the inode we belong to */
struct inode *inode;
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 9dd29e8..8f206263 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -275,7 +275,7 @@ DECLARE_EVENT_CLASS(btrfs__ordered_extent,
__entry->bytes_left = ordered->bytes_left;
__entry->flags  = ordered->flags;
__entry->compress_type  = ordered->compress_type;
-   __entry->refs   = atomic_read(>refs);
+   __entry->refs   = refcount_read(>refs);
__entry->root_objectid  =
BTRFS_I(inode)->root->root_key.objectid;
__entry->truncated_len  = ordered->truncated_len;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at

[PATCH 03/17] fs, btrfs: convert extent_map.refs from atomic_t to refcount_t

2017-03-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 fs/btrfs/extent_io.c |  4 ++--
 fs/btrfs/extent_map.c| 10 +-
 fs/btrfs/extent_map.h|  3 ++-
 fs/btrfs/tree-log.c  |  2 +-
 fs/btrfs/volumes.c   |  2 +-
 include/trace/events/btrfs.h |  2 +-
 6 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c3abf84..3cf0b02 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2852,7 +2852,7 @@ __get_extent_map(struct inode *inode, struct page *page, 
size_t pg_offset,
em = *em_cached;
if (extent_map_in_tree(em) && start >= em->start &&
start < extent_map_end(em)) {
-   atomic_inc(>refs);
+   refcount_inc(>refs);
return em;
}
 
@@ -2863,7 +2863,7 @@ __get_extent_map(struct inode *inode, struct page *page, 
size_t pg_offset,
em = get_extent(BTRFS_I(inode), page, pg_offset, start, len, 0);
if (em_cached && !IS_ERR_OR_NULL(em)) {
BUG_ON(*em_cached);
-   atomic_inc(>refs);
+   refcount_inc(>refs);
*em_cached = em;
}
return em;
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 26f9ac7..6985015 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -55,7 +55,7 @@ struct extent_map *alloc_extent_map(void)
em->flags = 0;
em->compress_type = BTRFS_COMPRESS_NONE;
em->generation = 0;
-   atomic_set(>refs, 1);
+   refcount_set(>refs, 1);
INIT_LIST_HEAD(>list);
return em;
 }
@@ -71,8 +71,8 @@ void free_extent_map(struct extent_map *em)
 {
if (!em)
return;
-   WARN_ON(atomic_read(>refs) == 0);
-   if (atomic_dec_and_test(>refs)) {
+   WARN_ON(refcount_read(>refs) == 0);
+   if (refcount_dec_and_test(>refs)) {
WARN_ON(extent_map_in_tree(em));
WARN_ON(!list_empty(>list));
if (test_bit(EXTENT_FLAG_FS_MAPPING, >flags))
@@ -322,7 +322,7 @@ static inline void setup_extent_mapping(struct 
extent_map_tree *tree,
struct extent_map *em,
int modified)
 {
-   atomic_inc(>refs);
+   refcount_inc(>refs);
em->mod_start = em->start;
em->mod_len = em->len;
 
@@ -381,7 +381,7 @@ __lookup_extent_mapping(struct extent_map_tree *tree,
if (strict && !(end > em->start && start < extent_map_end(em)))
return NULL;
 
-   atomic_inc(>refs);
+   refcount_inc(>refs);
return em;
 }
 
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index eb8b8fa..a67b2de 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -2,6 +2,7 @@
 #define __EXTENTMAP__
 
 #include 
+#include 
 
 #define EXTENT_MAP_LAST_BYTE ((u64)-4)
 #define EXTENT_MAP_HOLE ((u64)-3)
@@ -41,7 +42,7 @@ struct extent_map {
 */
struct map_lookup *map_lookup;
};
-   atomic_t refs;
+   refcount_t refs;
unsigned int compress_type;
struct list_head list;
 };
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index a59674c..ccfe9fe 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4196,7 +4196,7 @@ static int btrfs_log_changed_extents(struct 
btrfs_trans_handle *trans,
if (em->generation <= test_gen)
continue;
/* Need a ref to keep it from getting evicted from cache */
-   atomic_inc(>refs);
+   refcount_inc(>refs);
set_bit(EXTENT_FLAG_LOGGING, >flags);
list_add_tail(>list, );
num++;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a8fd2e9..c23a383 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4833,7 +4833,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
ret = add_extent_mapping(em_tree, em, 0);
if (!ret) {
list_add_tail(>list, >transaction->pending_chunks);
-   atomic_inc(>refs);
+   refcount_inc(>refs);
}
write_unlock(_tree->lock);
if (ret) {
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index a3c3cab..9dd29e8 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -213,7 +213,7 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
__entry->block_start= map->block_start;

[PATCH 05/17] fs, btrfs: convert btrfs_caching_control.count from atomic_t to refcount_t

2017-03-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 fs/btrfs/ctree.h   |  3 ++-
 fs/btrfs/extent-tree.c | 12 ++--
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 00e3518..557af39 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "extent_io.h"
 #include "extent_map.h"
 #include "async-thread.h"
@@ -517,7 +518,7 @@ struct btrfs_caching_control {
struct btrfs_work work;
struct btrfs_block_group_cache *block_group;
u64 progress;
-   atomic_t count;
+   refcount_t count;
 };
 
 /* Once caching_thread() finds this much free space, it will wake up waiters. 
*/
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 63ba295..796001b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -315,14 +315,14 @@ get_caching_control(struct btrfs_block_group_cache *cache)
}
 
ctl = cache->caching_ctl;
-   atomic_inc(>count);
+   refcount_inc(>count);
spin_unlock(>lock);
return ctl;
 }
 
 static void put_caching_control(struct btrfs_caching_control *ctl)
 {
-   if (atomic_dec_and_test(>count))
+   if (refcount_dec_and_test(>count))
kfree(ctl);
 }
 
@@ -598,7 +598,7 @@ static int cache_block_group(struct btrfs_block_group_cache 
*cache,
init_waitqueue_head(_ctl->wait);
caching_ctl->block_group = cache;
caching_ctl->progress = cache->key.objectid;
-   atomic_set(_ctl->count, 1);
+   refcount_set(_ctl->count, 1);
btrfs_init_work(_ctl->work, btrfs_cache_helper,
caching_thread, NULL, NULL);
 
@@ -619,7 +619,7 @@ static int cache_block_group(struct btrfs_block_group_cache 
*cache,
struct btrfs_caching_control *ctl;
 
ctl = cache->caching_ctl;
-   atomic_inc(>count);
+   refcount_inc(>count);
prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE);
spin_unlock(>lock);
 
@@ -706,7 +706,7 @@ static int cache_block_group(struct btrfs_block_group_cache 
*cache,
}
 
down_write(_info->commit_root_sem);
-   atomic_inc(_ctl->count);
+   refcount_inc(_ctl->count);
list_add_tail(_ctl->list, _info->caching_block_groups);
up_write(_info->commit_root_sem);
 
@@ -10415,7 +10415,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
*trans,
_info->caching_block_groups, list)
if (ctl->block_group == block_group) {
caching_ctl = ctl;
-   atomic_inc(_ctl->count);
+   refcount_inc(_ctl->count);
break;
}
}
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-03 Thread Kai Krakow

Am Thu, 2 Mar 2017 11:37:53 +0100
schrieb Adam Borowski :

> On Wed, Mar 01, 2017 at 05:30:37PM -0700, Chris Murphy wrote:
> > [1717713.408675] BTRFS warning (device dm-8): missing devices (1)
> > exceeds the limit (0), writeable mount is not allowed
> > [1717713.446453] BTRFS error (device dm-8): open_ctree failed
> > 
> > [chris@f25s ~]$ uname -r
> > 4.9.8-200.fc25.x86_64
> > 
> > I thought this was fixed. I'm still getting a one time degraded rw
> > mount, after that it's no longer allowed, which really doesn't make
> > any sense because those single chunks are on the drive I'm trying to
> > mount.  
> 
> Well, there's Qu's patch at:
> https://www.spinics.net/lists/linux-btrfs/msg47283.html
> but it doesn't apply cleanly nor is easy to rebase to current kernels.
> 
> > I don't understand what problem this proscription is trying to
> > avoid. If it's OK to mount rw,degraded once, then it's OK to allow
> > it twice. If it's not OK twice, it's not OK once.  
> 
> Well, yeah.  The current check is naive and wrong.  It does have a
> purpose, just fails in this, very common, case.

I guess the reasoning behind this is: Creating any more chunks on this
drive will make raid1 chunks with only one copy. Adding another drive
later will not replay the copies without user interaction. Is that true?

If yes, this may leave you with a mixed case of having a raid1 drive
with some chunks not mirrored and some mirrored. When the other drives
goes missing later, you are loosing data or even the whole filesystem
although you were left with the (wrong) imagination of having a
mirrored drive setup...

Is this how it works?

If yes, a real patch would also need to replay the missing copies after
adding a new drive.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

44 matches

Mail list logo