Re: runtime btrfsck

2017-05-05 Thread Stefan Priebe - Profihost AG
It's still running. Is this the normal behaviour? Is there any other way
to fix the bad key ordering?

Greets,
Stefan

Am 02.05.2017 um 08:29 schrieb Stefan Priebe - Profihost AG:
> Hello list,
> 
> i wanted to check an fs cause it has bad key ordering.
> 
> But btrfscheck is now running since 7 days. Current output:
> # btrfsck -p --repair /dev/mapper/crypt_md0
> enabling repair mode
> Checking filesystem on /dev/mapper/crypt_md0
> UUID: 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9
> bad key ordering 39 40
> checking extents [O]
> 
> FS is a 12TB BTRFS Raid 0 over 3 mdadm Raid 5 devices. How long should
> btrfsck run and is there any way to speed it up? btrfs tools is 4.8.5
> 
> Thanks!
> 
> Greets,
> Stefan
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: add framework to handle device flush error as a volume

2017-05-05 Thread Anand Jain
This adds comments to the flush error handling part of
the code, and hopes to maintain the same logic with a
framework which can be used to handle the errors at the
volume level.

Signed-off-by: Anand Jain 
---
v2:
 fix -ENOMEM at two places
 add code readability changes in check_barrier_error()

 fs/btrfs/disk-io.c | 57 ++
 fs/btrfs/volumes.h |  1 +
 2 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e064913fa62f..2f0d0688e0a6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3625,6 +3625,10 @@ static int write_dev_flush(struct btrfs_device *device, 
int wait)
if (wait) {
bio = device->flush_bio;
if (!bio)
+   /*
+* This means the alloc has failed with ENOMEM, however
+* here we return 0, as its not a device error.
+*/
return 0;
 
wait_for_completion(>flush_wait);
@@ -3664,6 +3668,32 @@ static int write_dev_flush(struct btrfs_device *device, 
int wait)
return 0;
 }
 
+static int check_barrier_error(struct btrfs_fs_devices *fsdevs)
+{
+   int submit_flush_error = 0;
+   int dev_flush_error = 0;
+   struct btrfs_device *dev;
+   int tolerance;
+
+   list_for_each_entry_rcu(dev, >devices, dev_list) {
+   if (!dev->bdev) {
+   submit_flush_error++;
+   dev_flush_error++;
+   continue;
+   }
+   if (dev->last_flush_error == -ENOMEM)
+   submit_flush_error++;
+   if (dev->last_flush_error && dev->last_flush_error != -ENOMEM)
+   dev_flush_error++;
+   }
+
+   tolerance = fsdevs->fs_info->num_tolerated_disk_barrier_failures;
+   if (submit_flush_error > tolerance || dev_flush_error > tolerance)
+   return -EIO;
+
+   return 0;
+}
+
 /*
  * send an empty flush down to each device in parallel,
  * then wait for them
@@ -3691,6 +3721,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
ret = write_dev_flush(dev, 0);
if (ret)
errors_send++;
+   dev->last_flush_error = ret;
}
 
/* wait for all the barriers */
@@ -3705,12 +3736,30 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
continue;
 
ret = write_dev_flush(dev, 1);
-   if (ret)
+   if (ret) {
+   dev->last_flush_error = ret;
errors_wait++;
+   }
+   }
+
+   /*
+* Try hard in case of flush. Lets say, in RAID1 we have
+* the following situation
+*  dev1: EIO dev2: ENOMEM
+* this is not a fatal error as we hope to recover from
+* ENOMEM in the next attempt to flush.
+* But the following is considered as fatal
+*  dev1: ENOMEM dev2: ENOMEM
+*  dev1: bdev == NULL dev2: ENOMEM
+*/
+   if (errors_send || errors_wait) {
+   /*
+* At some point we need the status of all disks
+* to arrive at the volume status. So error checking
+* is being pushed to a separate loop.
+*/
+   return check_barrier_error(info->fs_devices);
}
-   if (errors_send > info->num_tolerated_disk_barrier_failures ||
-   errors_wait > info->num_tolerated_disk_barrier_failures)
-   return -EIO;
return 0;
 }
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index cfb817384cb5..9bb248b3fa0e 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -87,6 +87,7 @@ struct btrfs_device {
int offline;
int can_discard;
int is_tgtdev_for_dev_replace;
+   int last_flush_error;
 
 #ifdef __BTRFS_NEED_DEVICE_DATA_ORDERED
seqcount_t data_seqcount;
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS converted from EXT4 becomes read-only after reboot

2017-05-05 Thread Chris Murphy
Please reply to all. You keep dropping Qu and the Btrfs list off your responses.

On Fri, May 5, 2017 at 2:22 PM, Alexandru Guzu  wrote:
> I booted a Live CD with btrfs-progs 4.10.2 and ran a check on the
> partition, the regular btrfs check did not find any errors, but the
> lowmem one did find errors:
>
> btrfs check --mode lowmem /dev/sda1
> Checking filesystem on /dev/sda1
> UUID: bc8af368-8565-48d8-a16b-e2d25061f6ac
> checking extents
> checking free space cache
> checking fs roots
> ERROR: root 5 EXTENT_DATA[627 12288] csum missing, have: 0, expected: 4096
> ERROR: root 5 EXTENT_DATA[627 20480] csum missing, have: 0, expected: 45056
> ERROR: root 5 EXTENT_DATA[627 73728] csum missing, have: 0, expected: 4096
> ERROR: root 5 INODE[627] nbytes(2928640) not equal to extent_size(2981888)
> ERROR: root 5 EXTENT_DATA[5759 4096] csum missing, have: 0, expected: 4096
> ERROR: root 5 INODE[5759] nbytes(12288) not equal to extent_size(16384)
> ERROR: root 5 EXTENT_DATA[5760 4096] csum missing, have: 0, expected: 12288
> ERROR: root 5 INODE[5760] nbytes(53248) not equal to extent_size(65536)
> ERROR: root 5 EXTENT_DATA[284832 4096] csum missing, have: 0, expected: 24576
> ERROR: root 5 INODE[284832] nbytes(8192) not equal to extent_size(32768)
> ERROR: root 5 EXTENT_DATA[284834 36864] csum missing, have: 0, expected: 
> 253952
> ERROR: root 5 INODE[284834] nbytes(40960) not equal to extent_size(294912)
> ERROR: root 5 EXTENT_DATA[839436 8192] interrupt
> ERROR: errors found in fs roots
> found 4971778048 bytes used, error(s) found
> total csum bytes: 4042156
> total tree bytes: 831029248
> total fs tree bytes: 815988736
> total extent tree bytes: 10010624
> btree space waste bytes: 221535838
> file data blocks allocated: 4330766336
>  referenced 5587816448
>
> I suppose I should attempt a repair?

No, the lowmem mode doesn't support repair yet. I'd expect normal mode
won't repair anything if it doesn't find problems.

>I have no important data on this drive.
> I haven't deleted the Firefox cache yet.
> btrfs scrub did not find any errors

I think you've found a bug in kernel handling, we'll just have to see
what Qu says.

I would go with the deleting of the Firefox cache first, and then if
necessary reinstall just Firefox. Are there other things that trigger
it?


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: help converting btrfs to new writeback error tracking?

2017-05-05 Thread Jeff Layton
On Fri, 2017-05-05 at 12:21 -0700, Liu Bo wrote:
> Hi Jeff,
> 
> On Thu, May 04, 2017 at 07:26:17AM -0400, Jeff Layton wrote:
> > I've been working on set of patches to clean up how writeback errors are
> > tracked and handled in the kernel:
> > 
> > http://marc.info/?l=linux-fsdevel=149304074111261=2
> > 
> > The basic idea is that rather than having a set of flags that are
> > cleared whenever they are checked, we have a sequence counter and error
> > that are tracked on a per-mapping basis, and can then use that sequence
> > counter to tell whether the error should be reported.
> > 
> > This changes the way that things like filemap_write_and_wait work.
> > Rather than having to ensure that AS_EIO/AS_ENOSPC are not cleared
> > inappropriately (and thus losing errors that should be reported), you
> > can now tell whether there has been a writeback error since a certain
> > point in time, irrespective of whether anyone else is checking for
> > errors.
> > 
> > I've been doing some conversions of the existing code to the new scheme,
> > but btrfs has _really_ complicated error handling. I think it could
> > probably be simplified with this new scheme, but I could use some help
> > here.
> > 
> > What I think we probably want to do is to sample the error sequence in
> > the mapping at well-defined points in time (probably when starting a
> > transaction?) and then use that to determine whether writeback errors
> > have occurred since then. Is there anyone in the btrfs community who
> > could help me here?
> > 
> 
> I went through the patch set and reviewed the btrfs part particular in
> [PATCH v3 14/20] fs: retrofit old error reporting API onto new infrastructure
> 
> It looks good to me.
> 
> In btrfs ->writepage(), it sets PG_error whenever an error
> (-EIO/-ENOSPC/-ENOMEM) occurs and it sets mapping's error as well in
> end_extent_writepage().  And the special case is the compression code, where 
> it
> only sets mapping's error when there is any error during processing 
> compression
> bytes.
> 
> Similar to ext4, btrfs tracks the IO error by setting mapping's error in
> writepage_endio and other places (eg. compression code), and around tree-log.c
> it's checking BTRFS_ORDERED_IOERR from ordered_extent->flags, which is usually
> set in writepage_endio and sometimes in some error handling code where it
> couldn't call endio.
> 
> So the conversion in btrfs's fsync() seems to be good enough, did I miss
> anything?
> 

Many thanks for taking a look:

There are a number of calls in btrfs to filemap_fdatawait_range that
check the return code. That function will wait for writeback on all of
the pages in the mapping range and return an error if there has been
one. Note too that there are also some places that ignore the return
code.

These patches change how filemap_fdatawait_range (and some similar
functions) work. Before this set, you'd get an error if one had occurred
since anyone last checked it. Now, you only get an error there if one
occurred since you started waiting. If the failed writeback occurred
before that function was called, you won't get an error back.

For fsync, it shouldn't matter. You'll get an error back there if one
occurred since the last fsync since you're setting it in the mapping.
The bigger question is whether other callers in this code do anything
with that error return.

If they do, then the next question is: from what point do you want to
detect errors that have occurred?

What sort of makes sense to me (in a handwavy way) would be to sample
the errseq_t in the mapping when you start a transaction, and then check
vs. that for errors. Then, even if you have parallel transactions going
on the same inode (is that even possible?) then you can tell whether
they all succeded or not.

Thoughts?
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH ping] btrfs: warn about RAID5/6 being experimental at mount time

2017-05-05 Thread David Sterba
On Wed, Apr 26, 2017 at 04:14:16AM +0200, Adam Borowski wrote:
> On Wed, Apr 19, 2017 at 11:07:45PM +0200, Adam Borowski wrote:
> > Too many people come complaining about losing their data -- and indeed,
> > there's no warning outside a wiki and the mailing list tribal knowledge.
> > Message severity chosen for consistency with XFS -- "alert" makes dmesg
> > produce nice red background which should get the point across.
> ...
> > I intend to ask for inclusion of this one (or an equivalent) in 4.9, either
> > in Debian or via GregKH -- while for us kernels "that old" are history,
> > regular users expect stable releases to be free of known serious data loss
> > bugs.
> 
> Hi guys, could you please comment?  While there's only relatively little
> urgency for mainline (heck, it'd be best if the warning was not needed at
> all!), there's a Debian release close by, and it's be grossly inresponsible
> to not let people know that a feature advertised in the documentation is in
> an unusable state (especially as of 4.9).  For you, filesystem developers,
> a way of thinking that "the user should do research" might be acceptable,
> but once it filters down to a stable release, the user expects no known
> serious bugs.

There are some raid56 fixes in 4.12, but IIRC the write hole is still
unfixed so the warning is still valid even for 4.12. It would be easier
to get the patch to 4.4 or 4.9 once it's in Linus tree.

> 
> And here the severity is "critical -- causes serious data loss".
> 
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index e54844767fe5..e7f91f70e149 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -3083,6 +3083,14 @@ int open_ctree(struct super_block *sb,
> > btrfs_set_opt(fs_info->mount_opt, SSD);
> > }
> >  
> > +   if ((fs_info->avail_data_alloc_bits |
> > +fs_info->avail_metadata_alloc_bits |
> > +fs_info->avail_system_alloc_bits) &
> > +   BTRFS_BLOCK_GROUP_RAID56_MASK) {
> > +   btrfs_alert(fs_info,
> > +   "btrfs RAID5/6 is EXPERIMENTAL and has known data-loss bugs");
> > +   }
> > +
> > /*
> >  * Mount does not set all options immediately, we can do it now and do
> >  * not have to wait for transaction commit
> > -- 
> 
> Doing this in the kernel should be better than in userspace (like
> https://patchwork.kernel.org/patch/9450035/) as it can deal with a future
> kernel with working RAID5/6 on old -progs; but if you prefer, I can finish
> that patch and request its inclusion in Debian stretch -progs instead or in
> addition to the above warning in the kernel.

I'll have look again.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: help converting btrfs to new writeback error tracking?

2017-05-05 Thread Liu Bo
Hi Jeff,

On Thu, May 04, 2017 at 07:26:17AM -0400, Jeff Layton wrote:
> I've been working on set of patches to clean up how writeback errors are
> tracked and handled in the kernel:
> 
> http://marc.info/?l=linux-fsdevel=149304074111261=2
> 
> The basic idea is that rather than having a set of flags that are
> cleared whenever they are checked, we have a sequence counter and error
> that are tracked on a per-mapping basis, and can then use that sequence
> counter to tell whether the error should be reported.
> 
> This changes the way that things like filemap_write_and_wait work.
> Rather than having to ensure that AS_EIO/AS_ENOSPC are not cleared
> inappropriately (and thus losing errors that should be reported), you
> can now tell whether there has been a writeback error since a certain
> point in time, irrespective of whether anyone else is checking for
> errors.
> 
> I've been doing some conversions of the existing code to the new scheme,
> but btrfs has _really_ complicated error handling. I think it could
> probably be simplified with this new scheme, but I could use some help
> here.
> 
> What I think we probably want to do is to sample the error sequence in
> the mapping at well-defined points in time (probably when starting a
> transaction?) and then use that to determine whether writeback errors
> have occurred since then. Is there anyone in the btrfs community who
> could help me here?
>

I went through the patch set and reviewed the btrfs part particular in
[PATCH v3 14/20] fs: retrofit old error reporting API onto new infrastructure

It looks good to me.

In btrfs ->writepage(), it sets PG_error whenever an error
(-EIO/-ENOSPC/-ENOMEM) occurs and it sets mapping's error as well in
end_extent_writepage().  And the special case is the compression code, where it
only sets mapping's error when there is any error during processing compression
bytes.

Similar to ext4, btrfs tracks the IO error by setting mapping's error in
writepage_endio and other places (eg. compression code), and around tree-log.c
it's checking BTRFS_ORDERED_IOERR from ordered_extent->flags, which is usually
set in writepage_endio and sometimes in some error handling code where it
couldn't call endio.

So the conversion in btrfs's fsync() seems to be good enough, did I miss
anything?

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: Add quota_override knob into sysfs

2017-05-05 Thread David Sterba
On Fri, Apr 21, 2017 at 11:13:06PM +, Sargun Dhillon wrote:
> This patch adds the read-write attribute quota_override into sysfs.
> Any process which has cap_sys_resource can set this flag to on, and
> once it is set to true, processes with cap_sys_resource can exceed
> the quota.

So we've moved to the hardest part, where to put the sysfs file and how
to name it.

My first though was against putting it right under the filesystem UUID
and rather introduce a directory with all tunables, but we already
have a read-write label there. Ok then.

The valid values are 0 and 1. And no locking is needed as fs_info::flags
are updated atomically (once patch 1 is updated).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: add quota override attribute

2017-05-05 Thread David Sterba
On Sun, Apr 23, 2017 at 09:48:34PM -0700, Sargun Dhillon wrote:
> On Sun, Apr 23, 2017 at 8:42 PM, Qu Wenruo  wrote:
> > At 04/22/2017 07:12 AM, Sargun Dhillon wrote:
> >>
> >> This patch introduces the quota override flag to btrfs_fs_info, and
> >> a change to quota limit checking code to temporarily allow for quota
> >> to be overridden for processes with cap_sys_resource.
> >>
> >> It's useful for administrative programs, such as log rotation,
> >> that may need to temporarily use more disk space in order to free up
> >> a greater amount of overall disk space without yielding more disk
> >> space to the rest of userland.
> >>
> >> Eventually, we may want to add the idea of an operator-specific
> >> quota, operator reserved space, or something else to allow for
> >> administrative override, but this is perhaps the simplest
> >> solution.
> >
> >
> > Indeed simplest method yet.
> >
> > But considering that reserved data space can be used by none privileged
> > user, I'm not sure if it's a good idea.
> >
> > For example:
> >
> > If root want to write a 64M file with new data, then it reserved 64M data
> > space.
> >
> > And another none privileged user also want to write that 64M file with
> > different data, then the user won't need to reserve data space.
> > (Although metadata space is still needed).
> >
> > Won't this cause some method to escaping the qgroup limit?
> This is more of a failure-avoidance mechanism. We run containers that
> don't have cap_sys_resource. The log rotator, on the other hand, has a
> full-set of capabilities in the root user namespace. Given that we'd
> only flip quota_override if the system gets into a state where the log
> rotator cannot run, I don't see it being particularly problematic.

So this usecase sounds valid to me. CAP_SYS_RESOURCE is documented to
allow quota overrides, no surprise here. The extra step to enable the
override per filesystem should put enough barriers against unintentional
behaviour.

> At least looking at my systems, none of my users have cap_sys_resource
> in their capabilities set, and it seems to be the closest capability
> that maps to disk quota logic. I'd hate to drop this into the bucket
> of cap_sys_admin. Are you perhaps suggesting per-uid and per-gid
> qgroups? Or being able to have a quota_override_uid value? I thought
> about adding an extended attribute, but that would require the attr to
> be set at file creation time, not necessarily when I need it for an
> escape. Another idea was to do what ext does, in adding a special
> "operator" reserved space which processes with uid == 0 &&
> cap_sys_resource can use. This would require changing the on-disk
> qgroup format.
> 
> You're right, we would (intentionally) "escape" the qgroup limit. A
> process with cap_sys_resource would be able to allocate more disk
> space temporarily If I'm understanding you correctly, I don't think
> that they would be able to rewrite the file's contents because that'd
> be considered changed extents, no?
> 
> >
> >>
> >> Signed-off-by: Sargun Dhillon 
> >> ---
> >>   fs/btrfs/ctree.h  | 3 +++
> >>   fs/btrfs/qgroup.c | 9 +++--
> >>   2 files changed, 10 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> >> index c411590..01a095b 100644
> >> --- a/fs/btrfs/ctree.h
> >> +++ b/fs/btrfs/ctree.h
> >> @@ -1098,6 +1098,9 @@ struct btrfs_fs_info {
> >> u32 nodesize;
> >> u32 sectorsize;
> >> u32 stripesize;
> >> +
> >> +   /* Allow tasks with cap_sys_resource to override the quota */
> >> +   bool quota_override;
> >
> >
> > Why not use existing fs_info->qgroup_flags?
> Isn't that persisted? I don't want this to surprise users across reboots.

Yes it's persisted, but we can use a bitmask for the in-memory bits
before we sync the qgroup_flags to disk.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: allow processes with cap_sys_resource to exceed quota

2017-05-05 Thread Sargun Dhillon
If you see my follow-on patch, it allows disabling the quota limit for
folks with cap_sys_resource per filesystem. I don't want to have any
process to be able to turn off quota limits, but just the process that
is the logrotator (and has the proper capabilities). Unfortunately,
most folks don't lock down their capabilities, so I agree, making it
blindly based on capabilities seems like a poor idea.

On Fri, May 5, 2017 at 11:22 AM, David Sterba  wrote:
> On Fri, Apr 21, 2017 at 07:27:23AM -0500, Sargun Dhillon wrote:
>> What do you think about putting this behaviour behind a sysctl? Seems
>> better than to start introducing a new mechanism of marking tasks?
>
> Technically it's easy to add own btrfs-specific ioctl, temporarily
> turning off quota limits, but I'm still not sure about all the
> consequences as this would apply to the whole system, no? The
> capabilities are per process, much more fine grained. I don't have other
> ideas how to address the problem though.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: allow processes with cap_sys_resource to exceed quota

2017-05-05 Thread David Sterba
On Fri, Apr 21, 2017 at 07:27:23AM -0500, Sargun Dhillon wrote:
> What do you think about putting this behaviour behind a sysctl? Seems
> better than to start introducing a new mechanism of marking tasks?

Technically it's easy to add own btrfs-specific ioctl, temporarily
turning off quota limits, but I'm still not sure about all the
consequences as this would apply to the whole system, no? The
capabilities are per process, much more fine grained. I don't have other
ideas how to address the problem though.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsck lowmem mode shows corruptions

2017-05-05 Thread Kai Krakow
Am Fri, 5 May 2017 08:55:10 +0800
schrieb Qu Wenruo :

> At 05/05/2017 01:29 AM, Kai Krakow wrote:
> > Hello!
> > 
> > Since I saw a few kernel freezes lately (due to experimenting with
> > ck-sources) including some filesystem-related backtraces, I booted
> > my rescue system to check my btrfs filesystem.
> > 
> > Luckily, it showed no problems. It said, everything's fine. But I
> > also thought: Okay, let's try lowmem mode. And that showed a
> > frightening long list of extent corruptions und unreferenced
> > chunks. Should I worry?  
> 
> Thanks for trying lowmem mode.
> 
> Would you please provide the version of btrfs-progs?

Sorry... I realized it myself the moment I hit the "send" button.

Here it is:

# btrfs version
btrfs-progs v4.10.2

# uname -a
Linux jupiter 4.10.13-ck #2 SMP PREEMPT Thu May 4 23:44:09 CEST 2017
x86_64 Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz GenuineIntel GNU/Linux

> IIRC "ERROR: data extent[96316809216 2097152] backref lost" bug has
> been fixed in recent release.

Is there a patch I could apply?

> And for reference, would you please provide the tree dump of your
> chunk and device tree?
> 
> This can be done by running:
> # btrfs-debug-tree -t device 
> # btrfs-debug-tree -t chunk 

I'll attach those...

I'd like to note that between the OP and these dumps, I scrubbed and
rebalanced the hole device. I think that would scramble up some
numbers. Also I took the dumps while the fs was online.

If you want me to do clean dumps of the offline device without
intermediate fs processing, let me know.

Thanks,
Kai

> And this 2 dump only contains the btrfs chunk mapping info, so
> nothing sensitive is contained.
> 
> Thanks,
> Qu
> > 
> > PS: The freezes seem to be related to bfq, switching to deadline
> > solved these.
> > 
> > Full log attached, here's an excerpt:
> > 
> > ---8<---
> > 
> > checking extents
> > ERROR: chunk[256 4324327424) stripe 0 did not find the related dev
> > extent ERROR: chunk[256 4324327424) stripe 1 did not find the
> > related dev extent ERROR: chunk[256 4324327424) stripe 2 did not
> > find the related dev extent ERROR: chunk[256 7545552896) stripe 0
> > did not find the related dev extent ERROR: chunk[256 7545552896)
> > stripe 1 did not find the related dev extent ERROR: chunk[256
> > 7545552896) stripe 2 did not find the related dev extent [...]
> > ERROR: device extent[1, 1094713344, 1073741824] did not find the
> > related chunk ERROR: device extent[1, 2168455168, 1073741824] did
> > not find the related chunk ERROR: device extent[1, 3242196992,
> > 1073741824] did not find the related chunk [...]
> > ERROR: device extent[2, 608854605824, 1073741824] did not find the
> > related chunk ERROR: device extent[2, 609928347648, 1073741824] did
> > not find the related chunk ERROR: device extent[2, 611002089472,
> > 1073741824] did not find the related chunk [...]
> > ERROR: device extent[3, 64433946624, 1073741824] did not find the
> > related chunk ERROR: device extent[3, 65507688448, 1073741824] did
> > not find the related chunk ERROR: device extent[3, 66581430272,
> > 1073741824] did not find the related chunk [...]
> > ERROR: data extent[96316809216 2097152] backref lost
> > ERROR: data extent[96316809216 2097152] backref lost
> > ERROR: data extent[96316809216 2097152] backref lost
> > ERROR: data extent[686074396672 13737984] backref lost
> > ERROR: data extent[686074396672 13737984] backref lost
> > ERROR: data extent[686074396672 13737984] backref lost
> > [...]
> > ERROR: errors found in extent allocation tree or chunk allocation
> > checking free space cache
> > checking fs roots
> > ERROR: errors found in fs roots
> > Checking filesystem on /dev/disk/by-label/system
> > UUID: bc201ce5-8f2b-4263-995a-6641e89d4c88
> > found 1960075935744 bytes used, error(s) found
> > total csum bytes: 1673537040
> > total tree bytes: 4899094528
> > total fs tree bytes: 2793914368
> > total extent tree bytes: 190398464
> > btree space waste bytes: 871743708
> > file data blocks allocated: 6907169177600
> >   referenced 1979268648960
> >   
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-btrfs" in the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



-- 
Regards,
Kai

Replies to list-only preferred.

chunk-tree.txt.gz
Description: application/gzip


device-tree.txt.gz
Description: application/gzip


Re: [PATCH v3] btrfs: fiemap: Cache and merge fiemap extent before submit it to user

2017-05-05 Thread David Sterba
On Thu, Apr 13, 2017 at 08:36:24AM +0800, Qu Wenruo wrote:
> 
> 
> At 04/12/2017 11:05 PM, David Sterba wrote:
> > On Fri, Apr 07, 2017 at 10:43:15AM +0800, Qu Wenruo wrote:
> >> [BUG]
> >> Cycle mount btrfs can cause fiemap to return different result.
> >> Like:
> >>   # mount /dev/vdb5 /mnt/btrfs
> >>   # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
> >>   # xfs_io -c "fiemap -v" /mnt/btrfs/file
> >>   /mnt/test/file:
> >>   EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
> >> 0: [0..127]:25088..25215   128   0x1
> >>   # umount /mnt/btrfs
> >>   # mount /dev/vdb5 /mnt/btrfs
> >>   # xfs_io -c "fiemap -v" /mnt/btrfs/file
> >>   /mnt/test/file:
> >>   EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
> >> 0: [0..31]: 25088..2511932   0x0
> >> 1: [32..63]:25120..2515132   0x0
> >> 2: [64..95]:25152..2518332   0x0
> >> 3: [96..127]:   25184..2521532   0x1
> >> But after above fiemap, we get correct merged result if we call fiemap
> >> again.
> >>   # xfs_io -c "fiemap -v" /mnt/btrfs/file
> >>   /mnt/test/file:
> >>   EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
> >> 0: [0..127]:25088..25215   128   0x1
> >>
> >> [REASON]
> >> Btrfs will try to merge extent map when inserting new extent map.
> >>
> >> btrfs_fiemap(start=0 len=(u64)-1)
> >> |- extent_fiemap(start=0 len=(u64)-1)
> >> |- get_extent_skip_holes(start=0 len=64k)
> >> |  |- btrfs_get_extent_fiemap(start=0 len=64k)
> >> | |- btrfs_get_extent(start=0 len=64k)
> >> ||  Found on-disk (ino, EXTENT_DATA, 0)
> >> ||- add_extent_mapping()
> >> ||- Return (em->start=0, len=16k)
> >> |
> >> |- fiemap_fill_next_extent(logic=0 phys=X len=16k)
> >> |
> >> |- get_extent_skip_holes(start=0 len=64k)
> >> |  |- btrfs_get_extent_fiemap(start=0 len=64k)
> >> | |- btrfs_get_extent(start=16k len=48k)
> >> ||  Found on-disk (ino, EXTENT_DATA, 16k)
> >> ||- add_extent_mapping()
> >> ||  |- try_merge_map()
> >> || Merge with previous em start=0 len=16k
> >> || resulting em start=0 len=32k
> >> ||- Return (em->start=0, len=32K)<< Merged result
> >> |- Stripe off the unrelated range (0~16K) of return em
> >> |- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
> >>^^^ Causing split fiemap extent.
> >>
> >> And since in add_extent_mapping(), em is already merged, in next
> >> fiemap() call, we will get merged result.
> >>
> >> [FIX]
> >> Here we introduce a new structure, fiemap_cache, which records previous
> >> fiemap extent.
> >>
> >> And will always try to merge current fiemap_cache result before calling
> >> fiemap_fill_next_extent().
> >> Only when we failed to merge current fiemap extent with cached one, we
> >> will call fiemap_fill_next_extent() to submit cached one.
> >>
> >> So by this method, we can merge all fiemap extents.
> > 
> > The cache gets reset on each call to extent_fiemap, so if fi_extents_max
> > is 1, the cache will be always unset and we'll never merge anything. The
> > same can happen if the number of extents reaches the limit
> > (FIEMAP_MAX_EXTENTS or any other depending on the ioctl caller). And
> > this leads to the unmerged extents.
> 
> Nope, extents will still be merged, as long as they can be merged.
> 
> The fiemap extent is only submitted if we found an unmergeable extent.
> 
> Even fi_extents_max is 1, it still possible for us to merge extents.
> 
> File A:
> Extent 1: offset=0 len=4k phys=X
> Extent 2: offset=4k len=4k phys=X+4
> Extent 3: offset=8k len=4k phys=Y
> 
> 1) Found Extent 1
> Cache it, not submitted yet.
> 2) Found Extent 2
> Merge it with cached one, not submitted yet.
> 3) Found Extent 3
> Can't merge, submit cached first.
> Submitted one reach fi_extents_max, exit current extent_fiemap.
> 
> 4) Next fiemap call starts from offset 8K,
> Extent 3 is the last extent, no need to cache just submit.
> 
> So we still got merged fiemap extent, without anything wrong.
> 
> The point is, fi_extents_max or other limit can only be merged when we 
> submit fiemap_extent, in that case either we found unmergable extent, or 
> we already hit the last extent.

Thanks for the explanation.

> >> It can also be done in fs/ioctl.c, however the problem is if
> >> fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
> >> extent.
> > 
> > I don't see why, it's the same code path, no?
> 
> My original design in VFS is to check if we can merge current fiemap 
> extent with the last one in fiemap_info.
> 
> But for fi_extents_max == 0 case, fiemap_info doesn't store any extent 
> so that's not possible.
> 
> 
> So for fi_extents_max == 0 case, either do it in each fs like what we 
> are doing, or introduce a new function like fiemap_cache_next_extent() 
> with reference to cached structure.
> 
> 

Re: [PATCH v3] btrfs: fiemap: Cache and merge fiemap extent before submit it to user

2017-05-05 Thread David Sterba
On Thu, Apr 20, 2017 at 12:52:55PM -0700, Liu Bo wrote:
> On Thu, Apr 20, 2017 at 10:09:39AM +0800, Qu Wenruo wrote:
> > 
> > 
> > At 04/20/2017 09:58 AM, Liu Bo wrote:
> > > On Thu, Apr 20, 2017 at 09:52:00AM +0800, Qu Wenruo wrote:
> > > > 
> > > > 
> [...]
> > > > > 
> > > > > If I understand it correctly, what it's missing currently is 'merge', 
> > > > > a
> > > > > straightfoward idea is to make use of the 'merge' ability of 
> > > > > btrfs_get_extent. >
> > > > > Since btrfs_get_extent_fiemap is a wrapper of btrfs_get_extent, does 
> > > > > it make
> > > > > sense if we make btrfs_get_extent_fiemap iterate all extent maps 
> > > > > within the
> > > > > range passing to it or just one more extent map to check if 
> > > > > btrfs_get_extent
> > > > > could return a merged extent map before returning?
> > > > 
> > > > So your idea to to do the merging inside btrfs_get_extent(), instead of
> > > > merging it in extent_fiemap()?
> > > > 
> > > 
> > > No, merge ems inside the wrapper, ie. btrfs_get_extent_fiemap().
> > 
> > Then llseek() with SEEK_HOLE/SEEK_DATA also get affected.
> > 
> > So limiting the extra time to merging extent maps in fiemap is still valid.
> >
> 
> Per my test, the v3 patch doesn't make big difference on the side of
> performance.

I think the point was not to improve performance, but to make the fiemap
output correct. But anyway it's good that the performance does not degrade.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: fiemap: Cache and merge fiemap extent before submit it to user

2017-05-05 Thread David Sterba
On Thu, Apr 20, 2017 at 10:09:39AM +0800, Qu Wenruo wrote:
> >>> If I understand it correctly, what it's missing currently is 'merge', a
> >>> straightfoward idea is to make use of the 'merge' ability of 
> >>> btrfs_get_extent. >
> >>> Since btrfs_get_extent_fiemap is a wrapper of btrfs_get_extent, does it 
> >>> make
> >>> sense if we make btrfs_get_extent_fiemap iterate all extent maps within 
> >>> the
> >>> range passing to it or just one more extent map to check if 
> >>> btrfs_get_extent
> >>> could return a merged extent map before returning?
> >>
> >> So your idea to to do the merging inside btrfs_get_extent(), instead of
> >> merging it in extent_fiemap()?
> >>
> > 
> > No, merge ems inside the wrapper, ie. btrfs_get_extent_fiemap().
> 
> Then llseek() with SEEK_HOLE/SEEK_DATA also get affected.

Agreed, unconditionally reading all extents inside seek would need to
be evaluated first. Limiting the scope to the fiemap ioctl is safer.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 8/9] btrfs: qgroup: Introduce extent changeset for qgroup reserve functions

2017-05-05 Thread David Sterba
On Tue, Apr 11, 2017 at 09:44:03AM +0800, Qu Wenruo wrote:
> > I get that we cannot easily avoid using the extent_changeset, so we'll
> > end up one way or another, the stack conservation has slight preference.
> 
> Yes, I understand dynamic allocation can complicate the error handler.
> 
> But the allocation inside btrfs_qgroup_reserve_data() could reduce the 
> impact. And bring less compact to qgroup disabled case.
> (So btrfs_qgroup_reserve_data() accepts struct extent_changeset ** other 
> than struct extent_changeset *)
> 
> The code to be modified should be at the same level as current patch.
> Mostly change extent_changeset_release() to extent_changeset_free().
> 
> 
> So overall the impact is not that huge, unless current error handlers 
> already have some problems.
> 
> Personally speaking I'm OK either way.
> 
> Do I need to build a temporary branch/patchset for your evaluation?

Yes please.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] Btrfs: change check-integrity to use bvec_iter

2017-05-05 Thread David Sterba
On Mon, Apr 17, 2017 at 06:16:26PM -0700, Liu Bo wrote:
> Some check-integrity code depends on bio->bi_vcnt, this changes it to use
> bio segments because some bios passing here may not have a reliable
> bi_vcnt.
> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/check-integrity.c | 27 +++
>  1 file changed, 15 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
> index ab14c2e..8e7ce48 100644
> --- a/fs/btrfs/check-integrity.c
> +++ b/fs/btrfs/check-integrity.c
> @@ -2822,44 +2822,47 @@ static void __btrfsic_submit_bio(struct bio *bio)
>   dev_state = btrfsic_dev_state_lookup(bio->bi_bdev);
>   if (NULL != dev_state &&
>   (bio_op(bio) == REQ_OP_WRITE) && bio_has_data(bio)) {
> - unsigned int i;
> + unsigned int i = 0;
>   u64 dev_bytenr;
>   u64 cur_bytenr;
> - struct bio_vec *bvec;
> + struct bio_vec bvec;
> + struct bvec_iter iter;
>   int bio_is_patched;
>   char **mapped_datav;
> + int segs = bio_segments(bio);

Type mismatch, bio_segments return unsigned.

>  
>   dev_bytenr = 512 * bio->bi_iter.bi_sector;
>   bio_is_patched = 0;
>   if (dev_state->state->print_mask &
>   BTRFSIC_PRINT_MASK_SUBMIT_BIO_BH)
>   pr_info("submit_bio(rw=%d,0x%x, bi_vcnt=%u, 
> bi_sector=%llu (bytenr %llu), bi_bdev=%p)\n",
> -bio_op(bio), bio->bi_opf, bio->bi_vcnt,
> +bio_op(bio), bio->bi_opf, segs,
>  (unsigned long long)bio->bi_iter.bi_sector,
>  dev_bytenr, bio->bi_bdev);
>  
> - mapped_datav = kmalloc_array(bio->bi_vcnt,
> + mapped_datav = kmalloc_array(segs,
>sizeof(*mapped_datav), GFP_NOFS);
>   if (!mapped_datav)
>   goto leave;
>   cur_bytenr = dev_bytenr;
>  
> - bio_for_each_segment_all(bvec, bio, i) {
> - BUG_ON(bvec->bv_len != PAGE_SIZE);
> - mapped_datav[i] = kmap(bvec->bv_page);
> + bio_for_each_segment(bvec, bio, iter) {
> + BUG_ON(bvec.bv_len != PAGE_SIZE);
> + mapped_datav[i] = kmap(bvec.bv_page);
> + i++;
>  
>   if (dev_state->state->print_mask &
>   BTRFSIC_PRINT_MASK_SUBMIT_BIO_BH_VERBOSE)
>   pr_info("#%u: bytenr=%llu, len=%u, offset=%u\n",
> -i, cur_bytenr, bvec->bv_len, 
> bvec->bv_offset);
> - cur_bytenr += bvec->bv_len;
> +i, cur_bytenr, bvec.bv_len, 
> bvec.bv_offset);
> + cur_bytenr += bvec.bv_len;
>   }
>   btrfsic_process_written_block(dev_state, dev_bytenr,
> -   mapped_datav, bio->bi_vcnt,
> +   mapped_datav, segs,
> bio, _is_patched,
> NULL, bio->bi_opf);
> - bio_for_each_segment_all(bvec, bio, i)
> - kunmap(bvec->bv_page);
> + bio_for_each_segment(bvec, bio, iter)
> + kunmap(bvec.bv_page);
>   kfree(mapped_datav);
>   } else if (NULL != dev_state && (bio->bi_opf & REQ_PREFLUSH)) {
>   if (dev_state->state->print_mask &
> -- 
> 2.5.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Wrong received UUIDs after backup restore and onward

2017-05-05 Thread Lewis Diamond

Hi,

I ran into a situation where my incremental backups using send|receive 
are failing after a full file system failure followed by restore from an 
external backup. The reason for the failures seem to be that the 
Received UUID of snapshots and backups are not properly updated. In 
short, everything (snapshots and backups) from the time of restore 
onward has the same Received UUID. Here is an example:


   [root@ldiamond-desktop test_btrfs]# btrfs sub create test1
   Create subvolume './test1'
   [root@ldiamond-desktop test_btrfs]# echo '1' > test1/test
   [root@ldiamond-desktop test_btrfs]# btrfs sub snap -r test1/ test1_snap
   Create a readonly snapshot of 'test1/' in './test1_snap'
   [root@ldiamond-desktop test_btrfs]# btrfs send test1_snap/ | btrfs receive 
/media/backup/test_btrfs/
   At subvol test1_snap/
   At subvol test1_snap
   [root@ldiamond-desktop test_btrfs]# echo '2' >> test1/test
   [root@ldiamond-desktop test_btrfs]# btrfs sub snap -r test1 test1_snap2
   Create a readonly snapshot of 'test1' in './test1_snap2'
   [root@ldiamond-desktop test_btrfs]# btrfs send test1_snap2 -p test1_snap | 
btrfs receive /media/backup/test_btrfs/
   At subvol test1_snap2
   At snapshot test1_snap2
   [root@ldiamond-desktop test_btrfs]# btrfs sub show test1
   /media/btr0/test_btrfs/test1
Name:   test1
UUID:   6eb0318e-f036-7840-a39f-f9f56b3769f4
Parent UUID:-
Received UUID:  -
Creation time:  2017-05-05 12:29:16 -0400
Subvolume ID:   698
Generation: 11566
Gen at creation:11558
Parent ID:  5
Top level ID:   5
Flags:  -
Snapshot(s):
test_btrfs/test1_snap
test_btrfs/test1_snap2
   [root@ldiamond-desktop test_btrfs]# btrfs sub show test1_snap
   /media/btr0/test_btrfs/test1_snap
Name:   test1_snap
UUID:   8afd7fec-5141-7b46-af05-77c78b051977
Parent UUID:6eb0318e-f036-7840-a39f-f9f56b3769f4
Received UUID:  -
Creation time:  2017-05-05 12:30:17 -0400
Subvolume ID:   700
Generation: 11561
Gen at creation:11561
Parent ID:  5
Top level ID:   5
Flags:  readonly
Snapshot(s):
   [root@ldiamond-desktop test_btrfs]# btrfs sub show test1_snap2
   /media/btr0/test_btrfs/test1_snap2
Name:   test1_snap2
UUID:   fc8a9859-962d-954b-92df-32ef31b88bc9
Parent UUID:6eb0318e-f036-7840-a39f-f9f56b3769f4
Received UUID:  -
Creation time:  2017-05-05 12:32:46 -0400
Subvolume ID:   702
Generation: 11566
Gen at creation:11566
Parent ID:  5
Top level ID:   5
Flags:  readonly
Snapshot(s):
   [root@ldiamond-desktop test_btrfs]# btrfs sub d test1 test1_snap*
   Delete subvolume (no-commit): '/media/btr0/test_btrfs/test1'
   Delete subvolume (no-commit): '/media/btr0/test_btrfs/test1_snap'
   Delete subvolume (no-commit): '/media/btr0/test_btrfs/test1_snap2'
   [root@ldiamond-desktop test_btrfs]# btrfs send 
/media/backup/test_btrfs/test1_snap2/ | btrfs receive ./
   At subvol /media/backup/test_btrfs/test1_snap2/
   At subvol test1_snap2
   [root@ldiamond-desktop test_btrfs]# btrfs property set -ts test1_snap2/ ro 
false
   [root@ldiamond-desktop test_btrfs]# echo '3' >> test1_snap2/test
   [root@ldiamond-desktop test_btrfs]# cat test1_snap2/test
   1
   2
   3
   [root@ldiamond-desktop test_btrfs]# btrfs sub snap -r test1_snap2/ 
test1_snap3
   Create a readonly snapshot of 'test1_snap2/' in './test1_snap3'
   [root@ldiamond-desktop test_btrfs]# btrfs send test1_snap3/ | btrfs receive 
/media/backup/test_btrfs/
   At subvol test1_snap3/
   At subvol test1_snap3
   [root@ldiamond-desktop test_btrfs]# echo '4' >> test1_snap2/test
   [root@ldiamond-desktop test_btrfs]# btrfs sub snap -r test1_snap2/ 
test1_snap4
   Create a readonly snapshot of 'test1_snap2/' in './test1_snap4'
   [root@ldiamond-desktop test_btrfs]# btrfs send test1_snap4/ -p test1_snap3/ 
| btrfs receive /media/backup/test_btrfs/
   At subvol test1_snap4/
   At snapshot test1_snap4
   [root@ldiamond-desktop test_btrfs]# btrfs sub show test1_snap2
   /media/btr0/test_btrfs/test1_snap2
Name:   test1_snap2
UUID:   5f505e2f-2e0a-8840-8132-08ba3bdb4484
Parent UUID:-
Received UUID:  

Re: [PATCH] Btrfs: tolerate errors if we have retried successfully

2017-05-05 Thread David Sterba
On Thu, Apr 13, 2017 at 06:11:56PM -0700, Liu Bo wrote:
> With raid1 profile, dio read isn't tolerating IO errors if read length is
> less than the stripe length (64K).

Can you please write more details why this is true? Some pointers to
code etc, I'm lost. Eg. where the errors is tolerated. Thanks.

> This fixes the problem by setting bio's error to 0 if a good copy has been
> found.
> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/inode.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 632b616..4e1398e 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8113,8 +8113,11 @@ static void btrfs_endio_direct_read(struct bio *bio)
>   struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
>   int err = bio->bi_error;
>  
> - if (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED)
> + if (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED) {
>   err = btrfs_subio_endio_read(inode, io_bio, err);
> + if (!err)
> + bio->bi_error = 0;
> + }
>  
>   unlock_extent(_I(inode)->io_tree, dip->logical_offset,
> dip->logical_offset + dip->bytes - 1);
> -- 
> 2.5.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: cleanup qgroup trace event

2017-05-05 Thread David Sterba
On Fri, May 05, 2017 at 10:09:36AM +0800, Anand Jain wrote:
> Commit 81fb6f77a026 (btrfs: qgroup: Add new trace point for
> qgroup data reserve) added the following events which aren't used.
>   btrfs__qgroup_data_map
>   btrfs_qgroup_init_data_rsv_map
>   btrfs_qgroup_free_data_rsv_map
> So remove them.
> 
> Signed-off-by: Anand Jain 
>   cc: quwen...@cn.fujitsu.com
> Reviewed-by: Qu Wenruo 

Patch queued, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] Btrfs: don't pass the inode through clean_io_failure

2017-05-05 Thread Josef Bacik
Instead pass around the failure tree and the io tree.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent_io.c | 51 ---
 fs/btrfs/extent_io.h | 10 +++---
 fs/btrfs/inode.c | 37 ++---
 3 files changed, 61 insertions(+), 37 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index de6f890..f693de1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1939,11 +1939,12 @@ static void check_page_uptodate(struct extent_io_tree 
*tree, struct page *page)
SetPageUptodate(page);
 }
 
-int free_io_failure(struct btrfs_inode *inode, struct io_failure_record *rec)
+int free_io_failure(struct extent_io_tree *failure_tree,
+   struct extent_io_tree *io_tree,
+   struct io_failure_record *rec)
 {
int ret;
int err = 0;
-   struct extent_io_tree *failure_tree = >io_failure_tree;
 
set_state_failrec(failure_tree, rec->start, NULL);
ret = clear_extent_bits(failure_tree, rec->start,
@@ -1952,7 +1953,7 @@ int free_io_failure(struct btrfs_inode *inode, struct 
io_failure_record *rec)
if (ret)
err = ret;
 
-   ret = clear_extent_bits(>io_tree, rec->start,
+   ret = clear_extent_bits(io_tree, rec->start,
rec->start + rec->len - 1,
EXTENT_DAMAGED);
if (ret && !err)
@@ -2068,24 +2069,24 @@ int repair_eb_io_failure(struct btrfs_fs_info *fs_info,
  * each time an IO finishes, we do a fast check in the IO failure tree
  * to see if we need to process or clean up an io_failure_record
  */
-int clean_io_failure(struct btrfs_inode *inode, u64 start, struct page *page,
-unsigned int pg_offset)
+int clean_io_failure(struct btrfs_fs_info *fs_info,
+struct extent_io_tree *failure_tree,
+struct extent_io_tree *io_tree, u64 start,
+struct page *page, u64 ino, unsigned int pg_offset)
 {
u64 private;
struct io_failure_record *failrec;
-   struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct extent_state *state;
int num_copies;
int ret;
 
private = 0;
-   ret = count_range_bits(>io_failure_tree, ,
-   (u64)-1, 1, EXTENT_DIRTY, 0);
+   ret = count_range_bits(failure_tree, , (u64)-1, 1,
+  EXTENT_DIRTY, 0);
if (!ret)
return 0;
 
-   ret = get_state_failrec(>io_failure_tree, start,
-   );
+   ret = get_state_failrec(failure_tree, start, );
if (ret)
return 0;
 
@@ -2101,25 +2102,25 @@ int clean_io_failure(struct btrfs_inode *inode, u64 
start, struct page *page,
if (fs_info->sb->s_flags & MS_RDONLY)
goto out;
 
-   spin_lock(>io_tree.lock);
-   state = find_first_extent_bit_state(>io_tree,
+   spin_lock(_tree->lock);
+   state = find_first_extent_bit_state(io_tree,
failrec->start,
EXTENT_LOCKED);
-   spin_unlock(>io_tree.lock);
+   spin_unlock(_tree->lock);
 
if (state && state->start <= failrec->start &&
state->end >= failrec->start + failrec->len - 1) {
num_copies = btrfs_num_copies(fs_info, failrec->logical,
  failrec->len);
if (num_copies > 1)  {
-   repair_io_failure(fs_info, btrfs_ino(inode), start,
- failrec->len, failrec->logical, page,
- pg_offset, failrec->failed_mirror);
+   repair_io_failure(fs_info, ino, start, failrec->len,
+ failrec->logical, page, pg_offset,
+ failrec->failed_mirror);
}
}
 
 out:
-   free_io_failure(inode, failrec);
+   free_io_failure(failure_tree, io_tree, failrec);
 
return 0;
 }
@@ -2360,6 +2361,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
struct io_failure_record *failrec;
struct inode *inode = page->mapping->host;
struct extent_io_tree *tree = _I(inode)->io_tree;
+   struct extent_io_tree *failure_tree = _I(inode)->io_failure_tree;
struct bio *bio;
int read_mode = 0;
int ret;
@@ -2372,7 +2374,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
 
ret = btrfs_check_repairable(inode, failed_bio, failrec, failed_mirror);
if (!ret) {
-   free_io_failure(BTRFS_I(inode), failrec);
+   free_io_failure(failure_tree, tree, failrec);
return -EIO;
}
 
@@ -2385,7 +2387,7 @@ static int 

[PATCH 2/3] btrfs: remove inode argument from repair_io_failure

2017-05-05 Thread Josef Bacik
Once we remove the btree_inode we won't have an inode to pass anymore, just pass
the fs_info directly and the inum since we use that to print out the repair
message.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent_io.c | 16 +++-
 fs/btrfs/extent_io.h |  6 +++---
 fs/btrfs/scrub.c |  2 +-
 3 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index acc7c68..de6f890 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1972,11 +1972,10 @@ int free_io_failure(struct btrfs_inode *inode, struct 
io_failure_record *rec)
  * currently, there can be no more than two copies of every data bit. thus,
  * exactly one rewrite is required.
  */
-int repair_io_failure(struct btrfs_inode *inode, u64 start, u64 length,
-   u64 logical, struct page *page,
-   unsigned int pg_offset, int mirror_num)
+int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
+ u64 length, u64 logical, struct page *page,
+ unsigned int pg_offset, int mirror_num)
 {
-   struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct bio *bio;
struct btrfs_device *dev;
u64 map_length = 0;
@@ -2035,7 +2034,7 @@ int repair_io_failure(struct btrfs_inode *inode, u64 
start, u64 length,
 
btrfs_info_rl_in_rcu(fs_info,
"read error corrected: ino %llu off %llu (dev %s sector %llu)",
- btrfs_ino(inode), start,
+ ino, start,
  rcu_str_deref(dev->name), sector);
btrfs_bio_counter_dec(fs_info);
bio_put(bio);
@@ -2055,8 +2054,7 @@ int repair_eb_io_failure(struct btrfs_fs_info *fs_info,
for (i = 0; i < num_pages; i++) {
struct page *p = eb->pages[i];
 
-   ret = repair_io_failure(BTRFS_I(fs_info->btree_inode), start,
-   PAGE_SIZE, start, p,
+   ret = repair_io_failure(fs_info, 0, start, PAGE_SIZE, start, p,
start - page_offset(p), mirror_num);
if (ret)
break;
@@ -2114,8 +2112,8 @@ int clean_io_failure(struct btrfs_inode *inode, u64 
start, struct page *page,
num_copies = btrfs_num_copies(fs_info, failrec->logical,
  failrec->len);
if (num_copies > 1)  {
-   repair_io_failure(inode, start, failrec->len,
- failrec->logical, page,
+   repair_io_failure(fs_info, btrfs_ino(inode), start,
+ failrec->len, failrec->logical, page,
  pg_offset, failrec->failed_mirror);
}
}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index e4e01d1..fd04124 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -467,9 +467,9 @@ struct bio *btrfs_bio_clone(struct bio *bio, gfp_t 
gfp_mask);
 struct btrfs_fs_info;
 struct btrfs_inode;
 
-int repair_io_failure(struct btrfs_inode *inode, u64 start, u64 length,
-   u64 logical, struct page *page,
-   unsigned int pg_offset, int mirror_num);
+int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
+ u64 length, u64 logical, struct page *page,
+ unsigned int pg_offset, int mirror_num);
 int clean_io_failure(struct btrfs_inode *inode, u64 start,
struct page *page, unsigned int pg_offset);
 void end_extent_writepage(struct page *page, int err, u64 start, u64 end);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b0251eb..594d5c4 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -731,7 +731,7 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64 
root, void *fixup_ctx)
ret = -EIO;
goto out;
}
-   ret = repair_io_failure(BTRFS_I(inode), offset, PAGE_SIZE,
+   ret = repair_io_failure(fs_info, inum, offset, PAGE_SIZE,
fixup->logical, page,
offset - page_offset(page),
fixup->mirror_num);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Btrfs: replace tree->mapping with tree->private_data

2017-05-05 Thread Josef Bacik
For extent_io tree's we have carried the address_mapping of the inode around in
the io tree in order to pull the inode back out for calling into various tree
ops hooks.  This works fine when everything that has an extent_io_tree has an
inode.  But we are going to remove the btree_inode, so we need to change this.
Instead just have a generic void * for private data that we can initialize with,
and have all the tree ops use that instead.  This had a lot of cascading changes
but should be relatively straightforward.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h |  1 +
 fs/btrfs/disk-io.c   | 51 ++---
 fs/btrfs/disk-io.h   |  6 +--
 fs/btrfs/extent_io.c | 52 -
 fs/btrfs/extent_io.h | 21 +-
 fs/btrfs/inode.c | 82 +++-
 fs/btrfs/relocation.c|  3 +-
 fs/btrfs/tests/extent-io-tests.c |  2 +-
 fs/btrfs/transaction.c   |  2 +-
 9 files changed, 128 insertions(+), 92 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3e21211..80b3383 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3154,6 +3154,7 @@ int btrfs_create_subvol_root(struct btrfs_trans_handle 
*trans,
 int btrfs_merge_bio_hook(struct page *page, unsigned long offset,
 size_t size, struct bio *bio,
 unsigned long bio_flags);
+void btrfs_set_range_writeback(void *private_data, u64 start, u64 end);
 int btrfs_page_mkwrite(struct vm_fault *vmf);
 int btrfs_readpage(struct file *file, struct page *page);
 void btrfs_evict_inode(struct inode *inode);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3a25a16..7c8ec5a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -118,7 +118,8 @@ void btrfs_end_io_wq_exit(void)
  * just before they are sent down the IO stack.
  */
 struct async_submit_bio {
-   struct inode *inode;
+   void *private_data;
+   struct btrfs_fs_info *fs_info;
struct bio *bio;
struct list_head list;
extent_submit_bio_hook_t *submit_bio_start;
@@ -871,7 +872,7 @@ static void run_one_async_start(struct btrfs_work *work)
int ret;
 
async = container_of(work, struct  async_submit_bio, work);
-   ret = async->submit_bio_start(async->inode, async->bio,
+   ret = async->submit_bio_start(async->private_data, async->bio,
  async->mirror_num, async->bio_flags,
  async->bio_offset);
if (ret)
@@ -885,7 +886,7 @@ static void run_one_async_done(struct btrfs_work *work)
int limit;
 
async = container_of(work, struct  async_submit_bio, work);
-   fs_info = BTRFS_I(async->inode)->root->fs_info;
+   fs_info = async->fs_info;
 
limit = btrfs_async_submit_limit(fs_info);
limit = limit * 2 / 3;
@@ -904,7 +905,7 @@ static void run_one_async_done(struct btrfs_work *work)
return;
}
 
-   async->submit_bio_done(async->inode, async->bio, async->mirror_num,
+   async->submit_bio_done(async->private_data, async->bio, 
async->mirror_num,
   async->bio_flags, async->bio_offset);
 }
 
@@ -916,10 +917,9 @@ static void run_one_async_free(struct btrfs_work *work)
kfree(async);
 }
 
-int btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, struct inode *inode,
-   struct bio *bio, int mirror_num,
-   unsigned long bio_flags,
-   u64 bio_offset,
+int btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
+   int mirror_num, unsigned long bio_flags,
+   u64 bio_offset, void *private_data,
extent_submit_bio_hook_t *submit_bio_start,
extent_submit_bio_hook_t *submit_bio_done)
 {
@@ -929,7 +929,8 @@ int btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, 
struct inode *inode,
if (!async)
return -ENOMEM;
 
-   async->inode = inode;
+   async->private_data = private_data;
+   async->fs_info = fs_info;
async->bio = bio;
async->mirror_num = mirror_num;
async->submit_bio_start = submit_bio_start;
@@ -975,7 +976,7 @@ static int btree_csum_one_bio(struct bio *bio)
return ret;
 }
 
-static int __btree_submit_bio_start(struct inode *inode, struct bio *bio,
+static int __btree_submit_bio_start(void *private_data, struct bio *bio,
int mirror_num, unsigned long bio_flags,
u64 bio_offset)
 {
@@ -986,10 +987,11 @@ static int __btree_submit_bio_start(struct inode *inode, 
struct bio *bio,
return btree_csum_one_bio(bio);
 }
 
-static int __btree_submit_bio_done(struct inode *inode, struct bio *bio,
+static int __btree_submit_bio_done(void *private_data, struct 

[PATCH 0/3] Kill btree inode prep patches

2017-05-05 Thread Josef Bacik
These three patches are just prep patches for the kill btree inode patch, they
just move some stuff around so we don't depend on struct inode in places where
we won't have one.  Once the other supporting generic code goes in I'll submit
the kill btree inode patch as well.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-05 Thread Marc MERLIN
Thanks again for your answer. Obviously even if my filesystem is toast,
it's useful to learn from what happened.

On Fri, May 05, 2017 at 01:03:02PM +0800, Qu Wenruo wrote:
> > > So unfortunately, your fs/subvolume trees are also corrupted.
> > > And almost no chance to do a graceful recovery.
> > So I'm confused here. You're saying my metadata is not corrupted (and in
> > my case, I have DUP, so I should have 2 copies),
> 
> Nope, here I'm all talking about metadata (tree blocks).
> Difference is the owner, either extent tree or fs/subvolume tree.
 
I see. I didn't realize that my filesystem managed to corrupt both
copies of its metadata.

> The fsck doesn't check data blocks.

Right, that's what scrub does, fair enough.

> The problem is, tree blocks (metadata) that refers these data blocks are
> corrupted.
> 
> And they are corrupted in such a way that both extent tree (tree contains
> extent allocation info) and fs tree (tree contains real fs info, like inode
> and data location) are corrupted.
> 
> So graceful recovery is not possible now.

I see, thanks for explaining.

> Unfortunately, no, even you have 2 copies, a lot of tree blocks are
> corrupted that neither copy matches checksum.
 
Thanks for confirming. I guess if I'm having corruption due to a bad
card, it makes sense that both get updated after one another and both
got corrupted for the same reason.

> Corrupted blocks are corrupted, that command is just trying to corrupt it
> again.
> It won't do the black magic to adjust tree blocks to avoid them.
 
I see. you may hve seen the earlier message from Kai Krakow who was
able to to recover his FS by trying this trick, but I understand it
can't work in all cases.

Thanks again for your answers.
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS converted from EXT4 becomes read-only after reboot

2017-05-05 Thread Chris Murphy
On Fri, May 5, 2017 at 8:56 AM, Alexandru Guzu  wrote:
> btrfs-progs is 4.4

That's before the rewrite of convert.


> I upgraded the kernel to 4.8.0-51 and the issue persists.
> However, I noticed that the issue is triggered when I start Firefox. I
> think Firefox starts writing something to some specific files and that
> triggers the issue because I was able to upgrade the kernel without
> getting the disk in RO mode.

When Btrfs becomes aware of its confusion, it'll go readonly to
prevent corrupting the file system. It might already have some
isolated corruption. It doesn't really tell us more than that.

>
> I can try to get a newer version of btrfs-progs, but what should I do
> with it? Should I check/scrub? I don't see how simply having a newer
> version of the btrfs-progs would prevent the kernel from throwing the
> error and putting the disk in read-only mode.

It won't change anything now. The version question is to establish
whether you have old or new style converted ext4. The call traces you
have come up in the list only with new style convert. But you have old
style. That's why I included Qu in the cc, I have no idea what the
problem is and if it's fixable with btrfs check from a recent
btrfs-progs or if it needs an even newer kernel, or if it's an unfixed
bug still.

In the meantime I would do this:

1. Backup what you care about. Decent chance the only way forward is
to create a new file system, so you might as well take advantage of
the fact at least you can mount it read only right now.

2. Boot some live media and use btrfs-image to capture an image of the
file system as it it. The newer progs the better, as they get a bit
more tolerant of file system errors and hopefully you can capture a
complete image. btrfs-image -c9 -t4 -s /dev/sdX filename.bin  \\ This
is metadata only, and will sanitize filenames.

3. Output from both of these
btrfs check 
btrfs check --mode=lowmem##this will spit out more
problems if it's older than 4.10.2

Do this without --repair.

What seems safe to me, is while booted with live media, mount the
problem Btrfs file system, and find the Firefox cache,
~/.cache/mozilla/firefox/*.default and delete that whole default
directory. Now you can reboot the installed system, and see if the
problem still happens. If it does, you can try to reinstall Firefox.

If that doesn't help, it's a coin toss if it's worth trying btrfs
check --repair, or if you're better off taking a read-only snapshot
and using btrfs send/receive to a new Btrfs filesystem. The send
receive process is easier to use than rsync -a or cp -a, but things
get allocated natively rather than however convert did it.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: add framework to handle device flush error as a volume

2017-05-05 Thread David Sterba
On Tue, Apr 25, 2017 at 04:58:31PM +0800, Anand Jain wrote:
> This adds comments to the flush error handling part of
> the code, and hopes to maintain the same logic with a
> framework which can be used to handle the errors at the
> volume level.
> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/disk-io.c | 58 
> ++
>  fs/btrfs/volumes.h |  1 +
>  2 files changed, 55 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index eb1ee7b6f532..dafcb6bb2d5d 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3527,6 +3527,10 @@ static int write_dev_flush(struct btrfs_device 
> *device, int wait)
>   if (wait) {
>   bio = device->flush_bio;
>   if (!bio)
> + /*
> +  * This means the alloc has failed with ENOMEM, however
> +  * here we return 0, as its not a device error.
> +  */
>   return 0;
>  
>   wait_for_completion(>flush_wait);
> @@ -3566,6 +3570,33 @@ static int write_dev_flush(struct btrfs_device 
> *device, int wait)
>   return 0;
>  }
>  
> +static int check_barrier_error(struct btrfs_fs_devices *fsdevs)
> +{
> + int submit_flush_error = 0;
> + int dev_flush_error = 0;
> + struct btrfs_device *dev;
> +
> + list_for_each_entry_rcu(dev, >devices, dev_list) {
> + if (!dev->bdev) {
> + submit_flush_error++;
> + dev_flush_error++;
> + continue;
> + }
> + if (dev->last_flush_error == ENOMEM)

That's -ENOMEM

> + submit_flush_error++;
> + if (dev->last_flush_error && dev->last_flush_error != ENOMEM)

also here.

> + dev_flush_error++;
> + }
> +
> + if (submit_flush_error >
> + fsdevs->fs_info->num_tolerated_disk_barrier_failures ||
> + dev_flush_error >
> + fsdevs->fs_info->num_tolerated_disk_barrier_failures)
> + return -EIO;

Can you please reformat this so it's clear what's the condition and
what's the statement?

> +
> + return 0;
> +}
> +
>  /*
>   * send an empty flush down to each device in parallel,
>   * then wait for them
> @@ -3593,6 +3624,7 @@ static int barrier_all_devices(struct btrfs_fs_info 
> *info)
>   ret = write_dev_flush(dev, 0);
>   if (ret)
>   errors_send++;
> + dev->last_flush_error = ret;

Here the error is set unconditionally, so it always tracks the return
code, not only the error ...

>   }
>  
>   /* wait for all the barriers */
> @@ -3607,12 +3639,30 @@ static int barrier_all_devices(struct btrfs_fs_info 
> *info)
>   continue;
>  
>   ret = write_dev_flush(dev, 1);
> - if (ret)
> + if (ret) {
> + dev->last_flush_error = ret;

... while this tracks only the errors. Unless I'm missing something,
both should be set in a consistent way.

>   errors_wait++;
> + }
> + }
> +
> + /*
> +  * Try hard in case of flush. Lets say, in RAID1 we have
> +  * the following situation
> +  *  dev1: EIO dev2: ENOMEM
> +  * this is not a fatal error as we hope to recover from
> +  * ENOMEM in the next attempt to flush.

This could still be problematic under some very rare conditions, but I
don't deem it important at the moment as the memory allocation will be
gone. Then the comment reflects the current state, which is fine.

> +  * But the following is considered as fatal
> +  *  dev1: ENOMEM dev2: ENOMEM
> +  *  dev1: bdev == NULL dev2: ENOMEM
> +  */
> + if (errors_send || errors_wait) {
> + /*
> +  * At some point we need the status of all disks
> +  * to arrive at the volume status. So error checking
> +  * is being pushed to a separate loop.
> +  */
> + return check_barrier_error(info->fs_devices);
>   }
> - if (errors_send > info->num_tolerated_disk_barrier_failures ||
> - errors_wait > info->num_tolerated_disk_barrier_failures)
> - return -EIO;
>   return 0;
>  }
>  
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 59be81206dd7..9c09dcd96e5d 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -74,6 +74,7 @@ struct btrfs_device {
>   int missing;
>   int can_discard;
>   int is_tgtdev_for_dev_replace;
> + int last_flush_error;
>  
>  #ifdef __BTRFS_NEED_DEVICE_DATA_ORDERED
>   seqcount_t data_seqcount;
> -- 
> 2.10.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from 

Re: [PATCH 0/6 RFC] utilize bio_clone_fast to clean up

2017-05-05 Thread David Sterba
On Mon, Apr 17, 2017 at 06:16:21PM -0700, Liu Bo wrote:
> This attempts to use bio_clone_fast() in the places where we clone bio,
> such as when bio got cloned for multiple disks and when bio got split
> during dio submit.
> 
> One benefit is to simplify dio submit to avoid calling bio_add_page one by
> one.
> 
> Another benefit is that comparing to bio_clone_bioset, bio_clone_fast is
> faster because of copying the vector pointer directly, and bio_clone_fast
> doesn't modify bi_vcnt, so the extra work is to fix up bi_vcnt usage we
> currently have to use bi_iter to iterate bvec.
> 
> Liu Bo (6):
>   Btrfs: use bio_clone_fast to clone our bio

Please extend the changelog of this patch, use the text in the cover
letter.

>   Btrfs: use bio_clone_bioset_partial to simplify DIO submit

This patch is too big, can you split it to smaller chunks? I was not
able to review it, it seems to touch several things at once, it's hard
to keep the context.

>   Btrfs: change how we iterate bios in endio
>   Btrfs: record error if one block has failed to retry
>   Btrfs: change check-integrity to use bvec_iter
>   Btrfs: unify naming of btrfs_io_bio

The rest looks ok.

Have you done perofrmance tests? Not that it's necessary, but would be
interesting to see the effects. The effects of simplified code are
likely unmeasurable, but the _fast version skips some mempool exercises
so this could lead to improvements under memory pressure. And these is
hardly deterministic conditions, could be hard. I'me expecting some
latency improvemtnest.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html