Re: btrfs_direct_IO oops

2016-10-10 Thread Dave Jones
On Mon, Oct 10, 2016 at 04:43:57AM +0100, Al Viro wrote:
 
 > Very interesting.  Could you slap something like
 > diff --git a/lib/iov_iter.c b/lib/iov_iter.c
 > index 0ce3411..1ef00e7 100644
 > --- a/lib/iov_iter.c
 > +++ b/lib/iov_iter.c
 > @@ -682,8 +682,9 @@ static void pipe_advance(struct iov_iter *i, size_t size)
 >  {
 >  struct pipe_inode_info *pipe = i->pipe;
 >  struct pipe_buffer *buf;
 > -int idx = i->idx;
 > -size_t off = i->iov_offset;
 > +int old_idx = idx = i->idx;
 > +size_t old_off = off = i->iov_offset;
 > +size_t old_size = size;
 >  
 >  if (unlikely(i->count < size))
 >  size = i->count;
 > @@ -713,6 +714,9 @@ static void pipe_advance(struct iov_iter *i, size_t size)
 >  pipe->nrbufs--;
 >  }
 >  }
 > +if (!sanity(i))
 > +printk(KERN_ERR "buggered pipe_advance(%zd) [%d,%zd]",
 > +old_size, old_idx, old_off);
 >  }
 >  
 >  void iov_iter_advance(struct iov_iter *i, size_t size)
 > 
 > in there and try to reproduce that one?

My compiler choked on that, but I fixed it up. The printk didn't
trigger though..


idx = 0, offset = 0
curbuf = 0, nrbufs = 1, buffers = 16
[b9a21100 ea00126019c0 0 4096]
[  (null) ea0013746440 0 0]
[  (null) ea00132956c0 0 0]
[  (null) ea0013295700 0 0]
[  (null) ea0013295740 0 0]
[  (null) ea0013295780 0 0]
[  (null) ea00132957c0 0 0]
[  (null) ea0013595c00 0 0]
[  (null) ea0012b1b6c0 0 0]
[  (null)   (null) 0 0]
[  (null)   (null) 0 0]
[  (null)   (null) 0 0]
[  (null)   (null) 0 0]
[  (null)   (null) 0 0]
[  (null)   (null) 0 0]
[  (null)   (null) 0 0]
[ cut here ]
WARNING: CPU: 1 PID: 13581 at lib/iov_iter.c:327 sanity+0x102/0x150
CPU: 1 PID: 13581 Comm: trinity-c17 Not tainted 4.8.0-think+ #9 
 c9963ae8
 b93e22d1
 
 

 b9c1e1cb
 b93fa5b2
 c9963b28
 b908b010

 0147d43c0e7f
 b9c1e1cb
 0147
 0010

Call Trace:
 [] dump_stack+0x6c/0x9b
 [] ? sanity+0x102/0x150
 [] __warn+0x110/0x130
 [] warn_slowpath_null+0x2c/0x40
 [] sanity+0x102/0x150
 [] copy_page_to_iter+0x2be/0x480
 [] ? workingset_activation+0xc1/0x120
 [] ? workingset_refault+0x190/0x190
 [] generic_file_read_iter+0x245/0xd30
 [] generic_file_splice_read+0xfd/0x230
 [] ? pipe_to_user+0x40/0x40
 [] do_splice_to+0x98/0xd0
 [] splice_direct_to_actor+0xd4/0x2c0
 [] ? generic_pipe_buf_nosteal+0x10/0x10
 [] do_splice_direct+0xc5/0x110
 [] do_sendfile+0x242/0x470
 [] SyS_sendfile64+0x7d/0xf0
 [] do_syscall_64+0x7f/0x200
 [] entry_SYSCALL64_slow_path+0x25/0x25
---[ end trace 6773f425063b7f84 ]---


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is is possible to submit binary image as fstest test case?

2016-10-09 Thread Dave Chinner
On Fri, Oct 07, 2016 at 06:05:51PM +0200, David Sterba wrote:
> On Fri, Oct 07, 2016 at 08:18:38PM +1100, Dave Chinner wrote:
> > On Thu, Oct 06, 2016 at 04:12:56PM +0800, Qu Wenruo wrote:
> > > So I'm wondering if I can just upload a zipped raw image as part of
> > > the test case?
> > 
> > Preferably not. We've managed to avoid pre-built images in xfstests
> > for 15 years, so there'd have to be a really good reason to start
> > doing this, especially as once we open that floodgate we'll end up
> > with everyone wanting to do this and it will blow out the size of
> > the repository in now time.
> > 
> > If the issue is just timing or being unable to trigger an error
> > at the right time, this is what error injection frameworks or
> > debug-only sysfs hooks are for. The XFS kernel code has both,
> > xfstests use both, and they pretty much do away with the need for
> > custom binary filesystem images for such testing...
> 
> Error injection framework may not be alwasy available, eg. in kernels
> built for production. Yet I'd like to make the test possible.

Why would you ever enable error injection on a production kernel?
We simply don't run the error injection tests when the
infrastructure is not there, jsut like we do with all other tests
that are depenent on optional kernel/fs features

> Also, I find it useful to keep the exact images that are attached to a
> report and not necessarily try to recreate it to trigger a bug. If the
> images are small, hosting them with the sources makes testing easy.
> Big images would probably go to own repository and be linked.
> 
> I don't really expect fstests to store crafted images and would advise
> Qu to not even try to propose that, because the answer was quite
> predictable.

You say that like it's a bad thing?  What's the guarantee that a
specific corrupt image will always be sufficient to trigger the
problem the test is supposed to check? i.e. we could change a
different part of the FS code and that could mask the bug the image
used to trigger. The test then passes, but we haven't actually fix
the bug that the test used to exercise. i.e. we've got a false "we
fixed the problem" report, when all we did is prevent a specific
vector from tripping over it.

Error injection and sysfs hooks into debug code are much more
reliable ways of testing that the root cause of a problem is fixed
and stays fixed.  The reproducing trigger cannot be masked by
changes in other code layers, so we know that if the error
injection/sysfs hook test handles the problem correctly, we have
actually fixed the underlying bug and not just masked it...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Making statfs return qgroup info (for container use case etc.)

2016-10-09 Thread Dave Chinner
On Fri, Oct 07, 2016 at 06:58:47PM +0200, David Sterba wrote:
> On Fri, Oct 07, 2016 at 09:40:11AM +1100, Dave Chinner wrote:
> > XFS does this with directory tree quotas. It was implmented 10 years
> > ago or so, IIRC...
> 
> Sometimes, the line between a historical remark and trolling is thin

All I've done is quickly point out that we've already got an
implementation of the exact functionality being asked for and given
a rough indication of when the commits hit the tree. i.e. so anyone
wanting to implment this in btrfs can search for it easily and work
out how to use the existing VFS infrastructure to report this
information.

You can take that as trolling if you want, but if you think I have
time for playing stupid, petty, idiotic games like that then you
need to ask yourself why you have had such a defensive and
paranoid reaction...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_direct_IO oops

2016-10-09 Thread Dave Jones
On Sat, Oct 08, 2016 at 07:20:08PM -0400, Dave Jones wrote:
 > On Sat, Oct 08, 2016 at 07:29:03PM +0100, Al Viro wrote:
 >  > On Sat, Oct 08, 2016 at 02:08:06PM -0400, Dave Jones wrote:
 >  > > That code: matches this dissembly:
 >  > > 
 >  > > for (i = seg + 1; i < iter->nr_segs; i++) {
 >  > 
 >  > *whoa*
 >  > 
 >  > OK, that loop in check_direct_IO() should be done *ONLY* for
 >  > iovec iter - even for a bvec one it's completely bogus, and
 >  > for pipe ones it blows up immediately.
 >  > 
 >  > Sorry, I'd missed that bogosity - replace
 >  > if (iov_iter_rw(iter) == WRITE)
 >  >   return 0;
 >  > with
 >  >   if (iov_iter_rw(iter) != READ || !iter_is_iovec(iter))
 >  >   return 0;
 >  > in there; that should fix the damn thing.
 > 
 > Yep, seems to do the trick. Have been running the last six hours
 > without seeing it or anything similar since.

Overnight, I did hit another iov related warning..

idx = 0, offset = 0
curbuf = 0, nrbufs = 1, buffers = 16
[9fa21100 ea00065f6d80 0 4096]
[  (null)   (null) 0 0]
[  (null)   (null) 0 0]
[  (null)   (null) 0 0]
[  (null)   (null) 0 0]
[  (null)   (null) 0 0]
[  (null)   (null) 0 0]
[  (null)   (null) 0 0]
[  (null)   (null) 0 0]
[  (null)   (null) 0 0]
[  (null)   (null) 0 0]
[  (null)   (null) 0 0]
[  (null)   (null) 0 0]
[  (null)   (null) 0 0]
[  (null)   (null) 0 0]
[  (null)   (null) 0 0]

WARNING: CPU: 0 PID: 29610 at lib/iov_iter.c:327 sanity+0x102/0x150
CPU: 0 PID: 29610 Comm: trinity-c9 Not tainted 4.8.0-think+ #8 
 c97b7ae8
 9f3e2011
 
 

 9fc1e22b
 9f3fa2f2
 c97b7b28
 9f08b010

 014734c3d60f
 9fc1e22b
 0147
 0010

Call Trace:
 [] dump_stack+0x6c/0x9b
 [] ? sanity+0x102/0x150
 [] __warn+0x110/0x130
 [] warn_slowpath_null+0x2c/0x40
 [] sanity+0x102/0x150
 [] copy_page_to_iter+0x2be/0x480
 [] ? pagecache_get_page+0xba/0x4f0
 [] generic_file_read_iter+0x245/0xd30
 [] generic_file_splice_read+0xfd/0x230
 [] ? pipe_to_user+0x40/0x40
 [] do_splice_to+0x98/0xd0
 [] splice_direct_to_actor+0xd4/0x2c0
 [] ? generic_pipe_buf_nosteal+0x10/0x10
 [] do_splice_direct+0xc5/0x110
 [] do_sendfile+0x242/0x470
 [] SyS_sendfile64+0x7d/0xf0
 [] do_syscall_64+0x7f/0x200
 [] entry_SYSCALL64_slow_path+0x25/0x25
---[ end trace 2c7bd411d4aa0491 ]---

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_direct_IO oops

2016-10-08 Thread Dave Jones
On Sat, Oct 08, 2016 at 07:29:03PM +0100, Al Viro wrote:
 > On Sat, Oct 08, 2016 at 02:08:06PM -0400, Dave Jones wrote:
 > > That code: matches this dissembly:
 > > 
 > > for (i = seg + 1; i < iter->nr_segs; i++) {
 > 
 > *whoa*
 > 
 > OK, that loop in check_direct_IO() should be done *ONLY* for
 > iovec iter - even for a bvec one it's completely bogus, and
 > for pipe ones it blows up immediately.
 > 
 > Sorry, I'd missed that bogosity - replace
 > if (iov_iter_rw(iter) == WRITE)
 >  return 0;
 > with
 >  if (iov_iter_rw(iter) != READ || !iter_is_iovec(iter))
 >  return 0;
 > in there; that should fix the damn thing.

Yep, seems to do the trick. Have been running the last six hours
without seeing it or anything similar since.

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs_direct_IO oops

2016-10-08 Thread Dave Jones
Found this in logs this morning. First time I've seen this one.
Might be related to some direct IO related changes I made in Trinity
that is tickling some new path.

Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 2 PID: 25313 Comm: trinity-c18 Not tainted 4.8.0-think+ #7 
task: 88040f7b1c00 task.stack: c976c000
 RIP: 0010:[] 
  [] btrfs_direct_IO+0x13c/0x480 [btrfs]
RSP: 0018:c976fb40  EFLAGS: 00010202
RAX: 2580 RBX: 0258 RCX: 0018
RDX:  RSI: 8804f4d16868 RDI: c976fc40
RBP: c976fbd8 R08: 0001 R09: 
R10: 0001 R11: 0006 R12: 8803
R13: 8803643e1a88 R14: 0001 R15: 0258
FS:  7efc0af88b40() GS:880507c0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 8803643e4008 CR3: 0003ad69e000 CR4: 001406e0
DR0: 006f0020 DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0600
Stack:
  88050174dda0 c976fb60 8804ffaab3f8 c976fcd8
   0001 0001c93b8050 8804f4d16868
  c976fc40   
 
Call Trace:
 [] generic_file_read_iter+0x33d/0xce0
 [] ? ___slab_alloc.constprop.86+0x277/0x5c0
 [] generic_file_splice_read+0xfd/0x230
 [] ? pipe_to_user+0x40/0x40
 [] do_splice_to+0x98/0xd0
 [] splice_direct_to_actor+0xd4/0x2c0
 [] ? generic_pipe_buf_nosteal+0x10/0x10
 [] do_splice_direct+0xc5/0x110
 [] do_sendfile+0x242/0x470
 [] SyS_sendfile64+0x7d/0xf0
 [] do_syscall_64+0x7f/0x200
 [] entry_SYSCALL64_slow_path+0x25/0x25
 Code:  4d 8b 74 1d 00 4d 3b 74 1d 10 74 26 44 89 fb e8 3b de eb c8 83 c3 01 4c 
63 fb 4d 39 e7 73 a5 e8 2b de eb c8 4c 89 f8 48 c1 e0 04 <4d> 3b 74 05 00 75 dd 
31 db e8 16 de eb c8 48 89 d8 48 8b 7d d0 
 
 RIP 
  [] btrfs_direct_IO+0x13c/0x480 [btrfs]
 RSP 
CR2: 8803643e4008


That code: matches this dissembly:

for (i = seg + 1; i < iter->nr_segs; i++) {
   41d40:   e8 00 00 00 00  callq  41d45 
   41d45:   83 c3 01add$0x1,%ebx
   41d48:   4c 63 fbmovslq %ebx,%r15
   41d4b:   4d 39 e7cmp%r12,%r15
   41d4e:   73 a5   jae41cf5 
if (iter->iov[seg].iov_base == iter->iov[i].iov_base)
   41d50:   e8 00 00 00 00  callq  41d55 
   41d55:   4c 89 f8mov%r15,%rax
   41d58:   48 c1 e0 04 shl$0x4,%rax
   41d5c:   4d 3b 74 05 00  cmp0x0(%r13,%rax,1),%r14
   41d61:   75 dd   jne41d40 
return 0;

 


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] generic: make 17[1-4] work well when btrfs compression is enabled

2016-10-07 Thread Dave Chinner
On Fri, Oct 07, 2016 at 03:00:42PM +0800, Wang Xiaoguang wrote:
> When enabling btrfs compression, original codes can not fill fs
> correctly, fix this.
> 
> Signed-off-by: Wang Xiaoguang 
> ---
>  tests/generic/171 | 4 +---
>  tests/generic/172 | 2 +-
>  tests/generic/173 | 4 +---
>  tests/generic/174 | 4 +---
>  4 files changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/tests/generic/171 b/tests/generic/171
> index f391685..d2ae8e4 100755
> --- a/tests/generic/171
> +++ b/tests/generic/171
> @@ -75,9 +75,7 @@ _cp_reflink $testdir/bigfile $testdir/clonefile
>  sync
>  
>  echo "Allocate the rest of the space"
> -nr_free=$(stat -f -c '%f' $testdir)
> -touch $testdir/file0 $testdir/file1
> -_pwrite_byte 0x61 0 $((blksz * nr_free)) $testdir/eat_my_space >> 
> $seqres.full 2>&1
> +dd if=/dev/zero of=$testdir/eat_my_space >> $seqres.full 2>&1

Please don't replace xfs_io writes using a specific data pattern
with dd calls that write zeros. Indeed, we don't use dd for new
tests anymore - xfs_io should be used.

Write a function that fills all the remaining free space (one may
already exist) and call it in all these places.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is is possible to submit binary image as fstest test case?

2016-10-07 Thread Dave Chinner
On Fri, Oct 07, 2016 at 05:26:27PM +0800, Qu Wenruo wrote:
> 
> 
> At 10/07/2016 05:18 PM, Dave Chinner wrote:
> >On Thu, Oct 06, 2016 at 04:12:56PM +0800, Qu Wenruo wrote:
> >>Hi,
> >>
> >>Just as the title says, for some case(OK, btrfs again) we need to
> >>catch a file system in special timing.
> >>
> >>In this specific case, we need to grab a btrfs image undergoing
> >>balancing, just before the balance finished.
> >>
> >>Although we can use flakey to drop all write, we still don't have
> >>method to catch the timing of the that transaction.
> >>
> >>
> >>On the other hand, we can tweak our local kernel, adding
> >>msleep()/message and dump the disk during the sleep.
> >>And the image I dumped can easily trigger btrfs kernel and user-space bug.
> >>
> >>So I'm wondering if I can just upload a zipped raw image as part of
> >>the test case?
> >
> >Preferably not. We've managed to avoid pre-built images in xfstests
> >for 15 years, so there'd have to be a really good reason to start
> >doing this, especially as once we open that floodgate we'll end up
> >with everyone wanting to do this and it will blow out the size of
> >the repository in now time.
> 
> Makes sense.
> For btrfs-progs, which includes test images, it already takes about
> 77M, even we have tried our best to reduce image size.
> 
> >
> >If the issue is just timing or being unable to trigger an error
> >at the right time, this is what error injection frameworks or
> >debug-only sysfs hooks are for. The XFS kernel code has both,
> >xfstests use both, and they pretty much do away with the need for
> >custom binary filesystem images for such testing...
> 
> So again, btrfs is lacking infrastructure for debug.
> It seems that we can only rely on images out of xfstest tree,

That's the /wrong answer/. Go and implement debug infrastructure
that btrfs needs - if you wait for someone else to do it, it will
never get done and btrfs will never stabilise

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is is possible to submit binary image as fstest test case?

2016-10-07 Thread Dave Chinner
On Thu, Oct 06, 2016 at 04:12:56PM +0800, Qu Wenruo wrote:
> Hi,
> 
> Just as the title says, for some case(OK, btrfs again) we need to
> catch a file system in special timing.
> 
> In this specific case, we need to grab a btrfs image undergoing
> balancing, just before the balance finished.
> 
> Although we can use flakey to drop all write, we still don't have
> method to catch the timing of the that transaction.
> 
> 
> On the other hand, we can tweak our local kernel, adding
> msleep()/message and dump the disk during the sleep.
> And the image I dumped can easily trigger btrfs kernel and user-space bug.
> 
> So I'm wondering if I can just upload a zipped raw image as part of
> the test case?

Preferably not. We've managed to avoid pre-built images in xfstests
for 15 years, so there'd have to be a really good reason to start
doing this, especially as once we open that floodgate we'll end up
with everyone wanting to do this and it will blow out the size of
the repository in now time.

If the issue is just timing or being unable to trigger an error
at the right time, this is what error injection frameworks or
debug-only sysfs hooks are for. The XFS kernel code has both,
xfstests use both, and they pretty much do away with the need for
custom binary filesystem images for such testing...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Making statfs return qgroup info (for container use case etc.)

2016-10-06 Thread Dave Chinner
On Thu, Oct 06, 2016 at 02:51:58PM +0100, Tim Small wrote:
> Hello,
> 
> I use btrfs for containers (e.g. full OS style containers using lxc/lxd
> etc. rather than application containers).  btrfs de-duplication and
> send/receive are very useful features for this use-case.
> 
> Each container runs in its own subvolume, and I use quotas to stop one
> container exhausting the disk space of the others.
> 
> df shows the total space + free space for the entire filesystem - which
> is confusing for users within the containers.  Worse - they don't have
> any way of knowing how much quota they actually have left (other than du
> -xs / vs. whatever I've told them).
> 
> It'd be really nice if running e.g. statfs("/home", ...) within a
> container could be made to return the qgroup usage + limit for the path
> (e.g. by passing in an option at mount time - such as qgroup level
> maybe?) , instead of the global filesystem data in f_bfree f_blocks etc.

XFS does this with directory tree quotas. It was implmented 10 years
ago or so, IIRC...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with btrfs_create() getting EBUSY; cause is stale highest_objectid at mount - kernel.org bug 173001

2016-10-06 Thread Dave Olson
Dave Olson  wrote:

> Dave Olson  wrote:
> 
> > The full details are in
> > https://bugzilla.kernel.org/show_bug.cgi?id=173001
> > 
> > Basicly, logrotate has rotated a file to a new name, tries to open a new
> > file with the original name, and gets EBUSY.  The file is not created.

The cause is that the highest_objectid field in the btrfs_root structure
is set incorrectly (an old stale value) at filesystem mount in some
cases.   See the writeup in the bug for more details.

I'm guessing that some of the journal information is not flushed to disk
after btrfs_create(), even when the filesystem is mounted with
flushoncommit, so if a powercycle is done within a few seconds of the
file being created, there is a good chance that this problem will occur
on a subsequent reboot.

btrfs check (up through 4.7.3) does not seem to detect this stale
condition.

Dave Olson
ol...@cumulusnetworks.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with btrfs fs in 4.1.25 (also 4.8.0) - getting EBUSY on file create - kernel.org bug 173001

2016-10-03 Thread Dave Olson
Dave Olson  wrote:

> The full details are in
> https://bugzilla.kernel.org/show_bug.cgi?id=173001
> 
> Basicly, logrotate has rotated a file to a new name, tries to open a new
> file with the original name, and gets EBUSY.  The file is not created.
> 
> Later on the file can be created with no problems.
> 
> I've turned on most of the BTRFS config options, and there are no BTRFS
> messages related to the failure.
> 
> The file is on the root filesystem.

This problem is also reproducible on a 4.8 kernel.org kernel with no
modifications, built today from the 4.8 tag.

It happened on the 77th powercycle, after about 3 hours.

Dave Olson
ol...@cumulusnetworks.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Problem with btrfs fs in 4.1.25 - getting EBUSY on file create - kernel.org bug 173001

2016-09-27 Thread Dave Olson
The full details are in
https://bugzilla.kernel.org/show_bug.cgi?id=173001

Basicly, logrotate has rotated a file to a new name, tries to open a new
file with the original name, and gets EBUSY.  The file is not created.

Later on the file can be created with no problems.

I've turned on most of the BTRFS config options, and there are no BTRFS
messages related to the failure.

The file is on the root filesystem.


Dave Olson
ol...@cumulusnetworks.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Preliminary BTRFS Encryption

2016-09-15 Thread Dave Chinner
On Tue, Sep 13, 2016 at 09:39:46PM +0800, Anand Jain wrote:
> 
> This patchset adds btrfs encryption support.
> 
> The main objective of this series is to have bugs fixed and stability.
> I have verified with fstests to confirm that there is no regression.
> 
> A design write-up is coming next, however here below is the quick example
> on the cli usage. Please try out, let me know if I have missed something.

Yup, that best practices say "do not roll your own encryption
infrastructure".

This is just my 2c worth - take it or leave it, don't other flaming.
Keep in mind that I'm not picking on btrfs here - I asked similar
hard questions about the proposed f2fs encryption implementation.
That was a "copy and snowflake" version of the ext4 encryption code -
they made changes and now we have generic code and common
functionality between ext4 and f2fs.

> Also would like to mention that a review from the security experts is due,
> which is important and I believe those review comments can be accommodated
> without major changes from here.

That's a fairly significant red flag to me - security reviews need
to be done at the design phase against specific threat models -
security review is not a code/implementation review...

The ext4 developers got this right by publishing threat models and
design docs, which got quite a lot of review and feedback before
code was published for review.

https://docs.google.com/document/d/1ft26lUQyuSpiu6VleP70_npaWdRfXFoNnB8JYnykNTg/edit#heading=h.qmnirp22ipew

[small reorder of comments]

> As of now these patch set supports encryption on per subvolume, as
> managing properties on per subvolume is a kind of core to btrfs, which is
> easier for data center solution-ing, seamlessly persistent and easy to
> manage.

We've got dmcrypt for this sort of transparent "device level"
encryption. Do we really need another btrfs layer that re-implements
generic, robust, widely deployed, stable functionality?

What concerns me the most here is that it seems like that nothing
has been learnt from the btrfs RAID5/6 debacle. i.e. the btrfs
reimplementation of existing robust, stable, widely deployed
infrastructure was fatally flawed and despite regular corruption
reports they were ignored for, what, 2 years? And then a /user/
spent the time to isolate the problem, and now several months later
it still hasn't been fixed. I haven't seen any developer interest in
fixing it, either.

This meets the definition of unmaintained software, and it sets a
poor example for how complex new btrfs features might be maintained
in the long term. Encryption simply cannot be treated like this - it
has to be right, and it has to be well maintained.

So what is being done differently ito the RAID5/6 review process
this time that will make the new btrfs-specific encryption
implementation solid and have minimal risk of zero-day fatal flaws?
And how are you going to guarantee that it will be adequately
maintained several years down the track?

> Also yes, thanks for the emails, I hear, per file encryption and inline
> with vfs layer is also important, which is wip among other things in the
> list.

The generic file encryption code is solid, reviewed, tested and
already widely deployed via two separate filesystems. There is a
much wider pool of developers who will maintain it, reveiw changes
and know all the traps that a new implementation might fall into.
There's a much bigger safety net here, which significantly lowers
the risk of zero-day fatal flaws in a new implementation and of
flaws in future modifications and enhancements.

Hence, IMO, the first thing to do is implement and make the generic
file encryption support solid and robust, not tack it on as an
afterthought for the magic btrfs encryption pixies to take care of.

Indeed, with the generic file encryption, btrfs may not even need
the special subvolume encryption pixies. i.e. you can effectively
implement subvolume encryption via configuration of a multi-user
encryption key for each subvolume and apply it to the subvolume tree
root at creation time. Then only users with permission to unlock the
subvolume key can access it.

Once the generic file encryption is solid and fulfils the needs of
most users, then you can look to solving the less common threat
models that neither dmcrypt or per-file encryption address. Only if
the generic code cannot be expanded to address specific threat
models should you then implement something that is unique to
btrfs

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 1/6] fstests: common: Introduce _post_mount_hook for btrfs

2016-09-14 Thread Dave Chinner
On Wed, Sep 14, 2016 at 09:55:22AM +0800, Qu Wenruo wrote:
> Introduce _post_mount_hook(), which will be executed after mounting
> scratch/test.
> 
> It's quite useful for fs(OK, only btrfs yet, again) which needs to
> use ioctl other than mount option to enable some of its feature.

Just implement a _btrfs_mount() function (similar to
_overlay_mount()) to do btrfs specific things at mount time.
> Signed-off-by: Qu Wenruo 
> ---
>  common/rc | 23 +++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/common/rc b/common/rc
> index 23c007a..631397f 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -321,6 +321,27 @@ _overlay_scratch_unmount()
>   $UMOUNT_PROG $SCRATCH_MNT
>  }
>  
> +_run_btrfs_post_mount_hook()
> +{
> + mnt_point=$1
> + for n in $ALWAYS_ENABLE_BTRFS_FEATURE; do

What's this magic, undefined, undocumented variable?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] writeback: allow for dirty metadata accounting

2016-09-12 Thread Dave Chinner
On Mon, Sep 12, 2016 at 10:56:04AM -0400, Josef Bacik wrote:
> I think that looping through all the sb's in the system would be
> kinda shitty for this tho, we want the "get number of dirty pages"
> part to be relatively fast.  What if I do something like the
> shrinker_control only for dirty objects. So the fs registers some
> dirty_objects_control, we call into each of those and get the counts
> from that.  Does that sound less crappy?  Thanks,

Hmmm - just an off-the-wall thought on this

If you're going to do that, then why wouldn't you simply use a
"shrinker" to do the metadata writeback rather than having a hook to
count dirty objects to pass to some other writeback code that calls
a hook to write the metadata?

That way filesystems can also implement dirty accounting and
"writers" for each cache of objects they currently implement
shrinkers for. i.e. just expanding shrinkers to be able to "count
dirty objects" and "write dirty objects" so that we can tell
filesystems to write back all their different metadata caches
proportionally to the size of the page cache and it's dirty state.
The existing file data and inode writeback could then just be new
generic "superblock shrinker" operations, and the fs could have it's
own private metadata writeback similar to the private sb shrinker
callout we currently have...

And, in doing so, we might be able to completely hide memcg from the
writeback implementations similar to the way memcg is completely
hidden from the shrinker reclaim implementations...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] writeback: allow for dirty metadata accounting

2016-09-12 Thread Dave Chinner
On Mon, Sep 12, 2016 at 10:24:12AM -0400, Josef Bacik wrote:
> Dave your reply got eaten somewhere along the way for me, so all i
> have is this email.  I'm going to respond to your stuff here.

No worries, I'll do a 2-in-1 reply :P

> On 09/12/2016 03:34 AM, Jan Kara wrote:
> >On Mon 12-09-16 10:46:56, Dave Chinner wrote:
> >>On Fri, Sep 09, 2016 at 10:17:43AM +0200, Jan Kara wrote:
> >>>On Mon 22-08-16 13:35:01, Josef Bacik wrote:
> >>>>Provide a mechanism for file systems to indicate how much dirty metadata 
> >>>>they
> >>>>are holding.  This introduces a few things
> >>>>
> >>>>1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY.
> >>>>2) WB stat for dirty metadata.  This way we know if we need to try and 
> >>>>call into
> >>>>the file system to write out metadata.  This could potentially be used in 
> >>>>the
> >>>>future to make balancing of dirty pages smarter.
> >>>
> >>>So I'm curious about one thing: In the previous posting you have mentioned
> >>>that the main motivation for this work is to have a simple support for
> >>>sub-pagesize dirty metadata blocks that need tracking in btrfs. However you
> >>>do the dirty accounting at page granularity. What are your plans to handle
> >>>this mismatch?
> >>>
> >>>The thing is you actually shouldn't miscount by too much as that could
> >>>upset some checks in mm checking how much dirty pages a node has directing
> >>>how reclaim should be done... But it's a question whether NR_METADATA_DIRTY
> >>>should be actually used in the checks in node_limits_ok() or in
> >>>node_pagecache_reclaimable() at all because once you start accounting dirty
> >>>slab objects, you are really on a thin ice...
> >>
> >>The other thing I'm concerned about is that it's a btrfs-only thing,
> >>which means having dirty btrfs metadata on a system with different
> >>filesystems (e.g. btrfs root/home, XFS data) is going to affect how
> >>memory balance and throttling is run on other filesystems. i.e. it's
> >>going ot make a filesystem specific issue into a problem that
> >>affects global behaviour.
> >
> >Umm, I don't think it will be different than currently. Currently btrfs
> >dirty metadata is accounted as dirty page cache because they have this
> >virtual fs inode which owns all metadata pages.

Yes, so effectively they are treated the same as file data pages
w.r.t. throttling, writeback and reclaim

> >It is pretty similar to
> >e.g. ext2 where you have bdev inode which effectively owns all metadata
> >pages and these dirty pages account towards the dirty limits. For ext4
> >things are more complicated due to journaling and thus ext4 hides the fact
> >that a metadata page is dirty until the corresponding transaction is
> >committed.  But from that moment on dirty metadata is again just a dirty
> >pagecache page in the bdev inode.

Yeah, though those filesystems don't suffer from the uncontrolled
explosion of metadata that btrfs is suffering from, so simply
treating them as another dirty inode that needs flushing works just
fine.

> >So current Josef's patch just splits the counter in which btrfs metadata
> >pages would be accounted but effectively there should be no change in the
> >behavior.

Yup, I missed the addition to the node_pagecache_reclaimable() that
ensures reclaim sees the same number or dirty pages...

> >It is just a question whether this approach is workable in the
> >future when they'd like to track different objects than just pages in the
> >counter.

I don't think it can. Because the counters directly influences the
page lru reclaim scanning algorithms, it can only be used to
account for pages that are in the LRUs. Other objects like slab
objects need to be accounted for and reclaimed by the shrinker
infrastructure.

Accounting for metadata writeback is a different issue - it could
track slab objects if we wanted to, but the issue is that these are
often difficult to determine the amount of IO needed to clean them
so generic balancing is hard. (e.g. effect of inode write
clustering).

> +1 to what Jan said.  Btrfs's dirty metadata is always going to
> affect any other file systems in the system, no matter how we deal
> with it.  In fact it's worse with our btree_inode approach as the
> dirtied_when thing will likely screw somebody and make us skip
> writing out dirty metadata when we want to.

XFS takes care of metadata flushing with a perio

Re: [PATCH 2/3] writeback: allow for dirty metadata accounting

2016-09-11 Thread Dave Chinner
On Fri, Sep 09, 2016 at 10:17:43AM +0200, Jan Kara wrote:
> On Mon 22-08-16 13:35:01, Josef Bacik wrote:
> > Provide a mechanism for file systems to indicate how much dirty metadata 
> > they
> > are holding.  This introduces a few things
> > 
> > 1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY.
> > 2) WB stat for dirty metadata.  This way we know if we need to try and call 
> > into
> > the file system to write out metadata.  This could potentially be used in 
> > the
> > future to make balancing of dirty pages smarter.
> 
> So I'm curious about one thing: In the previous posting you have mentioned
> that the main motivation for this work is to have a simple support for
> sub-pagesize dirty metadata blocks that need tracking in btrfs. However you
> do the dirty accounting at page granularity. What are your plans to handle
> this mismatch?
> 
> The thing is you actually shouldn't miscount by too much as that could
> upset some checks in mm checking how much dirty pages a node has directing
> how reclaim should be done... But it's a question whether NR_METADATA_DIRTY
> should be actually used in the checks in node_limits_ok() or in
> node_pagecache_reclaimable() at all because once you start accounting dirty
> slab objects, you are really on a thin ice...

The other thing I'm concerned about is that it's a btrfs-only thing,
which means having dirty btrfs metadata on a system with different
filesystems (e.g. btrfs root/home, XFS data) is going to affect how
memory balance and throttling is run on other filesystems. i.e. it's
going ot make a filesystem specific issue into a problem that
affects global behaviour.
> 
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 56c8fda..d329f89 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -1809,6 +1809,7 @@ static unsigned long get_nr_dirty_pages(void)
> >  {
> > return global_node_page_state(NR_FILE_DIRTY) +
> > global_node_page_state(NR_UNSTABLE_NFS) +
> > +   global_node_page_state(NR_METADATA_DIRTY) +
> > get_nr_dirty_inodes();
> 
> With my question is also connected this - when we have NR_METADATA_DIRTY,
> we could just account dirty inodes there and get rid of this
> get_nr_dirty_inodes() hack...

Accounting of dirty inodes would have to applied to every
filesystem before that could be done, but

> But actually getting this to work right to be able to track dirty inodes would
> be useful on its own - some throlling of creation of dirty inodes would be
> useful for several filesystems (ext4, xfs, ...).

... this relies on the VFS being able to track and control all
dirtying of inodes and metadata.

Which, it should be noted, cannot be done unconditionally because
some filesystems /explicitly avoid/ dirtying VFS inodes for anything
other than dirty data and provide no mechanism to the VFS for
writeback inodes or their related metadata. e.g. XFS, where all
metadata changes are transactional and so all dirty inode tracking
and writeback control is internal the to the XFS transaction
subsystem.

Adding an external throttle to dirtying of metadata doesn't make any
sense in this sort of architecture - in XFS we already have all the
throttles and expedited writeback triggers integrated into the
transaction subsystem (e.g transaction reservation limits, log space
limits, periodic background writeback, memory reclaim triggers,
etc). It's all so tightly integrated around the physical structure
of the filesystem I can't see any way to sanely abstract it to work
with a generic "dirty list" accounting and writeback engine at this
point...

I can see how tracking of information such as the global amount of
dirty metadata is useful for diagnostics, but I'm not convinced we
should be using it for globally scoped external control of deeply
integrated and highly specific internal filesystem functionality.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] ioctl_xfs_ioc_getfsmap.2: document XFS_IOC_GETFSMAP ioctl

2016-09-09 Thread Dave Chinner
On Thu, Sep 08, 2016 at 11:07:16PM -0700, Darrick J. Wong wrote:
> On Fri, Sep 09, 2016 at 09:38:06AM +1000, Dave Chinner wrote:
> > On Tue, Aug 30, 2016 at 12:09:49PM -0700, Darrick J. Wong wrote:
> > > > I recall for FIEMAP that some filesystems may not have files aligned
> > > > to sector offsets, and we just used byte offsets.  Storage like
> > > > NVDIMMs are cacheline granular, so I don't think it makes sense to
> > > > tie this to old disk sector sizes.  Alternately, the units could be
> > > > in terms of fs blocks as returned by statvfs.st_bsize, but mixing
> > > > units for fmv_block, fmv_offset, fmv_length is uneeded complexity.
> > > 
> > > Ugh.  I'd rather just change the units to bytes rather than force all
> > > the users to multiply things. :)
> > 
> > Yup, units need to be either in disk addresses (i.e. 512 byte units)
> > or bytes. If people can't handle disk addresses (seems to be the
> > case), the bytes it should be.
> 
> 
> 
> > > I'd much rather just add more special owner codes for any other
> > > filesystem that has distinguishable metadata types that are not
> > > covered by the existing OWN_ codes.  We /do/ have 2^64 possible
> > > values, so it's not like we're going to run out.
> > 
> > This is diagnositc information as much as anything, just like
> > fiemap is diagnostic information. So if we have specific type
> > information, it needs to be reported accurately to be useful.
> > 
> > Hence I really don't care if the users and developers of other fs
> > types don't understand what the special owner codes that a specific
> > filesystem returns mean. i.e. it's not useful user information -
> > only a tool that groks the specific filesystem is going to be able
> > to anything useful with special owner codes. So, IMO, there's little
> > point trying to make them generic or to even trying to define and
> > explain them in the man page
> 
>  I'm ok with describing generally what each special owner code
> means.  Maybe the manpage could be more explicit about "None of these
> codes are useful unless you're a low level filesystem tool"?

You can add that, but it doesn't address the underlying problem.
i.e.  that we can add/change the codes, their name, meaning, etc,
and now there's a third party man page that is incorrect and out of
date. It's the same problem with documenting filesystem specific
mount options in mount(8). Better, IMO, is to simple say "refer to
filesystem specific documentation for a description of these special
values". e.g. refer them to the XFS Filesystem Structure
document where this is all spelled out in enough detail to be useful
for someone thinking that they might want to use them

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: lockdep warning in btrfs in 4.8-rc3

2016-09-08 Thread Dave Jones
On Thu, Sep 08, 2016 at 08:58:48AM -0400, Chris Mason wrote:
 > On 09/08/2016 07:50 AM, Christian Borntraeger wrote:
 > > On 09/08/2016 01:48 PM, Christian Borntraeger wrote:
 > >> Chris,
 > >>
 > >> with 4.8-rc3 I get the following on an s390 box:
 > >
 > > Sorry for the noise, just saw the fix in your pull request.
 > >
 > 
 > The lockdep splat is still there, we'll need to annotate this one a little.

Here's another one (unrelated?) that I've not seen before today:

WARNING: CPU: 1 PID: 10664 at kernel/locking/lockdep.c:704 
register_lock_class+0x33f/0x510
CPU: 1 PID: 10664 Comm: kworker/u8:5 Not tainted 4.8.0-rc5-think+ #2 
Workqueue: writeback wb_workfn (flush-btrfs-1)
 0097 b97fbad3 88013b8c3770 a63d3ab1
   a6bf1792 a60df22f
 88013b8c37b0 a60897a0 02c0b97fbad3 a6bf1792
Call Trace:
 [] dump_stack+0x6c/0x9b
 [] ? register_lock_class+0x33f/0x510
 [] __warn+0x110/0x130
 [] warn_slowpath_null+0x2c/0x40
 [] register_lock_class+0x33f/0x510
 [] ? bio_add_page+0x7e/0x120
 [] __lock_acquire.isra.32+0x5b/0x8c0
 [] lock_acquire+0x58/0x70
 [] ? btrfs_try_tree_write_lock+0x4a/0xb0 [btrfs]
 [] _raw_write_lock+0x38/0x70
 [] ? btrfs_try_tree_write_lock+0x4a/0xb0 [btrfs]
 [] btrfs_try_tree_write_lock+0x4a/0xb0 [btrfs]
 [] lock_extent_buffer_for_io+0x28/0x2e0 [btrfs]
 [] btree_write_cache_pages+0x231/0x550 [btrfs]
 [] ? btree_set_page_dirty+0x20/0x20 [btrfs]
 [] btree_writepages+0x74/0x90 [btrfs]
 [] do_writepages+0x3e/0x80
 [] __writeback_single_inode+0x42/0x220
 [] writeback_sb_inodes+0x351/0x730
 [] ? __wb_update_bandwidth+0x1c1/0x2b0
 [] wb_writeback+0x138/0x2a0
 [] wb_workfn+0x10e/0x340
 [] ? __lock_acquire.isra.32+0x1cf/0x8c0
 [] process_one_work+0x24f/0x5d0
 [] ? process_one_work+0x1e0/0x5d0
 [] worker_thread+0x53/0x5b0
 [] ? process_one_work+0x5d0/0x5d0
 [] kthread+0x120/0x140
 [] ? finish_task_switch+0x6a/0x200
 [] ret_from_fork+0x1f/0x40
 [] ? kthread_create_on_node+0x270/0x270
---[ end trace 7b39395c07435bf1 ]---


 700 /*
 701  * Huh! same key, different name? Did someone 
trample
 702  * on some memory? We're most confused.
 703  */ 
 704 WARN_ON_ONCE(class->name != lock->name); 

That seems kinda scary. There was a trinity run going on at the same time,
so this _might_ be a random scribble from something unrelated to btrfs,
but just in case..

IWBNI that code printed out both cases so I could see if this was
corruption or two unrelated keys. I'll make it do that in case it
happens again.

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] ioctl_xfs_ioc_getfsmap.2: document XFS_IOC_GETFSMAP ioctl

2016-09-08 Thread Dave Chinner
On Tue, Aug 30, 2016 at 12:09:49PM -0700, Darrick J. Wong wrote:
> > I recall for FIEMAP that some filesystems may not have files aligned
> > to sector offsets, and we just used byte offsets.  Storage like
> > NVDIMMs are cacheline granular, so I don't think it makes sense to
> > tie this to old disk sector sizes.  Alternately, the units could be
> > in terms of fs blocks as returned by statvfs.st_bsize, but mixing
> > units for fmv_block, fmv_offset, fmv_length is uneeded complexity.
> 
> Ugh.  I'd rather just change the units to bytes rather than force all
> the users to multiply things. :)

Yup, units need to be either in disk addresses (i.e. 512 byte units)
or bytes. If people can't handle disk addresses (seems to be the
case), the bytes it should be.

> I'd much rather just add more special owner codes for any other
> filesystem that has distinguishable metadata types that are not
> covered by the existing OWN_ codes.  We /do/ have 2^64 possible
> values, so it's not like we're going to run out.

This is diagnositc information as much as anything, just like
fiemap is diagnostic information. So if we have specific type
information, it needs to be reported accurately to be useful.

Hence I really don't care if the users and developers of other fs
types don't understand what the special owner codes that a specific
filesystem returns mean. i.e. it's not useful user information -
only a tool that groks the specific filesystem is going to be able
to anything useful with special owner codes. So, IMO, there's little
point trying to make them generic or to even trying to define and
explain them in the man page

> > It seems like there are several fields in the structure that are used for
> > only input or only output?  Does it make more sense to have one structure
> > used only for the input request, and then the array of values returned be
> > in a different structure?  I'm not necessarily requesting that it be 
> > changed,
> > but it definitely is something I noticed a few times while reading this doc.
> 
> I've been thinking about rearranging this a bit, since the flags
> handling is very awkward with the current array structure.  Each
> rmap has its own flags; we may someday want to pass operation flags
> into the ioctl; and we currently have one operation flag to pass back
> to userspace.  Each of those flags can be a separate field.  I think
> people will get confused about FMV_OF_* and FMV_HOF_* being referenced
> in oflags, and iflags has no meaning for returned records.

Yup, that's what I initially noticed when I glanced at this. The XFS
getbmap interface is just plain nasty, and we shouldn't be copying
that API pattern if we can help it.

> So, this instead?
> 
> struct getfsmap_rec {
>   u32 device; /* device id */
>   u32 flags;  /* mapping flags */
>   u64 block;  /* physical addr, bytes */
>   u64 owner;  /* inode or special owner code */
>   u64 offset; /* file offset of mapping, bytes */
>   u64 length; /* length of segment, bytes */
>   u64 reserved;   /* will be set to zero */
> }; /* 48 bytes */
> 
> struct getfsmap_head {
>   u32 iflags; /* none defined yet */
>   u32 oflags; /* FMV_HOF_DEV_T */
>   u32 count;  /* # entries in recs array */
>   u32 entries;/* # entries filled in (output) */
>   u64 reserved[2];/* must be zero */
> 
>   struct getfsmap_rec keys[2]; /* low and high keys for the mapping 
> search */
>   struct getfsmap_rec recs[0];
> }; /* 32 bytes + 2*48 = 128 bytes */
> 
> #define XFS_IOC_GETFSMAP  _IOWR('X', 59, struct getfsmap_head)
> 
> This also means that userspace can set up for the next ioctl
> invocation with memcpy(&head->keys[0], &head->recs[head->entries - 1]).
> 
> Yes, I think I like this better.  Everyone else, please chime in. :)

That's pretty much the structure I was going to suggest - it matches
the fiemap pattern. i.e control parameters are separated from record
data. I'd dump a bit more reserved space in the structure, though;
we've got heaps of flag space for future expansion, but if we need
to pass new parameters into/out of the kernel we'll quickly use the
reserved space.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: common: Enhance _exclude_scratch_mount_option to handle multiply options and generic fs type

2016-09-06 Thread Dave Chinner
On Tue, Sep 06, 2016 at 01:06:39PM +0800, Qu Wenruo wrote:
> Considering not every contributor will add comment about excluded
> mount options, and in case generic test cases needs to exclude one
> mount option for given fstype, it will be quite hard to find the
> reason.

That is why we review changes. If it's not obvious to the reviewer
why the mount option is excluded, or it's not documented in the
commit message, then the reviewer should be asking for it to be
added.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: common: Enhance _exclude_scratch_mount_option to handle multiply options and generic fs type

2016-09-06 Thread Dave Chinner
On Tue, Sep 06, 2016 at 12:20:39PM +0800, Eryu Guan wrote:
> On Mon, Sep 05, 2016 at 03:13:33PM +0800, Qu Wenruo wrote:
> > Enhance _exclude_scratch_mount_option() function to get real mount
> > options from $MOUNT_OPTIONS.
> 
> This seems unnecessarily complex to me.
> 
> > 
> > Now it can understand and extract real mount option from string like
> > "-o opt1,opt2 -oopt3".
> > Furthermore, it doesn't use grep method, which can lead to false alert
> > for options like inode_cache and noinode_cache.
> > It now do compare with the first n characters of the prohibited list,
> > so it can handle "data=" and above "no" prefix well.
> 
> I think we can fix it by adding "-w" option to grep, and replacing
> "data=" with "data", "=" seems not necessary.
> 
> > 
> > And add a new parameter, 'fstype' for _exclude_scratch_mount_option().
> > So for generic test cases, it can still prohibit mount options for given
> > fs(mainly for btrfs though)
> 
> This requires every caller of this helper provides an additional fstype
> argument, and in most cases this argument is not useful (generic or
> current FSTYP). If btrfs needs to be handled differently, how about
> checking the fstype in the test and adding additional mount option rules
> if fstype is btrfs?
> 
> > 
> > Finally, allow it to accept multiple options at the same time.
> > No need for multiple _exclude_scratch_mount_option lines now
> 
> So _exclude_scratch_mount_option is simply:
> 
> # skip test if MOUNT_OPTIONS contains the given mount options
> _exclude_scratch_mount_option()
> {
> for opt in $*; do
> if echo $MOUNT_OPTIONS | grep -qw "$opt"; then
> _notrun "mount option \"$opt\" not allowed in this 
> test"
> fi
> done
> }
> 
> (Note that the comment in current code is wrong, MKFS_OPTIONS should be
> MOUNT_OPTIONS)
> 
> What do you and/or other people think?

Much simpler, easier to understand and tell why the test did not
run.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Will Btrfs have an official command to "uncow" existing files?

2016-08-22 Thread Dave Chinner
On Mon, Aug 22, 2016 at 04:06:13PM -0700, Darrick J. Wong wrote:
> [add Dave and Christoph to cc]
> 
> On Mon, Aug 22, 2016 at 04:14:19PM -0400, Jeff Mahoney wrote:
> > On 8/21/16 2:59 PM, Tomokhov Alexander wrote:
> > > Btrfs wiki FAQ gives a link to example Python script: 
> > > https://github.com/stsquad/scripts/blob/master/uncow.py
> > > 
> > > But such a crucial and fundamental tool must exist in stock btrfs-progs. 
> > > Filesystem with CoW technology at it's core must provide user sufficient 
> > > control over CoW aspects. Running 3rd-party or manually written scripts 
> > > for filesystem properties/metadata manipulation is not convenient, not 
> > > safe and definitely not the way it must be done.
> > > 
> > > Also is it possible (at least in theory) to "uncow" files being currently 
> > > opened in-place? Without the trickery with creation & renaming of files 
> > > or directories. So that running "chattr +C" on a file would be 
> > > sufficient. If possible, is it going to be implemented?
> > 
> > XFS is looking to do this via fallocate using a flag that all file
> > systems can choose to honor.  Once that lands, it would make sense for
> > btrfs to use it as well.  The idea is that when you pass the flag in, we
> > examine the range and CoW anything that has a refcount != 1.
> 
> There /was/ a flag to do that -- FALLOC_FL_UNSHARE_RANGE.  However,
> Christoph and Dave felt[1] that the fallocate call didn't need to have
> an explicit 'unshare' mode because unsharing shared blocks is
> necessary to guarantee that a subsequent write will not ENOSPC.  I
> felt that was sufficient justification to withdraw the unshare mode
> flag.  If you fallocate the entire length of a shared file on XFS, it
> will turn off CoW for that file until you reflink/dedupe it again.

>From the XFS POV that's all good...

> At the time I wondered whether or not the btrfs developers (the list
> was cc'd) would pipe up in support of the unshare flag, but nobody
> did.  Consequently it remains nonexistent.  Christoph commented a few
> months ago about unsharing fallocate over NFS atop XFS blocking for a
> long time, though nobody asked for 'unshare' to be reinstated as a
> separate fallocate mode, much less a 'don't unshare' flag for regular
> fallocate mode.

If there are other use cases, then we can easily implement it in
XFS. However, let's not overload the XFS reflink code with things
other fs devs have once said "that'd be nice to do"

> (FWIW I'm ok with not having to fight for more VFS changes. :))
> 
> > That code hasn't landed yet though.  The last time I saw it posted was
> > June.  I don't speak with knowledge of the integration plan, but it
> > might just be queued up for the next merge window now that the reverse
> > mapping patches have landed in 4.8.
> 
> I am going to try to land XFS reflink in 4.9; I hope to have an eighth
> patchset out for review at the end of the week.
> 
> So... if the btrfs folks really want an unshare flag I can trivially
> re-add it to the VFS headers and re-enable it in the XFS
> implementation  but y'all better speak up now and hammer out an
> acceptable definition.  I don't think XFS needs a new flag.

It's not urgent - it can be added at any time so I'd say it
something we should ignore on the XFS side of things until someone
actually requires an explicit "unshare" operation for another
filesystem

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: checksum error in metadata node - best way to move root fs to new drive?

2016-08-11 Thread Dave T
What I have gathered so far is the following:

1. my RAM is not faulty and I feel comfortable ruling out a memory
error as having anything to do with the reported problem.

2. my storage device does not seem to be faulty. I have not figured
out how to do more definitive testing, but smartctl reports it as
healthy.

3. this problem first happened on a normally running system in light
use. It had not recently crashed. But the root fs went read-only for
an unknown reason.

4. the aftermath of the initial problem may have been exacerbated by
hard resetting the system, but that's only a guess

> The compression-related problem is this:  Btrfs is considerably less tolerant 
> of checksum-related errors on btrfs-compressed data

I'm an unsophisticated user. The argument in support of this statement
sounds convincing to me. Therefore, I think I should discontinue using
compression. Anyone disagree?

Is there anything else I should change? (Do I need to provide
additional information?)

What can I do to find out more about what caused the initial problem.
I have heard memory errors mentioned, but that's apparently not the
case here. I have heard crash recovery mentioned, but that isn't how
my problem initially happened.

I also have a few general questions:

1. Can one discontinue using the compress mount option if it has been
used previously? What happens to existing data if the compress mount
option is 1) added when it wasn't used before, or 2) dropped when it
had been used.

2. I understand that the compress option generally improves btrfs
performance (via Phoronix article I read in the past; I don't find the
link). Since encryption has some characteristics in common with
compression, would one expect any decrease in performance from
dropping compression when using btrfs on dm-crypt? (For more context,
with an i7 6700K which has aes-ni, CPU performance should not be a
bottleneck on my computer.)

3. How do I find out if it is appropriate to use dup metadata on a
Samsung 950 Pro NVMe drive? I don't see deduplication mentioned in the
drive's datasheet:
http://www.samsung.com/semiconductor/minisite/ssd/downloads/document/Samsung_SSD_950_PRO_Data_Sheet_Rev_1_2.pdf

4. Given that my drive is not reporting problems, does it seem
reasonable to re-use this drive after the errors I reported? If so,
how should I do that? Can I simply make a new btrfs filesystem and
copy my data back? Should I start at a lower level and re-do the
dm-crypt layer?

5. Would most of you guys use btrfs + dm-crypt on a production file
server (with spinning disks in JBOD configuration -- i.e., no RAID).
In this situation, the data is very important, of course. My past
experience indicated that RAID only improves uptime, which is not so
critical in our environment. Our main criteria is that we should never
ever have data loss. As far as I understand it, we do have to use
encryption.

Thanks for the discussion so far. It's very educational for me.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: checksum error in metadata node - best way to move root fs to new drive?

2016-08-10 Thread Dave T
Apologies. I have to make a correction to the message I just sent.
Disregard that message and use this one:


On Wed, Aug 10, 2016 at 5:15 PM, Chris Murphy  wrote:

> 1. Report 'btrfs check' without --repair, let's see what it complains
> about and if it might be able to plausibly fix this.

First, a small part of the dmesg output:

[  172.772283] Btrfs loaded
[  172.772632] BTRFS: device label top_level devid 1 transid 103495 /dev/dm-0
[  274.320762] BTRFS info (device dm-0): use lzo compression
[  274.320764] BTRFS info (device dm-0): disk space caching is enabled
[  274.320764] BTRFS: has skinny extents
[  274.322555] BTRFS info (device dm-0): bdev /dev/mapper/cryptroot
errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
[  274.329965] BTRFS: detected SSD devices, enabling SSD mode


Now, full output of btrfs check without repair option.


checking extents
bad metadata [292414541824, 292414558208) crossing stripe boundary
bad metadata [292414607360, 292414623744) crossing stripe boundary
bad metadata [292414672896, 292414689280) crossing stripe boundary
bad metadata [292414738432, 292414754816) crossing stripe boundary
bad metadata [292415787008, 292415803392) crossing stripe boundary
bad metadata [292415918080, 292415934464) crossing stripe boundary
bad metadata [292416376832, 292416393216) crossing stripe boundary
bad metadata [292418015232, 292418031616) crossing stripe boundary
bad metadata [292419325952, 292419342336) crossing stripe boundary
bad metadata [292419588096, 292419604480) crossing stripe boundary
bad metadata [292419915776, 292419932160) crossing stripe boundary
bad metadata [292422930432, 292422946816) crossing stripe boundary
bad metadata [292423061504, 292423077888) crossing stripe boundary
ref mismatch on [292423155712 16384] extent item 1, found 0
Backref 292423155712 root 258 not referenced back 0x2280a20
Incorrect global backref count on 292423155712 found 1 wanted 0
backpointer mismatch on [292423155712 16384]
owner ref check failed [292423155712 16384]
bad metadata [292423192576, 292423208960) crossing stripe boundary
bad metadata [292423323648, 292423340032) crossing stripe boundary
bad metadata [292429549568, 292429565952) crossing stripe boundary
bad metadata [292439904256, 292439920640) crossing stripe boundary
bad metadata [292440297472, 292440313856) crossing stripe boundary
bad metadata [292442525696, 292442542080) crossing stripe boundary
bad metadata [292443770880, 292443787264) crossing stripe boundary
bad metadata [292443967488, 292443983872) crossing stripe boundary
bad metadata [292444033024, 292444049408) crossing stripe boundary
bad metadata [292444098560, 292444114944) crossing stripe boundary
bad metadata [292444164096, 292444180480) crossing stripe boundary
bad metadata [292444229632, 292444246016) crossing stripe boundary
bad metadata [292444688384, 292444704768) crossing stripe boundary
bad metadata [292444884992, 292444901376) crossing stripe boundary
bad metadata [292445081600, 292445097984) crossing stripe boundary
bad metadata [29244672, 292446736384) crossing stripe boundary
bad metadata [292448948224, 292448964608) crossing stripe boundary
Error: could not find btree root extent for root 258
Checking filesystem on /dev/mapper/cryptroot
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: checksum error in metadata node - best way to move root fs to new drive?

2016-08-10 Thread Dave T
see below

On Wed, Aug 10, 2016 at 5:15 PM, Chris Murphy  wrote:

> 1. Report 'btrfs check' without --repair, let's see what it complains
> about and if it might be able to plausibly fix this.

First, a small part of the dmesg output:

[  172.772283] Btrfs loaded
[  172.772632] BTRFS: device label top_level devid 1 transid 103495 /dev/dm-0
[  274.320762] BTRFS info (device dm-0): use lzo compression
[  274.320764] BTRFS info (device dm-0): disk space caching is enabled
[  274.320764] BTRFS: has skinny extents
[  274.322555] BTRFS info (device dm-0): bdev /dev/mapper/sysluks
errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
[  274.329965] BTRFS: detected SSD devices, enabling SSD mode


Now, full output of btrfs check without repair option.


checking extents
bad metadata [292414541824, 292414558208) crossing stripe boundary
bad metadata [292414607360, 292414623744) crossing stripe boundary
bad metadata [292414672896, 292414689280) crossing stripe boundary
bad metadata [292414738432, 292414754816) crossing stripe boundary
bad metadata [292415787008, 292415803392) crossing stripe boundary
bad metadata [292415918080, 292415934464) crossing stripe boundary
bad metadata [292416376832, 292416393216) crossing stripe boundary
bad metadata [292418015232, 292418031616) crossing stripe boundary
bad metadata [292419325952, 292419342336) crossing stripe boundary
bad metadata [292419588096, 292419604480) crossing stripe boundary
bad metadata [292419915776, 292419932160) crossing stripe boundary
bad metadata [292422930432, 292422946816) crossing stripe boundary
bad metadata [292423061504, 292423077888) crossing stripe boundary
ref mismatch on [292423155712 16384] extent item 1, found 0
Backref 292423155712 root 258 not referenced back 0x2280a20
Incorrect global backref count on 292423155712 found 1 wanted 0
backpointer mismatch on [292423155712 16384]
owner ref check failed [292423155712 16384]
bad metadata [292423192576, 292423208960) crossing stripe boundary
bad metadata [292423323648, 292423340032) crossing stripe boundary
bad metadata [292429549568, 292429565952) crossing stripe boundary
bad metadata [292439904256, 292439920640) crossing stripe boundary
bad metadata [292440297472, 292440313856) crossing stripe boundary
bad metadata [292442525696, 292442542080) crossing stripe boundary
bad metadata [292443770880, 292443787264) crossing stripe boundary
bad metadata [292443967488, 292443983872) crossing stripe boundary
bad metadata [292444033024, 292444049408) crossing stripe boundary
bad metadata [292444098560, 292444114944) crossing stripe boundary
bad metadata [292444164096, 292444180480) crossing stripe boundary
bad metadata [292444229632, 292444246016) crossing stripe boundary
bad metadata [292444688384, 292444704768) crossing stripe boundary
bad metadata [292444884992, 292444901376) crossing stripe boundary
bad metadata [292445081600, 292445097984) crossing stripe boundary
bad metadata [29244672, 292446736384) crossing stripe boundary
bad metadata [292448948224, 292448964608) crossing stripe boundary
Error: could not find btree root extent for root 258
Checking filesystem on /dev/mapper/cryptroot
UUID:
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: checksum error in metadata node - best way to move root fs to new drive?

2016-08-10 Thread Dave T
Thanks for all the responses, guys! I really appreciate it. This
information is very helpful. I will be working through the suggestions
(e.g., check without repair) for the next hour or so. I'll report back
when I have something to report.

My drive is a Samsung 950 Pro nvme drive, which in most respects is
treated like an SSD. (the only difference I am aware of is that trim
isn't needed).

> But until recently dup mode data on single device was impossible, so I
> doubt you were using that, and while dup mode metadata was the normal
> default, on ssd that changes to single mode as well.

Your assumptions are correct: single mode for data and metadata.

Does anyone have any thoughts about using dup mode for metadata on a
Samsung 950 Pro (or any NVMe drive)?

I will be very disappointed if I cannot use btrfs + dm-crypt. As far
as I can see, there is no alternative given that I need to use
snapshots (and LVM, as good as it is, has severe performance penalties
for its snapshots). I'm required to use crypto. I cannot risk doing
without snapshots. Therefore, btrfs + dm-crypt seem like my only
viable solution. Plus it is my preferred solution. I like both tools.

If all goes well, we are planning to implement a production file
server for our office with dm-crypt + btrfs (and a lot fo spinning
disks).

In the office we currently have another system identical to mine
running the same drive with dm-crypt + btrfs, the same operating
system, the same nvidia GPU and properitary driver and it is running
fine. One difference is that it is overclocked substantially (mine
isn't). I would have expected it would give a problem before mine
would. But it seems to be rock solid. I just ran btrfs scrub on it and
it finished in a few seconds with no errors.

On my computer I have run two extensive memory tests (8 cpu cores in
parallel, all tests). The current test has been running for 14 hrs
with no errors. (I think that 8 cores in parallel make this equivalent
to a much longer test with the default single cpu settings.)
Therefore, I do not beieve this issue is caused by RAM.

I'm hoping there is no configuration error or other mistake I made in
setting these systems up that would lead to the problems I'm
experiencing.

BTW, I was able to copy all the files to another drive with no
problem. I used "cp -a" to copy, then I ran "rsync -a" twiice to make
sure nothing was missed. My guess is that I'll be able to copy this
right back onto the root filesystem after I resolve whatever the
problem is and my operating system will be back to the same state it
was in prior to this problem.

OK, I'm off to try btrfs check without --repair... thanks again!

For reference:

btrfs-progs v4.6.1
Linux 4.6.4-1-ARCH #1 SMP PREEMPT Mon Jul 11 19:12:32 CEST 2016 x86_64 GNU/Linux



On Wed, Aug 10, 2016 at 5:21 PM, Chris Murphy  wrote:
> I'm using LUKS, aes xts-plain64, on six devices. One is using mixed-bg
> single device. One is dsingle mdup. And then 2x2 mraid1 draid1. I've
> had zero problems. The two computers these run on do have aesni
> support. Aging wise, they're all at least a  year old. But I've been
> using Btrfs on LUKS for much longer than that.
>
>
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


checksum error in metadata node - best way to move root fs to new drive?

2016-08-09 Thread Dave T
btrfs scrub returned with uncorrectable errors. Searching in dmesg
returns the following information:

BTRFS warning (device dm-0): checksum error at logical N on
/dev/mapper/[crypto] sector: y metadata node (level 2) in tree 250

it also says:

unable to fixup (regular) error at logical NN on /dev/mapper/[crypto]


I assume I have a bad block device. Does that seem correct? The
important data is backed up.

However, it would save me a lot of time reinstalling the operating
system and setting up my work environment if I can copy this root
filesystem to another storage device.

Can I do that, considering the errors I have mentioned?? With the
uncorrectable error being in a metadata node, what (if anything) does
that imply about restoring from this drive?

If I can copy this entire root filesystem, what is the best way to do
it? The btrfs restore tool? cp? rsync? Some cloning tool? Other
options?

If I use the btrfs restore tool, should I use options x, m and S? In
particular I wonder exactly what the S option does. If I leave S out,
are all symlinks ignored?

I'm trying to save time and clone this so that I get the operating
system and all my tweaks / configurations back. As I said, the really
important data is separately backed up.

I appreciate all suggestions.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs check --repair error: check_owner_ref: Assertion `rec->is_root` failed.

2016-08-09 Thread Dave T
New problems keep happening...

Now btrfs check --repair is giving an error:

check_owner_ref: Assertion `rec->is_root` failed.

I do not know what that means, but I did run the command as root (as
shown in the images of the screen)

images of errors are here: http://imgur.com/a/ZwkUR


SUMMARY of the past 2 days:

* 2 days ago a computer which had not given any prior problems (and it
only a few months old) experienced the root fs going read-only under
light normal usage (browsing the web).
* I was able to run btrfs rescue zero-log to get the root fs to mount.
* immediately upon rebooting and mounting the fs, I ran btrfs check
--repair. However, that gave some messages that looked concerning and
I posted them here. (The full details are copied lower in this
message. Screen pictures are here: http://imgur.com/a/fT1RV)
* dmesg showed BTRFS error (device dm-0): qgroup generation
* I've been trying to research the issue for the last two days but I
have found absolutely zero information. (Maybe I'm asking in the wrong
places?)
* the system worked normally yesterday (as far as I could tell)
* this morning my system locked up with btrfs-transaction consuming
100% CPU and NMI watchdog reporting BUG: soft lockup with
btrfs-transaction:314.
* I ran btrfs rescue zero-log. It took two reboots to get the root fs to mount.
* the system appeared to be operating normally for about 30 minutes,
then the root fs went read-only again
* I ran btrfs rescue zero-log (again!)
* btrfs continues to report errors. I have linked to screen pictures (imgur.com)
* btrfs check --repair is giving an error (mentioned above and in the
screen pictures)

All previous information is copied below in this message for easy reference.

What do I do? Replace the disk? Stop using btrfs with dm-crypt? Look
elsewhere for help?


On Tue, Aug 9, 2016 at 6:54 PM, Dave T  wrote:
> The original problem from 2 days ago just happened again. I ran btrfs
> rescue zero-log (again) and the root filesystem mounted but it was
> read-only on first boot. I rebooted again and everything seems normal.
> But clearly there is a problem that needs to be resolved.
>
> Problem:
> The root file system becomes read-only during normal usage. See copied
> message below for more information about the error.
>
> I'm happy to provide more info upon request. I appreciate any help. Is
> this btrfs? A bad disk? Something else?
>
> Linux x99 4.6.4-1-ARCH #1 SMP PREEMPT Mon Jul 11 19:12:32 CEST 2016
> x86_64 GNU/Linux
>
> btrfs-progs v4.6.1
>
>
> On Tue, Aug 9, 2016 at 2:07 PM, Dave T  wrote:
>> My system locked up with btrfs-transaction consuming 100% CPU and NMI
>> watchdog reporting BUG: soft lockup with btrfs-transaction:314.
>>
>> This comes 2 days after a serious event involving BTRFS where my
>> system would not mount the root fs. (I gave details in an email to the
>> list two days ago and copied again below.)
>>
>> Here are full details of todays "bug" (or whatever it was).
>>
>> When i left work last night I left my system running and I locked the
>> session. The only things open were KDE Plasma, some terminal windows
>> and some plain text documents in Kate editor. No real work was running
>> on the local machine.
>>
>> This morning I came to work and noticed that my computer was slightly
>> warm and the fans were running at higher than normal RPM.
>>
>> I logged in and opened top in an existing terminal. I saw that
>> btrfs-transaction was consuming 100% of a CPU core and kworker was
>> consumer 100% of another CPU core.
>>
>> I tried to run a command (to view logs) in another terminal window,
>> but the system became unresponsive. I was able to switch to another
>> virtual console, but it was very slow. I took photos with my phone.
>> See link below for two images (top and virtual console):
>>
>> http://imgur.com/a/fT1RV
>>
>> These photos show what I reported above:
>> * btrfs-transaction consuming 100% CPU
>> * NMI watchdog reporting BUG: soft lockup with btrfs-transaction:314
>>
>> I hard reset my system, expecting the worst, but it rebooted normally.
>> journalctl -xb -p3 showed no entries.
>>
>> Obviously I have a serious problem. However, I have no clue about what
>> the problem might be (except that it seemingly involves btrfs). What
>> other information can I provide?
>>
>> On Sun, Aug 7, 2016 at 6:44 PM, Dave  wrote:
>>> I am running Arch Linux on a system with full disk encryption and the
>>> storage is a Samsung 950 Pro NVMe drive (512 GB). The computer is a
>>> couple months old. No bad behavior until now. (I'm only using 21 GB of
>>> the 512 space on the disk.)
>>>
>>> btrfs-pro

Re: system locked up with btrfs-transaction consuming 100% CPU

2016-08-09 Thread Dave T
The original problem from 2 days ago just happened again. I ran btrfs
rescue zero-log (again) and the root filesystem mounted but it was
read-only on first boot. I rebooted again and everything seems normal.
But clearly there is a problem that needs to be resolved.

Problem:
The root file system becomes read-only during normal usage. See copied
message below for more information about the error.

I'm happy to provide more info upon request. I appreciate any help. Is
this btrfs? A bad disk? Something else?

Linux x99 4.6.4-1-ARCH #1 SMP PREEMPT Mon Jul 11 19:12:32 CEST 2016
x86_64 GNU/Linux

btrfs-progs v4.6.1


On Tue, Aug 9, 2016 at 2:07 PM, Dave T  wrote:
> My system locked up with btrfs-transaction consuming 100% CPU and NMI
> watchdog reporting BUG: soft lockup with btrfs-transaction:314.
>
> This comes 2 days after a serious event involving BTRFS where my
> system would not mount the root fs. (I gave details in an email to the
> list two days ago and copied again below.)
>
> Here are full details of todays "bug" (or whatever it was).
>
> When i left work last night I left my system running and I locked the
> session. The only things open were KDE Plasma, some terminal windows
> and some plain text documents in Kate editor. No real work was running
> on the local machine.
>
> This morning I came to work and noticed that my computer was slightly
> warm and the fans were running at higher than normal RPM.
>
> I logged in and opened top in an existing terminal. I saw that
> btrfs-transaction was consuming 100% of a CPU core and kworker was
> consumer 100% of another CPU core.
>
> I tried to run a command (to view logs) in another terminal window,
> but the system became unresponsive. I was able to switch to another
> virtual console, but it was very slow. I took photos with my phone.
> See link below for two images (top and virtual console):
>
> http://imgur.com/a/fT1RV
>
> These photos show what I reported above:
> * btrfs-transaction consuming 100% CPU
> * NMI watchdog reporting BUG: soft lockup with btrfs-transaction:314
>
> I hard reset my system, expecting the worst, but it rebooted normally.
> journalctl -xb -p3 showed no entries.
>
> Obviously I have a serious problem. However, I have no clue about what
> the problem might be (except that it seemingly involves btrfs). What
> other information can I provide?
>
> On Sun, Aug 7, 2016 at 6:44 PM, Dave  wrote:
>> I am running Arch Linux on a system with full disk encryption and the
>> storage is a Samsung 950 Pro NVMe drive (512 GB). The computer is a
>> couple months old. No bad behavior until now. (I'm only using 21 GB of
>> the 512 space on the disk.)
>>
>> btrfs-progs v4.5.1
>>
>> Today I was using my system normally and browsing the web. Firefox
>> stopped responding suddenly and for no apparent reason. Then (KDE)
>> Plasma stopped responding. I could not log out of KDE.
>>
>> I killed my user session (pkill -u me), then I tired to startx. At
>> that point I noticed my root filesystem was read-only.
>>
>> As a first step, I rebooted. That didn't help anything. I tried
>> rebooting several more times -- no change.
>>
>> The root filesystem (btrfs) would not mount. (See error below.) I
>> booted into a LiveUSB environment and ran this command:
>>
>> cryptsetup open --type luks /dev/xxx cryptroot
>>
>> It opens. Then I ran:
>>
>> mount -t btrfs -o
>> noatime,nodiratime,ssd,compress=lzo,defaults,space_cache,subvolid=257
>> /dev/mapper/cryptroot /mnt
>>
>> The error message is shown here:
>>
>> [ 2300.967048] BTRFS info (device dm-0): use ssd allocation scheme
>> [ 2300.967058] BTRFS info (device dm-0): use lzo compression
>> [ 2300.967066] BTRFS info (device dm-0): disk space caching is enabled
>> [ 2300.967069] BTRFS: has skinny extents
>> [ 2300.995393] BTRFS: error (device dm-0) in
>> btrfs_replay_log:2413: errno=-22 unknown (Failed to recover log tree)
>> [ 2300.997617] BTRFS info (device dm-0): delayed_refs has NO entry
>> [ 2300.997673] BTRFS error (device dm-0): cleaner transaction
>> attach returned -30
>> [ 2301.035405] BTRFS: open_ctree failed
>>
>> It is exactly the same error I saw when trying to boot normally as
>> mentioned above.
>>
>> Based on these two links:
>>
>>> https://btrfs.wiki.kernel.org/index.php/Problem_FAQ
>>> https://btrfs.wiki.kernel.org/index.php/Btrfs-zero-log
>>
>> I decided to take a chance on running this command:
>>
>> btrfs rescue zero-log
>>
>> That worked and I can mount the filesystem.
>>

Re: system locked up with btrfs-transaction consuming 100% CPU

2016-08-09 Thread Dave T
Thank you for the info, Duncan.

I will use Alt-sysrq-s alt-sysrq-u alt-sysrq-b. This is the best
description / recommendation I've read on the subject. I had read
about these special key sequences before but I could never remember
them and I didn't fully understand what they did. Now you have given
me the understanding as well as an easy-to-remember method. I'll use
it.

I launch KDE the same way you do (no DM). I also run a tiple monitor
setup, but I am using an nvidia GTX 1070 (and proprietary drivers),
for the time being.

My system does not have any issues when the monitors go to sleep. That
happens many times a day as I have a short timeout set.

I am very concerned about this primary problem (or problems) and I
hope I can find some understanding of what is going on. BTRFS has
worked well for me since 2012. While that's fantastic, it also means I
haven't had to troubleshoot it in the past. Now (because of 4 years of
problem-free operation) I'm using it on a critical production system.
I have backups, but I cannot allow these problems to go unresolved.

On Tue, Aug 9, 2016 at 5:32 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Dave T posted on Tue, 09 Aug 2016 14:07:46 -0400 as excerpted:
>
>> I hard reset my system, expecting the worst, but it rebooted normally.
>> journalctl -xb -p3 showed no entries.
>
> I don't have any suggestions for your primary problem, tho I do have a
> comment down below, but I do have a suggestion regarding your "hard
> reset".
>
> Consider doing some reading on "magic sysrequest", aka sysrq aka srq.
>
> $KERNDIR/Documentation/sysrq.txt , and there's lots of googlable articles
> about it as well.
>
> Basically, when you'd otherwise do a hard reset, try a series of triple-
> key chords, alt-sysrq- first.  (Sysrq is printscreen, if alt
> isn't pressed with it, so alt-sysrq-thirdkey.)
>
> The longer form of the emergency sequence is reisub -- you can read what
> the r-e-i keys due in the documentation -- but from my own experience, I
> find when the system's in bad enough shape I need to do an emergency
> reboot, these keys don't do much for me, while the last three, sub, often
> (but not always) do, and they're much easier to remember, so...
>
> Alt-sysrq-s alt-sysrq-u alt-sysrq-b
>
> s=Sync.  If the kernel is still alive and believes it's still stable
> enough to write to permanent storage without risking writing somewhere it
> shouldn't, this will force all write-cached "dirty" data to be written
> out.
>
> You can safely do an alt-srq-s at any time, and continue working, as it
> forces cached writes to be written out, but doesn't otherwise interfere
> with the running system.  As such, alt-srq-s is a useful sequence to use
> right before you do anything you suspect /might/ crash the system, like
> starting X with a new graphics driver.
>
> u=remoUnt-read-only.  Again, if the kernel is alive and stable, this will
> remount all filesystems read-only, allowing them to safely clean up in
> the process.  The action carries down to sub-filesystem layers like
> dmcrypt as well.
>
> Note that this is an emergency remount-read-only, so it's a bit more
> forceful regarding open files that would block an ordinary remount-
> readonly.  As such, consider the system unusable after doing an alt-srq-
> u, and shutdown or reboot immediately.
>
> b=reBoot.  This forces the kernel to do an immediate reboot, without
> syncing or remounting, etc.  Thus the s-u- first, to sync and remount.
>
>
> Besides being a bit safer than a hard reset, since when it works it
> allows the system to sync and cleanup the filesystems before the reboot,
> this also serves as a crude but effective method of finding out just how
> severely the system was locked up.  If the sync and remount steps light
> up your storage I/O activity LED, you know the kernel considered itself
> in pretty good shape, even if userspace was lost and there was no display
> at all.  If there's no response to them but the reboot step works, you
> know the kernel was still alive enough to respond, but either there
> wasn't anything dirty to write out, or more likely, the kernel believed
> itself to be corrupted, and thus didn't trust its ability to write to
> permanent storage without risking scribbling on other parts of the device
> (other files, perhaps even other partitions).  And of course if none of
> them work and you /do/ have to do a hard reboot, then you know the kernel
> itself was dead, at least to the point it could no longer respond at all
> to magic srq.
>
>
> As to the comment... I'm running plasma/kde5 on gentoo, here, but I'm
> running upstream-kde's live-git versi

system locked up with btrfs-transaction consuming 100% CPU

2016-08-09 Thread Dave T
My system locked up with btrfs-transaction consuming 100% CPU and NMI
watchdog reporting BUG: soft lockup with btrfs-transaction:314.

This comes 2 days after a serious event involving BTRFS where my
system would not mount the root fs. (I gave details in an email to the
list two days ago and copied again below.)

Here are full details of todays "bug" (or whatever it was).

When i left work last night I left my system running and I locked the
session. The only things open were KDE Plasma, some terminal windows
and some plain text documents in Kate editor. No real work was running
on the local machine.

This morning I came to work and noticed that my computer was slightly
warm and the fans were running at higher than normal RPM.

I logged in and opened top in an existing terminal. I saw that
btrfs-transaction was consuming 100% of a CPU core and kworker was
consumer 100% of another CPU core.

I tried to run a command (to view logs) in another terminal window,
but the system became unresponsive. I was able to switch to another
virtual console, but it was very slow. I took photos with my phone.
See link below for two images (top and virtual console):

http://imgur.com/a/fT1RV

These photos show what I reported above:
* btrfs-transaction consuming 100% CPU
* NMI watchdog reporting BUG: soft lockup with btrfs-transaction:314

I hard reset my system, expecting the worst, but it rebooted normally.
journalctl -xb -p3 showed no entries.

Obviously I have a serious problem. However, I have no clue about what
the problem might be (except that it seemingly involves btrfs). What
other information can I provide?

On Sun, Aug 7, 2016 at 6:44 PM, Dave  wrote:
> I am running Arch Linux on a system with full disk encryption and the
> storage is a Samsung 950 Pro NVMe drive (512 GB). The computer is a
> couple months old. No bad behavior until now. (I'm only using 21 GB of
> the 512 space on the disk.)
>
> btrfs-progs v4.5.1
>
> Today I was using my system normally and browsing the web. Firefox
> stopped responding suddenly and for no apparent reason. Then (KDE)
> Plasma stopped responding. I could not log out of KDE.
>
> I killed my user session (pkill -u me), then I tired to startx. At
> that point I noticed my root filesystem was read-only.
>
> As a first step, I rebooted. That didn't help anything. I tried
> rebooting several more times -- no change.
>
> The root filesystem (btrfs) would not mount. (See error below.) I
> booted into a LiveUSB environment and ran this command:
>
> cryptsetup open --type luks /dev/xxx cryptroot
>
> It opens. Then I ran:
>
> mount -t btrfs -o
> noatime,nodiratime,ssd,compress=lzo,defaults,space_cache,subvolid=257
> /dev/mapper/cryptroot /mnt
>
> The error message is shown here:
>
> [ 2300.967048] BTRFS info (device dm-0): use ssd allocation scheme
> [ 2300.967058] BTRFS info (device dm-0): use lzo compression
> [ 2300.967066] BTRFS info (device dm-0): disk space caching is enabled
> [ 2300.967069] BTRFS: has skinny extents
> [ 2300.995393] BTRFS: error (device dm-0) in
> btrfs_replay_log:2413: errno=-22 unknown (Failed to recover log tree)
> [ 2300.997617] BTRFS info (device dm-0): delayed_refs has NO entry
> [ 2300.997673] BTRFS error (device dm-0): cleaner transaction
> attach returned -30
> [ 2301.035405] BTRFS: open_ctree failed
>
> It is exactly the same error I saw when trying to boot normally as
> mentioned above.
>
> Based on these two links:
>
>> https://btrfs.wiki.kernel.org/index.php/Problem_FAQ
>> https://btrfs.wiki.kernel.org/index.php/Btrfs-zero-log
>
> I decided to take a chance on running this command:
>
> btrfs rescue zero-log
>
> That worked and I can mount the filesystem.
>
> I ran btrfs check --repair. Here is the output:
>
> root@broken / # umount /mnt
> root@broken / # btrfs check --repair /dev/mapper/cryptroot
> enabling repair mode
> Checking filesystem on /dev/mapper/cryptroot
> checking extents
> bad metadata [292414476288, 292414492672) crossing stripe boundary
> bad metadata [292414541824, 292414558208) crossing stripe boundary
> bad metadata [292414672896, 292414689280) crossing stripe boundary
> bad metadata [292414869504, 292414885888) crossing stripe boundary
> bad metadata [292415000576, 292415016960) crossing stripe boundary
> bad metadata [292415066112, 292415082496) crossing stripe boundary
> bad metadata [292415131648, 292415148032) crossing stripe boundary
> bad metadata [292415262720, 292415279104) crossing stripe boundary
> bad metadata [292415328256, 292415344640) crossing stripe boundary
> bad metadata [292415393792, 292415410176) crossing stripe boundary
> repaired damaged extent references
> 

Re: [PATCH 13/17] xfs: test swapext with reflink

2016-08-08 Thread Dave Chinner
On Mon, Aug 08, 2016 at 10:41:32AM -0700, Darrick J. Wong wrote:
> On Mon, Aug 08, 2016 at 04:13:59PM +0800, Eryu Guan wrote:
> > On Thu, Jul 21, 2016 at 04:47:38PM -0700, Darrick J. Wong wrote:
> > > Add a few tests to stress the new swapext code for reflink and rmap.
> > > +_reflink_range "$testdir/file1" 0 "$testdir/file2" 0 $bytes >> 
> > > "$seqres.full"
> > > +
> > > +echo "Defrag the big file"
> > > +old_nextents=$(xfs_io -c 'stat -v' $testdir/file1 | grep 'nextents' | 
> > > cut -d ' ' -f 3)
> > 
> > There's a "_count_extents" helper, does that work for this case?
> 
> It can, though stat -v reports GETFSXATTR results, which should be faster than
> _count_extents because the latter FIEMAPs the entire file and counts lines.
> Seeing as XFS records the extent count in the inode, we might as well use it.

perhaps put a special xfs case in _count_extents() that does this
rather than FIEMAP?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs open_ctree failed - root fs will not mount

2016-08-07 Thread Dave
I am running Arch Linux on a system with full disk encryption and the
storage is a Samsung 950 Pro NVMe drive (512 GB). The computer is a
couple months old. No bad behavior until now. (I'm only using 21 GB of
the 512 space on the disk.)

btrfs-progs v4.5.1

Today I was using my system normally and browsing the web. Firefox
stopped responding suddenly and for no apparent reason. Then (KDE)
Plasma stopped responding. I could not log out of KDE.

I killed my user session (pkill -u me), then I tired to startx. At
that point I noticed my root filesystem was read-only.

As a first step, I rebooted. That didn't help anything. I tried
rebooting several more times -- no change.

The root filesystem (btrfs) would not mount. (See error below.) I
booted into a LiveUSB environment and ran this command:

cryptsetup open --type luks /dev/xxx cryptroot

It opens. Then I ran:

mount -t btrfs -o
noatime,nodiratime,ssd,compress=lzo,defaults,space_cache,subvolid=257
/dev/mapper/cryptroot /mnt

The error message is shown here:

[ 2300.967048] BTRFS info (device dm-0): use ssd allocation scheme
[ 2300.967058] BTRFS info (device dm-0): use lzo compression
[ 2300.967066] BTRFS info (device dm-0): disk space caching is enabled
[ 2300.967069] BTRFS: has skinny extents
[ 2300.995393] BTRFS: error (device dm-0) in
btrfs_replay_log:2413: errno=-22 unknown (Failed to recover log tree)
[ 2300.997617] BTRFS info (device dm-0): delayed_refs has NO entry
[ 2300.997673] BTRFS error (device dm-0): cleaner transaction
attach returned -30
[ 2301.035405] BTRFS: open_ctree failed

It is exactly the same error I saw when trying to boot normally as
mentioned above.

Based on these two links:

> https://btrfs.wiki.kernel.org/index.php/Problem_FAQ
> https://btrfs.wiki.kernel.org/index.php/Btrfs-zero-log

I decided to take a chance on running this command:

btrfs rescue zero-log

That worked and I can mount the filesystem.

I ran btrfs check --repair. Here is the output:

root@broken / # umount /mnt
root@broken / # btrfs check --repair /dev/mapper/cryptroot
enabling repair mode
Checking filesystem on /dev/mapper/cryptroot
checking extents
bad metadata [292414476288, 292414492672) crossing stripe boundary
bad metadata [292414541824, 292414558208) crossing stripe boundary
bad metadata [292414672896, 292414689280) crossing stripe boundary
bad metadata [292414869504, 292414885888) crossing stripe boundary
bad metadata [292415000576, 292415016960) crossing stripe boundary
bad metadata [292415066112, 292415082496) crossing stripe boundary
bad metadata [292415131648, 292415148032) crossing stripe boundary
bad metadata [292415262720, 292415279104) crossing stripe boundary
bad metadata [292415328256, 292415344640) crossing stripe boundary
bad metadata [292415393792, 292415410176) crossing stripe boundary
repaired damaged extent references
Fixed 0 roots.
checking free space cache
cache and super generation don't match, space cache will be invalidated
checking fs roots
checking csums
checking root refs
checking quota groups
Ignoring qgroup relation key 258
Ignoring qgroup relation key 263
Ignoring qgroup relation key 71776119061217538
Ignoring qgroup relation key 71776119061217543
Counts for qgroup id: 257 are different
our:referenced 10412273664 referenced compressed 10412273664
disk:   referenced 10411311104 referenced compressed 10411311104
diff:   referenced 962560 referenced compressed 962560
our:exclusive 10412273664 exclusive compressed 10412273664
disk:   exclusive 10412273664 exclusive compressed 10412273664
found 21570773057 bytes used err is 0
total csum bytes: 19563456
total tree bytes: 403767296
total fs tree bytes: 349667328
total extent tree bytes: 27328512
btree space waste bytes: 66313360
file data blocks allocated: 39882014720
referenced 28043988992
extent buffer leak: start 20987904 len 16384
extent buffer leak: start 292688068608 len 16384
extent buffer leak: start 60915712 len 16384
extent buffer leak: start 29569581056 len 16384
extent buffer leak: start 29569597440 len 16384
extent buffer leak: start 292412063744 len 16384
extent buffer leak: start 292405870592 len 16384
extent buffer leak: start 292405936128 len 16384
extent buffer leak: start 292413964288 len 16384

Then I check dmesg and I see this error information:

[ 4925.562422] BTRFS info (device dm-0): use ssd allocation scheme
[ 4925.562432] BTRFS info (device dm-0): use lzo compression
[ 4925.562440] BTRFS info (device dm-0): disk space caching is enabled
[ 4925.562444] BTRFS: has skinny extents
[ 4925.578705] BTRFS error (device dm-0): qgroup generation
mismatch, marked as inconsistent
[ 4925.584033] BTRFS: checking UUID tree

What should I do next? I'm a simple 

Re: [4.8] btrfs heats my room with lock contention

2016-08-04 Thread Dave Chinner
On Thu, Aug 04, 2016 at 10:28:44AM -0400, Chris Mason wrote:
> 
> 
> On 08/04/2016 02:41 AM, Dave Chinner wrote:
> >
> >Simple test. 8GB pmem device on a 16p machine:
> >
> ># mkfs.btrfs /dev/pmem1
> ># mount /dev/pmem1 /mnt/scratch
> ># dbench -t 60 -D /mnt/scratch 16
> >
> >And heat your room with the warm air rising from your CPUs. Top
> >half of the btrfs profile looks like:
.
> >Performance vs CPu usage is:
> >
> >nprocs   throughput  cpu usage
> >1440MB/s  50%
> >2770MB/s 100%
> >4880MB/s 250%
> >8690MB/s 450%
> >16   280MB/s 950%
> >
> >In comparision, at 8-16 threads ext4 is running at ~2600MB/s and
> >XFS is running at ~3800MB/s. Even if I throw 300-400 processes at
> >ext4 and XFS, they only drop to ~1500-2000MB/s as they hit internal
> >limits.
> >
> Yes, with dbench btrfs does much much better if you make a subvol
> per dbench dir.  The difference is pretty dramatic.  I'm working on
> it this month, but focusing more on database workloads right now.

You've been giving this answer to lock contention reports for the
past 6-7 years, Chris.  I really don't care about getting big
benchmark numbers with contrived setups - the "use multiple
subvolumes" solution is simply not practical for users or their
workloads.  The default config should behave sanely and not not
contribute to global warming like this.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[4.8] btrfs heats my room with lock contention

2016-08-03 Thread Dave Chinner

Simple test. 8GB pmem device on a 16p machine:

# mkfs.btrfs /dev/pmem1
# mount /dev/pmem1 /mnt/scratch
# dbench -t 60 -D /mnt/scratch 16

And heat your room with the warm air rising from your CPUs. Top
half of the btrfs profile looks like:

  36.71%  [kernel]  [k] _raw_spin_unlock_irqrestore 

   ¿
  32.29%  [kernel]  [k] native_queued_spin_lock_slowpath

   ¿
   5.14%  [kernel]  [k] queued_write_lock_slowpath  

   ¿
   2.46%  [kernel]  [k] _raw_spin_unlock_irq

   ¿
   2.15%  [kernel]  [k] queued_read_lock_slowpath   

   ¿
   1.54%  [kernel]  [k] _find_next_bit.part.0   

   ¿
   1.06%  [kernel]  [k] __crc32c_le 

   ¿
   0.82%  [kernel]  [k] btrfs_tree_lock 

   ¿
   0.79%  [kernel]  [k] steal_from_bitmap.part.29   

   ¿
   0.70%  [kernel]  [k] __copy_user_nocache 

   ¿
   0.69%  [kernel]  [k] btrfs_tree_read_lock

   ¿
   0.69%  [kernel]  [k] delay_tsc   

   ¿
   0.64%  [kernel]  [k] btrfs_set_lock_blocking_rw  

   ¿
   0.63%  [kernel]  [k] copy_user_generic_string

   ¿
   0.51%  [kernel]  [k] do_raw_read_unlock  

   ¿
   0.48%  [kernel]  [k] do_raw_spin_lock

   ¿
   0.47%  [kernel]  [k] do_raw_read_lock

   ¿
   0.46%  [kernel]  [k] btrfs_clear_lock_blocking_rw

   ¿
   0.44%  [kernel]  [k] do_raw_write_lock   

   ¿
   0.41%  [kernel]  [k] __do_softirq

   ¿
   0.28%  [kernel]  [k] __memcpy

   ¿
   0.24%  [kernel]  [k] map_private_extent_buffer   

   ¿
   0.23%  [kernel]  [k] find_next_zero_bit  

   ¿
   0.22%  [kernel]  [k] btrfs_tree_read_unlock  

   ¿

Performance vs CPu usage is:

nprocs  throughput  cpu usage
1   440MB/s  50%
2   770MB/s 100%
4   880MB/s 250%
8   690MB/s 450%
16  280MB/s 950%

In comparision, at 8-16 threads ext4 is running at ~2600MB/s and
XFS is running at ~3800MB/s. Even if I throw 300-400 processes at
ext4 and XFS, they only drop to ~1500-2000MB/s as they hit internal
limits.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs&qu

Re: A lot warnings in dmesg while running thunderbird

2016-07-24 Thread Dave Chinner
On Fri, Jul 08, 2016 at 12:02:35PM -0400, Chris Mason wrote:
> Can you please run the attached test program:
> 
> gcc -o short-write short-write.c -lpthread
> ./short-write some-new-file-on-btrfs

Hi Chris, this seems like a useful thing to be testing on a regular
basis - can you turn this into an xfstests regression test and
submit it?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] generic/371: run write(2) and fallocate(2) in parallel

2016-07-21 Thread Dave Chinner
On Thu, Jul 21, 2016 at 06:41:00PM +0800, Eryu Guan wrote:
> On Thu, Jul 21, 2016 at 03:30:25PM +0800, Wang Xiaoguang wrote:
> > +
> > +run_time=$((180 * $TIME_FACTOR))
> 
> 180s is too long time, I can reproduce it in around 10s on my test vm,
> just loop for 100 times for each operation (pwrite and falloc)

Or simply do:

run_time=$((20 * $TIME_FACTOR * $TIME_FACTOR))

So that the time factor scales up quickly when you want to run long
running tests.

> > --- a/tests/generic/group
> > +++ b/tests/generic/group
> > @@ -373,3 +373,4 @@
> >  368 auto quick richacl
> >  369 auto quick richacl
> >  370 auto quick richacl
> > +371 auto enospc prealloc stress
> 
> So we can add 'quick' group and remove 'stress'.

I'd leave the test in stress if we can run it based on time as per
my above suggestion.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: btrfs: Test send on heavily deduped file

2016-07-21 Thread Dave Chinner
On Thu, Jul 21, 2016 at 10:05:25AM +0800, Qu Wenruo wrote:
> 
> 
> At 07/21/2016 07:37 AM, Dave Chinner wrote:
> >On Wed, Jul 20, 2016 at 03:40:29PM +0800, Qu Wenruo wrote:
> >>At 07/20/2016 03:01 PM, Eryu Guan wrote:
> >>>On Tue, Jul 19, 2016 at 01:42:03PM +0800, Qu Wenruo wrote:
> >>>>>
> >>>>>This test uses $LOAD_FACTOR, so it should be in 'stress' group. And it
> >>>>>hangs the latest kernel, stop other tests from running, I think we can
> >>>>>add it to 'dangerous' group as well.
> >>>>>
> >>>>
> >>>>Thanks for this info.
> >>>>I'm completely OK to add this group to 'stress' and 'dangerous'.
> >>>>
> >>>>
> >>>>However I'm a little curious about the meaning/standard of these groups.
> >>>>
> >>>>Does 'dangerous' conflicts with 'auto'?
> >>>>Since under most case, tester would just execute './check -g auto' and the
> >>>>system hangs at the test case.
> >>>>So I'm a little confused with the 'auto' group.
> >>>
> >>>I quote my previous email here to explain the 'auto' group
> >>>http://www.spinics.net/lists/fstests/msg03262.html
> >>>
> >>>"
> >>>I searched for Dave's explainations on 'auto' group in his reviews, and
> >>>got the following definitions:
> >>>
> >>>- it should be a valid & reliable test (it's finished and have
> >>> deterministic output) [1]
> >>>- it passes on current upstream kernels, if it fails, it's likely to be
> >>> resolved in forseeable future [2]
> >>>- it should take no longer than 5 minutes to finish [3]
> >>>"
> >>>
> >>>And "The only difference between quick and auto group criteria is the
> >>>test runtime." Usually 'quick' tests finish within 30s.
> >>>
> >>>For the 'dangerous' group, it was added in commit 3f28d55c3954 ("add
> >>>freeze and dangerous groups"), and seems that it didn't have a very
> >>>clear definition[*]. But I think any test that could hang/crash recent
> >>>kernels is considered as dangerous.
> >>>
> >>>* http://oss.sgi.com/archives/xfs/2012-03/msg00073.html
> >>
> >>Thanks for all the info, really helps a lot.
> >>
> >>Especially for quick and auto difference.
> >>
> >>Would you mind me applying this standard to current btrfs test cases?
> >
> >It shoul dbe applied to all test cases, regardless of the filesystem
> >type.
> 
> It's straightforward for specific fs test cases.
> 
> But for generic, I'm a little concerned of the quick/auto standard.
> Should we use result of one single fs(and which fs? I assume xfs
> though) or all fs, to determine quick/auto group?

It's up to the person introducing the new test to determine how it
should be classified. If this causes problems for other people, then
they can send patches to reclassify it appropriately based on their
runtime numbers and configuration.

Historicaly speaking, we've tended to ignore btrfs performance
because it's been so randomly terrible. It's so often been a massive
outlier that we've generally considered btrfs performance to be a
bug that needs fixing and not something that requires the test to be
reclassified for everyone else.

> So it makes quick/auto tag quite hard to determine.

It's quite straight forward, really. Send patches with numbers for
the tests you want reclassified, and lots of people will say "yes, i
see that too" or "no, that only takes 2s on my smallest, slowest
test machine, it should remain as a quick test". And that's about
it.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: btrfs: Test send on heavily deduped file

2016-07-20 Thread Dave Chinner
On Wed, Jul 20, 2016 at 03:40:29PM +0800, Qu Wenruo wrote:
> At 07/20/2016 03:01 PM, Eryu Guan wrote:
> >On Tue, Jul 19, 2016 at 01:42:03PM +0800, Qu Wenruo wrote:
> >>>
> >>>This test uses $LOAD_FACTOR, so it should be in 'stress' group. And it
> >>>hangs the latest kernel, stop other tests from running, I think we can
> >>>add it to 'dangerous' group as well.
> >>>
> >>
> >>Thanks for this info.
> >>I'm completely OK to add this group to 'stress' and 'dangerous'.
> >>
> >>
> >>However I'm a little curious about the meaning/standard of these groups.
> >>
> >>Does 'dangerous' conflicts with 'auto'?
> >>Since under most case, tester would just execute './check -g auto' and the
> >>system hangs at the test case.
> >>So I'm a little confused with the 'auto' group.
> >
> >I quote my previous email here to explain the 'auto' group
> >http://www.spinics.net/lists/fstests/msg03262.html
> >
> >"
> >I searched for Dave's explainations on 'auto' group in his reviews, and
> >got the following definitions:
> >
> >- it should be a valid & reliable test (it's finished and have
> >  deterministic output) [1]
> >- it passes on current upstream kernels, if it fails, it's likely to be
> >  resolved in forseeable future [2]
> >- it should take no longer than 5 minutes to finish [3]
> >"
> >
> >And "The only difference between quick and auto group criteria is the
> >test runtime." Usually 'quick' tests finish within 30s.
> >
> >For the 'dangerous' group, it was added in commit 3f28d55c3954 ("add
> >freeze and dangerous groups"), and seems that it didn't have a very
> >clear definition[*]. But I think any test that could hang/crash recent
> >kernels is considered as dangerous.
> >
> >* http://oss.sgi.com/archives/xfs/2012-03/msg00073.html
> 
> Thanks for all the info, really helps a lot.
> 
> Especially for quick and auto difference.
> 
> Would you mind me applying this standard to current btrfs test cases?

It shoul dbe applied to all test cases, regardless of the filesystem
type.

> BTW, does the standard apply to *ALL* possible mount options or just
> *deafult* mount option?

Generally it applies to the default case.

> For example, btrfs/011 can finish in about 5min with default mount
> option, but for 'nodatasum' it can take up to 2 hours.
> So should it belong to 'auto'?

Yes. Also, keep in mind that runtime is dependent on the type of
storage you are testing on. The general idea is that the
"< 30s quick, < 5m auto" rule is based on how long the test takes to
run on a local single spindle SATA drive, as that is the basic
hardware we'd expect people to be testing against. This means that a
test that takes 20s on your SSD might not be a "quick" test because
it takes 5 minutes on spinning rust

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: btrfs: Test send on heavily deduped file

2016-07-20 Thread Dave Chinner
On Wed, Jul 20, 2016 at 03:01:00PM +0800, Eryu Guan wrote:
> For running tests, "./check -g auto -x dangerous" might fit your need.

Yes, that's precisely the way the dangerous group is intended to be
used: as a exclusion filter that gets applied to other test group
definitions.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 19/20] xfs: run xfs_repair at the end of each test

2016-07-06 Thread Dave Chinner
On Mon, Jul 04, 2016 at 09:11:34PM -0700, Darrick J. Wong wrote:
> On Tue, Jul 05, 2016 at 11:56:17AM +0800, Eryu Guan wrote:
> > On Thu, Jun 16, 2016 at 06:48:01PM -0700, Darrick J. Wong wrote:
> > > Run xfs_repair twice at the end of each test -- once to rebuild
> > > the btree indices, and again with -n to check the rebuild work.
> > > 
> > > Signed-off-by: Darrick J. Wong 
> > > ---
> > >  common/rc |3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > 
> > > diff --git a/common/rc b/common/rc
> > > index 1225047..847191e 100644
> > > --- a/common/rc
> > > +++ b/common/rc
> > > @@ -2225,6 +2225,9 @@ _check_xfs_filesystem()
> > >  ok=0
> > >  fi
> > >  
> > > +$XFS_REPAIR_PROG $extra_options $extra_log_options $extra_rt_options 
> > > $device >$tmp.repair 2>&1
> > > +cat $tmp.repair | _fix_malloc>>$seqres.full
> > > +
> > 
> > Won't this hide fs corruptions? Did I miss anything?
> 
> I could've sworn it did:
> 
> xfs_repair -n
> (complain if corrupt)
> 
> xfs_repair
> 
> xfs_repair -n
> (complain if still corrupt)
> 
> But that first xfs_repair -n hunk disappeared. :(
> 
> Ok, will fix and resend.

Not sure this is the best idea - when repair on an aged test device
takes 10s, this means the test harness overhead increases by a
factor of 3. i.e. test takes 1s to run, checking the filesystem
between tests now takes 30s. i.e. this will badly blow out the run
time of the test suite on aged test devices

What does this overhead actually gain us that we couldn't encode
explicitly into a single test or two? e.g the test itself runs
repair on the aged test device

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sharing page cache pages between multiple mappings

2016-05-19 Thread Dave Chinner
On Thu, May 19, 2016 at 12:17:14PM +0200, Miklos Szeredi wrote:
> On Thu, May 19, 2016 at 11:05 AM, Michal Hocko  wrote:
> > On Thu 19-05-16 10:20:13, Miklos Szeredi wrote:
> >> Has anyone thought about sharing pages between multiple files?
> >>
> >> The obvious application is for COW filesytems where there are
> >> logically distinct files that physically share data and could easily
> >> share the cache as well if there was infrastructure for it.
> >
> > FYI this has been discussed at LSFMM this year[1]. I wasn't at the
> > session so cannot tell you any details but the LWN article covers it at
> > least briefly.
> 
> Cool, so it's not such a crazy idea.

Oh, it most certainly is crazy. :P

> Darrick, would you mind briefly sharing your ideas regarding this?

The current line of though is that we'll only attempt this in XFS on
inodes that are known to share underlying physical extents. i.e.
files that have blocks that have been reflinked or deduped.  That
way we can overload the breaking of reflink blocks (via copy on
write) with unsharing the pages in the page cache for that inode.
i.e. shared pages can propagate upwards in overlay if it uses
reflink for copy-up and writes will then break the sharing with the
underlying source without overlay having to do anything special.

Right now I'm not sure what mechanism we will use - we want to
support files that have a mix of private and shared pages, so that
implies we are not going to be sharing mappings but sharing pages
instead.  However, we've been looking at this as being completely
encapsulated within the filesystem because it's tightly linked to
changes in the physical layout of the filesystem, not as general
"share this mapping between two unrelated inodes" infrastructure.
That may change as we dig deeper into it...

> The use case I have is fixing overlayfs weird behavior. The following
> may result in "buf" not matching "data":
> 
> int fr = open("foo", O_RDONLY);
> int fw = open("foo", O_RDWR);
> write(fw, data, sizeof(data));
> read(fr, buf, sizeof(data));
> 
> The reason is that "foo" is on a read-only layer, and opening it for
> read-write triggers copy-up into a read-write layer.  However the old,
> read-only open still refers to the unmodified file.
>
> Fixing this properly requires that when opening a file, we don't
> delegate operations fully to the underlying file, but rather allow
> sharing of pages from underlying file until the file is copied up.  At
> that point we switch to sharing pages with the read-write copy.

Unless I'm missing something here (quite possible!), I'm not sure
we can fix that problem with page cache sharing or reflink. It
implies we are sharing pages in a downwards direction - private
overlay pages/mappings from multiple inodes would need to be shared
with a single underlying shared read-only inode, and I lack the
imagination to see how that works...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: btrfs: Test fiemap ioctl on completely deduped file

2016-05-11 Thread Dave Chinner
On Thu, May 12, 2016 at 08:46:41AM +0800, Qu Wenruo wrote:
> >Filesystem type is: 58465342
> >File size of a is 262144 (64 blocks of 4096 bytes)
> > ext: logical_offset:physical_offset: length:   expected: flags:
> >   0:0..  31: 24..55: 32: shared
> >   1:   32..  63: 24..55: 32: 56: 
> > last,shared,eof
> 
> Also the "shared" flag is different from btrfs, where btrfs is
> wrong, and the btrfs routine to check shared extent caused the soft
> lockup.
> 
> I originally planned to check "shared" flag, but the soft lockup is
> more important, and 8000+ output seems not suitable as golden
> output.

If that's what the test produces for correct behaviour, then there
isn't any problem with having golden output that large. e.g.
tests/xfs/136.out has 7800 lines in its golden output file. There
are quite a few tests with large amounts of output:

$ find . -name *.out -exec ls -s {} \; |sort -nr |head -5
144 ./tests/xfs/136.out
124 ./tests/generic/324.out
120 ./tests/xfs/165.out
116 ./tests/xfs/107.out
92 ./tests/btrfs/034.out
$

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] scop GFP_NOFS api

2016-05-03 Thread Dave Chinner
On Sun, May 01, 2016 at 08:19:44AM +1000, NeilBrown wrote:
> On Sat, Apr 30 2016, Dave Chinner wrote:
> > Indeed, blocking the superblock shrinker in reclaim is a key part of
> > balancing inode cache pressure in XFS. If the shrinker starts
> > hitting dirty inodes, it blocks on cleaning them, thereby slowing
> > the rate of allocation to that which inodes can be cleaned and
> > reclaimed. There are also background threads that walk ahead freeing
> > clean inodes, but we have to throttle direct reclaim in this manner
> > otherwise the allocation pressure vastly outweighs the ability to
> > reclaim inodes. if we don't balance this, inode allocation triggers
> > the OOM killer because reclaim keeps reporting "no progress being
> > made" because dirty inodes are skipped. BY blocking on such inodes,
> > the shrinker makes progress (slowly) and reclaim sees that memory is
> > being freed and so continues without invoking the OOM killer...
> 
> I'm very aware of the need to throttle allocation based on IO.  I
> remember when NFS didn't quite get this right and filled up memory :-)
> 
> balance_dirty_pages() used to force threads to wait on the write-out of
> one page for every page that they dirtied (or wait on 128 pages for every 128
> dirtied or whatever).  This was exactly to provide the sort of
> throttling you are talking about.
> 
> We don't do that any more.  It was problematic.  I don't recall all the
> reasons but I think that different backing devices having different
> clearance rates was part of the problem.

As the original architect of those changes (the IO-less dirty page
throttling was an evolution of the algorithm I developed for Irix
years before it was done in linux) I remember the reasons very well.

Mostly it was to prevent what we termed "IO breakdown" where so many
different foreground threads were dispatching writeback that it
turned dirty page writeback into random IO instead of being nice,
large sequential IOs.

> So now we monitor clearance rates and wait for some number of blocks to
> be written, rather than waiting for some specific blocks to be written.

Right - we have a limited pool of flusher threads dispatching IO
efficiently as possible, and foreground dirtying processes wait for
that to do the work of cleaning pages.

> We should be able to do the same thing to balance dirty inodes as
> we do to balance dirty pages.

No, we can't. Dirty inode state is deeply tied into filesystem
implementations - it's intertwined with the journal operation in
many filesystems and can't be separated easily.  Indeed, some
filesystem don't even use the VFS for tracking dirty inode state,
nor do they implement the ->write_inode method. Hence the VFS northe
inode cache shrinkers are able to determine if an inode is dirty, or
if it is, trigger writeback of it.

IOWs, inode caches are not unified in allocation, behaviour, design
or implementation like the page cache is, and so balancing dirty
inodes is likely not to be possible

> >>If it could be changed
> >>to just schedule the IO without waiting for it then I think this
> >>would be safe to be called in any FS allocation context.  It already
> >>uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
> >>if the lock is held.
> >
> > We could, but then we have the same problem as the inode cache -
> > there's no indication of progress going back to the memory reclaim
> > subsystem, nor is reclaim able to throttle memory allocation back to
> > the rate at which reclaim is making progress.
> >
> > There's feedback loops all throughout the XFS reclaim code - it's
> > designed specifically that way - I made changes to the shrinker
> > infrastructure years ago to enable this. It's no different to the
> > dirty page throttling that was done at roughly the same time -
> > that's also one big feedback loop controlled by the rate at which
> > pages can be cleaned. Indeed, it was designed was based on the same
> > premise as all the XFS shrinker code: in steady state conditions
> > we can't allocate a resource faster than we can reclaim it, so we
> > need to make reclaim as efficient at possible...
> 
> You seem to be referring here to the same change that I was referred to
> above, but seem to be seeing it from a different perspective.

Sure, not many people have the same viewpoint as the preson who had
to convince everyone else that it was a sound idea even before
prototypes were written...

> Waiting for inodes to be freed in important.  Waiting for any one
> specific inode to be freed is dangerous.

Sure. But there's a difference between waitin

Re: [PATCH 0/2] scop GFP_NOFS api

2016-04-29 Thread Dave Chinner
On Fri, Apr 29, 2016 at 02:04:18PM +0200, Michal Hocko wrote:
> I would also like to revisit generic inode/dentry shrinker and see
> whether it could be more __GFP_FS friendly. As you say many FS might
> even not depend on some FS internal locks so pushing GFP_FS check down
> the layers might make a lot of sense and allow to clean some [id]cache
> even for __GFP_FS context.

That's precisely my point about passing a context to the shrinker.
It's recursion within a single superblock context that makes up the
majority of cases GFP_NOFS is used for, so passing the superblock
immediately allows for reclaim to run the superblock shrinker on
every other superblock.

We can refine it further from there, but I strongly suspect that
further refinement is going to require filesystems to specifically
configure the superblock shrinker.

e.g. in XFS, we can't allow evict() even on clean VFS inodes in a
PF_FSTRANS context, because we may run a transaction on a clean
VFS inode to prepare it for reclaim.  We can, however,
allow the fs-specific shrinker callouts to run (i.e. call into
.free_cached_objects) so that it can reclaim clean XFS inodes
because that doesn't require transactions

i.e. the infrastructure I suggested we use is aimed directly at
providing the mechanism required for finer-grained inode/dentry
cache reclaim in contexts that it is currently disallowed
completely. I was also implying that once the infrastructure to pass
contexts is in place, I'd then make the changes to the generic
superblock shrinker code to enable finer grained reclaim and
optimise the XFS shrinkers to make use of it...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] scop GFP_NOFS api

2016-04-29 Thread Dave Chinner
On Fri, Apr 29, 2016 at 03:35:42PM +1000, NeilBrown wrote:
> On Tue, Apr 26 2016, Michal Hocko wrote:
> 
> > Hi,
> > we have discussed this topic at LSF/MM this year. There was a general
> > interest in the scope GFP_NOFS allocation context among some FS
> > developers. For those who are not aware of the discussion or the issue
> > I am trying to sort out (or at least start in that direction) please
> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> > which basically copies what we have for the scope GFP_NOIO allocation
> > context. I haven't converted any of the FS myself because that is way
> > beyond my area of expertise but I would be happy to help with further
> > changes on the MM front as well as in some more generic code paths.
> >
> > Dave had an idea on how to further improve the reclaim context to be
> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> > and FS specific cookie set in the FS allocation context and consumed
> > by the FS reclaim context to allow doing some provably save actions
> > that would be skipped due to GFP_NOFS normally.  I like this idea and
> > I believe we can go that direction regardless of the approach taken here.
> > Many filesystems simply need to cleanup their NOFS usage first before
> > diving into a more complex changes.>
> 
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface but without details it is hard to be certain.
> 
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context?  Do they
> need to happen at all?
> 
> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
> inode to storage.  Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd.  I think this was an excellent
> advance in simplicity.

No, we didn't move dirty page writeout to kswapd - we moved it back
to the background writeback threads where it can be done
efficiently.  kswapd should almost never do single page writeback
because of how inefficient it is from an IO perspective, even though
it can. i.e. if we are doing any significant amount of dirty page
writeback from memory reclaim (direct, kswapd or otherwise) then
we've screwed something up.

> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.

When lots of GFP_NOFS allocation is being done, this already
happens. The shrinkers that can't run due to context accumulate the
work on the shrinker structure, and when the shrinker can next run
(e.g. run from kswapd) it runs all the deferred work from GFP_NOFS
reclaim contexts.

IOWs, we already move shrinker work from direct reclaim to kswapd
when appropriate.

> The exceptions include:
>  - nfs and any filesystem using fscache can block for up to 1 second
>in ->releasepage().  They used to block waiting for some IO, but that
>caused deadlocks and wasn't really needed.  I left the timeout because
>it seemed likely that some throttling would help.  I suspect that a
>careful analysis will show that there is sufficient throttling
>elsewhere.
> 
>  - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
>for IO so it can free some quotainfo things. 

No it's not. evict() can block on IO - waiting for data or inode
writeback to complete, or even for filesystems to run transactions
on the inode. Hence the superblock shrinker can and does block in
inode cache reclaim.

Indeed, blocking the superblock shrinker in reclaim is a key part of
balancing inode cache pressure in XFS. If the shrinker starts
hitting dirty inodes, it blocks on cleaning them, thereby slowing
the rate of allocation to that which inodes can be cleaned and
reclaimed. There are also background threads that walk ahead freeing
clean inodes, but we have to throttle direct reclaim in this manner
otherwise the allocation pressure vastly outweighs the ability to
reclaim inodes. if we don't balance this, inode allocation triggers
the OOM killer because reclaim keeps reporting "no progress being
made" because dirty inodes are skipped. BY blocking on such inodes,
the shrinker makes progress (slowly) and reclaim sees that memory is
being freed and so continues without invoking the OOM killer...

>If it could be changed
>to just schedule the IO without waiting for it then I think this
>would be safe to be called in any FS allocation context

Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context

2016-04-27 Thread Dave Chinner
On Wed, Apr 27, 2016 at 10:03:11AM +0200, Michal Hocko wrote:
> On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> > On Tue, Apr 26, 2016 at 01:56:12PM +0200, Michal Hocko wrote:
> > > From: Michal Hocko 
> > > 
> > > THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE
> > > 
> > > It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
> > > prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.
> > > 
> > > Let's help this process and add a debugging tool to catch when an
> > > explicit allocation request for GFP_NO{FS,IO} is done from the scope
> > > context. The printed stacktrace should help to identify the caller
> > > and evaluate whether it can be changed to use a wider context or whether
> > > it is called from another potentially dangerous context which needs
> > > a scope protection as well.
> > 
> > You're going to get a large number of these from XFS. There are call
> > paths in XFs that get called both inside and outside transaction
> > context, and many of them are marked with GFP_NOFS to prevent issues
> > that have cropped up in the past.
> > 
> > Often these are to silence lockdep warnings (e.g. commit b17cb36
> > ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> > lockdep gets very unhappy about the same functions being called with
> > different reclaim contexts. e.g.  directory block mapping might
> > occur from readdir (no transaction context) or within transactions
> > (create/unlink). hence paths like this are tagged with GFP_NOFS to
> > stop lockdep emitting false positive warnings
> 
> I would much rather see lockdep being fixed than abusing GFP_NOFS to
> workaround its limitations. GFP_NOFS has a real consequences to the
> memory reclaim. I will go and check the commit you mentioned and try
> to understand why that is a problem. From what you described above
> I would like to get rid of exactly this kind of usage which is not
> really needed for the recursion protection.

The problem is that every time we come across this, the answer is
"use lockdep annotations". Our lockdep annotations are freakin'
complex because of this, and more often than not lockdep false
positives occur due to bugs in the annotations. e.g. see
fs/xfs/xfs_inode.h for all the inode locking annotations we have to
use and the hoops we have to jump through because we are limited to
8 subclasses and we have to be able to annotate nested inode locks
5 deep in places (RENAME_WHITEOUT, thanks).

At one point, we had to reset lockdep classes for inodes in reclaim
so that they didn't throw lockdep false positives the moment an
inode was locked in a memory reclaim context. We had to change
locking to remove that problem (commit 4f59af7 ("xfs: remove iolock
lock classes"). Then there were all the problems with reclaim
triggering lockdep warnings on directory inodes - we had to add a
separate directory inode class for them, and even then we still need
GFP_NOFS in places to minimise reclaim noise (as per the above
commit).

Put simply: we've had to resort to designing locking and allocation
strategies around the limitations of lockdep annotations, as opposed
to what is actually possible or even optimal. i.e. when the choice
is a 2 minute fix to add GFP_NOFS in cases like this, versus another
week long effort to rewrite the inode annotations (again) like this
one last year:

commit 0952c8183c1575a78dc416b5e168987ff98728bb
Author: Dave Chinner 
Date:   Wed Aug 19 10:32:49 2015 +1000

xfs: clean up inode lockdep annotations

Lockdep annotations are a maintenance nightmare. Locking has to be
modified to suit the limitations of the annotations, and we're
always having to fix the annotations because they are unable to
express the complexity of locking heirarchies correctly.
.

It's a no-brainer to see why GFP_NOFS will be added to the
allocation in question. I've been saying for years that I consider
lockdep harmful - if you want to get rid of GFP_NOFS, then you're
going to need to sort out the lockdep reclaim annotation mess at the
same time...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] mm: add PF_MEMALLOC_NOFS

2016-04-26 Thread Dave Chinner
On Tue, Apr 26, 2016 at 01:56:11PM +0200, Michal Hocko wrote:
> From: Michal Hocko 
> 
> GFP_NOFS context is used for the following 4 reasons currently
>   - to prevent from deadlocks when the lock held by the allocation
> context would be needed during the memory reclaim
>   - to prevent from stack overflows during the reclaim because
> the allocation is performed from a deep context already
>   - to prevent lockups when the allocation context depends on
> other reclaimers to make a forward progress indirectly
>   - just in case because this would be safe from the fs POV

- silencing lockdep false positives

> Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
> API to control the scope. This is basically copying
> memalloc_noio_{save,restore} API we have for other restricted allocation
> context GFP_NOIO.
> 
> Xfs has already had a similar functionality as PF_FSTRANS so let's just
> give it a more generic name and make it usable for others as well and
> move the GFP_NOFS context tracking to the page allocator. Xfs has its
> own accessor functions but let's keep them for now to reduce this patch
> as minimum.

Can you split this into two patches? The first simply does this:

#define PF_MEMALLOC_NOFS PF_FSTRANS

and changes only the XFS code to use PF_MEMALLOC_NOFS.

The second patch can then do the rest of the mm API changes that we
don't actually care about in XFS at all.  That way I can carry all
the XFS changes in the XFS tree and not have to worry about when
this stuff gets merged or conflicts with the rest of the work that
is being done to the mm/ code and whatever tree that eventually
lands in...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context

2016-04-26 Thread Dave Chinner
On Tue, Apr 26, 2016 at 01:56:12PM +0200, Michal Hocko wrote:
> From: Michal Hocko 
> 
> THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE
> 
> It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
> prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.
> 
> Let's help this process and add a debugging tool to catch when an
> explicit allocation request for GFP_NO{FS,IO} is done from the scope
> context. The printed stacktrace should help to identify the caller
> and evaluate whether it can be changed to use a wider context or whether
> it is called from another potentially dangerous context which needs
> a scope protection as well.

You're going to get a large number of these from XFS. There are call
paths in XFs that get called both inside and outside transaction
context, and many of them are marked with GFP_NOFS to prevent issues
that have cropped up in the past.

Often these are to silence lockdep warnings (e.g. commit b17cb36
("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
lockdep gets very unhappy about the same functions being called with
different reclaim contexts. e.g.  directory block mapping might
occur from readdir (no transaction context) or within transactions
(create/unlink). hence paths like this are tagged with GFP_NOFS to
stop lockdep emitting false positive warnings

Removing the GFP_NOFS flags in situations like this is simply going
to restart the flood of false positive lockdep warnings we've
silenced over the years, so perhaps lockdep needs to be made smarter
as well...

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BTRFS: assertion failed: num_extents, file: fs/btrfs/extent-tree.c, line: 5584

2016-04-20 Thread Dave Jones
Don't think I've reported this one before. It's on the same box I've been
seeing the btrfs_destroy_inode WARN_ON's on though.

Dave

BTRFS: assertion failed: num_extents, file: fs/btrfs/extent-tree.c, line: 5584
[ cut here ]
kernel BUG at fs/btrfs/ctree.h:4320!
invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN
CPU: 1 PID: 9320 Comm: trinity-c21 Not tainted 4.6.0-rc4-think+ #5
task: 880453bdb7c0 ti: 88045893 task.ti: 88045893
RIP: 0010:[]  [] 
assfail.constprop.88+0x1c/0x1e [btrfs]
RSP: :880458937860  EFLAGS: 00010282
RAX: 004e RBX:  RCX: 
RDX:  RSI: 0001 RDI: ed008b126f02
RBP: 880458937860 R08: 0001 R09: 0001
R10: 0003 R11: 0001 R12: 88045ef14548
R13: 88045066e048 R14: 88045066dc58 R15: 008f
FS:  7f6465c7b700() GS:88046880() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f6465c81000 CR3: 000454e05000 CR4: 001406e0
Stack:
 8804589378a0 c02407f3 008f 
 88045ef14548 88045066e048 88045066dc58 008f
 8804589378f0 c0253a3c 880453bdc048 880453bdb7c0
Call Trace:
 [] drop_outstanding_extent+0x153/0x1a0 [btrfs]
 [] btrfs_delalloc_release_metadata+0x9c/0x5d0 [btrfs]
 [] btrfs_delalloc_release_space+0x1f/0x50 [btrfs]
 [] __btrfs_buffered_write+0xb05/0xea0 [btrfs]
 [] ? btrfs_dirty_pages+0x2d0/0x2d0 [btrfs]
 [] ? local_clock+0x1c/0x20
 [] ? local_clock+0x1c/0x20
 [] ? debug_lockdep_rcu_enabled+0x77/0x90
 [] ? btrfs_file_write_iter+0xa07/0x1570 [btrfs]
 [] btrfs_file_write_iter+0x631/0x1570 [btrfs]
 [] ? __might_fault+0x166/0x1b0
 [] ? __might_fault+0xcb/0x1b0
 [] do_iter_readv_writev+0x134/0x230
 [] ? vfs_iter_read+0x260/0x260
 [] ? rcu_sync_lockdep_assert+0x78/0xb0
 [] ? percpu_down_read+0x5c/0xa0
 [] ? __sb_start_write+0xb4/0xf0
 [] do_readv_writev+0x39c/0x6a0
 [] ? btrfs_sync_file+0xd00/0xd00 [btrfs]
 [] ? vfs_write+0x4a0/0x4a0
 [] ? local_clock+0x1c/0x20
 [] ? __context_tracking_exit.part.6+0x52/0x220
 [] ? enter_from_user_mode+0x50/0x50
 [] vfs_writev+0x75/0xb0
 [] do_pwritev+0x12a/0x170
 [] ? SyS_pwritev+0x20/0x20
 [] SyS_pwritev2+0x17/0x30
 [] do_syscall_64+0x19b/0x4a0
 [] ? trace_hardirqs_on_thunk+0x1b/0x1d
 [] entry_SYSCALL64_slow_path+0x25/0x25
Code: d3 0f 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 89 f1 48 c7 c2 e0 c0 43 
c0 48 89 fe 48 89 e5 48 c7 c7 20 c4 43 c0 e8 92 06 01 d3 <0f> 0b 55 48 89 e5 41 
57 41 56 41 55 41 54 53 48 83 ec 18 48 89 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_destroy_inode WARN_ON.

2016-04-01 Thread Dave Jones
On Fri, Apr 01, 2016 at 02:12:27PM -0400, Dave Jones wrote:
 > BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 30s!
 > Showing busy workqueues and worker pools:
 > workqueue events: flags=0x0
 >   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
 > pending: vmstat_shepherd
 >   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
 > pending: check_corruption
 >   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=3/256
 > pending: usb_serial_port_work, lru_add_drain_per_cpu BAR(17230), 
 > e1000_watchdog_task
 > workqueue events_power_efficient: flags=0x82
 >   pwq 8: cpus=0-3 flags=0x4 nice=0 active=3/256
 > pending: fb_flashcursor, neigh_periodic_work, neigh_periodic_work
 > workqueue events_freezable_power_: flags=0x86
 >   pwq 8: cpus=0-3 flags=0x4 nice=0 active=1/256
 > pending: disk_events_workfn
 > workqueue netns: flags=0x6000a
 >   pwq 8: cpus=0-3 flags=0x4 nice=0 active=1/1
 > in-flight: 10038:cleanup_net
 > workqueue writeback: flags=0x4e
 >   pwq 8: cpus=0-3 flags=0x4 nice=0 active=2/256
 > pending: wb_workfn, wb_workfn
 > workqueue kblockd: flags=0x18
 >   pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=2/256
 > pending: blk_mq_timeout_work, blk_mq_timeout_work
 > workqueue vmstat: flags=0xc
 >   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
 > pending: vmstat_update
 >   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
 > pending: vmstat_update
 >   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
 > pending: vmstat_update
 > pool 8: cpus=0-3 flags=0x4 nice=0 hung=0s workers=11 idle: 11638 10276 609 
 > 17937 606 9237 605 891 15998 14100
 > note: trinity-c13[18815] exited with preempt_count 1

This has wedged userspace too:

23082 pts/2SN+0:00  |   \_ /bin/bash scripts/test-multi.sh
14140 pts/2SNL+   0:15  |   \_ ../trinity -q -l off -N 100 -a64 -x 
fsync -x fdatasync
16900 ?DNs0:04  |   \_ ../trinity -q -l off -N 100 -a64 
-x fsync -x fdata
18894 ?DNs0:02  |   \_ ../trinity -q -l off -N 100 -a64 
-x fsync -x fdata

(14:16:02:davej@think:trinity[master])$ stack 16900
[] wait_on_page_bit_killable+0x156/0x1b0
[] __lock_page_or_retry+0x112/0x1b0
[] filemap_fault+0x367/0xb30
[] __do_fault+0x167/0x3d0
[] handle_mm_fault+0x1837/0x2520
[] __do_page_fault+0x248/0x770
[] do_page_fault+0x39/0xa0
[] page_fault+0x1f/0x30
[] mm_release+0x1ec/0x230
[] do_exit+0x5d0/0x18c0
[] do_group_exit+0xac/0x190
[] get_signal+0x48f/0xeb0
[] do_signal+0xa0/0xb50
[] exit_to_usermode_loop+0xd9/0x100
[] do_syscall_64+0x238/0x2b0
[] return_from_SYSCALL_64+0x0/0x7a
[] 0x

(14:16:09:davej@think:trinity[master])$ stack 18894
[] btrfs_file_write_iter+0xe8/0x9a0 [btrfs]
[] __vfs_write+0x279/0x2e0
[] vfs_write+0x11e/0x2b0
[] SyS_write+0xd2/0x1a0
[] do_syscall_64+0x103/0x2b0
[] return_from_SYSCALL_64+0x0/0x7a
[] 0x

I tried to ftrace the latter process, and the box completely hung.

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_destroy_inode WARN_ON.

2016-04-01 Thread Dave Jones
On Sun, Mar 27, 2016 at 09:14:00PM -0400, Dave Jones wrote:
 
 >  > WARNING: CPU: 2 PID: 32570 at fs/btrfs/inode.c:9261 
 > btrfs_destroy_inode+0x389/0x3f0 [btrfs]
 >  > CPU: 2 PID: 32570 Comm: rm Not tainted 4.5.0-think+ #14
 >  >  c039baf9 ef721ef0 88025966fc08 8957bcdb
 >  >    88025966fc50 890b41f1
 >  >  88045d918040 242d4eed6048 88024eed6048 88024eed6048
 >  > Call Trace:
 >  >  [] ? btrfs_destroy_inode+0x389/0x3f0 [btrfs]
 >  >  [] dump_stack+0x68/0x9d
 >  >  [] __warn+0x111/0x130
 >  >  [] warn_slowpath_null+0x1d/0x20
 >  >  [] btrfs_destroy_inode+0x389/0x3f0 [btrfs]
 >  >  [] destroy_inode+0x67/0x90
 >  >  [] evict+0x1b7/0x240
 >  >  [] iput+0x3ae/0x4e0
 >  >  [] ? dput+0x20e/0x460
 >  >  [] do_unlinkat+0x256/0x440
 >  >  [] ? do_rmdir+0x350/0x350
 >  >  [] ? syscall_trace_enter_phase1+0x87/0x260
 >  >  [] ? enter_from_user_mode+0x50/0x50
 >  >  [] ? __lock_is_held+0x25/0xd0
 >  >  [] ? mark_held_locks+0x22/0xc0
 >  >  [] ? syscall_trace_enter_phase2+0x12d/0x3d0
 >  >  [] ? SyS_rmdir+0x20/0x20
 >  >  [] SyS_unlinkat+0x1b/0x30
 >  >  [] do_syscall_64+0xf4/0x240
 >  >  [] entry_SYSCALL64_slow_path+0x25/0x25
 >  > ---[ end trace a48ce4e6a1b5e409 ]---
 >  > 
 >  > That's WARN_ON(BTRFS_I(inode)->csum_bytes);
 >  > 
 >  > *maybe* it's a bad disk, but there's no indication in dmesg of anything 
 > awry.
 >  > Spinning rust on SATA, nothing special.
 > 
 > Same WARN_ON is reachable from umount too..
 > 
 > WARNING: CPU: 2 PID: 20092 at fs/btrfs/inode.c:9261 
 > btrfs_destroy_inode+0x40c/0x480 [btrfs]
 > CPU: 2 PID: 20092 Comm: umount Tainted: GW   4.5.0-think+ #1
 >   a32c482b 8803cd187b60 9d63af84
 >    c05c5e40 c04d316c
 >  8803cd187ba8 9d0c4c27 880460d80040 242dcd187bb0
 > Call Trace:
 >  [] dump_stack+0x95/0xe1
 >  [] ? btrfs_destroy_inode+0x40c/0x480 [btrfs]
 >  [] __warn+0x147/0x170
 >  [] warn_slowpath_null+0x31/0x40
 >  [] btrfs_destroy_inode+0x40c/0x480 [btrfs]
 >  [] ? btrfs_test_destroy_inode+0x40/0x40 [btrfs]
 >  [] destroy_inode+0x77/0xb0
 >  [] evict+0x20e/0x2c0
 >  [] dispose_list+0x70/0xb0
 >  [] evict_inodes+0x26f/0x2c0
 >  [] ? inode_add_lru+0x60/0x60
 >  [] ? fsnotify_unmount_inodes+0x215/0x2c0
 >  [] generic_shutdown_super+0x76/0x1c0
 >  [] kill_anon_super+0x29/0x40
 >  [] btrfs_kill_super+0x31/0x130 [btrfs]
 >  [] deactivate_locked_super+0x6f/0xb0
 >  [] deactivate_super+0x99/0xb0
 >  [] cleanup_mnt+0x70/0xd0
 >  [] __cleanup_mnt+0x1b/0x20
 >  [] task_work_run+0xef/0x130
 >  [] exit_to_usermode_loop+0xf9/0x100
 >  [] do_syscall_64+0x238/0x2b0
 >  [] entry_SYSCALL64_slow_path+0x25/0x25

Additional fallout:

BTRFS: assertion failed: num_extents, file: fs/btrfs/extent-tree.c, line: 5584
[ cut here ]
kernel BUG at fs/btrfs/ctree.h:4320!
invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
CPU: 1 PID: 18815 Comm: trinity-c13 Tainted: GW   4.6.0-rc1-think+ 
#1
task: 88045de10040 ti: 8803afa38000 task.ti: 8803afa38000
RIP: 0010:[]  [] 
assfail.constprop.88+0x2b/0x2d [btrfs]
RSP: 0018:8803afa3f838  EFLAGS: 00010282
RAX: 004e RBX: c046e200 RCX: 
RDX:  RSI: 0003 RDI: ed0075f47efb
RBP: 8803afa3f848 R08: 0001 R09: 0001
R10:  R11: 0001 R12: 15d0
R13: 8803fda0e048 R14: 8803fda0dc38 R15: 8803fda0dc58
FS:  7fa0566d6700() GS:880468a0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fa0566d9000 CR3: 000333bc4000 CR4: 001406e0
DR0: 7fa0554fb000 DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0600
Stack:
  8803fda0e048 8803afa3f880 c032288b
  880460bb33f8 8803fda0e048 8803fda0dc38
 8803fda0dc58 8803afa3f8c8 c032f851 0001
Call Trace:
 [] drop_outstanding_extent+0x10b/0x130 [btrfs]
 [] btrfs_delalloc_release_metadata+0x71/0x480 [btrfs]
 [] ? __btrfs_buffered_write+0xa6f/0xb50 [btrfs]
 [] btrfs_delalloc_release_space+0x27/0x50 [btrfs]
 [] __btrfs_buffered_write+0xa28/0xb50 [btrfs]
 [] ? btrfs_dirty_pages+0x1c0/0x1c0 [btrfs]
 [] ? filemap_fdatawait_range+0x3e/0x50
 [] ? generic_file_direct_write+0x237/0x2f0
 [] ? filemap_write_and_wait_range+0xa0/0xa0
 [] ? btrfs_file_write_iter+0x670/0x9a0 [btrfs]
 [] btrfs_file_write_iter+0x74d/0x9a0 [btrfs]
 [] do_iter_readv_writev+0x153/0x1f0
 

Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Dave Chinner
On Thu, Mar 31, 2016 at 06:34:17PM -0400, J. Bruce Fields wrote:
> On Fri, Apr 01, 2016 at 09:20:23AM +1100, Dave Chinner wrote:
> > On Thu, Mar 31, 2016 at 01:47:50PM -0600, Andreas Dilger wrote:
> > > On Mar 31, 2016, at 12:08 PM, J. Bruce Fields  
> > > wrote:
> > > > 
> > > > On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote:
> > > >> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
> > > >>> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
> > > >>>> On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> > > >>>>> Or is it ok that fallocate could block, potentially for a long time 
> > > >>>>> as
> > > >>>>> we stream cows through the page cache (or however unshare works
> > > >>>>> internally)?  Those same programs might not be expecting fallocate 
> > > >>>>> to
> > > >>>>> take a long time.
> > > >>>> 
> > > >>>> Yes, it's perfectly fine for fallocate to block for long periods of
> > > >>>> time. See what gfs2 does during preallocation of blocks - it ends up
> > > >>>> calling sb_issue_zerout() because it doesn't have unwritten
> > > >>>> extents, and hence can block for long periods of time
> > > >>> 
> > > >>> gfs2 fallocate is an implementation that will cause all but the most
> > > >>> trivial users real pain.  Even the initial XFS implementation just
> > > >>> marking the transactions synchronous made it unusable for all kinds
> > > >>> of applications, and this is much worse.  E.g. a NFS ALLOCATE 
> > > >>> operation
> > > >>> to gfs2 will probab;ly hand your connection for extended periods of
> > > >>> time.
> > > >>> 
> > > >>> If we need to support something like what gfs2 does we should have a
> > > >>> separate flag for it.
> > > >> 
> > > >> Using fallocate() for preallocation was always intended to
> > > >> be a faster, more efficient method allocating zeroed space
> > > >> than having userspace write blocks of data. Faster, more efficient
> > > >> does not mean instantaneous, and gfs2 using sb_issue_zerout() means
> > > >> that if the hardware has zeroing offloads (deterministic trim, write
> > > >> same, etc) it will use them, and that will be much faster than
> > > >> writing zeros from userspace.
> > > >> 
> > > >> IMO, what gfs2 is definitely within the intended usage of
> > > >> fallocate() for accelerating the preallocation of blocks.
> > > >> 
> > > >> Yes, it may not be optimal for things like NFS servers which haven't
> > > >> considered that a fallocate based offload operation might take some
> > > >> time to execute, but that's not a problem with fallocate. i.e.
> > > >> that's a problem with the nfs server ALLOCATE implementation not
> > > >> being prepared to return NFSERR_JUKEBOX to prevent client side hangs
> > > >> and timeouts while the operation is run
> > > > 
> > > > That's an interesting idea, but I don't think it's really legal.  I take
> > > > JUKEBOX to mean "sorry, I'm failing this operation for now, try again
> > > > later and it might succeed", not "OK, I'm working on it, try again and
> > > > you may find out I've done it".
> > > > 
> > > > So if the client gets a JUKEBOX error but the server goes ahead and does
> > > > the operation anyway, that'd be unexpected.
> > > 
> > > Well, the tape continued to be mounted in the background and/or the file
> > > restored from the tape into the filesystem...
> > 
> > Right, and SGI have been shipping a DMAPI-aware Linux NFS server for
> > many years, using the above NFSERR_JUKEBOX behaviour for operations
> > that may block for a long time due to the need to pull stuff into
> > the filesytsem from the slow backing store. Best explanation is in
> > the relevant commit in the last published XFS+DMAPI branch from SGI,
> > for example:
> > 
> > http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=28b171cf2b64167826474efbb82ad9d471a05f75
> 
> I haven't looked at the code, but I ass

Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Dave Chinner
On Thu, Mar 31, 2016 at 01:47:50PM -0600, Andreas Dilger wrote:
> On Mar 31, 2016, at 12:08 PM, J. Bruce Fields  wrote:
> > 
> > On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote:
> >> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
> >>> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
> >>>> On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> >>>>> Or is it ok that fallocate could block, potentially for a long time as
> >>>>> we stream cows through the page cache (or however unshare works
> >>>>> internally)?  Those same programs might not be expecting fallocate to
> >>>>> take a long time.
> >>>> 
> >>>> Yes, it's perfectly fine for fallocate to block for long periods of
> >>>> time. See what gfs2 does during preallocation of blocks - it ends up
> >>>> calling sb_issue_zerout() because it doesn't have unwritten
> >>>> extents, and hence can block for long periods of time
> >>> 
> >>> gfs2 fallocate is an implementation that will cause all but the most
> >>> trivial users real pain.  Even the initial XFS implementation just
> >>> marking the transactions synchronous made it unusable for all kinds
> >>> of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
> >>> to gfs2 will probab;ly hand your connection for extended periods of
> >>> time.
> >>> 
> >>> If we need to support something like what gfs2 does we should have a
> >>> separate flag for it.
> >> 
> >> Using fallocate() for preallocation was always intended to
> >> be a faster, more efficient method allocating zeroed space
> >> than having userspace write blocks of data. Faster, more efficient
> >> does not mean instantaneous, and gfs2 using sb_issue_zerout() means
> >> that if the hardware has zeroing offloads (deterministic trim, write
> >> same, etc) it will use them, and that will be much faster than
> >> writing zeros from userspace.
> >> 
> >> IMO, what gfs2 is definitely within the intended usage of
> >> fallocate() for accelerating the preallocation of blocks.
> >> 
> >> Yes, it may not be optimal for things like NFS servers which haven't
> >> considered that a fallocate based offload operation might take some
> >> time to execute, but that's not a problem with fallocate. i.e.
> >> that's a problem with the nfs server ALLOCATE implementation not
> >> being prepared to return NFSERR_JUKEBOX to prevent client side hangs
> >> and timeouts while the operation is run
> > 
> > That's an interesting idea, but I don't think it's really legal.  I take
> > JUKEBOX to mean "sorry, I'm failing this operation for now, try again
> > later and it might succeed", not "OK, I'm working on it, try again and
> > you may find out I've done it".
> > 
> > So if the client gets a JUKEBOX error but the server goes ahead and does
> > the operation anyway, that'd be unexpected.
> 
> Well, the tape continued to be mounted in the background and/or the file
> restored from the tape into the filesystem...

Right, and SGI have been shipping a DMAPI-aware Linux NFS server for
many years, using the above NFSERR_JUKEBOX behaviour for operations
that may block for a long time due to the need to pull stuff into
the filesytsem from the slow backing store. Best explanation is in
the relevant commit in the last published XFS+DMAPI branch from SGI,
for example:

http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=28b171cf2b64167826474efbb82ad9d471a05f75

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Dave Chinner
On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
> > On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> > > Or is it ok that fallocate could block, potentially for a long time as
> > > we stream cows through the page cache (or however unshare works
> > > internally)?  Those same programs might not be expecting fallocate to
> > > take a long time.
> > 
> > Yes, it's perfectly fine for fallocate to block for long periods of
> > time. See what gfs2 does during preallocation of blocks - it ends up
> > calling sb_issue_zerout() because it doesn't have unwritten
> > extents, and hence can block for long periods of time
> 
> gfs2 fallocate is an implementation that will cause all but the most
> trivial users real pain.  Even the initial XFS implementation just
> marking the transactions synchronous made it unusable for all kinds
> of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
> to gfs2 will probab;ly hand your connection for extended periods of
> time.
> 
> If we need to support something like what gfs2 does we should have a
> separate flag for it.

Using fallocate() for preallocation was always intended to
be a faster, more efficient method allocating zeroed space
than having userspace write blocks of data. Faster, more efficient
does not mean instantaneous, and gfs2 using sb_issue_zerout() means
that if the hardware has zeroing offloads (deterministic trim, write
same, etc) it will use them, and that will be much faster than
writing zeros from userspace.

IMO, what gfs2 is definitely within the intended usage of
fallocate() for accelerating the preallocation of blocks.

Yes, it may not be optimal for things like NFS servers which haven't
considered that a fallocate based offload operation might take some
time to execute, but that's not a problem with fallocate. i.e.
that's a problem with the nfs server ALLOCATE implementation not
being prepared to return NFSERR_JUKEBOX to prevent client side hangs
and timeouts while the operation is run

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate mode flag for "unshare blocks"?

2016-03-30 Thread Dave Chinner
On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> Or is it ok that fallocate could block, potentially for a long time as
> we stream cows through the page cache (or however unshare works
> internally)?  Those same programs might not be expecting fallocate to
> take a long time.

Yes, it's perfectly fine for fallocate to block for long periods of
time. See what gfs2 does during preallocation of blocks - it ends up
calling sb_issue_zerout() because it doesn't have unwritten
extents, and hence can block for long periods of time....

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: good documentation on btrfs internals and on disk layout

2016-03-30 Thread Dave Stevens

Quoting Liu Bo :


On Wed, Mar 30, 2016 at 01:58:03PM +, sri wrote:

Hi,

I could find very limited documentation related to on disk layout of btrfs
and how all trees are related to each other. Except wiki which has very
specific top level details I couldn't able to find more details on btrfs.

FSs such as zfs, ext3/4 and XFS there are documents which explains ondisk
layout of the file systems.

Could anybody please provide pointers for the same for better
understanding of btrfs on disk layout and how each tree interacts provided
multiple disks are configured for btrfs.


There is a paper[1] about btrfs filesystem which covers all the details.

[1]: BTRFS: The Linux B-Tree Filesystem


and this is where it is:

http://domino.watson.ibm.com/library/CyberDig.nsf/papers/6E1C5B6A1B6EDD9885257A38006B6130/$File/rj10501.pdf

D



Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
"As long as politics is the shadow cast on society by big business,
the attenuation of the shadow will not change the substance."

-- John Dewey





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_destroy_inode WARN_ON.

2016-03-27 Thread Dave Jones
On Thu, Mar 24, 2016 at 06:54:11PM -0400, Dave Jones wrote:
 > Just hit this on a tree from earlier this morning, v4.5-11140 or so.
 > 
 > WARNING: CPU: 2 PID: 32570 at fs/btrfs/inode.c:9261 
 > btrfs_destroy_inode+0x389/0x3f0 [btrfs]
 > CPU: 2 PID: 32570 Comm: rm Not tainted 4.5.0-think+ #14
 >  c039baf9 ef721ef0 88025966fc08 8957bcdb
 >    88025966fc50 890b41f1
 >  88045d918040 242d4eed6048 88024eed6048 88024eed6048
 > Call Trace:
 >  [] ? btrfs_destroy_inode+0x389/0x3f0 [btrfs]
 >  [] dump_stack+0x68/0x9d
 >  [] __warn+0x111/0x130
 >  [] warn_slowpath_null+0x1d/0x20
 >  [] btrfs_destroy_inode+0x389/0x3f0 [btrfs]
 >  [] destroy_inode+0x67/0x90
 >  [] evict+0x1b7/0x240
 >  [] iput+0x3ae/0x4e0
 >  [] ? dput+0x20e/0x460
 >  [] do_unlinkat+0x256/0x440
 >  [] ? do_rmdir+0x350/0x350
 >  [] ? syscall_trace_enter_phase1+0x87/0x260
 >  [] ? enter_from_user_mode+0x50/0x50
 >  [] ? __lock_is_held+0x25/0xd0
 >  [] ? mark_held_locks+0x22/0xc0
 >  [] ? syscall_trace_enter_phase2+0x12d/0x3d0
 >  [] ? SyS_rmdir+0x20/0x20
 >  [] SyS_unlinkat+0x1b/0x30
 >  [] do_syscall_64+0xf4/0x240
 >  [] entry_SYSCALL64_slow_path+0x25/0x25
 > ---[ end trace a48ce4e6a1b5e409 ]---
 > 
 > 
 > That's WARN_ON(BTRFS_I(inode)->csum_bytes);
 > 
 > *maybe* it's a bad disk, but there's no indication in dmesg of anything awry.
 > Spinning rust on SATA, nothing special.

Same WARN_ON is reachable from umount too..

WARNING: CPU: 2 PID: 20092 at fs/btrfs/inode.c:9261 
btrfs_destroy_inode+0x40c/0x480 [btrfs]
CPU: 2 PID: 20092 Comm: umount Tainted: GW   4.5.0-think+ #1
  a32c482b 8803cd187b60 9d63af84
   c05c5e40 c04d316c
 8803cd187ba8 9d0c4c27 880460d80040 242dcd187bb0
Call Trace:
 [] dump_stack+0x95/0xe1
 [] ? btrfs_destroy_inode+0x40c/0x480 [btrfs]
 [] __warn+0x147/0x170
 [] warn_slowpath_null+0x31/0x40
 [] btrfs_destroy_inode+0x40c/0x480 [btrfs]
 [] ? btrfs_test_destroy_inode+0x40/0x40 [btrfs]
 [] destroy_inode+0x77/0xb0
 [] evict+0x20e/0x2c0
 [] dispose_list+0x70/0xb0
 [] evict_inodes+0x26f/0x2c0
 [] ? inode_add_lru+0x60/0x60
 [] ? fsnotify_unmount_inodes+0x215/0x2c0
 [] generic_shutdown_super+0x76/0x1c0
 [] kill_anon_super+0x29/0x40
 [] btrfs_kill_super+0x31/0x130 [btrfs]
 [] deactivate_locked_super+0x6f/0xb0
 [] deactivate_super+0x99/0xb0
 [] cleanup_mnt+0x70/0xd0
 [] __cleanup_mnt+0x1b/0x20
 [] task_work_run+0xef/0x130
 [] exit_to_usermode_loop+0xf9/0x100
 [] do_syscall_64+0x238/0x2b0
 [] entry_SYSCALL64_slow_path+0x25/0x25

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs_destroy_inode WARN_ON.

2016-03-24 Thread Dave Jones
Just hit this on a tree from earlier this morning, v4.5-11140 or so.

WARNING: CPU: 2 PID: 32570 at fs/btrfs/inode.c:9261 
btrfs_destroy_inode+0x389/0x3f0 [btrfs]
CPU: 2 PID: 32570 Comm: rm Not tainted 4.5.0-think+ #14
 c039baf9 ef721ef0 88025966fc08 8957bcdb
   88025966fc50 890b41f1
 88045d918040 242d4eed6048 88024eed6048 88024eed6048
Call Trace:
 [] ? btrfs_destroy_inode+0x389/0x3f0 [btrfs]
 [] dump_stack+0x68/0x9d
 [] __warn+0x111/0x130
 [] warn_slowpath_null+0x1d/0x20
 [] btrfs_destroy_inode+0x389/0x3f0 [btrfs]
 [] destroy_inode+0x67/0x90
 [] evict+0x1b7/0x240
 [] iput+0x3ae/0x4e0
 [] ? dput+0x20e/0x460
 [] do_unlinkat+0x256/0x440
 [] ? do_rmdir+0x350/0x350
 [] ? syscall_trace_enter_phase1+0x87/0x260
 [] ? enter_from_user_mode+0x50/0x50
 [] ? __lock_is_held+0x25/0xd0
 [] ? mark_held_locks+0x22/0xc0
 [] ? syscall_trace_enter_phase2+0x12d/0x3d0
 [] ? SyS_rmdir+0x20/0x20
 [] SyS_unlinkat+0x1b/0x30
 [] do_syscall_64+0xf4/0x240
 [] entry_SYSCALL64_slow_path+0x25/0x25
---[ end trace a48ce4e6a1b5e409 ]---


That's WARN_ON(BTRFS_I(inode)->csum_bytes);

*maybe* it's a bad disk, but there's no indication in dmesg of anything awry.
Spinning rust on SATA, nothing special.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/12] xfs/030: fix output on newer filesystems

2016-03-22 Thread Dave Chinner
On Sat, Mar 05, 2016 at 12:20:50PM -0800, Christoph Hellwig wrote:
> Still fails for me:
> 
> --- tests/xfs/030.out 2016-03-03 07:55:58.556427678 +
> +++ /root/xfstests/results//xfs/030.out.bad   2016-03-05 20:20:17.561433837 
> +
> @@ -231,8 +231,6 @@
>  bad agbno AGBNO in agfl, agno 0
>  bad agbno AGBNO in agfl, agno 0
>  bad agbno AGBNO in agfl, agno 0
> -bad agbno AGBNO in agfl, agno 0
> -bad agbno AGBNO in agfl, agno 0

That's because the free lists are of different lengths on the
different fs configs. Not sure how best to handle it - maybe just
filter then entire bad agbno in agfl line out?

I'm going to commit the change anyway, as this is a separate issue
that needs to be solved.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: _fail the tests if _scratch_mount failed to avoid fully filling root fs

2016-03-22 Thread Dave Chinner
On Mon, Mar 21, 2016 at 03:23:41PM +0800, Eryu Guan wrote:
> btrfs failed to mount small fs on ppc64 host with error ENOSPC, even
> creating such small fs succeeded, then generic/027 consumed all free
> space on root fs not on SCRATCH_DEV and test harness cannot create tmp
> files and continue other tests.
> 
> Though I think it's a btrfs bug, it's still worth preventing this
> situation from happening in the harness, as such tests usually aim to
> exercise fs on ENOSPC conditions, there's no point to continue if the
> small fs is not mounted.

I think the btrfs bug should be fixed. At minimum, the workaround to
see if the filesytem can be mounted should be in btrfs's
implementation of scratch_mkfs_sized

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


qgroup code slowing down rebalance

2016-03-19 Thread Dave Hansen
I have a medium-sized multi-device btrfs filesystem (4 disks, 16TB
total) running under 4.5.0-rc5.  I recently added a disk and needed to
rebalance.  I started a rebalance operation three days ago.  It was on
the order of 20% done after those three days. :)

During this rebalance, the disks were pretty lightly used.  I would see
a small burst of tens of MB/s, then it would go back to no activity for
a few minutes, small burst, no activity, etc...  During the quiet times
(for the disk) one processor would be pegged inside the kernel and would
have virtually no I/O wait time.  Also during this time, the filesystem
was pretty unbearably slow.  An ls of a small directory would hang for
minutes.

A perf profile shows 92% of the cpu time is being spend in
btrfs_find_all_roots(), called under this call path:

btrfs_commit_transaction
 -> btrfs_qgroup_prepare_account_extents
   -> btrfs_find_all_roots

So I tried disabling quotas by doing:

btrfs quota disable /mnt/foo

which took a few minutes to complete, but once it did, the disks went
back up to doing ~200MB/s, the kernel time went down to ~20%, and the
system now has lots of I/O wait time.  It looks to be behaving nicely.

Is this expected?  From my perspective, it makes quotas pretty much
unusable at least during a rebalance.  I have a full 'perf record'
profile with call graphs if it would be helpful.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qgroup code slowing down rebalance

2016-03-19 Thread Dave Hansen
On 03/17/2016 06:02 PM, Qu Wenruo wrote:
> Dave Hansen wrote on 2016/03/17 09:36 -0700:
>> On 03/16/2016 06:36 PM, Qu Wenruo wrote:
>>> Dave Hansen wrote on 2016/03/16 13:53 -0700:
>>>> I have a medium-sized multi-device btrfs filesystem (4 disks, 16TB
>>>> total) running under 4.5.0-rc5.  I recently added a disk and needed to
>>>> rebalance.  I started a rebalance operation three days ago.  It was on
>>>> the order of 20% done after those three days. :)
>> ...
>> Data, RAID1: total=4.53TiB, used=4.53TiB
>> System, RAID1: total=32.00MiB, used=720.00KiB
>> Metadata, RAID1: total=17.00GiB, used=15.77GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> Considering the size and the amount of metadata, even doing a quota
> rescan will be quite slowing.
> 
> Would you please try to do a quota rescan and see the CPU/IO usage?

I did a quota rescan.  It uses about 80% of one CPU core, but also has
some I/O wait time and pulls 1-20MB/s of data off the disk (the balance
with quotas on was completely CPU-bound, and had very low I/O rates).

It would seem that the "quota rescan" *does* have the same issue as the
balance with quotas on, but to a much smaller extent than what I saw
with the "balance" operation.

I have a full profile recorded from the "quota rescan", but the most
relevant parts are pasted below.  Basically btrfs_search_slot() and
radix tree lookups are eating all the CPU time, but they're still doing
enough I/O to see _some_ idle time on the processor.

> 74.55% 3.10%  kworker/u8:0 [btrfs]  [k] 
> find_parent_nodes   
>|
>---find_parent_nodes
>   |  
>   |--99.95%-- __btrfs_find_all_roots
>   |  btrfs_find_all_roots
>   |  btrfs_qgroup_rescan_worker
>   |  normal_work_helper
>   |  btrfs_qgroup_rescan_helper
>   |  process_one_work
>   |  worker_thread
>   |  kthread
>   |  ret_from_fork
>--0.05%-- [...]
> 
> 32.14% 4.16%  kworker/u8:0 [btrfs]  [k] 
> btrfs_search_slot   
>|
>---btrfs_search_slot
>   |  
>   |--87.90%-- find_parent_nodes
>   |  __btrfs_find_all_roots
>   |  btrfs_find_all_roots
>   |  btrfs_qgroup_rescan_worker
>   |  normal_work_helper
>   |  btrfs_qgroup_rescan_helper
>   |  process_one_work
>   |  worker_thread
>   |  kthread
>   |  ret_from_fork
>   |  
>   |--11.70%-- btrfs_search_old_slot
>   |  __resolve_indirect_refs
>   |  find_parent_nodes
>   |  __btrfs_find_all_roots
>   |  btrfs_find_all_roots
>   |  btrfs_qgroup_rescan_worker
>   |  normal_work_helper
>   |  btrfs_qgroup_rescan_helper
>   |  process_one_work
>   |  worker_thread
>   |  kthread
>   |  ret_from_fork
>--0.39%-- [...]
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] fstest: btrfs: test single 4k extent after subpagesize buffered writes

2016-03-19 Thread Dave Chinner
On Wed, Mar 16, 2016 at 09:13:57PM -0700, Liu Bo wrote:
> On Tue, Mar 15, 2016 at 02:39:41PM +1100, Dave Chinner wrote:
> > On Mon, Mar 07, 2016 at 04:27:59PM -0800, Liu Bo wrote:
> > > This is to test if COW enabled btrfs can end up with single 4k extents
> > > when doing subpagesize buffered writes.
> > > 
> > > Signed-off-by: Liu Bo 
> > > ---
> > 
> > > +_scratch_mkfs >> $seqres.full 2>&1
> > > +_scratch_mount
> > > +
> > > +default_expire=`cat /proc/sys/vm/dirty_expire_centisecs`
> > > +echo 50 > /proc/sys/vm/dirty_expire_centisecs
> > 
> > why?
> 
> Setting it to 50 is to flush dirty pages more frequently so that it's
> more likely to reproduce this bug.

Explainations should be in comments.


> > > + $XFS_IO_PROG -c "pwrite $toff $tlen" $tfile > /dev/null 2>&1
> > > + toff=$((toff + tlen))
> > > +done
> > > +
> > > +sync
> > > +
> > > +# check for single PAGESIZE extent
> > > +$XFS_IO_PROG -c "fiemap -v" $tfile >> $seqres.full 2>&1
> > > +$XFS_IO_PROG -c "fiemap -v" $tfile | awk '{if ($4 == 8) print $4}'
> > 
> > Assumes page size is 4k. Also assumes that no individual extent in
> > the file is 4k. Likely broken.
> 
> My miss, will try to come up with a way to tell awk PAGE_SIZE, maybe call 
> getpagesize() ?
> (if you already how to do it, please do me a favor :-) )

awk -v pgsize=$page_size \
'{ cnt = pgsize/512; if ($4 == pgsize) print $4; }

Assuming that fiemap is reporting in 512 byte block size, not sector
sizes...

> > > --- /dev/null
> > > +++ b/tests/btrfs/027.out
> > > @@ -0,0 +1 @@
> > > +QA output created by 027
> > 
> > So are we expecting no output or not?
> 
> We don't expect any 4k single extent if btrfs is doing correctly, so I leave 
> the output empty.

Convention is that we echo a line to the golden output file to
document the test is expected to give no output. That's why you see

echo "Silence is golden"

in many tests...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qgroup code slowing down rebalance

2016-03-19 Thread Dave Hansen
On 03/16/2016 06:36 PM, Qu Wenruo wrote:
> Dave Hansen wrote on 2016/03/16 13:53 -0700:
>> I have a medium-sized multi-device btrfs filesystem (4 disks, 16TB
>> total) running under 4.5.0-rc5.  I recently added a disk and needed to
>> rebalance.  I started a rebalance operation three days ago.  It was on
>> the order of 20% done after those three days. :)
...
> Perf record profile will help a lot.
> Please upload it if it's OK for you.

I'll send it privately.

But, I do see basically the same behavior when balancing data and
metadata.  I profiled both.

> Also, the following data would help a lot:
> 1) btrfs fi df output
>To determine the metadata/data ratio
>Balancing metadata should be quite slow.

Data, RAID1: total=4.53TiB, used=4.53TiB
System, RAID1: total=32.00MiB, used=720.00KiB
Metadata, RAID1: total=17.00GiB, used=15.77GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


> 2) btrfs subvolume list output
>To determine how many tree blocks are shared against each other
>More shared tree blocks, slower quota routing is.
> 
>Feel free to mask the output to avoid information leaking.

Here you go (pasted at the end).  You can pretty clearly see that I'm
using this volume for incremental backups.  My plan was to keep the old
snapshots around until the filesystem fills up.

> 3) perf record for balancing metadata and data respectively
>Although this is optional. Just to prove some assumption.
> 
> btrfs_find_all_roots() is quite a slow operation, and in its worst case
> (tree blocks are shared by a lot of trees) it may be near O(2^n).

Yikes! But, that does look to be consistent with what I'm seeing.

>> ID 15183 gen 207620 top level 5 path blackbird_backups.enc
>> ID 15547 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1455779116
>> ID 15548 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1455779185
>> ID 15559 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1455814382
>> ID 15560 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1455814540
>> ID 15561 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1455814739
>> ID 15562 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1455814783
>> ID 15563 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1455815180
>> ID 15564 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1455815259
>> ID 15565 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1455815314
>> ID 15566 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1455815948
>> ID 15567 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1455816104
>> ID 15568 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1455816127
>> ID 15569 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1455816213
>> ID 15570 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1455816305
>> ID 15571 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1455816591
>> ID 15572 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1455816622
>> ID 15573 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1455816632
>> ID 15574 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1455816638
>> ID 15575 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1455816837
>> ID 15576 gen 207620 top level 5 path home-backup-on-btrfs.enc
>> ID 16634 gen 207620 top level 5 path homes-backup-nimitz-o2.enc
>> ID 17389 gen 207620 top level 5 path snapshots/root/1456271568
>> ID 17390 gen 207620 top level 5 path snapshots/root/1456272221
>> ID 17391 gen 207620 top level 5 path snapshots/root/1456416369
>> ID 17392 gen 207620 top level 5 path snapshots/root/1456435609
>> ID 17393 gen 207620 top level 5 path snapshots/root/1456539873
>> ID 17394 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1456553298
>> ID 17395 gen 207620 top level 5 path snapshots/root/1456553448
>> ID 17396 gen 207620 top level 5 path 
>> snapshots/home-backup-on-btrfs.enc/1456553460
>> ID 17397 gen 207620 top level 5 path 
>> snapshots/home-backup-on-btrfs.enc/1456553461
>> ID 17398 gen 207620 top level 5 path 
>> snapshots/home-backup-on-btrfs.enc/1456602362
>> ID 17399 gen 207620 top level 5 path 
>> snapshots/home-backup-on-btrfs.enc/1456602405
>> ID 17400 gen 207620 top level 5 path 
>> snapshots/blackbird_backups.enc/1456602405
>> ID 17401 gen 207620 top level 5 path 
>> snapshots/home-backup-on-btrfs.enc/1456628682
>

Re: [PATCH 12/12] block: test fallocate for block devices

2016-03-14 Thread Dave Chinner
On Sat, Mar 05, 2016 at 10:25:08AM -0800, Christoph Hellwig wrote:
> I'm not sure xfstests is the right fit, as it does not test a file
> system, but rather block devices.
> 
> If people think it should go into xfstests we should at least not
> add it to the default group, but just to a new bdev group.

I think it's the right place to test it - we have all the
infrastructure available to do it (i.e. xfs_io and various block
devices) and we really need to make sure this stuff works,
especially if we start to write filesystem code that depends on
correct behaviour...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] fstest: btrfs: test single 4k extent after subpagesize buffered writes

2016-03-14 Thread Dave Chinner
On Mon, Mar 07, 2016 at 04:27:59PM -0800, Liu Bo wrote:
> This is to test if COW enabled btrfs can end up with single 4k extents
> when doing subpagesize buffered writes.
> 
> Signed-off-by: Liu Bo 
> ---

> +_scratch_mkfs >> $seqres.full 2>&1
> +_scratch_mount
> +
> +default_expire=`cat /proc/sys/vm/dirty_expire_centisecs`
> +echo 50 > /proc/sys/vm/dirty_expire_centisecs

why?

> +tfile=$SCRATCH_MNT/testfile
> +
> +PAGE_SIZE=$(get_page_size)
> +SUBLEN=$((RANDOM % PAGE_SIZE))

local variables are lower case.

> +$XFS_IO_PROG -f -c "pwrite 0 $PAGE_SIZE" $tfile > /dev/null 2>&1
> +$XFS_IO_PROG -c "pwrite $PAGE_SIZE $SUBLEN" $tfile > /dev/null 2>&1
> +
> +toff=$((PAGE_SIZE + SUBLEN))
> +for ((i=0; i<1; i++))
> +do

# Add comment on what the loop is trying to do to the layout of the
# file being written.
for ((i = 0; i < 1; i++)); do

> + tlen=$PAGE_SIZE
> + if [ $((i % 2)) = 0 ]; then
> + tlen=$((PAGE_SIZE * 3))
> + fi
> + if [ $((i % 1000)) = 0 ]; then
> + tlen=$((RANDOM % PAGE_SIZE))
> + fi
> +
> + $XFS_IO_PROG -c "pwrite $toff $tlen" $tfile > /dev/null 2>&1
> + toff=$((toff + tlen))
> +done
> +
> +sync
> +
> +# check for single PAGESIZE extent
> +$XFS_IO_PROG -c "fiemap -v" $tfile >> $seqres.full 2>&1
> +$XFS_IO_PROG -c "fiemap -v" $tfile | awk '{if ($4 == 8) print $4}'

Assumes page size is 4k. Also assumes that no individual extent in
the file is 4k. Likely broken.

> +# success, all done
> +status=0
> +exit
> diff --git a/tests/btrfs/027.out b/tests/btrfs/027.out
> new file mode 100644
> index 000..e8291ab
> --- /dev/null
> +++ b/tests/btrfs/027.out
> @@ -0,0 +1 @@
> +QA output created by 027

So are we expecting no output or not?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] Btrfs in-band de-duplication test cases

2016-03-14 Thread Dave Chinner
On Mon, Mar 14, 2016 at 10:24:30AM -0700, Darrick J. Wong wrote:
> On Mon, Mar 14, 2016 at 03:56:47PM +0800, Qu Wenruo wrote:
> > Please don't merge this patchset.
> > 
> > As the there is some naming undecided recently.
> > 
> > The abbreviation 'dedup' may be changed to 'dedupe'.
> > I'll update them when all related parts is settled down.
> 
> There's already a 'dedupe' group in xfstests for testing the out-of-band ioctl
> that duperemove uses.  I wondered if that factored into your decision to use
> 'dedup' as the group name for the inband tests.
> 
> Seeing as other filesystems are beginning to support the OOB ioctls and might
> never support the in-band stuff btrfs is doing, what do people think about
> keeping the out-of and in-band dedup tests in separate groups to make it clear
> which dedupe feature each test is aiming to validate?

So just name the two groups appropriately: "ib-dedupe" and
"oob-dedupe" or something like that.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 RESEND 2/5] fstests: btrfs: Add basic test for btrfs in-band de-duplication

2016-02-28 Thread Dave Chinner
On Mon, Feb 29, 2016 at 10:04:35AM +0800, Qu Wenruo wrote:
> Hi Dave,
> 
> Thanks for the review.
> 
> All comment are correct and I'll update the patchset soon.
> 
> Only one small question below
> 
> Dave Chinner wrote on 2016/02/29 09:26 +1100:
> ...
> >>+# File size is twice the maximum file extent of btrfs
> >>+# So even fallbacked to non-dedup, it will have at least 2 extents
> >>+file_size=$(( 256 * 1024 * 1024 ))
> >
> >Used for xfs_io, so "file_size=256m" is all that is needed here.
> 
> Super nice feature for support unit suffix, I checked man page of
> xfs_io but only value for extsize mentioned the support for such
> suffix.
> 
> I assume all offset/length/bsize/value support suffix, right?

Yes, they do, always have, originally came from other XFS commands
(i.e see the mkfs.xfs for the "usual units suffixes" description).

> Hope man page get updated.

Can you send a patch?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] fstests: btrfs/011 use replace_options

2016-02-28 Thread Dave Chinner
On Wed, Feb 24, 2016 at 01:42:08PM +0100, David Sterba wrote:
> On Fri, Feb 19, 2016 at 09:25:28PM +0800, Anand Jain wrote:
> > This patch fixes test btrfs/011 which intended to use -r option
> > but was never used since its associated args 'replace_options'
> > didn't make it to the cli.
> 
> After this patch the tests fails due to extra output. I've tested
> several kernels (4.5 rcs), HDD and SSD so it's not caused by my testing
> setup.

I'll use this as an opportunity to point out that nobody is
reviewing most btrfs test changes. I'll commit the changes if no-one
says anything about them, they look sane to me and don't cause me
any problems, but that doesn't mean they will work. Mostly I'm
concerned about changes to the common code and code patterns, not
what the test does or whether it is even a valid test.

IOWs, it is up to the developers who use xfstests to make sure the
tests relevant to their areas of expertise work correctly and are
valid/viable tests, not me. To avoid this sort of issue in
future, please review changes promptly.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 RESEND 5/5] fstests: btrfs: Test inband dedup with balance.

2016-02-28 Thread Dave Chinner
On Wed, Feb 24, 2016 at 04:06:36PM +0800, Qu Wenruo wrote:
> Btrfs balance will reloate date extent, but its hash is removed too late
> at run_delayed_ref() time, which will cause extent ref increased
> increased during balance, cause either find_data_references() gives
> WARN_ON() or even run_delayed_refs() fails and cause transaction abort.
> 
> Add such concurrency test for inband dedup and balance.
> 
> Signed-off-by: Qu Wenruo 
> ---
...
> +for n in $(seq 1 $nr); do
> + $XFS_IO_PROG -f -c "pwrite -b $dedup_bs 0 $dedup_bs" \
> + ${file}_${n} > /dev/null 2>&1
> +done

_populate_fs(), please.

> +
> +kill $balance_pid &> /dev/null
> +wait
> +
> +# Sometimes even we killed $balance_pid and wait returned,
> +# balance may still be running, use balance cancel to wait it.
> +_run_btrfs_util_prog balance cancel $SCRATCH_MNT &> /dev/null

This needs to be in the cleanup function.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 RESEND 4/5] fstests: btrfs: Add per inode dedup flag test

2016-02-28 Thread Dave Chinner
On Wed, Feb 24, 2016 at 04:06:35PM +0800, Qu Wenruo wrote:
> +# File size is twice the maximum file extent of btrfs
> +# So even fallbacked to non-dedup, it will have at least 2 extents
> +file_size=$(( 256 * 1024 * 1024 ))
> +dedup_bs=$(( 64 * 1024 ))

256m, 64k.

> +_scratch_mkfs "-O dedup" >> $seqres.full 2>&1
> +_scratch_mount
> +
> +# Return 0 for not deduped at all , return 1 for part or full deduped
> +test_file_deduped () {
> + file=$1
> +

echo -n "$file: "

> + nr_uniq_extents=$(_uniq_extent_count $file)
> + nr_total_extents=$(_extent_count $file)
> +
> + if [ $nr_uniq_extents -eq $nr_total_extents ]; then
> + echo "not de-duplicated"
> + else
> + echo "de-duplicated"
> + fi

> +}
> +
> +dedup_write_file () {
> + file=$1
> + size=$2
> +
> + $XFS_IO_PROG -f -c "pwrite -b $dedup_bs 0 $size" $file | _filter_xfs_io
> +}
> +
> +print_result () {
> + file=$1
> +
> + echo "$(basename $file): $(test_file_deduped $file)"
> +}
> +_run_btrfs_util_prog dedup enable -b $dedup_bs $SCRATCH_MNT
> +touch $SCRATCH_MNT/dedup_file
> +touch $SCRATCH_MNT/no_dedup_file
> +mkdir $SCRATCH_MNT/dedup_dir
> +mkdir $SCRATCH_MNT/no_dedup_dir
> +
> +_run_btrfs_util_prog property set $SCRATCH_MNT/no_dedup_file dedup disable
> +_run_btrfs_util_prog property set $SCRATCH_MNT/no_dedup_dir dedup disable
> +
> +dedup_write_file $SCRATCH_MNT/tmp $dedup_bs

Just call xfs_io directly - wrapper is not necessary for a single
line command.

> +# sync to ensure hash is added to dedup tree
> +sync
> +
> +dedup_write_file $SCRATCH_MNT/dedup_file $file_size
> +dedup_write_file $SCRATCH_MNT/no_dedup_file $file_size
> +dedup_write_file $SCRATCH_MNT/dedup_dir/dedup_dir_default_file $file_size
> +dedup_write_file $SCRATCH_MNT/no_dedup_dir/no_dedup_dir_default_file 
> $file_size
> +
> +print_result $SCRATCH_MNT/dedup_file

test_file_deduped $SCRATCH_MNT/dedup_file | _filter_scratch

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 RESEND 3/5] fstests: btrfs: Add testcase for btrfs dedup enable disable race test

2016-02-28 Thread Dave Chinner
On Wed, Feb 24, 2016 at 04:06:34PM +0800, Qu Wenruo wrote:
> +
> +fsstress_work &
> +fsstress_pid=$!
> +
> +trigger_work &
> +trigger_pid=$!
> +
> +wait $fsstress_pid
> +kill $trigger_pid
> +wait

Maximum bound runtime based on $TIME_FACTOR would be better, rather
than having to wait for fsstress to complete.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 RESEND 2/5] fstests: btrfs: Add basic test for btrfs in-band de-duplication

2016-02-28 Thread Dave Chinner
On Wed, Feb 24, 2016 at 04:06:33PM +0800, Qu Wenruo wrote:
> Add basic test for btrfs in-band de-duplication, including:
> 1) Enable
> 2) Re-enable
> 3) On disk extents are refering to same bytenr
> 4) Disable
> 
> Signed-off-by: Qu Wenruo 
> ---
>  common/defrag   |   8 
>  tests/btrfs/200 | 125 
> 
>  tests/btrfs/200.out |  19 
>  tests/btrfs/group   |   1 +
>  4 files changed, 153 insertions(+)
>  create mode 100755 tests/btrfs/200
>  create mode 100644 tests/btrfs/200.out
> 
> diff --git a/common/defrag b/common/defrag
> index 942593e..34cc822 100644
> --- a/common/defrag
> +++ b/common/defrag
> @@ -47,6 +47,14 @@ _extent_count()
>   $XFS_IO_PROG -c "fiemap" $1 | tail -n +2 | grep -v hole | wc -l| 
> $AWK_PROG '{print $1}'
>  }
>  
> +_uniq_extent_count()
> +{
> + file=$1
> + $XFS_IO_PROG -c "fiemap" $file >> $seqres.full 2>&1
> + $XFS_IO_PROG -c "fiemap" $file | tail -n +2 | grep -v hole |\
> + $AWK_PROG '{print $3}' | sort | uniq | wc -l
> +}

This needs comments ot explain how it is different to _extent_count.
Also should probably be named _extent_count_unique()

> +
>   min=$1
> diff --git a/tests/btrfs/200 b/tests/btrfs/200
> new file mode 100755
> index 000..f2ff542
> --- /dev/null
> +++ b/tests/btrfs/200
> @@ -0,0 +1,125 @@
> +#! /bin/bash
> +# FS QA Test 200
> +#
> +# Basic btrfs inband dedup test, including:
> +# 1) Enable
> +# 2) Uniq file extent number
Unique.

> +# 3) Re-enable
> +# 4) Disable

I don't understand what 2-4 are describing. As a test summary,
"Basic btrfs inband dedup test" is sufficient.

> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +_require_btrfs_subcommand dedup
> +_require_btrfs_fs_feature dedup
> +_require_btrfs_mkfs_feature dedup

> +
> +# File size is twice the maximum file extent of btrfs
> +# So even fallbacked to non-dedup, it will have at least 2 extents
> +file_size=$(( 256 * 1024 * 1024 ))

Used for xfs_io, so "file_size=256m" is all that is needed here.

> +_scratch_mkfs "-O dedup" >> $seqres.full 2>&1
> +_scratch_mount
> +
> +do_dedup_test()
> +{
> + backend=$1
> + dedup_bs=$2
> +
> + _run_btrfs_util_prog dedup enable -s $backend -b $dedup_bs $SCRATCH_MNT
> + $XFS_IO_PROG -f -c "pwrite -b $dedup_bs 0 $dedup_bs" \
> + $SCRATCH_MNT/initial_block | _filter_xfs_io
> +
> + # sync to ensure dedup hash is added into dedup pool
> + sync

xfs_io -fs  or xfs_io ... -c "fsync" ... ?

> + $XFS_IO_PROG -f -c "pwrite -b $dedup_bs 0 $file_size" \
> + $SCRATCH_MNT/real_file | _filter_xfs_io
> + # sync again to ensure data are all written to disk and
> + # we can get stable extent map
> + sync

Again, why now just do a sync write or fsync from the xfs?

> +
> + # Test if real_file is de-duplicated
> + nr_uniq_extents=$(_uniq_extent_count $SCRATCH_MNT/real_file)
> + nr_total_extents=$(_extent_count $SCRATCH_MNT/real_file)
> +
> + echo "uniq/total: $nr_uniq_extents/$nr_total_extents" >> $seqres.full
> + # Allow a small amount of dedup miss, as commit interval or
> + # memory pressure may break a dedup_bs block and cause
> + # smalll extent which won't go through dedup routine
> + if [ $nr_uniq_extents -ge $(( $nr_total_extents * 5 / 100 )) ]; then
> + echo "Too high dedup failure rate"
> + fi

_within_tolerance

> +
> + # Also check the md5sum to ensure data is not corrupted
> + md5=$(_md5_checksum $SCRATCH_MNT/real_file)
> + if [ $md5 != $init_md5 ]; then
> + echo "File after in-band de-duplication is corrupted"
> + fi

Nope. Just echo the md5sum to the golden output file.


> +}
> +
> +# Create the initial file and calculate its checksum without dedup
> +$XFS_IO_PROG -f -c "pwrite 0 $file_size" $SCRATCH_MNT/csum_file | \
> + _filter_xfs_io
> +init_md5=$(_md5_checksum $SCRATCH_MNT/csum_file)
> +echo "md5 of the initial file is $init_md5" >> $seqres.full

Just echo the md5sum to the golden output file.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] fstests: generic test for directory fsync after rename operation

2016-02-17 Thread Dave Chinner
On Mon, Feb 15, 2016 at 10:54:23AM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Test that if we move one file between directories, fsync the parent
> directory of the old directory, power fail and remount the filesystem,
> the file is not lost and it's located at the destination directory.
> 
> This is motivated by a bug found in btrfs, which is fixed by the patch
> (for the linux kernel) titled:
> 
>   "Btrfs: fix file loss on log replay after renaming a file and fsync"
> 
> Tested against ext3, ext4, xfs, f2fs and reiserfs.
> 
> Signed-off-by: Filipe Manana 

> +# We expect our file foo to exist, have an entry in the new parent
> +# directory (c/) and not have anymore an entry in the old parent directory
> +# (a/b/).
> +[ -e $SCRATCH_MNT/a/b/foo ] && echo "File foo is still at directory a/b/"
> +[ -e $SCRATCH_MNT/c/foo ] || echo "File foo is not at directory c/"
> +
> +# The new file named bar should also exist.
> +[ -e $SCRATCH_MNT/a/bar ] || echo "File bar is missing"

This can all be replaced simply by:

ls -R $SCRATCH_MNT | _filter_scratch

Because the golden image match will tell us if files are missing or
in the wrong place.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4.2 00/32] xfstests: test the nfs/cifs/btrfs/xfs reflink/dedupe ioctls

2016-02-12 Thread Dave Chinner
On Thu, Feb 11, 2016 at 03:39:16PM -0800, Darrick J. Wong wrote:
> Dave Chinner: I've renumbered the new tests and pushed to github[3] if
> you'd like to pull.  See the pull request at the end of this message.
> 
> This is a patch set against the reflink/dedupe test cases in xfstests.
> The first three patches fix errors in the existing reflink tests, some
> of which are from Christoph Hellwig.
> 
> The next eight patches fix style errors, refactor commonly used code,
> remove unnecessary clutter, and add missing _require* checks to the
> existing reflink tests.
> 
> Patch 12 adds a test checking that unwritten extent conversion does
> NOT happen after a directio write to an unwritten extent hits a disk
> error.   Due to a bug in the VFS directio code, ext4 can disclose
> stale disk contents if an aio dio write fails; XFS suffers this
> problem for any failing dio write to an unwritten extent.  Christoph's
> kernel patchset titled "vfs/xfs: directio updates to ease COW handling
> V2" (and a separate ext4 warning cleanup) is needed to fix this.
> 
> Patches 13-31 add more reflink tests focusing on correct CoW behavior
> particularly with the CoW extent size hint enabled.  It also provides
> a few regression tests for bugs that have been hit while running XFS
> reflink, a few tests of the quota accounting when various reflink
> operations happen, and a few tests for get_bmapx to ensure that what
> it reports is at least somewhat accurate.
> 
> Patch 25 adds a few basic reverse-mapping tests for XFS.
> 
> If you're going to start using this mess, you probably ought to just
> pull from my github trees for kernel[1], xfsprogs[2], xfstests[3],
> xfs-docs[4], and man-pages[5].  All tests should pass on XFS, YMWV on
> btrfs and ocfs2.
> 
> Comments and questions are, as always, welcome.

I haven't worked out which patch causes this, but:

xfs/246  - output mismatch (see 
/home/dave/src/xfstests-dev/results//xfs_1k/xfs/246.out.bad)
--- tests/xfs/246.out   2016-02-13 09:46:01.419169115 +1100
+++ /home/dave/src/xfstests-dev/results//xfs_1k/xfs/246.out.bad 
2016-02-13 11:36:05.205851863 +1100
@@ -3,4 +3,5 @@
 Create the original files
 Dump extents after sync
 Hole CoW extents:
-SCRATCH_MNT/test-246/file1: no extents
+bmap: invalid option -- 'c'
+bmap [-adlpv] [-n nx] -- print block mapping for an XFS file
...
(Run 'diff -u tests/xfs/246.out 
/home/dave/src/xfstests-dev/results//xfs_1k/xfs/246.out.bad'  to see the entire 
diff)

This test neds a check that the "-c" option in the bmap command
exists. Not sure if you can use '_requires_xfs_io_command "bmap -c"'
here or whether this requires more help...

Followup patch, anyway, because I'm about to push out everything
that was in your second pull req (i.e. with the aiocp fix).

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/32] dio: unwritten conversion bug tests

2016-02-12 Thread Dave Chinner
On Fri, Feb 12, 2016 at 10:22:31AM -0500, Theodore Ts'o wrote:
> On Fri, Feb 12, 2016 at 02:52:53PM +1100, Dave Chinner wrote:
> > On Thu, Feb 11, 2016 at 03:40:37PM -0800, Darrick J. Wong wrote:
> > > Check that we don't expose old disk contents when a directio write to
> > > an unwritten extent fails due to IO errors.  This primarily affects
> > > XFS and ext4.
> > > 
> > > Signed-off-by: Darrick J. Wong 
> > 
> > aiocp.c: In function 'main':
> > aiocp.c:407:1: warning: format '%d' expects argument of type 'int', but 
> > argument 3 has type 'off_t' [-Wformat=]
> >  printf("tocopy=%d len=%d blk=%d\n", tocopy, length, aio_blksize);
> >  ^
> > 
> > Followup patch is fine, I'll commit as is.
> 
> Dave, given that Darrick's patches is going to cause more
> _scratch_mount occurences, how quickly do you plan to commit his
> patchset?  Should I wait until he makes his change before I do the
> s/_scratch_mount/_scratch_cycle_mount/ change?  Should I make that
> change, and let you fix it up the commit?

I had to head out suddenly yestrday afternoon, just after I sent
above email. Only just got back in front of the computer, so I'll be
pushing it out once the new pull has run through a sanity check.

If you can respin the patch once I've done that, it would be greatly
appreciated.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/32] dio: unwritten conversion bug tests

2016-02-11 Thread Dave Chinner
On Thu, Feb 11, 2016 at 03:40:37PM -0800, Darrick J. Wong wrote:
> Check that we don't expose old disk contents when a directio write to
> an unwritten extent fails due to IO errors.  This primarily affects
> XFS and ext4.
> 
> Signed-off-by: Darrick J. Wong 

aiocp.c: In function 'main':
aiocp.c:407:1: warning: format '%d' expects argument of type 'int', but 
argument 3 has type 'off_t' [-Wformat=]
 printf("tocopy=%d len=%d blk=%d\n", tocopy, length, aio_blksize);
 ^

Followup patch is fine, I'll commit as is.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/23] xfs/122: support refcount/rmap data structures

2016-02-09 Thread Dave Chinner
On Mon, Feb 08, 2016 at 11:55:06PM -0800, Darrick J. Wong wrote:
> On Tue, Feb 09, 2016 at 06:43:30PM +1100, Dave Chinner wrote:
> > On Mon, Feb 08, 2016 at 05:13:03PM -0800, Darrick J. Wong wrote:
> > > Include the refcount and rmap structures in the golden output.
> > > 
> > > Signed-off-by: Darrick J. Wong 
> > > ---
> > >  tests/xfs/122 |3 +++
> > >  tests/xfs/122.out |4 
> > >  tests/xfs/group   |2 +-
> > >  3 files changed, 8 insertions(+), 1 deletion(-)
> > > 
> > > 
> > > diff --git a/tests/xfs/122 b/tests/xfs/122
> > > index e6697a2..758cb50 100755
> > > --- a/tests/xfs/122
> > > +++ b/tests/xfs/122
> > > @@ -90,6 +90,9 @@ xfs_da3_icnode_hdr
> > >  xfs_dir3_icfree_hdr
> > >  xfs_dir3_icleaf_hdr
> > >  xfs_name
> > > +xfs_owner_info
> > > +xfs_refcount_irec
> > > +xfs_rmap_irec
> > >  xfs_alloctype_t
> > >  xfs_buf_cancel_t
> > >  xfs_bmbt_rec_32_t
> > 
> > So this is going to cause failures on any userspace that doesn't
> > know about these new types, right?
> > 
> > Should these be conditional in some way?
> 
> I wasn't sure how to handle this -- I could just keep the patch at the head of
> my stack (unreleased) until xfsprogs pulls in the appropriate libxfs pieces?
> So long as we're not dead certain of the final format of the rmapbt and
> refcountbt, there's probably not a lot of value in putting this in (yet).

Well, I'm more concerned about running on older/current distros that
don't have support for them in userspace. My brain is mush right
now, so I don't have any brilliant ideas (hence the question, rather
than also presenting a posible solution). I'll have a think; maybe
we can make use of the configurable .out file code we have now?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 21/23] xfs: aio cow tests

2016-02-09 Thread Dave Chinner
On Mon, Feb 08, 2016 at 05:14:01PM -0800, Darrick J. Wong wrote:
.,,,
> +
> +echo "Check for damage"
> +_dmerror_unmount
> +_dmerror_cleanup
> +_repair_scratch_fs >> "$seqres.full" 2>&1

Are you testing repair here? If so, why doesn't failure matter.
If not, why do it? Or is _require_scratch_nocheck all that is needed
here?

> +echo "CoW and unmount"
> +"$XFS_IO_PROG" -f -c "pwrite -S 0x63 $((blksz * bsz)) 1" "$testdir/file2" >> 
> "$seqres.full"
> +"$XFS_IO_PROG" -f -c "pwrite -S 0x63 -b $((blksz * bsz)) 0 $((blksz * nr))" 
> "$TEST_DIR/moo" >> "$seqres.full"

offset = block size times block size?

I think some better names might be needed...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 19/23] xfs: test rmapbt functionality

2016-02-09 Thread Dave Chinner
On Mon, Feb 08, 2016 at 05:13:48PM -0800, Darrick J. Wong wrote:
> Signed-off-by: Darrick J. Wong 
> ---
>  common/xfs|   44 ++
>  tests/xfs/233 |   78 ++
>  tests/xfs/233.out |6 +++
>  tests/xfs/234 |   89 
>  tests/xfs/234.out |6 +++
>  tests/xfs/235 |  108 
> +
>  tests/xfs/235.out |   14 +++
>  tests/xfs/236 |   93 ++
>  tests/xfs/236.out |8 
>  tests/xfs/group   |4 ++
>  10 files changed, 450 insertions(+)
>  create mode 100644 common/xfs
>  create mode 100755 tests/xfs/233
>  create mode 100644 tests/xfs/233.out
>  create mode 100755 tests/xfs/234
>  create mode 100644 tests/xfs/234.out
>  create mode 100755 tests/xfs/235
>  create mode 100644 tests/xfs/235.out
>  create mode 100755 tests/xfs/236
>  create mode 100644 tests/xfs/236.out
> 
> 
> diff --git a/common/xfs b/common/xfs
> new file mode 100644
> index 000..2d1a76f
> --- /dev/null
> +++ b/common/xfs
> @@ -0,0 +1,44 @@
> +##/bin/bash
> +# Routines for handling XFS
> +#---
> +#  Copyright (c) 2015 Oracle.  All Rights Reserved.
> +#  This program is free software; you can redistribute it and/or modify
> +#  it under the terms of the GNU General Public License as published by
> +#  the Free Software Foundation; either version 2 of the License, or
> +#  (at your option) any later version.
> +#
> +#  This program is distributed in the hope that it will be useful,
> +#  but WITHOUT ANY WARRANTY; without even the implied warranty of
> +#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +#  GNU General Public License for more details.
> +#
> +#  You should have received a copy of the GNU General Public License
> +#  along with this program; if not, write to the Free Software
> +#  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307
> +#  USA
> +#
> +#  Contact information: Oracle Corporation, 500 Oracle Parkway,
> +#  Redwood Shores, CA 94065, USA, or: http://www.oracle.com/
> +#---
> +
> +_require_xfs_test_rmapbt()
> +{
> + _require_test
> +
> + if [ "$(xfs_info "$TEST_DIR" | grep -c "rmapbt=1")" -ne 1 ]; then
> + _notrun "rmapbt not supported by test filesystem type: $FSTYP"
> + fi
> +}
> +
> +_require_xfs_scratch_rmapbt()
> +{
> + _require_scratch
> +
> + _scratch_mkfs > /dev/null
> + _scratch_mount
> + if [ "$(xfs_info "$SCRATCH_MNT" | grep -c "rmapbt=1")" -ne 1 ]; then
> + _scratch_unmount
> + _notrun "rmapbt not supported by scratch filesystem type: 
> $FSTYP"
> + fi
> + _scratch_unmount
> +}

No, not yet. :)

Wait until I get my "split common/rc" patchset out there, because it
does not require:

> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/xfs

This.

And i don't want to have to undo a bunch of stuff in tests yet. Just
lump it all in common/rc for the moment.

> +
> +# real QA test starts here
> +_supported_os Linux
> +_supported_fs xfs
> +_require_xfs_scratch_rmapbt
> +
> +echo "Format and mount"
> +_scratch_mkfs -d size=$((2 * 4096 * 4096)) -l size=4194304 > "$seqres.full" 
> 2>&1
> +_scratch_mount >> "$seqres.full" 2>&1

_scratch_mkfs_sized ?

> +here=`pwd`
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +cd /
> +#rm -f $tmp.*

More random uncommenting needed.

> +
> +echo "Check for damage"
> +umount "$SCRATCH_MNT"
> +_check_scratch_fs
> +
> +# success, all done
> +status=0
> +exit

Cull.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 18/23] xfs: test the automatic cowextsize extent garbage collector

2016-02-09 Thread Dave Chinner
On Mon, Feb 08, 2016 at 05:13:42PM -0800, Darrick J. Wong wrote:
> Signed-off-by: Darrick J. Wong 
> +
> +_cleanup()
> +{
> +cd /
> +echo $old_cow_lifetime > 
> /proc/sys/fs/xfs/speculative_cow_prealloc_lifetime
> +#rm -rf "$tmp".* "$testdir"

uncomment.

> +echo "CoW and leave leftovers"
> +echo $old_cow_lifetime > /proc/sys/fs/xfs/speculative_cow_prealloc_lifetime
> +seq 2 2 $((nr - 1)) | while read f; do
> + "$XFS_IO_PROG" -f -c "pwrite -S 0x63 $((blksz * f)) 1" "$testdir/file2" 
> >> "$seqres.full"
> + "$XFS_IO_PROG" -f -c "pwrite -S 0x63 $((blksz * f)) 1" 
> "$testdir/file2.chk" >> "$seqres.full"
> +done

Ok, I just realised what was bugging me about these loops: "f" is
not a typical loop iterator for a count. Normally we'd use "i" for
these

> +echo "old extents: $old_extents" >> "$seqres.full"
> +echo "new extents: $new_extents" >> "$seqres.full"
> +echo "maximum extents: $internal_blks" >> "$seqres.full"
> +test $new_extents -lt $((internal_blks / 7)) || _fail "file2 badly 
> fragmented"

I wouldn't use _fail like this, echo is sufficient to cause the test
to fail.

> +echo "Check for damage"
> +umount "$SCRATCH_MNT"
> +
> +# success, all done
> +status=0
> +exit

As would getting rid of the unmount and just setting status
appropriately...

/repeat

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 17/23] reflink: test CoW across a mixed range of block types with cowextsize set

2016-02-09 Thread Dave Chinner
lksz))
> +"$XFS_IO_PROG" -c "cowextsize $((blksz * 16))" "$testdir" >> "$seqres.full"
> +_pwrite_byte 0x61 0 $((blksz * nr)) "$testdir/file1" >> "$seqres.full"
> +$XFS_IO_PROG -f -c "falloc 0 $((blksz * nr))" "$testdir/file3" >> 
> "$seqres.full"
> +_pwrite_byte 0x00 0 $((blksz * nr)) "$testdir/file3.chk" >> "$seqres.full"
> +seq 0 2 $((nr-1)) | while read f; do
> + _reflink_range "$testdir/file1" $((blksz * f)) "$testdir/file3" 
> $((blksz * f)) $blksz >> "$seqres.full"
> + _pwrite_byte 0x61 $((blksz * f)) $blksz "$testdir/file3.chk" >> 
> "$seqres.full"
> +done

This looks like several tests use this setup. Factor?

> +_scratch_remount
> +
> +echo "Compare files"
> +md5sum "$testdir/file1" | _filter_scratch
> +md5sum "$testdir/file3" | _filter_scratch
> +md5sum "$testdir/file3.chk" | _filter_scratch
> +
> +echo "directio CoW across the transition"
> +"$XFS_IO_PROG" -d -f -c "pwrite -S 0x63 -b $((blksz * nr / 2)) $((blksz * nr 
> / 4)) $((blksz * nr / 2))" "$testdir/file3" >> "$seqres.full"
> +_pwrite_byte 0x63 $((blksz * nr / 4)) $((blksz * nr / 2)) 
> "$testdir/file3.chk" >> "$seqres.full"
> +_scratch_remount

These could really do with local variables to keep the verbosity
down and make it easy to change in future.

off=$((blksz * nr / 4))
iosz=$((blksz * nr / 2))

$XFS_IO_PROG -d -c "pwrite -S 0x63 -b $iosz $off $iosz" $testdir/file3
_pwrite_byte 0x63 $off $iosz $testdir/file3.chk

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/23] xfs: test fragmentation characteristics of copy-on-write

2016-02-09 Thread Dave Chinner
On Mon, Feb 08, 2016 at 05:13:09PM -0800, Darrick J. Wong wrote:
> Perform copy-on-writes at random offsets to stress the CoW allocation
> system.  Assess the effectiveness of the extent size hint at
> combatting fragmentation via unshare, a rewrite, and no-op after the
> random writes.
> 
> Signed-off-by: Darrick J. Wong 

> +seq=`basename "$0"`
> +seqres="$RESULT_DIR/$seq"
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1# failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +cd /
> +#rm -rf "$tmp".* "$testdir"

Now that I've noticed it, a few tests have this line commented out.
Probably should remove the tmp files, at least.

> +rm -f "$seqres.full"
> +
> +echo "Format and mount"
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount >> "$seqres.full" 2>&1
> +
> +testdir="$SCRATCH_MNT/test-$seq"
> +rm -rf $testdir
> +mkdir $testdir

Again, somthing that is repeated - we just mkfs'd the scratch
device, so the $testdir is guaranteed not to exist...

> +echo "Check for damage"
> +umount "$SCRATCH_MNT"

I've also noticed this in a lot of tests - the scratch device will
be unmounted by the harness, so I don't think this is necessary

> +free_blocks=$(stat -f -c '%a' "$testdir")
> +real_blksz=$(stat -f -c '%S' "$testdir")
> +space_needed=$(((blksz * nr * 3) * 5 / 4))
> +space_avail=$((free_blocks * real_blksz))
> +internal_blks=$((blksz * nr / real_blksz))
> +test $space_needed -gt $space_avail && _notrun "Not enough space. 
> $space_avail < $space_needed"

Why not:

_require_fs_space $space_needed

At minimum, it seems to be a repeated hunk of code, so it shoul dbe
factored.

> +testdir="$SCRATCH_MNT/test-$seq"
> +rm -rf $testdir
> +mkdir $testdir
> +
> +echo "Create the original files"
> +"$XFS_IO_PROG" -f -c "pwrite -S 0x61 0 0" "$testdir/file1" >> "$seqres.full"
> +"$XFS_IO_PROG" -f -c "pwrite -S 0x61 0 1048576" "$testdir/file2" >> 
> "$seqres.full"
> +_scratch_remount
> +
> +echo "Set extsz and cowextsz on zero byte file"
> +"$XFS_IO_PROG" -f -c "extsize 1048576" "$testdir/file1" | _filter_scratch
> +"$XFS_IO_PROG" -f -c "cowextsize 1048576" "$testdir/file1" | _filter_scratch
> +
> +echo "Set extsz and cowextsz on 1Mbyte file"
> +"$XFS_IO_PROG" -f -c "extsize 1048576" "$testdir/file2" | _filter_scratch
> +"$XFS_IO_PROG" -f -c "cowextsize 1048576" "$testdir/file2" | _filter_scratch
> +_scratch_remount
> +
> +fn() {
> + "$XFS_IO_PROG" -c "$1" "$2" | sed -e 's/.\([0-9]*\).*$/\1/g'
> +}
> +echo "Check extsz and cowextsz settings on zero byte file"
> +test $(fn extsize "$testdir/file1") -eq 1048576 || echo "file1 extsize not 
> set"
> +test $(fn cowextsize "$testdir/file1") -eq 1048576 || echo "file1 cowextsize 
> not set" 

For this sort of thing, just dump the extent size value to the
golden output. i.e.

echo "Check extsz and cowextsz settings on zero byte file"
$XFS_IO_PROG -c extsize $testdir/file1
$XFS_IO_PROG -c cowextsize $testdir/file1

is all that is needed. that way if it fails, we see what value it
had instead of the expected 1MB. This also makes the test much less
verbose and easier to read

> +
> +echo "Check extsz and cowextsz settings on 1Mbyte file"
> +test $(fn extsize "$testdir/file2") -eq 0 || echo "file2 extsize not set"
> +test $(fn cowextsize "$testdir/file2") -eq 1048576 || echo "file2 cowextsize 
> not set" 
> +
> +echo "Set cowextsize and check flag"
> +"$XFS_IO_PROG" -f -c "cowextsize 1048576" "$testdir/file3" | _filter_scratch
> +_scratch_remount
> +test $("$XFS_IO_PROG" -c "stat" "$testdir/file3" | grep 'fsxattr.xflags' | 
> awk '{print $4}' | grep -c 'C') -eq 1 || echo "file3 cowextsz flag not set"
> +test $(fn cowextsize "$testdir/file3") -eq 1048576 || echo "file3 cowextsize 
> not set"
> +"$XFS_IO_PROG" -f -c "cowextsize 0" "$testdir/file3" | _filter_scratch
> +_scratch_remount
> +test $(fn cowextsize "$testdir/file3") -eq 0 || echo "file3 cowextsize not 
> set"
> +test $("$XFS_IO_PROG" -c "stat" "$testdir/file3" | grep 'fsxattr.xflags' | 
> awk '{print $4}' | grep -c 'C') -eq 0 || echo "file3 cowextsz flag not set"

Same with all these - just grep the output for the line you want,
and the golden output matching does everything else. e.g. the flag
check simply becomes:

$XFS_IO_PROG -c "stat" $testdir/file3 | grep 'fsxattr.xflags'

Again, this tells us what the wrong flags are if it fails...

There are quite a few bits of these tests where the same thing
applies

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/23] dio unwritten conversion bug tests

2016-02-08 Thread Dave Chinner
On Mon, Feb 08, 2016 at 05:12:23PM -0800, Darrick J. Wong wrote:
> Check that we don't expose old disk contents when a directio write to
> an unwritten extent fails due to IO errors.  This primarily affects
> XFS and ext4.
> 
> Signed-off-by: Darrick J. Wong 
.
> --- a/tests/generic/group
> +++ b/tests/generic/group
> @@ -252,7 +252,9 @@
>  247 auto quick rw
>  248 auto quick rw
>  249 auto quick rw
> +250 auto quick
>  251 ioctl trim
> +252 auto quick

Also should be in the prealloc group if we are testing unwritten
extent behaviour and the rw group because it's testing IO.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/23] xfs/122: support refcount/rmap data structures

2016-02-08 Thread Dave Chinner
On Mon, Feb 08, 2016 at 05:13:03PM -0800, Darrick J. Wong wrote:
> Include the refcount and rmap structures in the golden output.
> 
> Signed-off-by: Darrick J. Wong 
> ---
>  tests/xfs/122 |3 +++
>  tests/xfs/122.out |4 
>  tests/xfs/group   |2 +-
>  3 files changed, 8 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/tests/xfs/122 b/tests/xfs/122
> index e6697a2..758cb50 100755
> --- a/tests/xfs/122
> +++ b/tests/xfs/122
> @@ -90,6 +90,9 @@ xfs_da3_icnode_hdr
>  xfs_dir3_icfree_hdr
>  xfs_dir3_icleaf_hdr
>  xfs_name
> +xfs_owner_info
> +xfs_refcount_irec
> +xfs_rmap_irec
>  xfs_alloctype_t
>  xfs_buf_cancel_t
>  xfs_bmbt_rec_32_t

So this is going to cause failures on any userspace that doesn't
know about these new types, right?

Should these be conditional in some way?

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/23] xfs: more reflink tests

2016-02-08 Thread Dave Chinner
On Mon, Feb 08, 2016 at 05:12:50PM -0800, Darrick J. Wong wrote:
> Create a couple of XFS-specific tests -- one to check that growing
> and shrinking the refcount btree works and a second one to check
> what happens when we hit maximum refcount.
> 
> Signed-off-by: Darrick J. Wong 
.
> +# real QA test starts here
> +_supported_os Linux
> +_supported_fs xfs
> +_require_scratch_reflink
> +_require_cp_reflink

> +
> +test -x "$here/src/punch-alternating" || _notrun "punch-alternating not 
> built"

I suspect we need a _require rule for checking that something in
the test src directory has been built.

> +echo "Check scratch fs"
> +umount "$SCRATCH_MNT"
> +echo "check refcount after removing all files" >> "$seqres.full"
> +"$XFS_DB_PROG" -c 'agf 0' -c 'addr refcntroot' -c 'p recs[1]' "$SCRATCH_DEV" 
> >> "$seqres.full"
> +"$XFS_REPAIR_PROG" -o force_geometry -n "$SCRATCH_DEV" >> "$seqres.full" 2>&1
> +res=$?
> +if [ $res -eq 0 ]; then
> + # If repair succeeds then format the device so that the post-test
> + # check doesn't fail due to the single AG.
> + _scratch_mkfs >> "$seqres.full" 2>&1
> +else
> + _fail "xfs_repair fails"
> +fi
> +
> +# success, all done
> +status=0
> +exit

This is what _require_scratch_nocheck avoids.

i.e. do this instead:

_require_scratch_nocheck
.....

"$XFS_REPAIR_PROG" -o force_geometry -n "$SCRATCH_DEV" >> "$seqres.full" 2>&1 
status=$?
exit

Also, we really don't need the quotes around these global
variables.  They are just noise and lots of stuff will break if
those variables are set to something that requires them to be
quoted.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4.1 00/23] xfstests: test the nfs/cifs/btrfs/xfs reflink/dedupe ioctls

2016-02-08 Thread Dave Chinner
On Mon, Feb 08, 2016 at 05:11:45PM -0800, Darrick J. Wong wrote:
> Happy New Year!
> 
> Dave Chinner: I've renumbered the new tests and pushed to github[3] if
> you'd like to pull.

Can you include the commit ID I should see at the head of the
tree so I can confirm I'm pulling the right branch?

BTW, git doesn't like this:

https://github.com/djwong/xfstests/tree/for-dave

What git really wants is the tree url with a separate branch name
like so:

https://github.com/djwong/xfstests.git for-dave

(i.e. the typical output from a git request-pull command)

> This is a (no longer) small patch set against the reflink/dedupe test
> cases in xfstests.  The first four patches fix errors in the existing
> reflink tests, some of which are from Christoph Hellwig.
> 
> Patches 5-6 refactor the dmerror code so that we can use it to
> simulate transient IO errors, then use this code to test that
> unwritten extent conversion does NOT happen after a directio write to
> an unwritten extent hits a disk error.   Due to a bug in the VFS
> directio code, ext4 can disclose stale disk contents if an aio dio
> write fails; XFS suffers this problem for any failing dio write to an
> unwritten extent.  Christoph's kernel patchset titled "vfs/xfs:
> directio updates to ease COW handling V2" (and a separate ext4 warning
> cleanup) is needed to fix this.
> 
> Patches 7-9, 13, 15, 17, 18, 20, 21, and 23 exercise various parts
> of the copy on write behavior that are necessary to support shared
> blocks.  The earlier patches focus on correct CoW behavior in the
> presence of IO errors during the copy-write, and the later patches
> focus on XFS' new cow-extent-size hint that greatly reduces
> fragmentation due to copy on write behavior by encouraging the
> allocator to allocate larger extents of replacement blocks.
> 
> Patches 10-12 and 14 perform stress testing on reflink and CoW to
> check the behaviors when we get close to maximum refcount, when we
> specify obnxiously large offsets and lengths, and when we try to
> reflink millions of extents at a time.
> 
> Patch 16 tests quota accounting behavior when reflink is enabled.
> 
> Patch 19 adds a few tests for the XFS reverse mapping btree to ensure
> that things like metadump and growfs work correctly.
> 
> Patch 22 checks that get_bmapx and fiemap (on XFS) correctly flag
> extents as having shared blocks.  XFS now follows btrfs and ocfs2
> FIEMAP behavior such that if any blocks of a file's extent are shared,
> the whole extent is marked shared.  This is in contrast to earlier
> XFS-only behavior that reported shared and non-shared regions as
> separate extents.

This may change - xfs_bmap doesn't combine extents in it's output
even if they are adjacent. For debugging purposes (which is what
xfs_bmap/fiemap is for), it's much better to be able to see the
exact extent layout and block sharing.

I suspect the solution of least surprise is to make fiemap behave
like the other filesystems, and make xfs_bmap behave in a manner
that is useful to us :P

> If you're going to start using this mess, you probably ought to just
> pull from my github trees for kernel[1], xfsprogs[2], xfstests[3],
> xfs-docs[4], and man-pages[5].  All tests should pass on XFS.   I
> tried btrfs this weekend and it failed 166, 175, 182, 266, 271, 272,
> 278, 281, 297, 298, 304, 333, and 334.  ocfs2 (when I jury-rigged it
> to run the cp_reflink tests) seemed to have a quota bug and crashes
> hard in 284 (but was otherwise fine).

Fun fun fun. I'll look through the patchs, and if there's nothing
major I'll pull it in once I get a commit ID from you.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: btrfs, test for send with clone operations

2016-02-04 Thread Dave Chinner
On Thu, Feb 04, 2016 at 12:11:28AM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Test that an incremental send operation which issues clone operations
> works for files that have a full path containing more than one parent
> directory component.
> 
> This used to fail before the following patch for the linux kernel:
> 
>   "[PATCH] Btrfs: send, fix extent buffer tree lock assertion failure"
> 
> Signed-off-by: Filipe Manana 

Looks ok, I've pulled it in. Something to think about:

> +# Create a bunch of small and empty files, this is just to make sure our
> +# subvolume's btree gets more than 1 leaf, a condition necessary to trigger a
> +# past bug (1000 files is enough even for a leaf/node size of 64K, the 
> largest
> +# possible size).
> +for ((i = 1; i <= 1000; i++)); do
> + echo -n > $SCRATCH_MNT/a/b/c/z_$i
> +done

We already do have a generic function for doing this called
_populate_fs(), it's just not optimised for speed with large numbers
of files being created.

i.e. The above is simple a single directory tree with a single level
with 1000 files of size 0:

_populate_fs() -d 1 -n 1 -f 1000 -s 0 -r $SCRATCH_MNT/a/b/

Can you look into optimising _populate_fs() to use multiple threads
(say up to 4 by default) and "echo -n" to create files, and then
convert all our open coded "create lots of files" loops in tests to
use it?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/35 v2] separate operations from flags in the bio/request structs

2016-01-06 Thread Dave Chinner
On Wed, Jan 06, 2016 at 08:40:09PM -0500, Martin K. Petersen wrote:
> >>>>> "Mike" == mchristi   writes:
> 
> Mike> The following patches begin to cleanup the request->cmd_flags and
> bio-> bi_rw mess. We currently use cmd_flags to specify the operation,
> Mike> attributes and state of the request. For bi_rw we use it for
> Mike> similar info and also the priority but then also have another
> Mike> bi_flags field for state. At some point, we abused them so much we
> Mike> just made cmd_flags 64 bits, so we could add more.
> 
> Mike> The following patches seperate the operation (read, write discard,
> Mike> flush, etc) from cmd_flags/bi_rw.
> 
> Mike> This patchset was made against linux-next from today Jan 5 2016.
> Mike> (git tag next-20160105).
> 
> Very nice work. Thanks for doing this!
> 
> I think it's a much needed cleanup. I focused mainly on the core block,
> discard, write same and sd.c pieces and everything looks sensible to me.
> 
> I wonder what the best approach is to move a patch set with this many
> stakeholders forward? Set a "speak now or forever hold your peace"
> review deadline?

I say just ask Linus to pull it immediately after the next merge
window closes

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/35] xfs: set bi_op to REQ_OP

2016-01-06 Thread Dave Chinner
On Tue, Jan 05, 2016 at 02:53:16PM -0600, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> This patch has xfs set the bio bi_op to a REQ_OP, and
> rq_flag_bits to bi_rw.
> 
> Note:
> I have run xfs tests on these btrfs patches. There were some failures
> with and without the patches. I have not had time to track down why
> xfstest fails without the patches.
> 
> Signed-off-by: Mike Christie 
> ---
>  fs/xfs/xfs_aops.c |  3 ++-
>  fs/xfs/xfs_buf.c  | 27 +++
>  2 files changed, 17 insertions(+), 13 deletions(-)

Not sure which patches your note is refering to here.

The XFS change here looks fine.

Acked-by: Dave Chinner 

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ideas on unified real-ro mount option across all filesystems

2015-12-26 Thread Dave Chinner
On Thu, Dec 24, 2015 at 10:22:31AM +1100, Stewart Smith wrote:
> Eric Sandeen  writes:
> >> 3) A lot of user even don't now mount ro can still modify device
> >>Yes, I didn't know this point until I checked the log replay code of
> >>btrfs.
> >>Adding such mount option alias may raise some attention of users.
> >
> > Given that nothing in the documentation implies that the block device itself
> > must remain unchanged on a read-only mount, I don't see any problem which
> > needs fixing.  MS_RDONLY rejects user IO; that's all.
> >
> > If you want to be sure your block device rejects all IO for forensics or
> > what have you, I'd suggest # blockdev --setro /dev/whatever prior to mount,
> > and take it out of the filesystem's control.  Or better yet, making an
> > image and not touching the original.
> 
> What we do for the petitboot bootloader in POWER and OpenPower firmware
> (a linux+initramfs that does kexec to boot) is that we use device mapper
> to make a snapshot in memory where we run recovery (for some
> filesystems, notably XFS is different due to journal not being endian
> safe). We also have to have an option *not* to do that, just in case
> there's a bug in journal replay... and we're lucky in the fact that we
> probably do have enough memory to complete replay, this solution could
> be completely impossible on lower memory machines.

Which means the boot loader is going to break horribly when we
change the on-disk format and feature flags the boot loader doesn't
understand get set in the root filesystem. Then the bootloader will
refuse to mount the filesystem and the system won't boot anymore...

IOWs, developers and users can't make a root XFS filesystem with a
new/experimental feature on POWER/OpenPower machines because the
bootloader will refuse to mount it regardless of the clean/dirty
state of the journal

> As such, I believe we're the only bit of firmware/bootloader ever that
> has correctly parsed a journalling filesystem.

Nope.  The Irix bootloader (sash) could do this 20 years ago - there
are even feature mask bits reserved specifically for SASH in the XFS
superblock. However, seeing as the bootloader was always upgraded
during the install of each new Irix release, the bootloader was
always up-to-date with the on-disk features the kernel supported and
so there was never a problem with mismatched feature support.

However, Linux users can upgrade or change the kernel at any time
independently of the bootloader, so it's pretty much guaranteed that
mismatched bootloader/kernel filesystem capabilities will cause
users problems at some point in the not-too-distant future.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] fstests: fix btrfs test failures after commit 27d077ec0bda

2015-12-23 Thread Dave Chinner
On Tue, Dec 22, 2015 at 02:22:40AM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Commit 27d077ec0bda (common: use mount/umount helpers everywhere) made
> a few btrfs test fail for 2 different reasons:
> 
> 1) Some tests (btrfs/029 and btrfs/031) use $SCRATCH_MNT as a mount
>point for some subvolume created in $TEST_DEV, therefore calling
>_scratch_unmount does not work as it passes $SCRATCH_DEV as the
>argument to the umount program. This is intentional to test reflinks
>accross different mountpoints of the same filesystem but for different
>subvolumes;

The correct way to fix this is to stop abusing $SCRATCH_MNT and
to instead use a local mount point on the test device

> 2) For multiple devices filesystems (btrfs/003 and btrfs/011) that test
>the device replace feature, we need to unmount using the mount path
>($SCRATCH_MNT) because unmounting using one of the devices as an
>argument ($SCRATCH_DEV) does not always work - after replace operations
>we get in /proc/mounts a device other than $SCRATCH_DEV associated
>with the mount point $SCRATCH_MNT (this is mentioned in a comment at
>btrfs/011 for example), so we need to pass that other device to the
>umount program or pass it the mount point.

Which says to that _scratch_unmount should be using $SCRATCH_MNT
rather than $SCRATCH_DEV. That would fix the problem without needing
to modify any of the tests, right?

> Using $SCRATCH_MNT as a mountpoint for a device other than $SCRATCH_DEV is
> misleading, but that's a different problem that existed long before and
> this change attempts only to fix the regression from 27d077ec0bda.

It may be misleading, but that's the fundamental problem that needs
fixing.  As always, we should be trying to fix the root cause of the
problem, not working around them...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ideas to do custom operation just after mount?

2015-12-23 Thread Dave Chinner
On Mon, Dec 21, 2015 at 01:18:22PM +0800, Anand Jain wrote:
> 
> 
> >BTW, any good idea for btrfs to do such operation like
> >enabling/disabling some minor features? Especially when it can be set on
> >individual file/dirs.
> >
> >Features like incoming write time deduplication, is designed to be
> >enabled/disabled for individual file/dirs, so it's not a quite good idea
> >to use mount option to do it.
> >
> >Although some feature, like btrfs quota(qgroup), should be implemented
> >by mount option though.
> >I don't understand why qgroup is enabled/disabled by ioctl. :(
> 
> 
> mount option won't persist across systems/computers unless
> remembered by human.

So record the mount option you want persistent in the filesystem at
mount time and don't turn it off until a "no-" mount option is
provided at mount time.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Loss of connection to Half of the drives

2015-12-22 Thread Dave S
  18 size 931.51GiB used 4.00GiB path /dev/sdan
devid   19 size 931.51GiB used 3.01GiB path /dev/sdao
devid   20 size 931.51GiB used 3.01GiB path /dev/sdap

btrfs-progs v3.19.1

# dmesg|grep -i btrfs
[   15.938327] Btrfs loaded
[   15.938826] BTRFS: device label scratch2 devid 15 transid 7 /dev/sdak
[   15.939138] BTRFS: device label scratch2 devid 14 transid 7 /dev/sdaj
[   15.939163] BTRFS: device label scratch2 devid 11 transid 7 /dev/sdag
[   15.939960] BTRFS: device label scratch2 devid 19 transid 7 /dev/sdao
[   15.940056] BTRFS: device label scratch2 devid 18 transid 7 /dev/sdan
[   15.941805] BTRFS: device label scratch2 devid 4 transid 9 /dev/sdf
[   15.944312] BTRFS: device label scratch2 devid 10 transid 9 /dev/sdl
[   15.952288] BTRFS: device label scratch2 devid 7 transid 9 /dev/sdi
[   15.962451] BTRFS: device label scratch2 devid 2 transid 9 /dev/sdd
[   15.962995] BTRFS: device label scratch2 devid 3 transid 9 /dev/sde
[   15.963431] BTRFS: device label scratch2 devid 13 transid 7 /dev/sdai
[   15.974559] BTRFS: device label scratch2 devid 5 transid 9 /dev/sdg
[   15.979969] BTRFS: device label scratch2 devid 16 transid 7 /dev/sdal
[   15.981962] BTRFS: device label scratch2 devid 8 transid 9 /dev/sdj
[   15.988912] BTRFS: device label scratch2 devid 1 transid 9 /dev/sdc
[   15.998077] BTRFS: device label scratch2 devid 20 transid 7 /dev/sdap
[   16.001058] BTRFS: device label scratch2 devid 12 transid 7 /dev/sdah
[   16.008946] BTRFS: device label scratch2 devid 9 transid 9 /dev/sdk
[   16.010560] BTRFS: device label scratch2 devid 17 transid 7 /dev/sdam
[   16.011840] BTRFS: device label scratch2 devid 6 transid 9 /dev/sdh
[  120.971556] BTRFS info (device sdh): disk space caching is enabled
[  120.971560] BTRFS: has skinny extents
[  120.974091] BTRFS: failed to read chunk root on sdh
[  120.982591] BTRFS: open_ctree failed
[  142.749246] btrfs[2551]: segfault at 100108 ip 0044c503 sp
7fff8f261290 error 6 in btrfs[40+83000]
[  190.570074] btrfs[2614]: segfault at 100108 ip 0044c503 sp
7fff8f11aef0 error 6 in btrfs[40+83000]
[  215.734281] btrfs[2656]: segfault at 100108 ip 0044f5a3 sp
7fff4cfd23e0 error 6 in btrfs[40+88000]
[ 2545.896233] btrfs[4576]: segfault at 100108 ip 0044c503 sp
7c919400 error 6 in btrfs[40+83000]
[ 3000.106228] BTRFS info (device sdh): disk space caching is enabled
[ 3000.106233] BTRFS: has skinny extents
[ 3000.148162] BTRFS: failed to read chunk root on sdh
[ 3000.168649] BTRFS: open_ctree failed
[ 3071.430479] btrfs[4701]: segfault at 100108 ip 0046e06a sp
7fff44539688 error 6 in btrfs[40+c7000]
[ 3147.025032] btrfs[4755]: segfault at 100108 ip 0044c503 sp
7fffa79bbf40 error 6 in btrfs[40+83000]


Sincerely
-Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/11] reflink: more tests

2015-12-20 Thread Dave Chinner
On Sat, Dec 19, 2015 at 01:11:31AM -0800, Darrick J. Wong wrote:
> Add more tests for unaligned copy-on-write things, and explicitly
> test the ability to pass "len == 0" to mean reflink/dedupe all
> the way to the end of the file".
> 
> Signed-off-by: Darrick J. Wong 
.
> +_cleanup()
> +{
> +cd /
> +rm -rf "$tmp".* "$TESTDIR"
> +}
.
> +
> +TESTDIR="$TEST_DIR/test-$seq"
> +rm -rf "$TESTDIR"
> +mkdir "$TESTDIR"

This use of TESTDIR is highly confusing. When I see TESTDIR I think
that it's the mount point of the test device, not potentially some
directory on the scratch device.

I see that this occurs through lots of tests, not just these new
ones. Can you do a pass across the tests with, say, sed and rename
all these to something less confusing? e.g. "testdir", in lower
case, makes it clear that it's a local variable, not the global test
device mount point (i.e. test local variables are lower case,
exported test harnes variables are upper case ;)

Separate patch/pull req is fine.

Cheers,

dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ideas to do custom operation just after mount?

2015-12-20 Thread Dave Chinner
On Thu, Dec 17, 2015 at 09:55:14AM +0800, Qu Wenruo wrote:
> Hi,
> 
> Will xfstests provides some API to do some operation just after
> mounting a filesystem?
> 
> Some filesystem(OK, btrfs again) has some function(now qgroup only)
> which needed to be enabled by ioctl instead of mount option.

Irk.

Seriously, I need ot find some time to split all the fs specific
stuff in common/rc into separate files and have each different
filesystem implement it's own version of the relevant functions
like _scratch_mkfs, _scratch_mount, etc.

That will enable btrfs to do truly hideous stuff like this that
admins will hate you for requiring, and the rest of us won't have to
care about it.

> So is there any good idea to do some operation just after mounting
> in xfstests?

You're just going to have to wait a bit - the first piece of this
is in the overlayfs support patches, and the rest of it is in my
local work area in an incomplete state.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: !PageLocked BUG_ON hit in clear_page_dirty_for_io

2015-12-10 Thread Dave Jones
On Thu, Dec 10, 2015 at 05:57:20PM -0500, Dave Jones wrote:
 > On Thu, Dec 10, 2015 at 04:30:24PM -0500, Chris Mason wrote:
 >  > On Thu, Dec 10, 2015 at 02:35:55PM -0500, Dave Jones wrote:
 >  > > On Thu, Dec 10, 2015 at 02:02:20PM -0500, Chris Mason wrote:
 >  > >  > On Tue, Dec 08, 2015 at 11:25:28PM -0500, Dave Jones wrote:
 >  > >  > > Not sure if I've already reported this one, but I've been seeing 
 > this
 >  > >  > > a lot this last couple days.
 >  > >  > > 
 >  > >  > > kernel BUG at mm/page-writeback.c:2654!
 >  > >  > > invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
 >  > >  > > CPU: 1 PID: 2566 Comm: trinity-c1 Tainted: GW   
 > 4.4.0-rc4-think+ #14
 >  > >  > > task: 880462811b80 ti: 8800cd808000 task.ti: 
 > 8800cd808000
 >  > >  > > RIP: 0010:[]  [] 
 > clear_page_dirty_for_io+0x180/0x1d0
 >  > >  > 
 >  > >  > Huh, are you able to reproduce at will?  From this code path it 
 > should
 >  > >  > mean somebody else is unlocking a page they don't own.
 >  > > 
 >  > > pretty easily yeah. I hit it maybe a couple dozen times yesterday.
 >  > > So if you've got some idea of printk's to spray anywhere I can give
 >  > > that a shot.
 >  > 
 >  > I'd rather try to trigger it here.  Going to have to add some way to
 >  > record which stack trace last unlocked and/or freed the page.
 > 
 > I'm using..
 > 
 > trinity -q -l off -C8 -a64 -x fsync -x fdatasync -x syncfs -x sync 
 > --enable-fds=testfile,pseudo
 > 
 > interestingly, if I just use 'testfile' by itself, I can't reproduce it.
 > (That means "create a bunch a few files in current dir and use their fds")
 > the "pseudo" bit means "also use fds from /proc, /sys and /dev".

Actually scratch that. I finally got it to reproduce with just 'testfile'.
Which makes sense, given its a btrfs bug we're chasing, not anything to do
with sys/proc

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


<    1   2   3   4   5   6   7   8   >