[PATCH 1/2] btrfs-progs: Allow btrfs_leaf_free_space to accept NULL root

2015-11-05 Thread Qu Wenruo
Btrfs_leaf_free_space() function is used to determine the leaf/node
size.
It's OK to use root->nodesize to determine nodesize, but in fact,
extent_buffer->len can also be used to determine the nodesize if caller
can ensure it's a tree block.

So this patch will add support for NULL root for btrfs_leaf_free_space()
function, to allow btrfs_print_leaf() functions to be called in gdb or
to debug temporary btrfs in make_btrfs() without a valid root.

Signed-off-by: Qu Wenruo 
---
 ctree.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/ctree.c b/ctree.c
index 1434007..46153e3 100644
--- a/ctree.c
+++ b/ctree.c
@@ -1619,13 +1619,14 @@ static int leaf_space_used(struct extent_buffer *l, int 
start, int nr)
  */
 int btrfs_leaf_free_space(struct btrfs_root *root, struct extent_buffer *leaf)
 {
+   u32 nodesize = (root ? BTRFS_LEAF_DATA_SIZE(root) : leaf->len);
int nritems = btrfs_header_nritems(leaf);
int ret;
-   ret = BTRFS_LEAF_DATA_SIZE(root) - leaf_space_used(leaf, 0, nritems);
+   ret = nodesize - leaf_space_used(leaf, 0, nritems);
if (ret < 0) {
-   printk("leaf free space ret %d, leaf data size %lu, used %d 
nritems %d\n",
-  ret, (unsigned long) BTRFS_LEAF_DATA_SIZE(root),
-  leaf_space_used(leaf, 0, nritems), nritems);
+   printk("leaf free space ret %d, leaf data size %u, used %d 
nritems %d\n",
+  ret, nodesize, leaf_space_used(leaf, 0, nritems),
+  nritems);
}
return ret;
 }
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs-progs: debug-tree: Add option to show ondisk block without open_ctree

2015-11-05 Thread Qu Wenruo
Add new option '-B' to show tree block without calling open_ctree.
It's very useful to debug non-standard super or heavily damaged case.

As it needs nodesize, also adds a new option '-n' to specify nodesize.

Normal user should avoid calling it on random bytes, as it won't check
the validation of the tree block.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-debug-tree.asciidoc | 12 -
 btrfs-debug-tree.c  | 81 +
 2 files changed, 82 insertions(+), 11 deletions(-)

diff --git a/Documentation/btrfs-debug-tree.asciidoc 
b/Documentation/btrfs-debug-tree.asciidoc
index 23fc115..a5528cb 100644
--- a/Documentation/btrfs-debug-tree.asciidoc
+++ b/Documentation/btrfs-debug-tree.asciidoc
@@ -25,8 +25,16 @@ Print detailed extents info.
 Print info of btrfs device and root tree dirs only.
 -r::
 Print info of roots only.
--b ::
-Print info of the specified block only.
+-b ::
+Print info of the specified block at logical bytenr.
+-B ::
+Print info of the specified block at on-disk bytenr.
+Need to use with '-n ' option.
++
+Use with caution, as it won't do normal tree block check.
+
+-n ::
+Specify the nodesize for '-B ' option.
 
 EXIT STATUS
 ---
diff --git a/btrfs-debug-tree.c b/btrfs-debug-tree.c
index 8adc39f..52e3de3 100644
--- a/btrfs-debug-tree.c
+++ b/btrfs-debug-tree.c
@@ -21,6 +21,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "kerncompat.h"
 #include "radix-tree.h"
@@ -41,8 +43,11 @@ static int print_usage(int ret)
fprintf(stderr, "\t-r : print info of roots only\n");
fprintf(stderr, "\t-R : print info of roots and root backups\n");
fprintf(stderr, "\t-u : print info of uuid tree only\n");
-   fprintf(stderr, "\t-b block_num : print info of the specified block"
-" only\n");
+   fprintf(stderr, "\t-b bytenr: print info of the specified block"
+" at logical bytenr only\n");
+   fprintf(stderr, "\t-B bytenr: print info of the specified block"
+" at disk bytenr only, need -n option(use with 
caution)\n");
+   fprintf(stderr, "\t-n nodesize: specify the nodesize to use with -B\n");
fprintf(stderr,
"\t-t tree_id : print only the tree with the given id\n");
fprintf(stderr, "%s\n", PACKAGE_STRING);
@@ -122,6 +127,41 @@ static void print_old_roots(struct btrfs_super_block 
*super)
}
 }
 
+static int print_ondisk_leaf(const char *device, u64 ondisk_bytenr,
+u32 nodesize)
+{
+   struct extent_buffer *buf = NULL;
+   int fd;
+   int ret;
+
+   buf = malloc(sizeof(*buf) + nodesize);
+   if (!buf) {
+   ret = -ENOMEM;
+   goto out;
+   }
+   memset(buf, 0, sizeof(buf) + nodesize);
+   buf->start = ondisk_bytenr;
+   buf->len = nodesize;
+   buf->refs = 1;
+   fd = open(device, O_RDONLY);
+   if (fd < 0) {
+   ret = -errno;
+   goto out;
+   }
+
+   ret = pread(fd, buf->data, nodesize, ondisk_bytenr);
+   if (ret < nodesize) {
+   ret = (ret < 0 ? ret : -EIO);
+   goto out;
+   }
+
+   /* We don't check anything, user should be responsible for it */
+   btrfs_print_leaf(NULL, buf);
+out:
+   free(buf);
+   return ret;
+}
+
 int main(int ac, char **av)
 {
struct btrfs_root *root;
@@ -140,7 +180,9 @@ int main(int ac, char **av)
int uuid_tree_only = 0;
int roots_only = 0;
int root_backups = 0;
-   u64 block_only = 0;
+   u64 logical_only = 0;
+   u64 ondisk_only = 0;
+   u32 nodesize = 0;
struct btrfs_root *tree_root_scan;
u64 tree_id = 0;
 
@@ -153,7 +195,7 @@ int main(int ac, char **av)
{ NULL, 0, NULL, 0 }
};
 
-   c = getopt_long(ac, av, "deb:rRut:", long_options, NULL);
+   c = getopt_long(ac, av, "deb:B:rRut:n:", long_options, NULL);
if (c < 0)
break;
switch(c) {
@@ -174,7 +216,13 @@ int main(int ac, char **av)
root_backups = 1;
break;
case 'b':
-   block_only = arg_strtou64(optarg);
+   logical_only = arg_strtou64(optarg);
+   break;
+   case 'B':
+   ondisk_only = arg_strtou64(optarg);
+   break;
+   case 'n':
+   nodesize = arg_strtou64(optarg);
break;
case 't':
tree_id = arg_strtou64(optarg);
@@ -196,6 +244,21 @@ int main(int ac, char **av)
exit(1);
}
 
+   /* Ondisk_only means we won't need to go 

[PATCH] Btrfs: fix race waiting for qgroup rescan worker

2015-11-05 Thread fdmanana
From: Filipe Manana 

We were initializing the completion (fs_info->qgroup_rescan_completion)
object after releasing the qgroup rescan lock, which gives a small time
window for a rescan waiter to not actually wait for the rescan worker
to finish. Example:

 CPU 1 CPU 2

 fs_info->qgroup_rescan_completion->done is 0

 btrfs_qgroup_rescan_worker()
   complete_all(_info->qgroup_rescan_completion)
 sets fs_info->qgroup_rescan_completion->done
 to UINT_MAX / 2

 ... do some other stuff 

 qgroup_rescan_init()
   mutex_lock(_info->qgroup_rescan_lock)
   set flag BTRFS_QGROUP_STATUS_FLAG_RESCAN
 in fs_info->qgroup_flags
   mutex_unlock(_info->qgroup_rescan_lock)

   
btrfs_qgroup_wait_for_completion()
 
mutex_lock(_info->qgroup_rescan_lock)
 sees flag 
BTRFS_QGROUP_STATUS_FLAG_RESCAN
   in 
fs_info->qgroup_flags
 
mutex_unlock(_info->qgroup_rescan_lock)

 
wait_for_completion_interruptible(
   
_info->qgroup_rescan_completion)

   
fs_info->qgroup_rescan_completion->done
   is > 0 so it returns 
immediately

  init_completion(_info->qgroup_rescan_completion)
sets fs_info->qgroup_rescan_completion->done to 0

So fix this by initializing the completion object while holding the mutex
fs_info->qgroup_rescan_lock.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/qgroup.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 75c0249..75bb4af9 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2387,12 +2387,11 @@ qgroup_rescan_init(struct btrfs_fs_info *fs_info, u64 
progress_objectid,
memset(_info->qgroup_rescan_progress, 0,
sizeof(fs_info->qgroup_rescan_progress));
fs_info->qgroup_rescan_progress.objectid = progress_objectid;
+   init_completion(_info->qgroup_rescan_completion);
 
spin_unlock(_info->qgroup_lock);
mutex_unlock(_info->qgroup_rescan_lock);
 
-   init_completion(_info->qgroup_rescan_completion);
-
memset(_info->qgroup_rescan_work, 0,
   sizeof(fs_info->qgroup_rescan_work));
btrfs_init_work(_info->qgroup_rescan_work,
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: qgroup: exit the rescan worker during umount

2015-11-05 Thread Filipe Manana
On Wed, Nov 4, 2015 at 11:56 PM, Justin Maggard  wrote:
> I was hitting a consistent NULL pointer dereference during shutdown that
> showed the trace running through end_workqueue_bio().  I traced it back to
> the endio_meta_workers workqueue being poked after it had already been
> destroyed.
>
> Eventually I found that the root cause was a qgroup rescan that was still
> in progress while we were stopping all the btrfs workers.
>
> Currently we explicitly pause balance and scrub operations in
> close_ctree(), but we do nothing to stop the qgroup rescan.  We should
> probably be doing the same for qgroup rescan, but that's a much larger
> change.  This small change is good enough to allow me to unmount without
> crashing.
>
> v3: avoid more races by calling btrfs_qgroup_wait_for_completion()

Btw, information about what changes between versions should go after
the "---" (triple dash) below. You should also have mentioned what
changed between v2 and v1 as well (see
https://btrfs.wiki.kernel.org/index.php/Developer%27s_FAQ#Repeated_submissions).

>
> Signed-off-by: Justin Maggard 
Reviewed-by: Filipe Manana 

I've added this to my integration branch for 4.4 [1] and will prepare
later a pull request for Chris. I've removed there the v3 line from
the change log, as that's not  intended to be there, serves only for
patch reviewing in the list.

Thanks for doing this and the test case for fstests.

[1] 
http://git.kernel.org/cgit/linux/kernel/git/fdmanana/linux.git/log/?h=integration-4.4

> ---
>  fs/btrfs/disk-io.c | 3 +++
>  fs/btrfs/qgroup.c  | 9 ++---
>  2 files changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 2d46675..1eb0839 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3780,6 +3780,9 @@ void close_ctree(struct btrfs_root *root)
> fs_info->closing = 1;
> smp_mb();
>
> +   /* wait for the qgroup rescan worker to stop */
> +   btrfs_qgroup_wait_for_completion(fs_info);
> +
> /* wait for the uuid_scan task to finish */
> down(_info->uuid_tree_rescan_sem);
> /* avoid complains from lockdep et al., set sem back to initial state 
> */
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 46476c2..75c0249 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -2286,7 +2286,7 @@ static void btrfs_qgroup_rescan_worker(struct 
> btrfs_work *work)
> goto out;
>
> err = 0;
> -   while (!err) {
> +   while (!err && !btrfs_fs_closing(fs_info)) {
> trans = btrfs_start_transaction(fs_info->fs_root, 0);
> if (IS_ERR(trans)) {
> err = PTR_ERR(trans);
> @@ -2307,7 +2307,8 @@ out:
> btrfs_free_path(path);
>
> mutex_lock(_info->qgroup_rescan_lock);
> -   fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_RESCAN;
> +   if (!btrfs_fs_closing(fs_info))
> +   fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_RESCAN;
>
> if (err > 0 &&
> fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT) {
> @@ -2336,7 +2337,9 @@ out:
> }
> btrfs_end_transaction(trans, fs_info->quota_root);
>
> -   if (err >= 0) {
> +   if (btrfs_fs_closing(fs_info)) {
> +   btrfs_info(fs_info, "qgroup scan paused");
> +   } else if (err >= 0) {
> btrfs_info(fs_info, "qgroup scan completed%s",
> err > 0 ? " (inconsistency flag cleared)" : "");
> } else {
> --
> 2.6.2
>



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-11-05 Thread Austin S Hemmelgarn

On 2015-11-04 23:06, Duncan wrote:

(Tho I should mention, while not on zfs, I've actually had my own
problems with ECC RAM too.  In my case, the RAM was certified to run at
speeds faster than it was actually reliable at, such that actually stored
data, what the ECC protects, was fine, the data was actually getting
damaged in transit to/from the RAM.  On a lightly loaded system, such as
one running many memory tests or under normal desktop usage conditions,
the RAM was generally fine, no problems.  But on a heavily loaded system,
such as when doing parallel builds (I run gentoo, which builds from
sources in ordered to get the higher level of option flexibility that
comes only when you can toggle build-time options), I'd often have memory
faults and my builds would fail.

The most common failure, BTW, was on tarball decompression, bunzip2 or
the like, since the tarballs contained checksums that were verified on
data decompression, and often they'd fail to verify.

Once I updated the BIOS to one that would let me set the memory speed
instead of using the speed the modules themselves reported, and I
declocked the memory just one notch (this was DDR1, IIRC I declocked from
the PC3200 it was rated, to PC3000 speeds), not only was the memory then
100% reliable, but I could and did actually reduce the number of wait-
states for various operations, and it was STILL 100% reliable.  It simply
couldn't handle the raw speeds it was certified to run, is all, tho it
did handle it well enough, enough of the time, to make the problem far
more difficult to diagnose and confirm than it would have been had the
problem appeared at low load as well.

As it happens, I was running reiserfs at the time, and it handled both
that hardware issue, and a number of others I've had, far better than I'd
have expected of /any/ filesystem, when the memory feeding it is simply
not reliable.  Reiserfs metadata, in particular, seems incredibly
resilient in the face of hardware issues, and I lost far less data than I
might have expected, tho without checksums and with bad memory, I imagine
I had occasional undetected bitflip corruption in files here or there,
but generally nothing I detected.  I still use reiserfs on my spinning
rust today, but it's not well suited to SSD, which is where I run btrfs.

But the point for this discussion is that just because it's ECC RAM
doesn't mean you can't have memory related errors, just that if you do,
they're likely to be different errors, "transit errors", that will tend
to be undetected by many memory checkers, at least the ones that don't
tend to run full out memory bandwidth if they're simply checking that
what was stored in a cell can be read back, unchanged.)
I've actually seen similar issues with both ECC and non-ECC memory 
myself.  Any time I'm getting RAM for a system that I can afford to 
over-spec, I get the next higher speed and under-clock it (which in turn 
means I can lower the timing parameters and usually get a faster system 
than if I was running it at the rated speed).  FWIW, I also make a point 
of doing multiple memtest86+ runs (at a minimum, one running single 
core, and one with forced SMP) when I get new RAM, and even have a 
run-level configured on my Gentoo based home server system where it 
boots Xen and fires up twice as many VM's running memtest86+ as I have 
CPU cores, which is usually enough to fully saturate memory bandwidth 
and check for the type of issues you mentioned having above (although 
the BOINC client I run usually does a good job of triggering those kind 
of issues fast, distributed computing apps tend to be memory bound and 
use a lot of memory bandwidth).




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH 1/2] btrfs-progs: Allow btrfs_leaf_free_space to accept NULL root

2015-11-05 Thread David Sterba
On Thu, Nov 05, 2015 at 04:32:59PM +0800, Qu Wenruo wrote:
> Btrfs_leaf_free_space() function is used to determine the leaf/node
> size.
> It's OK to use root->nodesize to determine nodesize, but in fact,
> extent_buffer->len can also be used to determine the nodesize if caller
> can ensure it's a tree block.
> 
> So this patch will add support for NULL root for btrfs_leaf_free_space()
> function, to allow btrfs_print_leaf() functions to be called in gdb or
> to debug temporary btrfs in make_btrfs() without a valid root.
> 
> Signed-off-by: Qu Wenruo 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs-progs: debug-tree: Add option to show ondisk block without open_ctree

2015-11-05 Thread David Sterba
On Thu, Nov 05, 2015 at 04:33:00PM +0800, Qu Wenruo wrote:
> Add new option '-B' to show tree block without calling open_ctree.
> It's very useful to debug non-standard super or heavily damaged case.
> 
> As it needs nodesize, also adds a new option '-n' to specify nodesize.
> 
> Normal user should avoid calling it on random bytes, as it won't check
> the validation of the tree block.
> 
> Signed-off-by: Qu Wenruo 

I'm going to postpone this patch for next development cycle. It's
likely that we'll add more options so I'd like to not make it a mess
(similar to what btrfs-corrupt-block options are).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: clear bio reference after submit_one_bio()

2015-11-05 Thread Holger Hoffstätte
On 10/11/15 20:09, Alex Lyakas wrote:
> Hi Naota,
> 
> What happens if btrfs_bio_alloc() in submit_extent_page fails? Then we
> return -ENOMEM to the caller, but we do not set *bio_ret to NULL. And
> if *bio_ret was non-NULL upon entry into submit_extent_page, then we
> had submitted this bio before getting to btrfs_bio_alloc(). So should
> btrfs_bio_alloc() failure be handled in the same way?
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 3915c94..cd443bc 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2834,8 +2834,11 @@ static int submit_extent_page(int rw, struct
> extent_io_tree *tree,
> 
> bio = btrfs_bio_alloc(bdev, sector, BIO_MAX_PAGES,
> GFP_NOFS | __GFP_HIGH);
> -   if (!bio)
> +   if (!bio) {
> +   if (bio_ret)
> +   *bio_ret = NULL;
> return -ENOMEM;
> +   }
> 
> bio_add_page(bio, page, page_size, offset);
> bio->bi_end_io = end_io_func;
> 

Did you get any feedback on this? It seems it could cause data loss or
corruption on allocation failures, no?

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to allocate for space usage in particular btrfs volume

2015-11-05 Thread Hugo Mills
On Thu, Nov 05, 2015 at 10:44:51AM +, OmegaPhil wrote:
> So a couple of gig still unaccountable but irrelevant. Thanks, problem
> solved! Although hopefully checksumming will be allowed on nocow files
> in the future as thats currently 17% of all data unprotected and will
> get worse...

   It's not likely to happen. If you're modifying data in-place (as is
the case on nodatacow files), then there's a window of vulnerability
between modifying the data and modifying the checksum records during
which the checksums are incorrect. This breaks the consistency
guarantees of the FS.

   Hugo.

-- 
Hugo Mills | Nostalgia isn't what it used to be.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


[GIT PULL] Btrfs fixes

2015-11-05 Thread fdmanana
From: Filipe Manana 

Hi Chris,

please consider the following fixes for the 4.4 merge window (they were
all previously sent to the mailing list already).

One fixes a sleep inside atomic context issue, introduced by one patch
in the integration-4.4 branch. Another two fix races regarding waiting
for qgroup rescan worker to finish and a race between the qgroup rescan
worker and unmounting the filesystem (leading to crashes). The remaining
patch fixes an issue with partial direct IO writes, which has been
introduced in the 4.0 kernel, and results either in an assertion failure
(BUG_ON) when CONFIG_BTRFS_ASSERT=y or arithmetic underflow of an inode's
outstanding extents counter (used for proper space reservation) when
assertions are disabled.

Two test cases for fstests were sent recently to cover the issues regarding
the races and the direct IO partial write regression.

Thanks.

The following changes since commit 2959a32a858a2c44bbbce83d19c158d54cc5998a:

  Btrfs: fix hole punching when using the no-holes feature (2015-11-03 07:44:20 
-0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git 
integration-4.4

for you to fetch changes up to 3b2ba7b31d56c3d8f57cd5d32b8fb5101ab446e4:

  Btrfs: fix sleeping inside atomic context in qgroup rescan worker (2015-11-05 
11:02:22 +)


Filipe Manana (3):
  Btrfs: fix extent accounting for partial direct IO writes
  Btrfs: fix race waiting for qgroup rescan worker
  Btrfs: fix sleeping inside atomic context in qgroup rescan worker

Justin Maggard (1):
  btrfs: qgroup: exit the rescan worker during umount

 fs/btrfs/disk-io.c |  3 +++
 fs/btrfs/inode.c   | 52 +---
 fs/btrfs/qgroup.c  | 13 +++--
 3 files changed, 47 insertions(+), 21 deletions(-)

-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix sleeping inside atomic context in qgroup rescan worker

2015-11-05 Thread fdmanana
From: Filipe Manana 

We are holding a btree path with spinning locks and then we attempt to
clone an extent buffer, which calls kmem_cache_alloc() and this function
can sleep, causing the following trace to be reported on a debug kernel:

[107118.218536] BUG: sleeping function called from invalid context at 
mm/slab.c:2871
[107118.224110] in_atomic(): 1, irqs_disabled(): 0, pid: 19148, name: 
kworker/u32:3
[107118.226120] INFO: lockdep is turned off.
[107118.226843] Preemption disabled at:[] 
btrfs_clear_lock_blocking_rw+0x96/0xea [btrfs]

[107118.229175] CPU: 3 PID: 19148 Comm: kworker/u32:3 Tainted: GW   
4.3.0-rc5-btrfs-next-17+ #1
[107118.231326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[107118.233687] Workqueue: btrfs-qgroup-rescan btrfs_qgroup_rescan_helper 
[btrfs]
[107118.236835]   880424bf3b78 812566f4 

[107118.238369]  880424bf3ba0 81070664 817f1cd5 
0b37
[107118.239769]   880424bf3bc8 8107070a 
8850
[107118.241244] Call Trace:
[107118.241729]  [] dump_stack+0x4e/0x79
[107118.242602]  [] ___might_sleep+0x23a/0x241
[107118.243586]  [] __might_sleep+0x9f/0xa6
[107118.244532]  [] cache_alloc_debugcheck_before+0x25/0x36
[107118.245939]  [] kmem_cache_alloc+0x50/0x215
[107118.246930]  [] __alloc_extent_buffer+0x2a/0x11f [btrfs]
[107118.248121]  [] btrfs_clone_extent_buffer+0x3d/0xdd 
[btrfs]
[107118.249451]  [] btrfs_qgroup_rescan_worker+0x16d/0x434 
[btrfs]
[107118.250755]  [] ? arch_local_irq_save+0x9/0xc
[107118.251754]  [] normal_work_helper+0x14c/0x32a [btrfs]
[107118.252899]  [] ? normal_work_helper+0x14c/0x32a [btrfs]
[107118.254195]  [] btrfs_qgroup_rescan_helper+0x12/0x14 
[btrfs]
[107118.255436]  [] process_one_work+0x24a/0x4ac
[107118.263690]  [] worker_thread+0x206/0x2c2
[107118.264888]  [] ? rescuer_thread+0x2cb/0x2cb
[107118.267413]  [] kthread+0xef/0xf7
[107118.268417]  [] ? kthread_parkme+0x24/0x24
[107118.269505]  [] ret_from_fork+0x3f/0x70
[107118.270491]  [] ? kthread_parkme+0x24/0x24

So just use blocking locks for our path to solve this.
This fixes the patch titled:
  "btrfs: qgroup: Don't copy extent buffer to do qgroup rescan"

Signed-off-by: Filipe Manana 
---
 fs/btrfs/qgroup.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 75bb4af9..93e12c1 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2198,7 +2198,6 @@ qgroup_rescan_leaf(struct btrfs_fs_info *fs_info, struct 
btrfs_path *path,
int slot;
int ret;
 
-   path->leave_spinning = 1;
mutex_lock(_info->qgroup_rescan_lock);
ret = btrfs_search_slot_for_read(fs_info->extent_root,
 _info->qgroup_rescan_progress,
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: test unmount during quota rescan

2015-11-05 Thread Filipe Manana
On Thu, Nov 5, 2015 at 12:50 AM, Justin Maggard  wrote:
> This test case tests if we are able to unmount a filesystem while
> a quota rescan is running.  Up to now (4.3) this would result
> in a kernel NULL pointer dereference.
>
> Fixed by patch (btrfs: qgroup: exit the rescan worker during umount).
>
> Signed-off-by: Justin Maggard 
Reviewed-by: Filipe Manana 

> ---

Btw, for future patches/versions, here after the --- you should
mention what changed between versions of the patch.

thanks

>  tests/btrfs/114 | 61 
> +
>  tests/btrfs/114.out |  2 ++
>  tests/btrfs/group   |  1 +
>  3 files changed, 64 insertions(+)
>  create mode 100644 tests/btrfs/114
>  create mode 100644 tests/btrfs/114.out
>
> diff --git a/tests/btrfs/114 b/tests/btrfs/114
> new file mode 100644
> index 000..0a0e8ba
> --- /dev/null
> +++ b/tests/btrfs/114
> @@ -0,0 +1,61 @@
> +#! /bin/bash
> +# FS QA Test No. btrfs/114
> +#
> +# btrfs quota scan/unmount sanity test
> +# Make sure that unmounting during a quota rescan doesn't crash
> +#
> +#---
> +# Copyright (c) 2015 NETGEAR, Inc.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +   cd /
> +   rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# real QA test starts here
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +
> +_scratch_mkfs >>$seqres.full 2>&1
> +_scratch_mount
> +
> +for i in `seq 0 1 45`; do
> +   echo -n > $SCRATCH_MNT/file.$i
> +done
> +echo 3 > /proc/sys/vm/drop_caches
> +$BTRFS_UTIL_PROG quota enable $SCRATCH_MNT
> +_scratch_unmount
> +
> +echo "Silence is golden"
> +status=0
> +exit
> diff --git a/tests/btrfs/114.out b/tests/btrfs/114.out
> new file mode 100644
> index 000..a2aa4a2
> --- /dev/null
> +++ b/tests/btrfs/114.out
> @@ -0,0 +1,2 @@
> +QA output created by 114
> +Silence is golden
> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index 7cf7dd7..10ab26b 100644
> --- a/tests/btrfs/group
> +++ b/tests/btrfs/group
> @@ -116,3 +116,4 @@
>  111 auto quick send
>  112 auto quick clone
>  113 auto quick compress clone
> +114 auto qgroup
> --
> 2.6.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to allocate for space usage in particular btrfs volume

2015-11-05 Thread OmegaPhil
On 05/11/15 04:18, Duncan wrote:
> OmegaPhil posted on Wed, 04 Nov 2015 21:53:09 + as excerpted:
> 
>> The volume doesn't change hugely over time, so it really ought not to
>> have broken so quickly - a quick rundown of the storage usage:
>>
>> 36% general (small files, some smallish videos)
>> 24% music 23% pr0n 17% VMs
>>
>> But in terms of 'large files changing', it could be the VM disks perhaps
>> -
>> I'll move them out, balance, and then back in again, hopefully that'd be
>> a meaningful test.
> 
> VM image files (and large database files, for the same reason) are a bit 
> of a problem on btrfs, and indeed, any COW-based filesystem, since the 
> random rewrite pattern matching that use-case is pretty much the absolute 
> worst-case match for a COW-based filesystem there is.
> 
> And that would be the worst-case in terms of the unsplit extents issue 
> Hugo was talking about as well.  So they may well be the problem, indeed.
> 
> Since you're not doing snapshotting (which conflicts with this option, 
> with an imperfect workaround), setting nocow on those files may well 
> eliminate the problem, but be aware if you aren't already that (1) nocow 
> does turn off checksumming as well, in ordered to avoid a race that could 
> easily lead to data corruption, and (2) you can't just activate nocow on 
> the existing file and expect it to work, the procedure is a bit more 
> complicated than that, since nocow is only guaranteed to work if it's set 
> at file creation.  Detailed instructions for #2 skipped as they've been 
> posted many times, but if you are interested and haven't seen them, ask.


Thankyou Hugo, Duncan - moving VMs out, then:

=

sudo chattr +C '/mnt/storage-1/Virtual Machines'
sudo btrfs balance start -mconvert=dup /mnt/storage-1
sudo btrfs fi defrag -rv /mnt/storage-1

=

And after moving VMs back, du reports 884GB used,

=

$ sudo btrfs fi usage /mnt/storage-1

Overall:
Device size: 953.87GiB
Device allocated:924.07GiB
Device unallocated:   29.80GiB
Device missing:  0.00B
Used:886.11GiB
Free (estimated): 65.72GiB  (min: 50.82GiB)
Data ratio:   1.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)

Data,single: Size:918.01GiB, Used:882.09GiB
   /dev/sdb  918.01GiB

Metadata,DUP: Size:3.00GiB, Used:2.01GiB
   /dev/sdb6.00GiB

System,DUP: Size:32.00MiB, Used:128.00KiB
   /dev/sdb   64.00MiB

Unallocated:
   /dev/sdb   29.80GiB

=

So a couple of gig still unaccountable but irrelevant. Thanks, problem
solved! Although hopefully checksumming will be allowed on nocow files
in the future as thats currently 17% of all data unprotected and will
get worse...



signature.asc
Description: OpenPGP digital signature


Possible project idea

2015-11-05 Thread Austin S Hemmelgarn
I'd been looking at the wiki page with project ideas, and I realized 
that there were no listed ideas that suggested the adding support for 
arbitrary erasure coding methods.  Ceph for example has an option that 
allows you to set arbitrary erasure coding such that you use n devices 
to store the data, and can tolerate loss of any m devices out of it.


Technically, this would be covered as a special case by the whole 
copies/stripes/parity thing that was discussed a while back before 
raid56 code made it into the kernel (I've tried to find the thread for 
reference, but haven't been unable to locate it), but I didn't see that 
'project' listed anywhere on the page either.


IMHO, this would be an excellent feature to differentiate BTRFS from 
bcachefs and ZFS (although I would not be surprised if they both copied 
it themselves).




smime.p7s
Description: S/MIME Cryptographic Signature


Re: Regression in: [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete

2015-11-05 Thread Mark Fasheh
On Wed, Nov 04, 2015 at 09:01:36AM +0800, Qu Wenruo wrote:
> 
> 
> Mark Fasheh wrote on 2015/11/03 11:26 -0800:
> >On Mon, Nov 02, 2015 at 09:34:24AM +0800, Qu Wenruo wrote:
> >>
> >>
> >>Stefan Priebe wrote on 2015/11/01 21:49 +0100:
> >>>Hi,
> >>>
> >>>this one: http://www.spinics.net/lists/linux-btrfs/msg47377.html
> >>>
> >>>adds a regression to my test systems with very large disks (30tb and 50tb).
> >>>
> >>>btrfs balance is super slow afterwards while heavily making use of cp
> >>>--reflink=always on big files (200gb - 500gb).
> >>>
> >>>Sorry didn't know how to correctly reply to that "old" message.
> >>>
> >>>Greets,
> >>>Stefan
> >>>--
> >>>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >>>the body of a message to majord...@vger.kernel.org
> >>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>Thanks for the testing.
> >>
> >>Are you using qgroup or just doing normal balance with qgroup disabled?
> >>
> >>For the latter case, that's should be optimized to skip the dirty
> >>extent insert in qgroup disabled case.
> >>
> >>For qgroup enabled case, I'm afraid that's the design.
> >>As relocation will drop a subtree to relocate, and to ensure qgroup
> >>consistent, we must walk down all the tree blocks and mark them
> >>dirty for later qgroup accounting.
> >
> >Qu, we're always going to have to walk the tree when deleting it, this is
> >part of removing a subvolume. We've walked shared subtrees in this code for
> >numerous kernel releases without incident before it was removed in 4.2.
> >
> >Do you have any actual evidence that this is a major performance regression?
> > From our previous conversations you seemed convinced of this, without even
> >having a working subtree walk to test. I remember the hand wringing
> >about an individual commit being too heavy with the qgroup code (even though
> >I pointed out that tree walk is a restartable transaction).
> >
> >It seems that you are confused still about how we handle removing a volume
> >wrt qgroups.
> >
> >If you have questions or concerns I would be happy to explain them but
> >IMHO your statements there are opinion and not based in fact.
> 
> Yes, I don't deny it.
> But it's quite hard to prove it, as we need such a huge storage like Stefan.
> What I have is only several hundred GB test storage.
> Even accounting all my home NAS, I only have 2T, far from the
> storage Stefan has.
> 
> And what Stefan report should already give some hint about the
> performance issue.
> 
> In your word "it won't be doing anything (ok some kmalloc/free of a
> very tiny object)", it's already slowing down balance, since balance
> also use btrfs_drop_subtree().

When I wrote that I was under the impression that the qgroup code was doing
it's own sanity checking (it used to) and since Stephan had them disabled
they couldn't be causing the problem. I read your e-mail explaining that the
qgroup api was now intertwined with delayed ref locking after this one.

The same exact code ran in either case before and after your patches, so my
guess is that the issue is actually inside the qgroup code that shouldn't
have been run. I wonder if we even just filled up his memory but never
cleaned the objects. The only other thing I can think of is if
account_leaf_items() got run in a really tight loop for some reason.

Kmalloc in the way we are using it is not usually a performance issue,
especially if we've been reading off disk in the same process. Ask yourself
this - your own patch series does the same kmalloc for every qgroup
operation. Did you notice a complete and massive performance slowdown like
the one Stefan reported?

I will say that we never had this problem reported before, and
account_leaf_items() is always run in all kernels, even without qgroups
enabled. That will change with my new patch though.

What we can say for sure is that drop_snapshot in the qgroup case will read
more disk and obviously that will have a negative impact depending on what
the tree looks like. So IMHO we ought to be focusing on reducing the amount
of I/O involved.


> But we can't just ignore such "possible" performance issue just
> because old code did the same thing.(Although not the same now,
> we're marking all subtree blocks dirty other than shared one).

Well, I can't disagree with that - the only reason we are talking right now
is because you intentionally ignored the qgroup code in drop_snapshot(). So
let's start with this - no more 'fixing' code by tearing it out and replacing
it with /* TODO: somebody else re-implement this */   ;)
--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] btrfs: qgroup: account shared subtree during snapshot delete

2015-11-05 Thread Mark Fasheh
Commit 0ed4792 ('btrfs: qgroup: Switch to new extent-oriented qgroup
mechanism.') removed our qgroup accounting during
btrfs_drop_snapshot(). Predictably, this results in qgroup numbers
going bad shortly after a snapshot is removed.

Fix this by adding a dirty extent record when we encounter extents during
our shared subtree walk. This effectively restores the functionality we had
with the original shared subtree walking code in 1152651 (btrfs: qgroup:
account shared subtrees during snapshot delete).

The idea with the original patch (and this one) is that shared subtrees can
get skipped during drop_snapshot. The shared subtree walk then allows us a
chance to visit those extents and add them to the qgroup work for later
processing. This ultimately makes the accounting for drop snapshot work.

The new qgroup code nicely handles all the other extents during the tree
walk via the ref dec/inc functions so we don't have to add actions beyond
what we had originally.

Signed-off-by: Mark Fasheh 
---
 fs/btrfs/extent-tree.c | 47 ---
 fs/btrfs/qgroup.c  |  2 ++
 2 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 601d7d4..410b46d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7850,21 +7850,47 @@ reada:
 }
 
 /*
- * TODO: Modify related function to add related node/leaf to dirty_extent_root,
- * for later qgroup accounting.
- *
- * Current, this function does nothing.
+ * These may not be seen by the usual inc/dec ref code so we have to
+ * add them here.
  */
+static int record_one_subtree_extent(struct btrfs_trans_handle *trans,
+struct btrfs_root *root, u64 bytenr,
+u64 num_bytes)
+{
+   struct btrfs_qgroup_extent_record *qrecord;
+   struct btrfs_delayed_ref_root *delayed_refs;
+
+   qrecord = kmalloc(sizeof(*qrecord), GFP_NOFS);
+   if (!qrecord)
+   return -ENOMEM;
+
+   qrecord->bytenr = bytenr;
+   qrecord->num_bytes = num_bytes;
+   qrecord->old_roots = NULL;
+
+   delayed_refs = >transaction->delayed_refs;
+   spin_lock(_refs->lock);
+   if (btrfs_qgroup_insert_dirty_extent(delayed_refs, qrecord))
+   kfree(qrecord);
+   spin_unlock(_refs->lock);
+
+   return 0;
+}
+
 static int account_leaf_items(struct btrfs_trans_handle *trans,
  struct btrfs_root *root,
  struct extent_buffer *eb)
 {
int nr = btrfs_header_nritems(eb);
-   int i, extent_type;
+   int i, extent_type, ret;
struct btrfs_key key;
struct btrfs_file_extent_item *fi;
u64 bytenr, num_bytes;
 
+   /* We can be called directly from walk_up_proc() */
+   if (!root->fs_info->quota_enabled)
+   return 0;
+
for (i = 0; i < nr; i++) {
btrfs_item_key_to_cpu(eb, , i);
 
@@ -7883,6 +7909,10 @@ static int account_leaf_items(struct btrfs_trans_handle 
*trans,
continue;
 
num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi);
+
+   ret = record_one_subtree_extent(trans, root, bytenr, num_bytes);
+   if (ret)
+   return ret;
}
return 0;
 }
@@ -7951,8 +7981,6 @@ static int adjust_slots_upwards(struct btrfs_root *root,
 
 /*
  * root_eb is the subtree root and is locked before this function is called.
- * TODO: Modify this function to mark all (including complete shared node)
- * to dirty_extent_root to allow it get accounted in qgroup.
  */
 static int account_shared_subtree(struct btrfs_trans_handle *trans,
  struct btrfs_root *root,
@@ -8030,6 +8058,11 @@ walk_down:
btrfs_tree_read_lock(eb);
btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
path->locks[level] = BTRFS_READ_LOCK_BLOCKING;
+
+   ret = record_one_subtree_extent(trans, root, 
child_bytenr,
+   root->nodesize);
+   if (ret)
+   goto out;
}
 
if (level == 0) {
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index b068209..ce1cdcf 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1461,6 +1461,8 @@ struct btrfs_qgroup_extent_record
struct btrfs_qgroup_extent_record *entry;
u64 bytenr = record->bytenr;
 
+   assert_spin_locked(_refs->lock);
+
trace_btrfs_qgroup_insert_dirty_extent(record);
 
while (*p) {
-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] btrfs: Add qgroup tracing

2015-11-05 Thread Mark Fasheh
This patch adds tracepoints to the qgroup code on both the reporting side
(insert_dirty_extents) and the accounting side. Taken together it allows us
to see what qgroup operations have happened, and what their result was.

Signed-off-by: Mark Fasheh 
---
 fs/btrfs/qgroup.c| 10 +
 include/trace/events/btrfs.h | 88 +++-
 2 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index d904ee1..b068209 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1461,6 +1461,8 @@ struct btrfs_qgroup_extent_record
struct btrfs_qgroup_extent_record *entry;
u64 bytenr = record->bytenr;
 
+   trace_btrfs_qgroup_insert_dirty_extent(record);
+
while (*p) {
parent_node = *p;
entry = rb_entry(parent_node, struct btrfs_qgroup_extent_record,
@@ -1591,6 +1593,9 @@ static int qgroup_update_counters(struct btrfs_fs_info 
*fs_info,
cur_old_count = btrfs_qgroup_get_old_refcnt(qg, seq);
cur_new_count = btrfs_qgroup_get_new_refcnt(qg, seq);
 
+   trace_qgroup_update_counters(qg->qgroupid, cur_old_count,
+cur_new_count);
+
/* Rfer update part */
if (cur_old_count == 0 && cur_new_count > 0) {
qg->rfer += num_bytes;
@@ -1684,6 +1689,9 @@ btrfs_qgroup_account_extent(struct btrfs_trans_handle 
*trans,
goto out_free;
BUG_ON(!fs_info->quota_root);
 
+   trace_btrfs_qgroup_account_extent(bytenr, num_bytes, nr_old_roots,
+ nr_new_roots);
+
qgroups = ulist_alloc(GFP_NOFS);
if (!qgroups) {
ret = -ENOMEM;
@@ -1753,6 +1761,8 @@ int btrfs_qgroup_account_extents(struct 
btrfs_trans_handle *trans,
record = rb_entry(node, struct btrfs_qgroup_extent_record,
  node);
 
+   trace_btrfs_qgroup_account_extents(record);
+
if (!ret) {
/*
 * Use (u64)-1 as time_seq to do special search, which
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 0b73af9..9d7b545 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -23,7 +23,7 @@ struct map_lookup;
 struct extent_buffer;
 struct btrfs_work;
 struct __btrfs_workqueue;
-struct btrfs_qgroup_operation;
+struct btrfs_qgroup_extent_record;
 
 #define show_ref_type(type)\
__print_symbolic(type,  \
@@ -1117,6 +1117,92 @@ DEFINE_EVENT(btrfs__workqueue_done, 
btrfs_workqueue_destroy,
TP_ARGS(wq)
 );
 
+DECLARE_EVENT_CLASS(btrfs_qgroup_extent,
+   TP_PROTO(struct btrfs_qgroup_extent_record *rec),
+
+   TP_ARGS(rec),
+
+   TP_STRUCT__entry(
+   __field(u64,  bytenr)
+   __field(u64,  num_bytes )
+   ),
+
+   TP_fast_assign(
+   __entry->bytenr = rec->bytenr,
+   __entry->num_bytes  = rec->num_bytes;
+   ),
+
+   TP_printk("bytenr = %llu, num_bytes = %llu",
+ (unsigned long long)__entry->bytenr,
+ (unsigned long long)__entry->num_bytes)
+);
+
+DEFINE_EVENT(btrfs_qgroup_extent, btrfs_qgroup_account_extents,
+
+   TP_PROTO(struct btrfs_qgroup_extent_record *rec),
+
+   TP_ARGS(rec)
+);
+
+DEFINE_EVENT(btrfs_qgroup_extent, btrfs_qgroup_insert_dirty_extent,
+
+   TP_PROTO(struct btrfs_qgroup_extent_record *rec),
+
+   TP_ARGS(rec)
+);
+
+TRACE_EVENT(btrfs_qgroup_account_extent,
+
+   TP_PROTO(u64 bytenr, u64 num_bytes, u64 nr_old_roots, u64 nr_new_roots),
+
+   TP_ARGS(bytenr, num_bytes, nr_old_roots, nr_new_roots),
+
+   TP_STRUCT__entry(
+   __field(u64,  bytenr)
+   __field(u64,  num_bytes )
+   __field(u64,  nr_old_roots  )
+   __field(u64,  nr_new_roots  )
+   ),
+
+   TP_fast_assign(
+   __entry->bytenr = bytenr;
+   __entry->num_bytes  = num_bytes;
+   __entry->nr_old_roots   = nr_old_roots;
+   __entry->nr_new_roots   = nr_new_roots;
+   ),
+
+   TP_printk("bytenr = %llu, num_bytes = %llu, nr_old_roots = %llu, "
+ "nr_new_roots = %llu",
+ __entry->bytenr,
+ __entry->num_bytes,
+ __entry->nr_old_roots,
+ __entry->nr_new_roots)
+);
+
+TRACE_EVENT(qgroup_update_counters,
+
+   TP_PROTO(u64 qgid, u64 cur_old_count, u64 cur_new_count),
+
+   TP_ARGS(qgid, cur_old_count, cur_new_count),
+
+   TP_STRUCT__entry(
+   __field(u64,  qgid  

[PATCH 0/3] btrfs: update qgroups in drop snapshot, V2

2015-11-05 Thread Mark Fasheh
Hi,

The following 3 patches fix a regression introduced in Linux
4.2 where btrfs_drop_snapshot() wasn't updating qgroups, resulting in
them going bad.

The original e-mail pointing this out is below:

http://www.spinics.net/lists/linux-btrfs/msg46093.html

The first patch is from Josef and fix bugs in our counting of
roots (which is critical for qgroups to work correctly). It was previously
sent to the list:

http://www.spinics.net/lists/linux-btrfs/msg47035.html

Truth be told, most of the time fixing this was spent figuring out that and
another issue (which has also been fixed).  Once I realized I was seeing a
bug and we fixed it correctly, my drop snapshot patch got dramatically
smaller.

I also re-added some of the tracing in qgroup.c that we recently
lost. It is again possible to debug qgroup operations on a live
system, allowing us to find issues like the two above by narrowing
down our operations and manually walking through them via
cat sys/debug/tracing.

The entire patch series can be tested in xfstests test btrfs/104.

Thanks,
--Mark

Changes from V1, thanks to Qu for his comments:
  - lock around call to btrfs_qgroup_insert_dirty_extent()
  - check whether qgroups are enabled in account_leaf_items()
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Btrfs: use btrfs_get_fs_root in resolve_indirect_ref

2015-11-05 Thread Mark Fasheh
From: Josef Bacik 

The backref code will look up the fs_root we're trying to resolve our indirect
refs for, unfortunately we use btrfs_read_fs_root_no_name, which returns -ENOENT
if the ref is 0.  This isn't helpful for the qgroup stuff with snapshot delete
as it won't be able to search down the snapshot we are deleting, which will
cause us to miss roots.  So use btrfs_get_fs_root and send false for check_ref
so we can always get the root we're looking for.  Thanks,

Signed-off-by: Josef Bacik 
Signed-off-by: Mark Fasheh 
---
 fs/btrfs/backref.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 9a2ec79..0e9da72 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -355,7 +355,7 @@ static int __resolve_indirect_ref(struct btrfs_fs_info 
*fs_info,
 
index = srcu_read_lock(_info->subvol_srcu);
 
-   root = btrfs_read_fs_root_no_name(fs_info, _key);
+   root = btrfs_get_fs_root(fs_info, _key, false);
if (IS_ERR(root)) {
srcu_read_unlock(_info->subvol_srcu, index);
ret = PTR_ERR(root);
-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs

2015-11-05 Thread Jeff Mahoney
On 9/30/15 3:57 PM, Roman Lebedev wrote:
> Hello.
> 
> My / is btrfs.
> To do some my local stuff more cleanly i wanted to use overlayfs, 
> but it didn't quite work.
> 
> Simple non-automatic sequence to reproduce the issue:
>  mkdir lower upper work merged
>  mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
>  vi merged/file
>  :wq

Filipe and I got a chance to look into this today.  The crash is due to
commit 4bacc9c9234 (overlayfs: Make f_path always point to the overlay
and f_inode to the underlay)  Incidentally, the test case is as simple
as ":> file ; fsync file" after mounting.

The short version is that after this commit, we see:

file->f_mapping->host = 
file->f_inode = 
file->f_path.dentry->d_inode = 

So now file_operations callbacks can't assume that file->f_path.dentry
belongs to the same file system that implements the callback.  More than
that, any code that could ultimately get a dentry that comes from an
open file can't trust that it's from the same file system.

This crash is due to this issue.  Unlike xfs and ext2/3/4, we use
file->f_path.dentry->d_inode to resolve the inode.  Using file_inode()
is an easy enough fix here, but we run into trouble later.  We have
logic in the btrfs fsync() call path (check_parent_dirs_for_sync) that
walks back up the dentry chain examining the inode's last transaction
and last unlink transaction to determine whether a full transaction
commit is required.  This obviously doesn't work if we're walking the
overlayfs path instead.  Regardless of any argument over whether that's
doing the right thing, it's a pretty common pattern to assume that
file->f_path.dentry comes from the same file system when using a
file_operation.  Is it intended that that assumption is no longer valid?

-Jeff

> Results in vi being killed on exit, and the following trace appears in dmesg:
> 
> [34304.047841] BUG: unable to handle kernel paging request at 09618e56
> [34304.047846] IP: [] btrfs_sync_file+0xa6/0x350 [btrfs]
> [34304.047864] PGD 0 
> [34304.047866] Oops: 0002 [#12] SMP 
> [34304.047867] Modules linked in: overlay cpufreq_userspace cpufreq_stats 
> cpufreq_powersave cpufreq_conservative binfmt_misc nfsd auth_rpcgss 
> oid_registry nfs_acl nfs lockd grace fscache sunrpc fglrx(PO) nls_utf8 joydev 
> nls_cp437 vfat fat hid_generic usbhid kvm_amd hid kvm crct10dif_pclmul 
> crc32_pclmul ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_generic 
> snd_hda_codec_hdmi sha256_ssse3 sha256_generic snd_hda_intel snd_hda_codec 
> hmac drbg ansi_cprng aesni_intel snd_hda_core aes_x86_64 mxm_wmi snd_hwdep 
> lrw eeepc_wmi snd_pcm gf128mul asus_wmi sparse_keymap rfkill video snd_timer 
> glue_helper sp5100_tco evdev ablk_helper e1000e ohci_pci pcspkr snd ohci_hcd 
> xhci_pci edac_mce_amd ehci_pci serio_raw xhci_hcd soundcore fam15h_power 
> ehci_hcd cryptd edac_core ptp pps_core usbcore k10temp i2c_piix4
> [34304.047893]  sg usb_common acpi_cpufreq wmi tpm_infineon button processor 
> shpchp tpm_tis tpm thermal_sys tcp_yeah tcp_vegas it87 hwmon_vid loop 
> parport_pc ppdev lp parport autofs4 crc32c_generic btrfs xor raid6_pq sd_mod 
> crc32c_intel ahci libahci libata scsi_mod
> [34304.047905] CPU: 4 PID: 13990 Comm: vi Tainted: P  DO
> 4.2.0-1-amd64 #1 Debian 4.2.1-2
> [34304.047906] Hardware name: To be filled by O.E.M. To be filled by 
> O.E.M./CROSSHAIR V FORMULA-Z, BIOS 2201 03/23/2015
> [34304.047908] task: 8803d5f7f2c0 ti: 8806a3ec8000 task.ti: 
> 8806a3ec8000
> [34304.047909] RIP: 0010:[]  [] 
> btrfs_sync_file+0xa6/0x350 [btrfs]
> [34304.047920] RSP: 0018:8806a3ecbe88  EFLAGS: 00010246
> [34304.047921] RAX: 8803d5f7f2c0 RBX: 8807b2d46600 RCX: 
> 81a6ad00
> [34304.047922] RDX: 8000 RSI:  RDI: 
> 8807c19f8970
> [34304.047923] RBP: 8807c19f8970 R08:  R09: 
> 0001
> [34304.047924] R10:  R11: 0246 R12: 
> 8807c19f88c8
> [34304.047925] R13:  R14: 09618b22 R15: 
> 55cb20184a70
> [34304.047926] FS:  7f31c5492800() GS:88082fd0() 
> knlGS:
> [34304.047927] CS:  0010 DS:  ES:  CR0: 80050033
> [34304.047928] CR2: 09618e56 CR3: 00044af44000 CR4: 
> 000406e0
> [34304.047929] Stack:
> [34304.047930]  0001 7fff 880403d5b918 
> 8000
> [34304.047932]    55cb20186d40 
> 8807b2d46600
> [34304.047933]  0004 88044b249000 0020 
> 8807b2d46600
> [34304.047935] Call Trace:
> [34304.047939]  [] ? do_fsync+0x38/0x60
> [34304.047940]  [] ? SyS_fsync+0x10/0x20
> [34304.047943]  [] ? system_call_fast_compare_end+0xc/0x6b
> [34304.047944] Code: 49 8b 0f 48 85 c9 75 e9 eb b3 48 8b 44 24 08 49 8d ac 24 
> a8 00 00 00 48 89 ef 4c 29 e8 48 83 c0 01 48 89 44 24 18 e8 3a 59 3e e1  
> 41 ff 86 34 03 00 00 49 8b 84 24 70 ff ff 

Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-11-05 Thread Zoiled

Duncan wrote:

Austin S Hemmelgarn posted on Wed, 04 Nov 2015 13:45:37 -0500 as
excerpted:


On 2015-11-04 13:01, Janos Toth F. wrote:

But the worst part is that there are some ISO files which were
seemingly copied without errors but their external checksums (the one
which I can calculate with md5sum and compare to the one supplied by
the publisher of the ISO file) don't match!
Well... this, I cannot understand.
How could these files become corrupt from a single disk failure? And
more importantly: how could these files be copied without errors? Why
didn't Btrfs gave a read error when the checksums didn't add up?

If you can prove that there was a checksum mismatch and BTRFS returned
invalid data instead of a read error or going to the other disk, then
that is a very serious bug that needs to be fixed.  You need to keep in
mind also however that it's completely possible that the data was bad
before you wrote it to the filesystem, and if that's the case, there's
nothing any filesystem can do to fix it for you.

As Austin suggests, if btrfs is returning data, and you haven't turned
off checksumming with nodatasum or nocow, then it's almost certainly
returning the data it was given to write out in the first place.  Whether
that data it was given to write out was correct, however, is an
/entirely/ different matter.

If ISOs are failing their external checksums, then something is going
on.  Had you verified the external checksums when you first got the
files?  That is, are you sure the files were correct as downloaded and/or
ripped?

Where were the ISOs stored between original procurement/validation and
writing to btrfs?  Is it possible you still have some/all of them on that
media?  Do they still external-checksum-verify there?

Basically, assuming btrfs checksums are validating, there's three other
likely possibilities for where the corruption could have come from before
writing to btrfs.  Either the files were bad as downloaded or otherwise
procured -- which is why I asked whether you verified them upon receipt
-- or you have memory that's going bad, or your temporary storage is
going bad, before the files ever got written to btrfs.

The memory going bad is a particularly worrying possibility,
considering...


Now I am really considering to move from Linux to Windows and from
Btrfs RAID-5 to Storage Spaces RAID-1 + ReFS (the only limitation is
that ReFS is only "self-healing" on RAID-1, not RAID-5, so I need a new
motherboard with more native SATA connectors and an extra HDD). That
one seemed to actually do what it promises (abort any read operations
upon checksum errors [which always happens seamlessly on every read]
but look at the redundant data first and seamlessly "self-heal" if
possible). The only thing which made Btrfs to look as a better
alternative was the RAID-5 support. But I recently experienced two
cases of 1 drive failing of 3 and it always tuned out as a smaller or
bigger disaster (completely lost data or inconsistent data).

Have you considered looking into ZFS?  I hate to suggest it as an
alternative to BTRFS, but it's a much more mature and well tested
technology than ReFS, and has many of the same features as BTRFS (and
even has the option for triple parity instead of the double you get with
RAID6).  If you do consider ZFS, make a point to look at FreeBSD in
addition to the Linux version, the BSD one was a much better written
port of the original Solaris drivers, and has better performance in many
cases (and as much as I hate to admit it, BSD is way more reliable than
Linux in most use cases).

You should also seriously consider whether the convenience of having a
filesystem that fixes internal errors itself with no user intervention
is worth the risk of it corrupting your data.  Returning correct data
whenever possible is one thing, being 'self-healing' is completely
different.  When you start talking about things that automatically fix
internal errors without user intervention is when most seasoned system
administrators start to get really nervous.  Self correcting systems
have just as much chance to make things worse as they do to make things
better, and most of them depend on the underlying hardware working
correctly to actually provide any guarantee of reliability.

I too would point you at ZFS, but there's one VERY BIG caveat, and one
related smaller one!

The people who have a lot of ZFS experience say it's generally quite
reliable, but gobs of **RELIABLE** memory are *absolutely* *critical*!
The self-healing works well, *PROVIDED* memory isn't producing errors.
Absolutely reliable memory is in fact *so* critical, that running ZFS on
non-ECC memory is severely discouraged as a very real risk to your data.

Which is why the above hints that your memory may be bad are so
worrying.  Don't even *THINK* about ZFS, particularly its self-healing
features, if you're not absolutely sure your memory is 100% reliable,
because apparently, based on the comment's I've seen, if it's not, you

Re: Regression in: [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete

2015-11-05 Thread Mark Fasheh
On Fri, Nov 06, 2015 at 09:02:13AM +0800, Qu Wenruo wrote:
> >The same exact code ran in either case before and after your patches, so my
> >guess is that the issue is actually inside the qgroup code that shouldn't
> >have been run. I wonder if we even just filled up his memory but never
> >cleaned the objects. The only other thing I can think of is if
> >account_leaf_items() got run in a really tight loop for some reason.
> >
> >Kmalloc in the way we are using it is not usually a performance issue,
> >especially if we've been reading off disk in the same process. Ask yourself
> >this - your own patch series does the same kmalloc for every qgroup
> >operation. Did you notice a complete and massive performance slowdown like
> >the one Stefan reported?
> 
> You're right, such memory allocation may impact performance but not
> so noticeable, compared to other operations which may kick disk IO,
> like btrfs_find_all_roots().
> 
> But at least, enabling qgroup will impact performance.
> 
> Yeah, this time I has test data now.
> In a environment with 100 different snapshot, sysbench shows an
> overall performance drop about 5%, and in some case, up to 7%, with
> qgroup enabled.
> 
> Not sure about the kmalloc impact, maybe less than 1% or maybe 2~3%,
> but at least it's worthy trying to use kmem cache.

Ok cool, what'd you do to generate the snapshots? I can try a similar test
on one of my machines and see what I get. I'm not surprised that the
overhead is noticable, and I agree it's easy enough to try things like
replacing the allocation once we have a test going.

Thanks,
--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regression in: [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete

2015-11-05 Thread Qu Wenruo



Mark Fasheh wrote on 2015/11/05 11:23 -0800:

On Wed, Nov 04, 2015 at 09:01:36AM +0800, Qu Wenruo wrote:



Mark Fasheh wrote on 2015/11/03 11:26 -0800:

On Mon, Nov 02, 2015 at 09:34:24AM +0800, Qu Wenruo wrote:



Stefan Priebe wrote on 2015/11/01 21:49 +0100:

Hi,

this one: http://www.spinics.net/lists/linux-btrfs/msg47377.html

adds a regression to my test systems with very large disks (30tb and 50tb).

btrfs balance is super slow afterwards while heavily making use of cp
--reflink=always on big files (200gb - 500gb).

Sorry didn't know how to correctly reply to that "old" message.

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Thanks for the testing.

Are you using qgroup or just doing normal balance with qgroup disabled?

For the latter case, that's should be optimized to skip the dirty
extent insert in qgroup disabled case.

For qgroup enabled case, I'm afraid that's the design.
As relocation will drop a subtree to relocate, and to ensure qgroup
consistent, we must walk down all the tree blocks and mark them
dirty for later qgroup accounting.


Qu, we're always going to have to walk the tree when deleting it, this is
part of removing a subvolume. We've walked shared subtrees in this code for
numerous kernel releases without incident before it was removed in 4.2.

Do you have any actual evidence that this is a major performance regression?
 From our previous conversations you seemed convinced of this, without even
having a working subtree walk to test. I remember the hand wringing
about an individual commit being too heavy with the qgroup code (even though
I pointed out that tree walk is a restartable transaction).

It seems that you are confused still about how we handle removing a volume
wrt qgroups.

If you have questions or concerns I would be happy to explain them but
IMHO your statements there are opinion and not based in fact.


Yes, I don't deny it.
But it's quite hard to prove it, as we need such a huge storage like Stefan.
What I have is only several hundred GB test storage.
Even accounting all my home NAS, I only have 2T, far from the
storage Stefan has.

And what Stefan report should already give some hint about the
performance issue.

In your word "it won't be doing anything (ok some kmalloc/free of a
very tiny object)", it's already slowing down balance, since balance
also use btrfs_drop_subtree().


When I wrote that I was under the impression that the qgroup code was doing
it's own sanity checking (it used to) and since Stephan had them disabled
they couldn't be causing the problem. I read your e-mail explaining that the
qgroup api was now intertwined with delayed ref locking after this one.


My fault, as btrfs_qgroup_mark_exntent_dirty() is an exception which 
doesn't have the qgroup status check and depend on existing locks.




The same exact code ran in either case before and after your patches, so my
guess is that the issue is actually inside the qgroup code that shouldn't
have been run. I wonder if we even just filled up his memory but never
cleaned the objects. The only other thing I can think of is if
account_leaf_items() got run in a really tight loop for some reason.

Kmalloc in the way we are using it is not usually a performance issue,
especially if we've been reading off disk in the same process. Ask yourself
this - your own patch series does the same kmalloc for every qgroup
operation. Did you notice a complete and massive performance slowdown like
the one Stefan reported?


You're right, such memory allocation may impact performance but not so 
noticeable, compared to other operations which may kick disk IO, like 
btrfs_find_all_roots().


But at least, enabling qgroup will impact performance.

Yeah, this time I has test data now.
In a environment with 100 different snapshot, sysbench shows an overall 
performance drop about 5%, and in some case, up to 7%, with qgroup enabled.


Not sure about the kmalloc impact, maybe less than 1% or maybe 2~3%, but 
at least it's worthy trying to use kmem cache.




I will say that we never had this problem reported before, and
account_leaf_items() is always run in all kernels, even without qgroups
enabled. That will change with my new patch though.

What we can say for sure is that drop_snapshot in the qgroup case will read
more disk and obviously that will have a negative impact depending on what
the tree looks like. So IMHO we ought to be focusing on reducing the amount
of I/O involved.


Totally agree.

Thanks,
Qu





But we can't just ignore such "possible" performance issue just
because old code did the same thing.(Although not the same now,
we're marking all subtree blocks dirty other than shared one).


Well, I can't disagree with that - the only reason we are talking right now
is because you intentionally ignored the qgroup code in drop_snapshot(). So

Re: Regression in: [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete

2015-11-05 Thread Qu Wenruo



Mark Fasheh wrote on 2015/11/05 19:15 -0800:

On Fri, Nov 06, 2015 at 09:02:13AM +0800, Qu Wenruo wrote:

The same exact code ran in either case before and after your patches, so my
guess is that the issue is actually inside the qgroup code that shouldn't
have been run. I wonder if we even just filled up his memory but never
cleaned the objects. The only other thing I can think of is if
account_leaf_items() got run in a really tight loop for some reason.

Kmalloc in the way we are using it is not usually a performance issue,
especially if we've been reading off disk in the same process. Ask yourself
this - your own patch series does the same kmalloc for every qgroup
operation. Did you notice a complete and massive performance slowdown like
the one Stefan reported?


You're right, such memory allocation may impact performance but not
so noticeable, compared to other operations which may kick disk IO,
like btrfs_find_all_roots().

But at least, enabling qgroup will impact performance.

Yeah, this time I has test data now.
In a environment with 100 different snapshot, sysbench shows an
overall performance drop about 5%, and in some case, up to 7%, with
qgroup enabled.

Not sure about the kmalloc impact, maybe less than 1% or maybe 2~3%,
but at least it's worthy trying to use kmem cache.


Ok cool, what'd you do to generate the snapshots? I can try a similar test
on one of my machines and see what I get. I'm not surprised that the
overhead is noticable, and I agree it's easy enough to try things like
replacing the allocation once we have a test going.

Thanks,
--Mark


Doing fsstress in a subvolume with 4 threads, creating a snapshot of 
that subvolume about every 5 seconds.


And do sysbench inside the 50th snapshot.

Such test takes both overhead of btrfs_find_all_roots() and kmalloc().
So I'm not sure which overhead is bigger.

Thanks,
Qu


--
Mark Fasheh


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs

2015-11-05 Thread Jeff Mahoney
On 11/5/15 10:18 PM, Al Viro wrote:
> On Thu, Nov 05, 2015 at 09:57:35PM -0500, Jeff Mahoney wrote:
> 
>> So now file_operations callbacks can't assume that file->f_path.dentry
>> belongs to the same file system that implements the callback.  More than
>> that, any code that could ultimately get a dentry that comes from an
>> open file can't trust that it's from the same file system.
> 
> Use file_inode() for inode.
> 
>> This crash is due to this issue.  Unlike xfs and ext2/3/4, we use
>> file->f_path.dentry->d_inode to resolve the inode.  Using file_inode()
>> is an easy enough fix here, but we run into trouble later.  We have
>> logic in the btrfs fsync() call path (check_parent_dirs_for_sync) that
>> walks back up the dentry chain examining the inode's last transaction
>> and last unlink transaction to determine whether a full transaction
>> commit is required.  This obviously doesn't work if we're walking the
>> overlayfs path instead.  Regardless of any argument over whether that's
>> doing the right thing, it's a pretty common pattern to assume that
>> file->f_path.dentry comes from the same file system when using a
>> file_operation.  Is it intended that that assumption is no longer valid?
> 
> It's actually rare, and your example is a perfect demonstration of the
> reasons why it is so rare.  What's to protect btrfs_log_dentry_safe()
> from racing with rename(2)?  Sure, you do dget_parent().  Which protects
> you from having one-time parent dentry freed under you.  What it doesn't
> do is making any promises about its relationship with your file.

I suppose the irony here is that, AFAIK, that code is to ensure a file
doesn't get lost between transactions due to rename.

Isn't the file->f_path.dentry relationship stable otherwise, though? The
name might change and the parent might change but the dentry that the
file points to won't.

I did find a few other places where that assumption happens without any
questionable traversals.  Sure, all three are in file systems unlikely
to be used with overlayfs.

ocfs2_prepare_inode_for_write uses file->f_path.dentry for
should_remove_suid (due to needing to do it early since cluster locking
is unknown in setattr, according to the commit).  Having
should_remove_suid operate on an inode would solve that easily.

fat_ioctl_set_attributes uses it to call fat_setattr, but that only uses
the inode and could have the inode_operation use a wrapper.

cifs_new_fileinfo keeps a reference to the dentry but it seems to be
used mostly to access the inode except for the nasty-looking call to
build_path_from_dentry in cifs_reopen_file, which I won't be touching.
That does look like a questionable traversal, especially with the "we
can't take the rename lock here" comment.

-Jeff

-- 
Jeff Mahoney



signature.asc
Description: OpenPGP digital signature


Re: kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs

2015-11-05 Thread Al Viro
On Thu, Nov 05, 2015 at 09:57:35PM -0500, Jeff Mahoney wrote:

> So now file_operations callbacks can't assume that file->f_path.dentry
> belongs to the same file system that implements the callback.  More than
> that, any code that could ultimately get a dentry that comes from an
> open file can't trust that it's from the same file system.

Use file_inode() for inode.

> This crash is due to this issue.  Unlike xfs and ext2/3/4, we use
> file->f_path.dentry->d_inode to resolve the inode.  Using file_inode()
> is an easy enough fix here, but we run into trouble later.  We have
> logic in the btrfs fsync() call path (check_parent_dirs_for_sync) that
> walks back up the dentry chain examining the inode's last transaction
> and last unlink transaction to determine whether a full transaction
> commit is required.  This obviously doesn't work if we're walking the
> overlayfs path instead.  Regardless of any argument over whether that's
> doing the right thing, it's a pretty common pattern to assume that
> file->f_path.dentry comes from the same file system when using a
> file_operation.  Is it intended that that assumption is no longer valid?

It's actually rare, and your example is a perfect demonstration of the
reasons why it is so rare.  What's to protect btrfs_log_dentry_safe()
from racing with rename(2)?  Sure, you do dget_parent().  Which protects
you from having one-time parent dentry freed under you.  What it doesn't
do is making any promises about its relationship with your file.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Urgent Business Proposal‏‏

2015-11-05 Thread Joe Adams



Attached is the full details of the Business Proposal view it and get back to 
me immediately for more details.


Project Investment Deal.pdf
Description: Adobe PDF document


Re: [GIT PULL] Btrfs fixes

2015-11-05 Thread Chris Mason
On Thu, Nov 05, 2015 at 11:20:37AM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Hi Chris,
> 
> please consider the following fixes for the 4.4 merge window (they were
> all previously sent to the mailing list already).
> 
> One fixes a sleep inside atomic context issue, introduced by one patch
> in the integration-4.4 branch. Another two fix races regarding waiting
> for qgroup rescan worker to finish and a race between the qgroup rescan
> worker and unmounting the filesystem (leading to crashes). The remaining
> patch fixes an issue with partial direct IO writes, which has been
> introduced in the 4.0 kernel, and results either in an assertion failure
> (BUG_ON) when CONFIG_BTRFS_ASSERT=y or arithmetic underflow of an inode's
> outstanding extents counter (used for proper space reservation) when
> assertions are disabled.
> 
> Two test cases for fstests were sent recently to cover the issues regarding
> the races and the direct IO partial write regression.

Great, thanks Filipe.

I'll send these for my second pull (probably next Wed).

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html