Re: btrfs btree_ctree_super fault

2016-11-16 Thread Chris Cui
We have just encountered the same bug on 4.9.0-rc2.  Any solution now?

> kernel BUG at fs/btrfs/ctree.c:3172!
> invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> CPU: 0 PID: 22702 Comm: trinity-c40 Not tainted 4.9.0-rc4-think+ #1
> task: 8804ffde37c0 task.stack: c90002188000
> RIP: 0010:[]
>   [] btrfs_set_item_key_safe+0x179/0x190 [btrfs]
> RSP: :c9000218b8a8  EFLAGS: 00010246
> RAX:  RBX: 8804fddcf348 RCX: 1000
> RDX:  RSI: c9000218b9ce RDI: c9000218b8c7
> RBP: c9000218b908 R08: 4000 R09: c9000218b8c8
> R10:  R11: 0001 R12: c9000218b8b6
> R13: c9000218b9ce R14: 0001 R15: 880480684a88
> FS:  7f7c7f998b40() GS:88050780() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2:  CR3: 00044f15f000 CR4: 001406f0
> DR0: 7f4ce439d000 DR1:  DR2: 
> DR3:  DR6: 0ff0 DR7: 0600
> Stack:
>  88050143 d305a00a2245 006c0002 0510
>  6c0002d3 1000 6427eebb 880480684a88
>   8804fddcf348 2000 
> Call Trace:
>  [] __btrfs_drop_extents+0xb00/0xe30 [btrfs]
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] generic: test concurrent non-overlapping direct I/O on the same extents

2016-11-16 Thread Eryu Guan
On Wed, Nov 16, 2016 at 04:29:34PM -0800, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> There have been a couple of logic bugs in `btrfs_get_extent()` which
> could lead to spurious -EEXIST errors from read or write. This test
> exercises those conditions by having two threads race to add an extent
> to the extent map.
> 
> This is fixed by Linux commit 8dff9c853410 ("Btrfs: deal with duplciates
> during extent_map insertion in btrfs_get_extent") and the patch "Btrfs:
> deal with existing encompassing extent map in btrfs_get_extent()"
> (http://marc.info/?l=linux-btrfs=147873402311143=2).
> 
> Although the bug is Btrfs-specific, nothing about the test is.
> 
> Signed-off-by: Omar Sandoval 
> ---
[snip]
> +# real QA test starts here
> +
> +_supported_fs generic
> +_supported_os Linux
> +_require_test
> +_require_xfs_io_command "falloc"
> +_require_test_program "dio-interleaved"
> +
> +extent_size="$(($(stat -f -c '%S' "$TEST_DIR") * 2))"

There's a helper to get fs block size: "get_block_size".

> +num_extents=1024
> +testfile="$TEST_DIR/$$-testfile"
> +
> +truncate -s 0 "$testfile"

I prefer using xfs_io to do the truncate, 

$XFS_IO_PROG -fc "truncate 0" "$testfile"

Because in rare cases truncate(1) may be unavailable, e.g. RHEL5,
usually it's not a big issue, but xfs_io works all the time, we have a
better way, so why not :)

> +for ((off = 0; off < num_extents * extent_size; off += extent_size)); do
> + xfs_io -c "falloc $off $extent_size" "$testfile"

Use $XFS_IO_PROG not bare xfs_io.

I can fix all the tiny issues at commit time.

Thanks,
Eryu

> +done
> +
> +# To reproduce the Btrfs bug, the extent map must not be cached in memory.
> +sync
> +echo 3 > /proc/sys/vm/drop_caches
> +
> +"$here/src/dio-interleaved" "$extent_size" "$num_extents" "$testfile"
> +
> +echo "Silence is golden"
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/390.out b/tests/generic/390.out
> new file mode 100644
> index 000..3c7b405
> --- /dev/null
> +++ b/tests/generic/390.out
> @@ -0,0 +1,2 @@
> +QA output created by 390
> +Silence is golden
> diff --git a/tests/generic/group b/tests/generic/group
> index 08007d7..d137d01 100644
> --- a/tests/generic/group
> +++ b/tests/generic/group
> @@ -392,3 +392,4 @@
>  387 auto clone
>  388 auto log metadata
>  389 auto quick acl
> +390 auto quick rw
> -- 
> 2.10.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: qgroup: fix error in ASSERT condition expression

2016-11-16 Thread Tsutomu Itoh
Option -f, -F and --sort don't work because a conditional expression
of ASSERT is wrong.

Signed-off-by: Tsutomu Itoh 
---
 qgroup.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/qgroup.c b/qgroup.c
index 9d10cb8..071d15e 100644
--- a/qgroup.c
+++ b/qgroup.c
@@ -480,7 +480,7 @@ int btrfs_qgroup_setup_comparer(struct 
btrfs_qgroup_comparer_set  **comp_set,
*comp_set = set;
}
 
-   ASSERT(set->comps[set->ncomps].comp_func != NULL);
+   ASSERT(set->comps[set->ncomps].comp_func == NULL);
 
set->comps[set->ncomps].comp_func = all_comp_funcs[comparer];
set->comps[set->ncomps].is_descending = is_descending;
@@ -847,7 +847,7 @@ int btrfs_qgroup_setup_filter(struct 
btrfs_qgroup_filter_set **filter_set,
*filter_set = set;
}
 
-   ASSERT(set->filters[set->nfilters].filter_func != NULL);
+   ASSERT(set->filters[set->nfilters].filter_func == NULL);
set->filters[set->nfilters].filter_func = all_filter_funcs[filter];
set->filters[set->nfilters].data = data;
set->nfilters++;
-- 
2.9.3
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Announcing btrfs-dedupe

2016-11-16 Thread Zygo Blaxell
On Wed, Nov 16, 2016 at 11:24:33PM +0100, Niccolò Belli wrote:
> On martedì 15 novembre 2016 18:52:01 CET, Zygo Blaxell wrote:
> >Like I said, millions of extents per week...
> >
> >64K is an enormous dedup block size, especially if it comes with a 64K
> >alignment constraint as well.
> >
> >These are the top ten duplicate block sizes from a sample of 95251
> >dedup ops on a medium-sized production server with 4TB of filesystem
> >(about one machine-day of data):
> 
> Which software do you use to dedupe your data? I tried duperemove but it
> gets killed by the OOM killer because it triggers some kind of memory leak:
> https://github.com/markfasheh/duperemove/issues/163

Duperemove does use a lot of memory, but the logs at that URL only show
2G of RAM in duperemove--not nearly enough to trigger OOM under normal
conditions on an 8G machine.  There's another process with 6G of virtual
address space (although much less than that resident) that looks more
interesting (i.e. duperemove might just be the victim of some interaction
between baloo_file and the OOM killer).

On the other hand, the logs also show kernel 4.8.  100% of my test
machines failed to finish booting before they were cut down by OOM on
4.7.x kernels.  The same problem occurs on early kernels in the 4.8.x
series.  I am having good results with 4.8.6 and later, but you should
be aware that significant changes have been made to the way OOM works
in these kernel versions, and maybe you're hitting a regression for your
use case.

> Niccolò Belli
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


[PATCH] fstests: Introduce check for explicit SHARED extent flag reporting

2016-11-16 Thread Qu Wenruo
For fs support reflink, some of them (OK, btrfs again) doesn't split
SHARED flag for extent fiemap reporting.

For example:
  0 4K 8K
   / File1: Extent 0  \
  /\
  |<- On disk Extent-->|
  |/
  | File2 /
Extent: 0

Fs supports explicit SHARED extent reporting should report fiemap like:
File1: 2 extents
Extent 0-4K: SHARED
Extent 4-8K:
File2: 1 extents
Extent 0-4K: SHARED

Fs doesn't support explicit reporting will report fiemap like:
File1: 1 extent
Extent 0-8K: SHARED
File2: 1 extent
Extent 0-4K: SHARED

Test case like generic/372 require explicit reporting will cause false
alert on btrfs.

Add such runtime check for that requirememt.

Signed-off-by: Qu Wenruo 
---
 common/reflink| 44 
 tests/generic/372 |  1 +
 2 files changed, 45 insertions(+)

diff --git a/common/reflink b/common/reflink
index 8b34046..9ada2e8 100644
--- a/common/reflink
+++ b/common/reflink
@@ -78,6 +78,50 @@ _require_scratch_reflink()
_scratch_unmount
 }
 
+# this test requires scratch fs to report explicit SHARED flag
+# e.g.
+#   0 4K 8K
+#/ File1: Extent 0  \
+#   /\
+#   |<- On disk Extent-->|
+#   |/
+#   | File2 /
+# Extent: 0
+# Fs supports explicit SHARED extent reporting should report fiemap like:
+# File1: 2 extents
+# Extent 0-4K: SHARED
+# Extent 4-8K:
+# File2: 1 extents
+# Extent 0-4K: SHARED
+#
+# Fs doesn't support explicit reporting will report fiemap like:
+# File1: 1 extent
+# Extent 0-8K: SHARED
+# File2: 1 extent
+# Extent 0-4K: SHARED
+_require_scratch_explicit_shared_extents()
+{
+   _require_scratch
+   _require_fiemap
+   _require_scratch_reflink
+   _require_xfs_io_command "reflink"
+   local nr_extents
+
+   _scratch_mkfs > /dev/null
+   _scratch_mount
+
+   _pwrite_byte 0x61 0 128k $SCRATCH_MNT/file1
+   _reflink_range $SCRATCH_MNT/file1 0 $SCRATCH_MNT/file2 0 64k
+
+   _scratch_cycle_mount
+
+   nr_extents=$(_count_extents $SCRATCH_MNT/file1)
+   if [ $nr_extents -eq 1 ]; then
+   _notrun "Explicit SHARED flag reporting not support by 
filesystem type: $FSTYP"
+   fi
+   _scratch_unmount
+}
+
 # this test requires the test fs support dedupe...
 _require_test_dedupe()
 {
diff --git a/tests/generic/372 b/tests/generic/372
index 31dff20..51a3eca 100755
--- a/tests/generic/372
+++ b/tests/generic/372
@@ -47,6 +47,7 @@ _supported_os Linux
 _supported_fs generic
 _require_scratch_reflink
 _require_fiemap
+_require_scratch_explicit_shared_extents
 
 echo "Format and mount"
 _scratch_mkfs > $seqres.full 2>&1
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Heatmap - v2 - block group internals!

2016-11-16 Thread Qu Wenruo



At 11/17/2016 04:30 AM, Hans van Kranenburg wrote:

In the last two days I've added the --blockgroup option to btrfs heatmap
to let it create pictures of block group internals.

Examples and more instructions are to be found in the README at:
https://github.com/knorrie/btrfs-heatmap/blob/master/README.md

To use the new functionality it needs a fairly recent python-btrfs for
the 'skinny' METADATA_ITEM_KEY to be present. Latest python-btrfs
release is v0.3, created yesterday.

Yay,


Wow, really cool!

I always dream about a visualizing tool to represent the chunk and 
extent level of btrfs.


This should really save me from reading the boring dec numbers from 
btrfs-debug-tree.


Although IMHO the full fs output is mixing extent and chunk level
together, which makes it a little hard to represent multi-device case,
it's still an awesome tool!

And considering the "show-block" tool in btrfs-progs is quite old,
I think if the tool get further polished it may have a chance get into 
btrfs-progs.


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: Block btrfs from test case generic/372

2016-11-16 Thread Qu Wenruo



At 11/17/2016 05:12 AM, Dave Chinner wrote:

(Did you forget to cc fste...@vger.kernel.org?)

On Tue, Nov 15, 2016 at 04:13:32PM +0800, Qu Wenruo wrote:

Since btrfs always return the whole extent even part of it is shared
with other files, so the hole/extent counts differs for "file1" in this
test case.

For example:

 /-- File 1 Extent 0-\
/ \
|<--Extent A-->|
\ /  \ /
 \ File 2/\ File 2/
   Ext 0~4KExt 64k~68K

In that case, fiemap on File 1 will only return 1 large extent A with
SHARED flag.
While XFS will split it into 3 extents,  first and last 4K with SHARED
flag while the rest without SHARED flag.


fiemap should behave the same across all filesystems if at all
possible. This test failure indicates btrfs doesn't report an
accurate representation of shared extents which, IMO, is a btrfs
issue that needs fixing, not a test problem

Regardless of this


Considering only btrfs implements CoW using extent booking mechanism(*),
it does affect a lot of behavior, from such SHARED flag representing to
hole punching behavior.

I hope there is a well documented standard on what ever flag means and
how it should be represented.

While I'm not quite sure if it's worthy for btrfs to modify the
represent.
Even btrfs can report it as "SHARED" "NON-SHARED" "SHARED" for File 1,
hole punching the "NON_SHARED" range won't free any space.
(Which I assume it differs from xfs, and that's making things confusing)




This makes the test case meaningless as btrfs doesn't follow such
assumption.
So black list btrfs for this test case to avoid false alert.


...  we are not going to add ad-hoc filesystem blacklists for
random tests.

Adding "blacklists" without any explanation of why something has
been blacklisted is simply a bad practice. We use _require rules
to specifically document what functionality is required for the
test and check that it provided.  i.e. this:

_require_explicit_shared_extents()
{
if [ $FSTYP == "btrfs" ]; then
_not_run "btrfs can't report accurate shared extent ranges in 
fiemap"
fi
}


Right, this is much more helpful than the blabla I wrote in commit message.

Although I'd prefer to detect it at runtime other than just checking
the fs type.

Maybe one day btrfs will support it.
(Although we should solve the above mentioned behavior difference first)



documents /exactly/ why this test is not run on btrfs.

And, quite frankly, while this is /better/ it still ignores the
fact we have functions like _within_tolerance for allowing a range
of result values to be considered valid rather than just a fixed
value. IOWs, changing the check of the extent count of file 1 post
reflink to use a _within_tolerance range would mean the test would
validate file1 on all reflink supporting filesystems and we don't
need to exclude btrfs at all...


I really agree on this idea, although for me the difference is too big.

For file 1, xfs reports 5 extents, while btrfs only reports 1.
If using _within_tolerance to cover the range, and one day some
mysterious xfs bug(OK, I don't really believe it will happen, since
it's xfs, not btrfs) makes it report 4 extents.

Or one btrfs bug(on the other hand, quite possible) makes btrfs report
2 extents, then we can't detect the bug either.

So I'd prefer the _require_explicit_shared_extents() method.

Thanks,
Qu



Cheers,

Dave.




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: deal with existing encompassing extent map in btrfs_get_extent()

2016-11-16 Thread Omar Sandoval
On Thu, Nov 10, 2016 at 02:45:36PM -0800, Omar Sandoval wrote:
> On Thu, Nov 10, 2016 at 02:38:14PM -0800, Liu Bo wrote:
> > On Thu, Nov 10, 2016 at 12:24:13PM -0800, Omar Sandoval wrote:
> > > On Thu, Nov 10, 2016 at 12:09:06PM -0800, Omar Sandoval wrote:
> > > > On Thu, Nov 10, 2016 at 12:01:20PM -0800, Liu Bo wrote:
> > > > > On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote:
> > > > > > From: Omar Sandoval 
> > > > > > 
> > > > > > My QEMU VM was seeing inexplicable I/O errors that I tracked down to
> > > > > > errors coming from the qcow2 virtual drive in the host system. The 
> > > > > > qcow2
> > > > > > file is a nocow file on my Btrfs drive, which QEMU opens with 
> > > > > > O_DIRECT.
> > > > > > Every once in awhile, pread() or pwrite() would return EEXIST, which
> > > > > > makes no sense. This turned out to be a bug in btrfs_get_extent().
> > > > > > 
> > > > > > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
> > > > > > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() 
> > > > > > where
> > > > > > two threads race on adding the same extent map to an inode's extent 
> > > > > > map
> > > > > > tree. However, if the added em is merged with an adjacent em in the
> > > > > > extent tree, then we'll end up with an existing extent that is not
> > > > > > identical to but instead encompasses the extent we tried to add. 
> > > > > > When we
> > > > > > call merge_extent_mapping() to find the nonoverlapping part of the 
> > > > > > new
> > > > > > em, the arithmetic overflows because there is no such thing. We 
> > > > > > then end
> > > > > > up trying to add a bogus em to the em_tree, which results in a 
> > > > > > EEXIST
> > > > > > that can bubble all the way up to userspace.
> > > > > 
> > > > > I don't get how this could happen(even after reading Commit
> > > > > 8dff9c853410), btrfs_get_extent in direct_IO is protected by
> > > > > lock_extent_direct, the assumption is that a racy thread should be
> > > > > blocked by lock_extent_direct and when it gets the lock, it finds the
> > > > > just-inserted em when going into btrfs_get_extent if its offset is
> > > > > within [em->start, extent_map_end(em)].
> > > > > 
> > > > > I think we may also need to figure out why the above doesn't work as
> > > > > expected besides fixing another special case.
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > -liubo
> > > > 
> > > > lock_extent_direct() only protects the range you're doing I/O into, not
> > > > the entire extent. If two threads are doing two non-overlapping reads in
> > > > the same extent, then you can get this race.
> > > 
> > > More concretely, assume the extent tree on disk has:
> > > 
> > > +-+---+
> > > |start=0,len=8192,bytenr=0|start=8192,len=8192,bytenr=8192|
> > > +-+---+
> > > 
> > > And the extent map tree in memory has a single em cached for the second
> > > extent {start=8192, len=8192, bytenr=8192}. Then, two threads try do do
> > > direct I/O reads:
> > > 
> > > Thread 1   | Thread 2
> > > ---+---
> > > pread(offset=0, nbyte=4096)| pread(offset=4096, nbyte=4096)
> > > lock_extent_direct(start=0, end=4095)  | lock_extent_direct(start=4096, 
> > > end=8191)
> > > btrfs_get_extent(start=0, len=4096)| btrfs_get_extent(start=4096, 
> > > len4096)
> > >   lookup_extent_mapping() = NULL   |   lookup_extent_mapping() = NULL
> > >   reads extent from B-tree |   reads extent from B-tree
> > >|   write_lock(_tree->lock)
> > >  |   add_extent_mapping(start=0, 
> > > len=8192, bytenr=0)
> > >  | try_merge_map()
> > >  | em_tree now has {start=0, 
> > > len=16384, bytenr=0}
> > >  |   write_unlock(_tree->lock)
> > > write_lock(_tree->lock) |
> > > add_extent_mapping(start=0, len=8192,  |
> > >bytenr=0) = -EEXIST |
> > > search_extent_mapping() = {start=0,|
> > >len=16384,  |
> > >  bytenr=0}   |
> > > merge_extent_mapping() does bogus math |
> > > and overflows, returns EEXIST  |
> > 
> > Yeah, so much fun.
> > 
> > The problem is that we lock and request [0, 4096], but we insert a em of
> > [0, 8192] instead.  So if we insert a [0, 4096] em, then we can make
> > sure that the em returned by btrfs_get_extent is protected from race by
> > the range of lock_extent_direct.
> > 
> > I'll give it a shot and do some testing.
> > 
> > For this patch,
> > 
> > Reviewed-by: Liu Bo 
> 
> Thank you!
> 
> > Would you please make a reproducer for fstests?
> 
> Sure. Trying to trigger this with xfs_io never works because it's 

[PATCH] generic: test concurrent non-overlapping direct I/O on the same extents

2016-11-16 Thread Omar Sandoval
From: Omar Sandoval 

There have been a couple of logic bugs in `btrfs_get_extent()` which
could lead to spurious -EEXIST errors from read or write. This test
exercises those conditions by having two threads race to add an extent
to the extent map.

This is fixed by Linux commit 8dff9c853410 ("Btrfs: deal with duplciates
during extent_map insertion in btrfs_get_extent") and the patch "Btrfs:
deal with existing encompassing extent map in btrfs_get_extent()"
(http://marc.info/?l=linux-btrfs=147873402311143=2).

Although the bug is Btrfs-specific, nothing about the test is.

Signed-off-by: Omar Sandoval 
---
 .gitignore|  1 +
 src/Makefile  |  2 +-
 src/dio-interleaved.c | 98 +++
 tests/generic/390 | 76 +++
 tests/generic/390.out |  2 ++
 tests/generic/group   |  1 +
 6 files changed, 179 insertions(+), 1 deletion(-)
 create mode 100644 src/dio-interleaved.c
 create mode 100755 tests/generic/390
 create mode 100644 tests/generic/390.out

diff --git a/.gitignore b/.gitignore
index 915d2d8..b8d13a0 100644
--- a/.gitignore
+++ b/.gitignore
@@ -44,6 +44,7 @@
 /src/bulkstat_unlink_test_modified
 /src/dbtest
 /src/devzero
+/src/dio-interleaved
 /src/dirperf
 /src/dirstress
 /src/dmiperf
diff --git a/src/Makefile b/src/Makefile
index dd51216..4056496 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -21,7 +21,7 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize 
preallo_rw_pattern_reader \
stale_handle pwrite_mmap_blocked t_dir_offset2 seek_sanity_test \
seek_copy_test t_readdir_1 t_readdir_2 fsync-tester nsexec cloner \
renameat2 t_getcwd e4compact test-nextquota punch-alternating \
-   attr-list-by-handle-cursor-test listxattr
+   attr-list-by-handle-cursor-test listxattr dio-interleaved
 
 SUBDIRS =
 
diff --git a/src/dio-interleaved.c b/src/dio-interleaved.c
new file mode 100644
index 000..831a191
--- /dev/null
+++ b/src/dio-interleaved.c
@@ -0,0 +1,98 @@
+#ifndef _GNU_SOURCE
+#define _GNU_SOURCE
+#endif
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static pthread_barrier_t barrier;
+
+static unsigned long extent_size;
+static unsigned long num_extents;
+
+struct dio_thread_data {
+   int fd;
+   int thread_id;
+};
+
+static void *dio_thread(void *arg)
+{
+   struct dio_thread_data *data = arg;
+   off_t off;
+   ssize_t ret;
+   void *buf;
+
+   if ((errno = posix_memalign(, extent_size / 2, extent_size / 2))) {
+   perror("malloc");
+   return NULL;
+   }
+   memset(buf, 0, extent_size / 2);
+
+   off = (num_extents - 1) * extent_size;
+   if (data->thread_id)
+   off += extent_size / 2;
+   while (off >= 0) {
+   pthread_barrier_wait();
+
+   ret = pread(data->fd, buf, extent_size / 2, off);
+   if (ret == -1)
+   perror("pread");
+
+   off -= extent_size;
+   }
+
+   free(buf);
+   return NULL;
+}
+
+int main(int argc, char **argv)
+{
+   struct dio_thread_data data[2];
+   pthread_t thread;
+   int fd;
+
+   if (argc != 4) {
+   fprintf(stderr, "usage: %s SECTORSIZE NUM_EXTENTS PATH\n",
+   argv[0]);
+   return EXIT_FAILURE;
+   }
+
+   extent_size = strtoul(argv[1], NULL, 0);
+   num_extents = strtoul(argv[2], NULL, 0);
+
+   errno = pthread_barrier_init(, NULL, 2);
+   if (errno) {
+   perror("pthread_barrier_init");
+   return EXIT_FAILURE;
+   }
+
+   fd = open(argv[3], O_RDONLY | O_DIRECT);
+   if (fd == -1) {
+   perror("open");
+   return EXIT_FAILURE;
+   }
+
+   data[0].fd = fd;
+   data[0].thread_id = 0;
+   errno = pthread_create(, NULL, dio_thread, [0]);
+   if (errno) {
+   perror("pthread_create");
+   close(fd);
+   return EXIT_FAILURE;
+   }
+
+   data[1].fd = fd;
+   data[1].thread_id = 1;
+   dio_thread([1]);
+
+   pthread_join(thread, NULL);
+
+   close(fd);
+   return EXIT_SUCCESS;
+}
diff --git a/tests/generic/390 b/tests/generic/390
new file mode 100755
index 000..0ef6537
--- /dev/null
+++ b/tests/generic/390
@@ -0,0 +1,76 @@
+#! /bin/bash
+# FS QA Test 390
+#
+# Test two threads doing non-overlapping direct I/O in the same extents.
+# Motivated by a bug in Btrfs' direct I/O get_block function which would lead
+# to spurious -EEXIST failures from direct I/O reads.
+#
+#---
+# Copyright (c) 2016 Facebook.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#

Re: Announcing btrfs-dedupe

2016-11-16 Thread Niccolò Belli

On martedì 15 novembre 2016 18:52:01 CET, Zygo Blaxell wrote:

Like I said, millions of extents per week...

64K is an enormous dedup block size, especially if it comes with a 64K
alignment constraint as well.

These are the top ten duplicate block sizes from a sample of 95251
dedup ops on a medium-sized production server with 4TB of filesystem
(about one machine-day of data):


Which software do you use to dedupe your data? I tried duperemove but it 
gets killed by the OOM killer because it triggers some kind of memory leak: 
https://github.com/markfasheh/duperemove/issues/163


Niccolò Belli
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: Block btrfs from test case generic/372

2016-11-16 Thread Dave Chinner
(Did you forget to cc fste...@vger.kernel.org?)

On Tue, Nov 15, 2016 at 04:13:32PM +0800, Qu Wenruo wrote:
> Since btrfs always return the whole extent even part of it is shared
> with other files, so the hole/extent counts differs for "file1" in this
> test case.
> 
> For example:
> 
>  /-- File 1 Extent 0-\
> / \
> |<--Extent A-->|
> \ /  \ /
>  \ File 2/\ File 2/
>Ext 0~4KExt 64k~68K
> 
> In that case, fiemap on File 1 will only return 1 large extent A with
> SHARED flag.
> While XFS will split it into 3 extents,  first and last 4K with SHARED
> flag while the rest without SHARED flag.

fiemap should behave the same across all filesystems if at all
possible. This test failure indicates btrfs doesn't report an
accurate representation of shared extents which, IMO, is a btrfs
issue that needs fixing, not a test problem

Regardless of this

> This makes the test case meaningless as btrfs doesn't follow such
> assumption.
> So black list btrfs for this test case to avoid false alert.

...  we are not going to add ad-hoc filesystem blacklists for
random tests.

Adding "blacklists" without any explanation of why something has
been blacklisted is simply a bad practice. We use _require rules
to specifically document what functionality is required for the
test and check that it provided.  i.e. this:

_require_explicit_shared_extents()
{
if [ $FSTYP == "btrfs" ]; then
_not_run "btrfs can't report accurate shared extent ranges in 
fiemap"
fi
}

documents /exactly/ why this test is not run on btrfs.

And, quite frankly, while this is /better/ it still ignores the
fact we have functions like _within_tolerance for allowing a range
of result values to be considered valid rather than just a fixed
value. IOWs, changing the check of the extent count of file 1 post
reflink to use a _within_tolerance range would mean the test would
validate file1 on all reflink supporting filesystems and we don't
need to exclude btrfs at all...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: check: fix missing newlines

2016-11-16 Thread Omar Sandoval
From: Omar Sandoval 

Also, the other progress messages go to stderr, so "checking extents"
probably should, as well.

Fixes: c7a1f66a205f ("btrfs-progs: check: switch some messages to common 
helpers")
Signed-off-by: Omar Sandoval 
---
As a side note, it seems almost completely random whether we print to
stdout or stderr for any given message. That could probably use some
cleaning up for consistency. A quick run of e2fsck indicated that it
prints almost everything on stdout except for usage and administrative
problems. xfs_repair just seems to put everything in stderr. I
personally like the e2fsck approach. Anyone have any preference?

 cmds-check.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 57c4300..3fb3bd7 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -11467,13 +11467,13 @@ int cmd_check(int argc, char **argv)
}
 
if (!ctx.progress_enabled)
-   printf("checking extents");
+   fprintf(stderr, "checking extents\n");
if (check_mode == CHECK_MODE_LOWMEM)
ret = check_chunks_and_extents_v2(root);
else
ret = check_chunks_and_extents(root);
if (ret)
-   printf("Errors found in extent allocation tree or chunk 
allocation");
+   error("errors found in extent allocation tree or chunk 
allocation");
 
ret = repair_root_items(info);
if (ret < 0)
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Send/receive snapshot from/between backup

2016-11-16 Thread René Bühlmann
On 11/02/2016 05:13 PM, Piotr Pawłow wrote:
> On 02.11.2016 15:23, René Bühlmann wrote:
>> Origin: S2 S3
>>
>> USB: S1 S2
>>
>> SSH: S1
>>
>> Transferring S3 to USB is no problem as S2 is on both btrfs drives. But
>> how can I transfer S3 to SSH?
> If I understand correctly how send / receive works, for the incremental
> receive to work there must be a subvolume on the destination which has
> "received uuid" equal to the uuid of parent choosen for the incremental
> send.
>
>> I tried to transfer...
>>
>> 1. S3 from Origin to SSH -> does not work as there is no common snapshot.
>>
>> 2. S2 from USB to SSH -> did not work.
> The "received uuid" of S1 on SSH is the uuid S1 had on Origin. The uuid
> of S1 on USB is different, so when choosen as parent for the incremental
> send it doesn't match.
>
>> 3. S1 from USB to Origin (such that there is a common snapshot with SSH)
>> -> did not work.
> There are no previously received subvolumes on Origin at all, so it
> isn't going to work.
>
>> Is it correct that 1. would work if a common snapshot is present on
>> Origin and SSH?
> If there was a snapshot received from Origin that still exists on
> Origin, then yes, you could use it as a clone source for incremental send.
>
>> Is it expected that 2. and 3. do not work?
>>
>> Is there some other way to achieve it?
> I doubt you can do it without some "hacking" to fool btrfs receive.
>
> You would need a tool that can issue BTRFS_IOC_SET_RECEIVED_SUBVOL ioctl
> to change the received uuid. Then you could:
>
> 1. Change received uuid of S1 on SSH to match S1 uuid on USB.
> 2. Send incremental S1-S2 from USB to SSH.
> 3. Change received uuid of S2 on SSH to match S2 on Origin.
> 4. Send incremental S2-S3 from Origin to SSH.
>
> Regards
>
Thanks for all the input,

I did successfully try this approach, could change the "received uuid"
and then transfer a snapshot from a different source. So far so good.

But:

Due to a lot of errors during btrfs check on SSH, I decided to recreate
the BTRFS filesystem on SSH still with the goal to not transfer all the
data over the network.

These were the steps:

1. Create a new btrfs (calling it SSH')

2. Full transfer S1 from SSH to SSH'

3. Incremental transfer S2 from USB to SSH' (S1 as parent)

4. Incremental transfer S3 from Origin to SSH' (S2 as parent)

5. Btrfs check SSH'

6. Used rsync (with checksum-diff) to verify that S3 on Origin and SSH'
contain the same files.


Step 2 did work and beside of a single checksum error on SSH the
transfer completed without errors.

Step 3 and 4 did work as well and surprisingly, I did not even had to
update the "received uuid". They cant be full transfers as that would
have taken months with my bandwidth. How can this be?

Step 5 did not return any errors

Step 6 did find a single file differing which is due to the checksum
error on step 2.


So, everything seems to be fine now, I just do not understand why this
did work without any updating of the UUID.
Do you have an explanation for that?

In any case, thanks for your help.
René


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs Heatmap - v2 - block group internals!

2016-11-16 Thread Hans van Kranenburg
In the last two days I've added the --blockgroup option to btrfs heatmap
to let it create pictures of block group internals.

Examples and more instructions are to be found in the README at:
https://github.com/knorrie/btrfs-heatmap/blob/master/README.md

To use the new functionality it needs a fairly recent python-btrfs for
the 'skinny' METADATA_ITEM_KEY to be present. Latest python-btrfs
release is v0.3, created yesterday.

Yay,

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] fstests: generic/098 update test for truncating a file into the middle of a hole

2016-11-16 Thread Liu Bo
This updates generic/098 by adding a sync option, i.e. 'sync' after the second
write, and with btrfs's NO_HOLES, we could still get wrong isize after remount.

This gets fixed by the patch

'Btrfs: fix truncate down when no_holes feature is enabled'

Signed-off-by: Liu Bo 
---
v2: use 'local' for local variable and add comments for 'sync' option.

 tests/generic/098 | 60 +--
 tests/generic/098.out | 10 +
 2 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/tests/generic/098 b/tests/generic/098
index 838bb5d..8ab0ad4 100755
--- a/tests/generic/098
+++ b/tests/generic/098
@@ -64,27 +64,45 @@ rm -f $seqres.full
 _scratch_mkfs >>$seqres.full 2>&1
 _scratch_mount
 
-# Create our test file with some data and durably persist it.
-$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 128K" $SCRATCH_MNT/foo | _filter_xfs_io
-sync
-
-# Append some data to the file, increasing its size, and leave a hole between
-# the old size and the start offset if the following write. So our file gets
-# a hole in the range [128Kb, 256Kb[.
-$XFS_IO_PROG -c "pwrite -S 0xbb 256K 32K" $SCRATCH_MNT/foo | _filter_xfs_io
-
-# Now truncate our file to a smaller size that is in the middle of the hole we
-# previously created. On most truncate implementations the data we appended
-# before gets discarded from memory (with truncate_setsize()) and never ends
-# up being written to disk.
-$XFS_IO_PROG -c "truncate 160K" $SCRATCH_MNT/foo
-
-_scratch_cycle_mount
-
-# We expect to see a file with a size of 160Kb, with the first 128Kb of data 
all
-# having the value 0xaa and the remaining 32Kb of data all having the value 
0x00
-echo "File content after remount:"
-od -t x1 $SCRATCH_MNT/foo
+workout()
+{
+   local need_sync=$1
+
+   # Create our test file with some data and durably persist it.
+   $XFS_IO_PROG -t -f -c "pwrite -S 0xaa 0 128K" $SCRATCH_MNT/foo | 
_filter_xfs_io
+   sync
+
+   # Append some data to the file, increasing its size, and leave a hole 
between
+   # the old size and the start offset if the following write. So our file 
gets
+   # a hole in the range [128Kb, 256Kb[.
+   $XFS_IO_PROG -c "pwrite -S 0xbb 256K 32K" $SCRATCH_MNT/foo | 
_filter_xfs_io
+
+   # This 'sync' is to flush file extent on disk and update on-disk inode 
size.
+   # This is required to trigger a bug in btrfs truncate where it updates 
on-disk
+   # inode size incorrectly.
+   if [ $need_sync -eq 1 ]; then
+   sync
+   fi
+
+   # Now truncate our file to a smaller size that is in the middle of the 
hole we
+   # previously created.
+   # If we don't flush dirty page cache above, on most truncate
+   # implementations the data we appended before gets discarded from
+   # memory (with truncate_setsize()) and never ends up being written to
+   # disk.
+   $XFS_IO_PROG -c "truncate 160K" $SCRATCH_MNT/foo
+
+   _scratch_cycle_mount
+
+   # We expect to see a file with a size of 160Kb, with the first 128Kb of 
data all
+   # having the value 0xaa and the remaining 32Kb of data all having the 
value 0x00
+   echo "File content after remount:"
+   od -t x1 $SCRATCH_MNT/foo
+}
+
+workout 0
+# flush after each write
+workout 1
 
 status=0
 exit
diff --git a/tests/generic/098.out b/tests/generic/098.out
index 37415ee..f87f046 100644
--- a/tests/generic/098.out
+++ b/tests/generic/098.out
@@ -9,3 +9,13 @@ File content after remount:
 040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 *
 050
+wrote 131072/131072 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 32768/32768 bytes at offset 262144
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+File content after remount:
+000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
+*
+040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+*
+050
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Клиентские базы Skype: prodawez390 Whatsapp: +79139230330 Viber: +79139230330 Telegram: +79139230330 Email: prodawez...@gmail.com

2016-11-16 Thread hobar...@gmx.com
Клиентские базы Skype: prodawez390 Whatsapp: +79139230330 Viber:  +79139230330 
Telegram: +79139230330 Email: prodawez...@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: generic/098 update test for truncating a file into the middle of a hole

2016-11-16 Thread Liu Bo
On Tue, Nov 15, 2016 at 02:53:12PM +0800, Eryu Guan wrote:
> On Fri, Nov 11, 2016 at 02:30:04PM -0800, Liu Bo wrote:
> > This updates generic/098 by adding a sync option, i.e. 'sync' after the 
> > second
> > write, and with btrfs's NO_HOLES, we could still get wrong isize after 
> > remount.
> > 
> > This gets fixed by the patch
> > 
> > 'Btrfs: fix truncate down when no_holes feature is enabled'
> > 
> > Signed-off-by: Liu Bo 
> 
> Looks good to me, just some nitpicks inline :)
> 
> > ---
> >  tests/generic/098 | 57 
> > ---
> >  tests/generic/098.out | 10 +
> >  2 files changed, 46 insertions(+), 21 deletions(-)
> > 
> > diff --git a/tests/generic/098 b/tests/generic/098
> > index 838bb5d..3b89939 100755
> > --- a/tests/generic/098
> > +++ b/tests/generic/098
> > @@ -64,27 +64,42 @@ rm -f $seqres.full
> >  _scratch_mkfs >>$seqres.full 2>&1
> >  _scratch_mount
> >  
> > -# Create our test file with some data and durably persist it.
> > -$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 128K" $SCRATCH_MNT/foo | 
> > _filter_xfs_io
> > -sync
> > -
> > -# Append some data to the file, increasing its size, and leave a hole 
> > between
> > -# the old size and the start offset if the following write. So our file 
> > gets
> > -# a hole in the range [128Kb, 256Kb[.
> > -$XFS_IO_PROG -c "pwrite -S 0xbb 256K 32K" $SCRATCH_MNT/foo | _filter_xfs_io
> > -
> > -# Now truncate our file to a smaller size that is in the middle of the 
> > hole we
> > -# previously created. On most truncate implementations the data we appended
> > -# before gets discarded from memory (with truncate_setsize()) and never 
> > ends
> > -# up being written to disk.
> > -$XFS_IO_PROG -c "truncate 160K" $SCRATCH_MNT/foo
> > -
> > -_scratch_cycle_mount
> > -
> > -# We expect to see a file with a size of 160Kb, with the first 128Kb of 
> > data all
> > -# having the value 0xaa and the remaining 32Kb of data all having the 
> > value 0x00
> > -echo "File content after remount:"
> > -od -t x1 $SCRATCH_MNT/foo
> > +workout()
> > +{
> > +   NEED_SYNC=$1
> 
> Use "local" to declare this var, and in lower case. Usually we use upper
> case for global variables.

OK.

> 
> > +
> > +   # Create our test file with some data and durably persist it.
> > +   $XFS_IO_PROG -t -f -c "pwrite -S 0xaa 0 128K" $SCRATCH_MNT/foo | 
> > _filter_xfs_io
> > +   sync
> > +
> > +   # Append some data to the file, increasing its size, and leave a hole 
> > between
> > +   # the old size and the start offset if the following write. So our file 
> > gets
> > +   # a hole in the range [128Kb, 256Kb[.
> > +   $XFS_IO_PROG -c "pwrite -S 0xbb 256K 32K" $SCRATCH_MNT/foo | 
> > _filter_xfs_io
> > +
> > +   if [ $NEED_SYNC -eq 1 ]; then
> > +   sync
> > +   fi
> 
> Good to see some comments to explain why we need this to test
> with/without sync case.

Sure, will fix in v2.

Thanks,

-liubo
> 
> Thanks,
> Eryu
> 
> > +
> > +   # Now truncate our file to a smaller size that is in the middle of the 
> > hole we
> > +   # previously created.
> > +   # If we don't flush dirty page cache above, on most truncate
> > +   # implementations the data we appended before gets discarded from
> > +   # memory (with truncate_setsize()) and never ends up being written to
> > +   # disk.
> > +   $XFS_IO_PROG -c "truncate 160K" $SCRATCH_MNT/foo
> > +
> > +   _scratch_cycle_mount
> > +
> > +   # We expect to see a file with a size of 160Kb, with the first 128Kb of 
> > data all
> > +   # having the value 0xaa and the remaining 32Kb of data all having the 
> > value 0x00
> > +   echo "File content after remount:"
> > +   od -t x1 $SCRATCH_MNT/foo
> > +}
> > +
> > +workout 0
> > +# flush after each write
> > +workout 1
> >  
> >  status=0
> >  exit
> > diff --git a/tests/generic/098.out b/tests/generic/098.out
> > index 37415ee..f87f046 100644
> > --- a/tests/generic/098.out
> > +++ b/tests/generic/098.out
> > @@ -9,3 +9,13 @@ File content after remount:
> >  040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >  *
> >  050
> > +wrote 131072/131072 bytes at offset 0
> > +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> > +wrote 32768/32768 bytes at offset 262144
> > +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> > +File content after remount:
> > +000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> > +*
> > +040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > +*
> > +050
> > -- 
> > 2.5.0
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe fstests" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2016-11-16 Thread Martin Steigerwald
Am Mittwoch, 16. November 2016, 07:57:08 CET schrieb Austin S. Hemmelgarn:
> On 2016-11-16 06:04, Martin Steigerwald wrote:
> > Am Mittwoch, 16. November 2016, 16:00:31 CET schrieb Roman Mamedov:
> >> On Wed, 16 Nov 2016 11:55:32 +0100
> >> 
> >> Martin Steigerwald  wrote:
[…]
> > As there seems to be no force option to override the limitation and I
> > do not feel like compiling my own btrfs-tools right now, I will use rsync
> > instead.
> 
> In a case like this, I'd trust rsync more than send/receive.  The
> following rsync switches might also be of interest:
> -a: This turns on a bunch of things almost everyone wants when using
> rsync, similar to the same switch for cp, just with even more added in.
> -H: This recreates hardlinks on the receiving end.
> -S: This recreates sparse files.
> -A: This copies POSIX ACL's
> -X: This copies extended attributes (most of them at least, there are a
> few that can't be arbitrarily written to).
> Pre-creating the subvolumes by hand combined with using all of those
> will get you almost everything covered by send/receive except for
> sharing of extents and ctime.

I usually use rsync -aAHXSP already :).

I was able to rsync any relevant data of the disk which is now being deleted 
by shred command.

Thank you,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: change btrfs_csum_final result param type to u8

2016-11-16 Thread David Sterba
On Mon, Oct 31, 2016 at 05:47:24PM +0100, David Sterba wrote:
> On Thu, Oct 27, 2016 at 08:52:33AM +0100, Domagoj Tršan wrote:
> > csum member of struct btrfs_super_block has array type of u8. It makes sense
> > that function btrfs_csum_final should be also declared to accept u8 *. I
> > changed the declaration of method void btrfs_csum_final(u32 crc, char 
> > *result);
> > to void btrfs_csum_final(u32 crc, u8 *result);
> 
> Sorry, I've noticed it just now, several callers of btrfs_csum_final
> cast the 2nd argument to (char *), which gets change to u8. Can you
> please fix the callers? Thanks.

Done and committed.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: don't poke into bio internals

2016-11-16 Thread David Sterba
On Wed, Nov 16, 2016 at 01:52:07PM +0100, Christoph Hellwig wrote:
> this series has a few patches that switch btrfs to use the proper helpers for
> accessing bio internals.  This helps to prepare for supporting multi-page
> bio_vecs, which are currently under development.

Looks good to me, thanks. I'll let it pass through tests, expected
merge target is 4.10.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fs: Btrfs - Improvement in code readability when

2016-11-16 Thread David Sterba
On Thu, Nov 10, 2016 at 03:17:41PM +0530, Shailendra Verma wrote:
> From: "Shailendra Verma" 
> 
> There is no need to call kfree() if memdup_user() fails, as no memory
> was allocated and the error in the error-valued pointer should be returned.
> 
> Signed-off-by: Shailendra Verma 

Queued for 4.10, I've edited the subject line as it's very descriptive.
("btrfs: return early from failed memory allocations in ioctl handlers")
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] btrfs: make max inline data can be equal to sectorsize

2016-11-16 Thread David Sterba
On Mon, Nov 14, 2016 at 09:55:34AM +0800, Qu Wenruo wrote:
> At 11/12/2016 04:22 AM, Liu Bo wrote:
> > On Tue, Oct 11, 2016 at 02:47:42PM +0800, Wang Xiaoguang wrote:
> >> If we use mount option "-o max_inline=sectorsize", say 4096, indeed
> >> even for a fresh fs, say nodesize is 16k, we can not make the first
> >> 4k data completely inline, I found this conditon causing this issue:
> >>   !compressed_size && (actual_end & (root->sectorsize - 1)) == 0
> >>
> >> If it retuns true, we'll not make data inline. For 4k sectorsize,
> >> 0~4094 dara range, we can make it inline, but 0~4095, it can not.
> >> I don't think this limition is useful, so here remove it which will
> >> make max inline data can be equal to sectorsize.
> >
> > It's difficult to tell whether we need this, I'm not a big fan of using
> > max_inline size more than the default size 2048, given that most reports
> > about ENOSPC is due to metadata and inline may make it worse.
> 
> IMHO if we can use inline data extents to trigger ENOSPC more easily, 
> then we should allow it to dig the problem further.
> 
> Just ignoring it because it may cause more bug will not solve the real 
> problem anyway.

Not allowing the full 4k value as max_inline looks artificial to me.
We've removed other similar limitation in the past so I'd tend to agree
to do the same here. There's no significant use for it as far as I can
tell, if you want to exhaust metadata, the difference to max_inline=4095
would be really tiny in the end. So, I'm okay with merging it. If
anybody feels like adding his reviewed-by, please do so.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: fix file extent corruption

2016-11-16 Thread Josef Bacik

On 11/14/2016 06:11 PM, Liu Bo wrote:

On Mon, Nov 14, 2016 at 02:06:21PM -0500, Josef Bacik wrote:

In order to do hole punching we have a block reserve to hold the reservation we
need to drop the extents in our range.  Since we could end up dropping a lot of
extents we set rsv->failfast so we can just loop around again and drop the
remaining of the range.  Unfortunately we unconditionally fill the hole extents
in and start from the last extent we encountered, which we may or may not have
dropped.  So this can result in overlapping file extent entries, which can be
tripped over in a variety of ways, either by hitting BUG_ON(!ret) in
fill_holes() after the search, or in btrfs_set_item_key_safe() in
btrfs_drop_extent() at a later time by an unrelated task.  Fix this by only
setting drop_end to the last extent we did actually drop.  This way our holes
are filled in properly for the range that we did drop, and the rest of the range
that remains to be dropped is actually dropped.  Thanks,


Can you pleaes share the reproducer?



Yup here you go

https://paste.fedoraproject.org/483195/30633414

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2][V2] Btrfs: fix file extent corruption

2016-11-16 Thread Josef Bacik
In order to do hole punching we have a block reserve to hold the reservation we
need to drop the extents in our range.  Since we could end up dropping a lot of
extents we set rsv->failfast so we can just loop around again and drop the
remaining of the range.  Unfortunately we unconditionally fill the hole extents
in and start from the last extent we encountered, which we may or may not have
dropped.  So this can result in overlapping file extent entries, which can be
tripped over in a variety of ways, either by hitting BUG_ON(!ret) in
fill_holes() after the search, or in btrfs_set_item_key_safe() in
btrfs_drop_extent() at a later time by an unrelated task.  Fix this by only
setting drop_end to the last extent we did actually drop.  This way our holes
are filled in properly for the range that we did drop, and the rest of the range
that remains to be dropped is actually dropped.  Thanks,

Signed-off-by: Josef Bacik 
---
V1->V2:
- don't call fill_holes if our drop_end is == start.

 fs/btrfs/file.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index cbefdc8..23859e7 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -706,6 +706,7 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
u64 num_bytes = 0;
u64 extent_offset = 0;
u64 extent_end = 0;
+   u64 last_end = start;
int del_nr = 0;
int del_slot = 0;
int extent_type;
@@ -797,8 +798,10 @@ next_slot:
 * extent item in the call to setup_items_for_insert() later
 * in this function.
 */
-   if (extent_end == key.offset && extent_end >= search_start)
+   if (extent_end == key.offset && extent_end >= search_start) {
+   last_end = extent_end;
goto delete_extent_item;
+   }
 
if (extent_end <= search_start) {
path->slots[0]++;
@@ -861,6 +864,12 @@ next_slot:
key.offset = start;
}
/*
+* From here on out we will have actually dropped something, so
+* last_end can be updated.
+*/
+   last_end = extent_end;
+
+   /*
 *  |  range to drop - |
 *  |  extent  |
 */
@@ -1010,7 +1019,7 @@ delete_extent_item:
if (!replace_extent || !(*key_inserted))
btrfs_release_path(path);
if (drop_end)
-   *drop_end = found ? min(end, extent_end) : end;
+   *drop_end = found ? min(end, last_end) : end;
return ret;
 }
 
@@ -2526,7 +2535,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
 
trans->block_rsv = >fs_info->trans_block_rsv;
 
-   if (cur_offset < ino_size) {
+   if (cur_offset < drop_end && cur_offset < ino_size) {
ret = fill_holes(trans, inode, path, cur_offset,
 drop_end);
if (ret) {
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 186671] New: OOM on system with just rsync running 32GB of ram 30GB of pagecache

2016-11-16 Thread E V
System panic'd overnight running 4.9rc5 & rsync. Attached a photo of
the stack trace, and the 38 call traces in a 2 minute window shortly
before, to the bugzilla case for those not on it's e-mail list:

https://bugzilla.kernel.org/show_bug.cgi?id=186671

On Mon, Nov 14, 2016 at 3:56 PM, E V  wrote:
> Pretty sure it was the system after the OOM just did a history search
> to check, though it is 3 days afterwards and several OOMs killed
> several processes in somewhat rapid succession, I just listed the 1st.
> I'll turn on CONFIG_DEBUG_VM and reboot again.
>
> On Mon, Nov 14, 2016 at 12:04 PM, Vlastimil Babka  wrote:
>> On 11/14/2016 02:27 PM, E V wrote:
>>> System is an intel dual socket Xeon E5620, 7500/5520/5500/X58 ICH10
>>> family according to lspci. Anyways 4.8.4 OOM'd while I was gone. I'll
>>> download the current 4.9rc and reboot, but in the mean time here's
>>> xxd, vmstat & kern.log output:
>>> 8532039 
>>
>> Hmm this would suggest that the memory is mostly free. But not according
>> to vmstat. Is it possible you mistakenly provided the xxd from a fresh
>> boot, but vmstat from after the OOM?
>>
>> But sure, a page_count() of zero is a reason why __isolate_lru_page()
>> would fail due to its get_page_unless_zero(). The question is then how
>> could it drop to zero without being freed at the same time, as
>> put_page() does.
>>
>> I was going to suspect commit 83929372f6 and a page_ref_sub() it adds to
>> delete_from_page_cache(), but that's since 4.8 and you mention problems
>> since 4.7.
>>
>> Anyway it might be worth enabling CONFIG_DEBUG_VM as the relevant code
>> usually has VM_BUG_ONs.
>>
>> Vlastimil
>>
>>>9324 0100
>>>2226 0200
>>> 405 0300
>>>  80 0400
>>>  34 0500
>>>  48 0600
>>>  17 0700
>>>  17 0800
>>>  32 0900
>>>  19 0a00
>>>   1 0c00
>>>   1 0d00
>>>   1 0e00
>>>  12 1000
>>>   8 1100
>>>  32 1200
>>>  10 1300
>>>   2 1400
>>>  11 1500
>>>  12 1600
>>>   7 1700
>>>   3 1800
>>>   5 1900
>>>   6 1a00
>>>  11 1b00
>>>  22 1c00
>>>   3 1d00
>>>  19 1e00
>>>  21 1f00
>>>  18 2000
>>>  28 2100
>>>  40 2200
>>>  38 2300
>>>  85 2400
>>>  59 2500
>>>   40520 81ff
>>>
>>> /proc/vmstat:
>>> nr_free_pages 60965
>>> nr_zone_inactive_anon 4646
>>> nr_zone_active_anon 3265
>>> nr_zone_inactive_file 633882
>>> nr_zone_active_file 7017458
>>> nr_zone_unevictable 0
>>> nr_zone_write_pending 0
>>> nr_mlock 0
>>> nr_slab_reclaimable 299205
>>> nr_slab_unreclaimable 195497
>>> nr_page_table_pages 935
>>> nr_kernel_stack 4976
>>> nr_bounce 0
>>> numa_hit 3577063288
>>> numa_miss 541393191
>>> numa_foreign 541393191
>>> numa_interleave 19415
>>> numa_local 3577063288
>>> numa_other 0
>>> nr_free_cma 0
>>> nr_inactive_anon 4646
>>> nr_active_anon 3265
>>> nr_inactive_file 633882
>>> nr_active_file 7017458
>>> nr_unevictable 0
>>> nr_isolated_anon 0
>>> nr_isolated_file 0
>>> nr_pages_scanned 0
>>> workingset_refault 42685891
>>> workingset_activate 15247281
>>> workingset_nodereclaim 26375216
>>> nr_anon_pages 5067
>>> nr_mapped 5630
>>> nr_file_pages 7654746
>>> nr_dirty 0
>>> nr_writeback 0
>>> nr_writeback_temp 0
>>> nr_shmem 2504
>>> nr_shmem_hugepages 0
>>> nr_shmem_pmdmapped 0
>>> nr_anon_transparent_hugepages 0
>>> nr_unstable 0
>>> nr_vmscan_write 5243750485
>>> nr_vmscan_immediate_reclaim 4207633857
>>> nr_dirtied 1839143430
>>> nr_written 1832626107
>>> nr_dirty_threshold 1147728
>>> nr_dirty_background_threshold 151410
>>> pgpgin 166731189
>>> pgpgout 7328142335
>>> pswpin 98608
>>> pswpout 117794
>>> pgalloc_dma 29504
>>> pgalloc_dma32 1006726216
>>> pgalloc_normal 5275218188
>>> pgalloc_movable 0
>>> allocstall_dma 0
>>> allocstall_dma32 0
>>> allocstall_normal 36461
>>> allocstall_movable 5867
>>> pgskip_dma 0
>>> pgskip_dma32 0
>>> pgskip_normal 6417890
>>> pgskip_movable 0
>>> pgfree 6309223401
>>> pgactivate 35076483
>>> pgdeactivate 63556974
>>> pgfault 35753842
>>> pgmajfault 69126
>>> pglazyfreed 0
>>> pgrefill 70008598
>>> pgsteal_kswapd 3567289713
>>> pgsteal_direct 5878057
>>> pgscan_kswapd 9059309872
>>> pgscan_direct 4239367903
>>> pgscan_direct_throttle 0
>>> zone_reclaim_failed 0
>>> pginodesteal 102916
>>> slabs_scanned 460790262
>>> kswapd_inodesteal 9130243
>>> kswapd_low_wmark_hit_quickly 10634373
>>> kswapd_high_wmark_hit_quickly 7348173
>>> pageoutrun 18349115
>>> pgrotated 16291322
>>> drop_pagecache 0
>>> drop_slab 0
>>> pgmigrate_success 18912908
>>> 

Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2016-11-16 Thread Austin S. Hemmelgarn

On 2016-11-16 06:04, Martin Steigerwald wrote:

Am Mittwoch, 16. November 2016, 16:00:31 CET schrieb Roman Mamedov:

On Wed, 16 Nov 2016 11:55:32 +0100

Martin Steigerwald  wrote:

I do think that above kernel messages invite such a kind of interpretation
tough. I took the "BTRFS: open_ctree failed" message as indicative to some
structural issue with the filesystem.


For the reason as to why the writable mount didn't work, check "btrfs fi df"
for the filesystem to see if you have any "single" profile chunks on it:
quite likely you did already mount it "degraded,rw" in the past *once*,
after which those "single" chunks get created, and consequently it won't
mount r/w anymore (without lifting the restriction on the number of missing
devices as proposed).


That exactly explains it. I very likely did a degraded mount without ro on
this disk already.

Funnily enough this creates another complication:

merkaba:/mnt/zeit#1> btrfs send somesubvolume | btrfs receive /mnt/
someotherbtrfs
ERROR: subvolume /mnt/zeit/somesubvolume is not read-only

Yet:

merkaba:/mnt/zeit> btrfs property get somesubvolume
ro=false
merkaba:/mnt/zeit> btrfs property set somesubvolume ro true
ERROR: failed to set flags for somesubvolume: Read-only file system

To me it seems right logic would be to allow the send to proceed in case
the whole filesystem is readonly.
It should, but doesn't currently.  There was a thread about this a while 
back, but I don't think it ever resulted in anything changing.


As there seems to be no force option to override the limitation and I
do not feel like compiling my own btrfs-tools right now, I will use rsync
instead.
In a case like this, I'd trust rsync more than send/receive.  The 
following rsync switches might also be of interest:
-a: This turns on a bunch of things almost everyone wants when using 
rsync, similar to the same switch for cp, just with even more added in.

-H: This recreates hardlinks on the receiving end.
-S: This recreates sparse files.
-A: This copies POSIX ACL's
-X: This copies extended attributes (most of them at least, there are a 
few that can't be arbitrarily written to).
Pre-creating the subvolumes by hand combined with using all of those 
will get you almost everything covered by send/receive except for 
sharing of extents and ctime.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/9] btrfs: use bi_size

2016-11-16 Thread Christoph Hellwig
Instead of using bi_vcnt to calculate it.

Signed-off-by: Christoph Hellwig 
---
 fs/btrfs/compression.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 12a631d..8618ac3 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -562,7 +562,6 @@ static noinline int add_ra_bio_pages(struct inode *inode,
  *
  * bio->bi_iter.bi_sector points to the compressed extent on disk
  * bio->bi_io_vec points to all of the inode pages
- * bio->bi_vcnt is a count of pages
  *
  * After the compressed pages are read, we copy the bytes into the
  * bio we were passed and then call the bio end_io calls
@@ -574,7 +573,6 @@ int btrfs_submit_compressed_read(struct inode *inode, 
struct bio *bio,
struct extent_map_tree *em_tree;
struct compressed_bio *cb;
struct btrfs_root *root = BTRFS_I(inode)->root;
-   unsigned long uncompressed_len = bio->bi_vcnt * PAGE_SIZE;
unsigned long compressed_len;
unsigned long nr_pages;
unsigned long pg_index;
@@ -619,7 +617,7 @@ int btrfs_submit_compressed_read(struct inode *inode, 
struct bio *bio,
free_extent_map(em);
em = NULL;
 
-   cb->len = uncompressed_len;
+   cb->len = bio->bi_iter.bi_size;
cb->compressed_len = compressed_len;
cb->compress_type = extent_compress_type(bio_flags);
cb->orig_bio = bio;
@@ -647,8 +645,7 @@ int btrfs_submit_compressed_read(struct inode *inode, 
struct bio *bio,
add_ra_bio_pages(inode, em_start + em_len, cb);
 
/* include any pages we added in add_ra-bio_pages */
-   uncompressed_len = bio->bi_vcnt * PAGE_SIZE;
-   cb->len = uncompressed_len;
+   cb->len = bio->bi_iter.bi_size;
 
comp_bio = compressed_bio_alloc(bdev, cur_disk_byte, GFP_NOFS);
if (!comp_bio)
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/9] btrfs: calculate end of bio offset properly

2016-11-16 Thread Christoph Hellwig
Use the bvec offset and len members to prepare for multipage bvecs.

Signed-off-by: Christoph Hellwig 
---
 fs/btrfs/compression.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 8618ac3..27e9feb 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -445,6 +445,13 @@ int btrfs_submit_compressed_write(struct inode *inode, u64 
start,
return 0;
 }
 
+static u64 bio_end_offset(struct bio *bio)
+{
+   struct bio_vec *last = >bi_io_vec[bio->bi_vcnt - 1];
+
+   return page_offset(last->bv_page) + last->bv_len - last->bv_offset;
+}
+
 static noinline int add_ra_bio_pages(struct inode *inode,
 u64 compressed_end,
 struct compressed_bio *cb)
@@ -463,8 +470,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
u64 end;
int misses = 0;
 
-   page = cb->orig_bio->bi_io_vec[cb->orig_bio->bi_vcnt - 1].bv_page;
-   last_offset = (page_offset(page) + PAGE_SIZE);
+   last_offset = bio_end_offset(cb->orig_bio);
em_tree = _I(inode)->extent_tree;
tree = _I(inode)->io_tree;
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 9/9] btrfs: only check bio size to see if a repair bio should have the failfast flag

2016-11-16 Thread Christoph Hellwig
The number of pages in a bio is a bad indicatator for the number of
splits lower levels could do, and with the multipage bio_vec work even
that measure goes away and will become a number of segments of physically
contiguous areas instead.  Check the total bio size vs the sector size
instead, which gives us an indication without any false negatives,
although the false positive rate might increase a bit.

Signed-off-by: Christoph Hellwig 
---
 fs/btrfs/extent_io.c | 4 ++--
 fs/btrfs/inode.c | 4 +---
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ea9ade7..a05fc41 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2296,7 +2296,7 @@ int btrfs_check_repairable(struct inode *inode, struct 
bio *failed_bio,
 *  a) deliver good data to the caller
 *  b) correct the bad sectors on disk
 */
-   if (failed_bio->bi_vcnt > 1) {
+   if (failed_bio->bi_iter.bi_size > BTRFS_I(inode)->root->sectorsize) {
/*
 * to fulfill b), we need to know the exact failing sectors, as
 * we don't want to rewrite any more than the failed ones. thus,
@@ -2403,7 +2403,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
return -EIO;
}
 
-   if (failed_bio->bi_vcnt > 1)
+   if (failed_bio->bi_iter.bi_size > BTRFS_I(inode)->root->sectorsize)
read_mode = READ_SYNC | REQ_FAILFAST_DEV;
else
read_mode = READ_SYNC;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3f09cb6..54afe41 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7933,9 +7933,7 @@ static int dio_read_error(struct inode *inode, struct bio 
*failed_bio,
return -EIO;
}
 
-   if ((failed_bio->bi_vcnt > 1)
-   || (failed_bio->bi_io_vec->bv_len
-   > BTRFS_I(inode)->root->sectorsize))
+   if (failed_bio->bi_iter.bi_size > BTRFS_I(inode)->root->sectorsize)
read_mode = READ_SYNC | REQ_FAILFAST_DEV;
else
read_mode = READ_SYNC;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/9] btrfs: refactor __btrfs_lookup_bio_sums to use bio_for_each_segment_all

2016-11-16 Thread Christoph Hellwig
Rework the loop a little bit to use the generic bio_for_each_segment_all
helper for iterating over the bio.

Signed-off-by: Christoph Hellwig 
---
 fs/btrfs/file-item.c | 31 +++
 1 file changed, 11 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index fa8aa53..54ccb91 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -163,7 +163,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
   struct inode *inode, struct bio *bio,
   u64 logical_offset, u32 *dst, int dio)
 {
-   struct bio_vec *bvec = bio->bi_io_vec;
+   struct bio_vec *bvec;
struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
struct btrfs_csum_item *item = NULL;
struct extent_io_tree *io_tree = _I(inode)->io_tree;
@@ -177,7 +177,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
u32 diff;
int nblocks;
int bio_index = 0;
-   int count;
+   int count = 0;
u16 csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
 
path = btrfs_alloc_path();
@@ -223,8 +223,11 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
if (dio)
offset = logical_offset;
 
-   page_bytes_left = bvec->bv_len;
-   while (bio_index < bio->bi_vcnt) {
+   bio_for_each_segment_all(bvec, bio, bio_index) {
+   page_bytes_left = bvec->bv_len;
+   if (count)
+   goto next;
+
if (!dio)
offset = page_offset(bvec->bv_page) + bvec->bv_offset;
count = btrfs_find_ordered_sum(inode, offset, disk_bytenr,
@@ -285,29 +288,17 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root 
*root,
 found:
csum += count * csum_size;
nblocks -= count;
-
+next:
while (count--) {
disk_bytenr += root->sectorsize;
offset += root->sectorsize;
page_bytes_left -= root->sectorsize;
-   if (!page_bytes_left) {
-   bio_index++;
-   /*
-* make sure we're still inside the
-* bio before we update page_bytes_left
-*/
-   if (bio_index >= bio->bi_vcnt) {
-   WARN_ON_ONCE(count);
-   goto done;
-   }
-   bvec++;
-   page_bytes_left = bvec->bv_len;
-   }
-
+   if (!page_bytes_left)
+   break; /* move to next bio */
}
}
 
-done:
+   WARN_ON_ONCE(count);
btrfs_free_path(path);
return 0;
 }
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 8/9] btrfs: use bio_for_each_segment_all in __btrfsic_submit_bio

2016-11-16 Thread Christoph Hellwig
And remove the bogus check for a NULL return value from kmap, which
can't happen.  While we're at it: I don't think that kmapping up to 256
will work without deadlocks on highmem machines, a better idea would
be to use vm_map_ram to map all of them into a single virtual address
range.  Incidentally that would also simplify the code a lot.

Signed-off-by: Christoph Hellwig 
---
 fs/btrfs/check-integrity.c | 30 +++---
 1 file changed, 11 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index a6f657f..86f681f 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -2819,10 +2819,11 @@ static void __btrfsic_submit_bio(struct bio *bio)
 * btrfsic_mount(), this might return NULL */
dev_state = btrfsic_dev_state_lookup(bio->bi_bdev);
if (NULL != dev_state &&
-   (bio_op(bio) == REQ_OP_WRITE) && NULL != bio->bi_io_vec) {
+   (bio_op(bio) == REQ_OP_WRITE) && bio_has_data(bio)) {
unsigned int i;
u64 dev_bytenr;
u64 cur_bytenr;
+   struct bio_vec *bvec;
int bio_is_patched;
char **mapped_datav;
 
@@ -2840,32 +2841,23 @@ static void __btrfsic_submit_bio(struct bio *bio)
if (!mapped_datav)
goto leave;
cur_bytenr = dev_bytenr;
-   for (i = 0; i < bio->bi_vcnt; i++) {
-   BUG_ON(bio->bi_io_vec[i].bv_len != PAGE_SIZE);
-   mapped_datav[i] = kmap(bio->bi_io_vec[i].bv_page);
-   if (!mapped_datav[i]) {
-   while (i > 0) {
-   i--;
-   kunmap(bio->bi_io_vec[i].bv_page);
-   }
-   kfree(mapped_datav);
-   goto leave;
-   }
+
+   bio_for_each_segment_all(bvec, bio, i) {
+   BUG_ON(bvec->bv_len != PAGE_SIZE);
+   mapped_datav[i] = kmap(bvec->bv_page);
+
if (dev_state->state->print_mask &
BTRFSIC_PRINT_MASK_SUBMIT_BIO_BH_VERBOSE)
pr_info("#%u: bytenr=%llu, len=%u, offset=%u\n",
-  i, cur_bytenr, bio->bi_io_vec[i].bv_len,
-  bio->bi_io_vec[i].bv_offset);
-   cur_bytenr += bio->bi_io_vec[i].bv_len;
+  i, cur_bytenr, bvec->bv_len, 
bvec->bv_offset);
+   cur_bytenr += bvec->bv_len;
}
btrfsic_process_written_block(dev_state, dev_bytenr,
  mapped_datav, bio->bi_vcnt,
  bio, _is_patched,
  NULL, bio->bi_opf);
-   while (i > 0) {
-   i--;
-   kunmap(bio->bi_io_vec[i].bv_page);
-   }
+   bio_for_each_segment_all(bvec, bio, i)
+   kunmap(bvec->bv_page);
kfree(mapped_datav);
} else if (NULL != dev_state && (bio->bi_opf & REQ_PREFLUSH)) {
if (dev_state->state->print_mask &
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/9] btrfs: don't access the bio directly in the raid5/6 code

2016-11-16 Thread Christoph Hellwig
Just use bio_for_each_segment_all to iterate over all segments.

Signed-off-by: Christoph Hellwig 
---
 fs/btrfs/raid56.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index d016d4a..da941fb 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1144,10 +1144,10 @@ static void validate_rbio_for_rmw(struct btrfs_raid_bio 
*rbio)
 static void index_rbio_pages(struct btrfs_raid_bio *rbio)
 {
struct bio *bio;
+   struct bio_vec *bvec;
u64 start;
unsigned long stripe_offset;
unsigned long page_index;
-   struct page *p;
int i;
 
spin_lock_irq(>bio_list_lock);
@@ -1156,10 +1156,8 @@ static void index_rbio_pages(struct btrfs_raid_bio *rbio)
stripe_offset = start - rbio->bbio->raid_map[0];
page_index = stripe_offset >> PAGE_SHIFT;
 
-   for (i = 0; i < bio->bi_vcnt; i++) {
-   p = bio->bi_io_vec[i].bv_page;
-   rbio->bio_pages[page_index + i] = p;
-   }
+   bio_for_each_segment_all(bvec, bio, i)
+   rbio->bio_pages[page_index+ i] = bvec->bv_page;
}
spin_unlock_irq(>bio_list_lock);
 }
@@ -1433,13 +1431,11 @@ static int fail_bio_stripe(struct btrfs_raid_bio *rbio,
  */
 static void set_bio_pages_uptodate(struct bio *bio)
 {
+   struct bio_vec *bvec;
int i;
-   struct page *p;
 
-   for (i = 0; i < bio->bi_vcnt; i++) {
-   p = bio->bi_io_vec[i].bv_page;
-   SetPageUptodate(p);
-   }
+   bio_for_each_segment_all(bvec, bio, i)
+   SetPageUptodate(bvec->bv_page);
 }
 
 /*
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/9] btrfs: don't access the bio directly in btrfs_csum_one_bio

2016-11-16 Thread Christoph Hellwig
Use bio_for_each_segment_all to iterate over the segments instead.
This requires a bit of reshuffling so that we only lookup up the ordered
item once inside the bio_for_each_segment_all loop.

Signed-off-by: Christoph Hellwig 
---
 fs/btrfs/file-item.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index d0d571c..fa8aa53 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -447,13 +447,12 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct 
inode *inode,
   struct bio *bio, u64 file_start, int contig)
 {
struct btrfs_ordered_sum *sums;
-   struct btrfs_ordered_extent *ordered;
+   struct btrfs_ordered_extent *ordered = NULL;
char *data;
-   struct bio_vec *bvec = bio->bi_io_vec;
-   int bio_index = 0;
+   struct bio_vec *bvec;
int index;
int nr_sectors;
-   int i;
+   int i, j;
unsigned long total_bytes = 0;
unsigned long this_sum_bytes = 0;
u64 offset;
@@ -470,17 +469,20 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct 
inode *inode,
if (contig)
offset = file_start;
else
-   offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+   offset = 0; /* shut up gcc */
 
-   ordered = btrfs_lookup_ordered_extent(inode, offset);
-   BUG_ON(!ordered); /* Logic error */
sums->bytenr = (u64)bio->bi_iter.bi_sector << 9;
index = 0;
 
-   while (bio_index < bio->bi_vcnt) {
+   bio_for_each_segment_all(bvec, bio, j) {
if (!contig)
offset = page_offset(bvec->bv_page) + bvec->bv_offset;
 
+   if (!ordered) {
+   ordered = btrfs_lookup_ordered_extent(inode, offset);
+   BUG_ON(!ordered); /* Logic error */
+   }
+
data = kmap_atomic(bvec->bv_page);
 
nr_sectors = BTRFS_BYTES_TO_BLKS(root->fs_info,
@@ -529,9 +531,6 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct 
inode *inode,
}
 
kunmap_atomic(data);
-
-   bio_index++;
-   bvec++;
}
this_sum_bytes = 0;
btrfs_add_ordered_sum(inode, ordered, sums);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/9] btrfs: don't access the bio directly in the raid5/6 code

2016-11-16 Thread Christoph Hellwig
Just use bio_for_each_segment_all to iterate over all segments.

Signed-off-by: Christoph Hellwig 
---
 fs/btrfs/inode.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 147df4c..3f09cb6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8394,7 +8394,7 @@ static int btrfs_submit_direct_hook(struct 
btrfs_dio_private *dip,
struct btrfs_root *root = BTRFS_I(inode)->root;
struct bio *bio;
struct bio *orig_bio = dip->orig_bio;
-   struct bio_vec *bvec = orig_bio->bi_io_vec;
+   struct bio_vec *bvec;
u64 start_sector = orig_bio->bi_iter.bi_sector;
u64 file_offset = dip->logical_offset;
u64 submit_len = 0;
@@ -8403,7 +8403,7 @@ static int btrfs_submit_direct_hook(struct 
btrfs_dio_private *dip,
int async_submit = 0;
int nr_sectors;
int ret;
-   int i;
+   int i, j;
 
map_length = orig_bio->bi_iter.bi_size;
ret = btrfs_map_block(root->fs_info, btrfs_op(orig_bio),
@@ -8433,7 +8433,7 @@ static int btrfs_submit_direct_hook(struct 
btrfs_dio_private *dip,
btrfs_io_bio(bio)->logical = file_offset;
atomic_inc(>pending_bios);
 
-   while (bvec <= (orig_bio->bi_io_vec + orig_bio->bi_vcnt - 1)) {
+   bio_for_each_segment_all(bvec, orig_bio, j) {
nr_sectors = BTRFS_BYTES_TO_BLKS(root->fs_info, bvec->bv_len);
i = 0;
 next_block:
@@ -8487,7 +8487,6 @@ static int btrfs_submit_direct_hook(struct 
btrfs_dio_private *dip,
i++;
goto next_block;
}
-   bvec++;
}
}
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/9] btrfs: use bio iterators for the decompression handlers

2016-11-16 Thread Christoph Hellwig
Pass the full bio to the decompression routines and use bio iterators
to iterate over the data in the bio.

Signed-off-by: Christoph Hellwig 
---
 fs/btrfs/compression.c | 122 +
 fs/btrfs/compression.h |  12 ++---
 fs/btrfs/lzo.c |  17 ++-
 fs/btrfs/zlib.c|  15 ++
 4 files changed, 54 insertions(+), 112 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index d4d8b7e..12a631d 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -81,9 +81,9 @@ struct compressed_bio {
u32 sums;
 };
 
-static int btrfs_decompress_biovec(int type, struct page **pages_in,
-  u64 disk_start, struct bio_vec *bvec,
-  int vcnt, size_t srclen);
+static int btrfs_decompress_bio(int type, struct page **pages_in,
+  u64 disk_start, struct bio *orig_bio,
+  size_t srclen);
 
 static inline int compressed_bio_size(struct btrfs_root *root,
  unsigned long disk_size)
@@ -175,11 +175,10 @@ static void end_compressed_bio_read(struct bio *bio)
/* ok, we're the last bio for this extent, lets start
 * the decompression.
 */
-   ret = btrfs_decompress_biovec(cb->compress_type,
+   ret = btrfs_decompress_bio(cb->compress_type,
  cb->compressed_pages,
  cb->start,
- cb->orig_bio->bi_io_vec,
- cb->orig_bio->bi_vcnt,
+ cb->orig_bio,
  cb->compressed_len);
 csum_failed:
if (ret)
@@ -959,9 +958,7 @@ int btrfs_compress_pages(int type, struct address_space 
*mapping,
  *
  * disk_start is the starting logical offset of this array in the file
  *
- * bvec is a bio_vec of pages from the file that we want to decompress into
- *
- * vcnt is the count of pages in the biovec
+ * orig_bio contains the pages from the file that we want to decompress into
  *
  * srclen is the number of bytes in pages_in
  *
@@ -970,18 +967,18 @@ int btrfs_compress_pages(int type, struct address_space 
*mapping,
  * be contiguous.  They all correspond to the range of bytes covered by
  * the compressed extent.
  */
-static int btrfs_decompress_biovec(int type, struct page **pages_in,
-  u64 disk_start, struct bio_vec *bvec,
-  int vcnt, size_t srclen)
+static int btrfs_decompress_bio(int type, struct page **pages_in,
+  u64 disk_start, struct bio *orig_bio,
+  size_t srclen)
 {
struct list_head *workspace;
int ret;
 
workspace = find_workspace(type);
 
-   ret = btrfs_compress_op[type-1]->decompress_biovec(workspace, pages_in,
-disk_start,
-bvec, vcnt, srclen);
+   ret = btrfs_compress_op[type-1]->decompress_bio(workspace, pages_in,
+disk_start, orig_bio,
+srclen);
free_workspace(type, workspace);
return ret;
 }
@@ -1021,9 +1018,7 @@ void btrfs_exit_compress(void)
  */
 int btrfs_decompress_buf2page(char *buf, unsigned long buf_start,
  unsigned long total_out, u64 disk_start,
- struct bio_vec *bvec, int vcnt,
- unsigned long *pg_index,
- unsigned long *pg_offset)
+ struct bio *bio)
 {
unsigned long buf_offset;
unsigned long current_buf_start;
@@ -1031,13 +1026,13 @@ int btrfs_decompress_buf2page(char *buf, unsigned long 
buf_start,
unsigned long working_bytes = total_out - buf_start;
unsigned long bytes;
char *kaddr;
-   struct page *page_out = bvec[*pg_index].bv_page;
+   struct bio_vec bvec = bio_iter_iovec(bio, bio->bi_iter);
 
/*
 * start byte is the first byte of the page we're currently
 * copying into relative to the start of the compressed data.
 */
-   start_byte = page_offset(page_out) - disk_start;
+   start_byte = page_offset(bvec.bv_page) - disk_start;
 
/* we haven't yet hit data corresponding to this page */
if (total_out <= start_byte)
@@ -1057,80 +1052,45 @@ int btrfs_decompress_buf2page(char *buf, unsigned long 
buf_start,
 
/* copy bytes from the working buffer into the pages */
while (working_bytes > 0) {
-   bytes = min(PAGE_SIZE - *pg_offset,
-   PAGE_SIZE - buf_offset);
+   bytes = min_t(unsigned long, bvec.bv_len,
+   

don't poke into bio internals

2016-11-16 Thread Christoph Hellwig
Hi all,

this series has a few patches that switch btrfs to use the proper helpers for
accessing bio internals.  This helps to prepare for supporting multi-page
bio_vecs, which are currently under development.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2016-11-16 Thread Austin S. Hemmelgarn

On 2016-11-16 05:55, Martin Steigerwald wrote:

Am Mittwoch, 16. November 2016, 15:43:36 CET schrieb Roman Mamedov:

On Wed, 16 Nov 2016 11:25:00 +0100

Martin Steigerwald  wrote:

merkaba:~> mount -o degraded,clear_cache /dev/satafp1/backup /mnt/zeit
mount: Falscher Dateisystemtyp, ungültige Optionen, der
Superblock von /dev/mapper/satafp1-backup ist beschädigt, fehlende
Kodierungsseite oder ein anderer Fehler

  Manchmal liefert das Systemprotokoll wertvolle Informationen –
  versuchen Sie  dmesg | tail  oder ähnlich

merkaba:~#32> dmesg | tail -6
[ 3080.120687] BTRFS info (device dm-13): allowing degraded mounts
[ 3080.120699] BTRFS info (device dm-13): force clearing of disk cache
[ 3080.120703] BTRFS info (device dm-13): disk space caching is
enabled
[ 3080.120706] BTRFS info (device dm-13): has skinny extents
[ 3080.150957] BTRFS warning (device dm-13): missing devices (1)
exceeds the limit (0), writeable mount is not allowed
[ 3080.195941] BTRFS: open_ctree failed


I have to wonder did you read the above message? What you need at this point
is simply "-o degraded,ro". But I don't see that tried anywhere down the
line.

See also (or try): https://patchwork.kernel.org/patch/9419189/


Actually I read that one, but I read more into it than what it was saying:

I read into it that BTRFS would automatically use a read only mount.


merkaba:~> mount -o degraded,ro /dev/satafp1/daten /mnt/zeit

actually really works. *Thank you*, Roman.


I do think that above kernel messages invite such a kind of interpretation
tough. I took the "BTRFS: open_ctree failed" message as indicative to some
structural issue with the filesystem.
Technically, the fact that a device is missing is a structural issue 
with the FS.  Whether or not that falls under what any arbitrary person 
considers a structural issue or not is a different story.


General background though:
open_ctree is one of the core functions in the BTRFS code used during 
mounting the filesystem.  Everything that calls it checks the return 
code and spits out 'BTRFS: open_ctree failed' if it failed.  The problem 
is, just about everything internal (and many external things as well) to 
the BTRFS code that could prevent the FS from mounting happens either in 
open_ctree, or in a function it calls, so all that that line tells us is 
that the mount failed, which is less than useful in most cases.  Given 
both the confusion you've experienced regarding this (which has happened 
to other people too), combined with the amount of effort I've had to put 
in to get the rest of the SysOps people where I work to understand that 
that message just means 'mount failed', I would really love to see that 
just be replaced with 'mount failed' in non-debug builds, preferrably 
with better info about _why_ things failed (the case of a degraded 
filesystem is pretty covered, but most other cases other than 
incompatible feature bits are not).


So mounting work although for some reason scrubbing is aborted (I had this
issue a long time ago on my laptop as well). After removing /var/lib/btrfs
scrub status file for the filesystem:
Last I knew, scrub doesn't work on degraded filesystems (in fact, by 
definition, it _can't_ work on a degraded array).  It absolutely won't 
work though without the read-only flag on filesystems which are mounted 
read-only.


merkaba:~> btrfs scrub start /mnt/zeit
scrub started on /mnt/zeit, fsid […] (pid=9054)
merkaba:~> btrfs scrub status /mnt/zeit
scrub status for […]
scrub started at Wed Nov 16 11:52:56 2016 and was aborted after
00:00:00
total bytes scrubbed: 0.00B with 0 errors

Anyway, I will now just rsync off the files.

Interestingly enough btrfs restore complained about looping over certain
files… lets see whether the rsync or btrfs send/receive proceeds through.
I'd expect rsync to be more likely to work than send/receive.  In 
general, if you can read the files, rsync will work, whereas 
send/receive needs to read some low-level data from the FS which may not 
be touched when just reading files, so there are cases where rsync will 
work but send/receive won't.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2016-11-16 Thread Martin Steigerwald
Am Mittwoch, 16. November 2016, 11:55:32 CET schrieben Sie:
> So mounting work although for some reason scrubbing is aborted (I had this
> issue a long time ago on my laptop as well). After removing /var/lib/btrfs 
> scrub status file for the filesystem:
> 
> merkaba:~> btrfs scrub start /mnt/zeit
> scrub started on /mnt/zeit, fsid […] (pid=9054)
> merkaba:~> btrfs scrub status /mnt/zeit
> scrub status for […]
> scrub started at Wed Nov 16 11:52:56 2016 and was aborted after 
> 00:00:00
> total bytes scrubbed: 0.00B with 0 errors
> 
> Anyway, I will now just rsync off the files.
> 
> Interestingly enough btrfs restore complained about looping over certain
> files… lets see whether the rsync or btrfs send/receive proceeds through.

I have an idea on why scrubbing may not work:

The filesystem is mounted read only and on checksum errors on one disk scrub 
would try to repair it with the good copy from another disk.

Yes, this is it:

merkaba:~>  btrfs scrub start -r /dev/satafp1/daten
scrub started on /dev/satafp1/daten, fsid […] (pid=9375)
merkaba:~>  btrfs scrub status /dev/satafp1/daten 
scrub status for […]
scrub started at Wed Nov 16 12:13:27 2016, running for 00:00:10
total bytes scrubbed: 45.53MiB with 0 errors

It would be helpful to receive a proper error message on this one.

Okay, seems today I learned quite something about BTRFS.

Thanks,

-- 
Martin Steigerwald  | Trainer

teamix GmbH
Südwestpark 43
90449 Nürnberg

Tel.:  +49 911 30999 55 | Fax: +49 911 30999 99
mail: martin.steigerw...@teamix.de | web:  http://www.teamix.de | blog: 
http://blog.teamix.de

Amtsgericht Nürnberg, HRB 18320 | Geschäftsführer: Oliver Kügow, Richard Müller

teamix Support Hotline: +49 911 30999-112
 
 *** Bitte liken Sie uns auf Facebook: facebook.com/teamix ***

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2016-11-16 Thread Martin Steigerwald
Am Mittwoch, 16. November 2016, 16:00:31 CET schrieb Roman Mamedov:
> On Wed, 16 Nov 2016 11:55:32 +0100
> 
> Martin Steigerwald  wrote:
> > I do think that above kernel messages invite such a kind of interpretation
> > tough. I took the "BTRFS: open_ctree failed" message as indicative to some
> > structural issue with the filesystem.
> 
> For the reason as to why the writable mount didn't work, check "btrfs fi df"
> for the filesystem to see if you have any "single" profile chunks on it:
> quite likely you did already mount it "degraded,rw" in the past *once*,
> after which those "single" chunks get created, and consequently it won't
> mount r/w anymore (without lifting the restriction on the number of missing
> devices as proposed).

That exactly explains it. I very likely did a degraded mount without ro on 
this disk already.

Funnily enough this creates another complication:

merkaba:/mnt/zeit#1> btrfs send somesubvolume | btrfs receive /mnt/
someotherbtrfs
ERROR: subvolume /mnt/zeit/somesubvolume is not read-only

Yet:

merkaba:/mnt/zeit> btrfs property get somesubvolume
ro=false
merkaba:/mnt/zeit> btrfs property set somesubvolume ro true 
 
ERROR: failed to set flags for somesubvolume: Read-only file system

To me it seems right logic would be to allow the send to proceed in case
the whole filesystem is readonly.

As there seems to be no force option to override the limitation and I
do not feel like compiling my own btrfs-tools right now, I will use rsync
instead.

Thanks,

-- 
Martin Steigerwald  | Trainer

teamix GmbH
Südwestpark 43
90449 Nürnberg

Tel.:  +49 911 30999 55 | Fax: +49 911 30999 99
mail: martin.steigerw...@teamix.de | web:  http://www.teamix.de | blog: 
http://blog.teamix.de

Amtsgericht Nürnberg, HRB 18320 | Geschäftsführer: Oliver Kügow, Richard Müller

teamix Support Hotline: +49 911 30999-112
 
 *** Bitte liken Sie uns auf Facebook: facebook.com/teamix ***

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2016-11-16 Thread Roman Mamedov
On Wed, 16 Nov 2016 11:55:32 +0100
Martin Steigerwald  wrote:

> I do think that above kernel messages invite such a kind of interpretation
> tough. I took the "BTRFS: open_ctree failed" message as indicative to some
> structural issue with the filesystem.

For the reason as to why the writable mount didn't work, check "btrfs fi df"
for the filesystem to see if you have any "single" profile chunks on it: quite
likely you did already mount it "degraded,rw" in the past *once*, after which
those "single" chunks get created, and consequently it won't mount r/w anymore
(without lifting the restriction on the number of missing devices as proposed).

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2016-11-16 Thread Martin Steigerwald
Am Mittwoch, 16. November 2016, 15:43:36 CET schrieb Roman Mamedov:
> On Wed, 16 Nov 2016 11:25:00 +0100
> 
> Martin Steigerwald  wrote:
> > merkaba:~> mount -o degraded,clear_cache /dev/satafp1/backup /mnt/zeit
> > mount: Falscher Dateisystemtyp, ungültige Optionen, der
> > Superblock von /dev/mapper/satafp1-backup ist beschädigt, fehlende
> > Kodierungsseite oder ein anderer Fehler
> > 
> >   Manchmal liefert das Systemprotokoll wertvolle Informationen –
> >   versuchen Sie  dmesg | tail  oder ähnlich
> > 
> > merkaba:~#32> dmesg | tail -6
> > [ 3080.120687] BTRFS info (device dm-13): allowing degraded mounts
> > [ 3080.120699] BTRFS info (device dm-13): force clearing of disk cache
> > [ 3080.120703] BTRFS info (device dm-13): disk space caching is
> > enabled
> > [ 3080.120706] BTRFS info (device dm-13): has skinny extents
> > [ 3080.150957] BTRFS warning (device dm-13): missing devices (1)
> > exceeds the limit (0), writeable mount is not allowed
> > [ 3080.195941] BTRFS: open_ctree failed
> 
> I have to wonder did you read the above message? What you need at this point
> is simply "-o degraded,ro". But I don't see that tried anywhere down the
> line.
> 
> See also (or try): https://patchwork.kernel.org/patch/9419189/

Actually I read that one, but I read more into it than what it was saying:

I read into it that BTRFS would automatically use a read only mount.


merkaba:~> mount -o degraded,ro /dev/satafp1/daten /mnt/zeit

actually really works. *Thank you*, Roman.


I do think that above kernel messages invite such a kind of interpretation
tough. I took the "BTRFS: open_ctree failed" message as indicative to some
structural issue with the filesystem.


So mounting work although for some reason scrubbing is aborted (I had this
issue a long time ago on my laptop as well). After removing /var/lib/btrfs 
scrub status file for the filesystem:

merkaba:~> btrfs scrub start /mnt/zeit
scrub started on /mnt/zeit, fsid […] (pid=9054)
merkaba:~> btrfs scrub status /mnt/zeit
scrub status for […]
scrub started at Wed Nov 16 11:52:56 2016 and was aborted after 
00:00:00
total bytes scrubbed: 0.00B with 0 errors

Anyway, I will now just rsync off the files.

Interestingly enough btrfs restore complained about looping over certain
files… lets see whether the rsync or btrfs send/receive proceeds through.

Ciao,

-- 
Martin Steigerwald  | Trainer

teamix GmbH
Südwestpark 43
90449 Nürnberg

Tel.:  +49 911 30999 55 | Fax: +49 911 30999 99
mail: martin.steigerw...@teamix.de | web:  http://www.teamix.de | blog: 
http://blog.teamix.de

Amtsgericht Nürnberg, HRB 18320 | Geschäftsführer: Oliver Kügow, Richard Müller

teamix Support Hotline: +49 911 30999-112
 
 *** Bitte liken Sie uns auf Facebook: facebook.com/teamix ***

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2016-11-16 Thread Roman Mamedov
On Wed, 16 Nov 2016 11:25:00 +0100
Martin Steigerwald  wrote:

> merkaba:~> mount -o degraded,clear_cache /dev/satafp1/backup /mnt/zeit
> mount: Falscher Dateisystemtyp, ungültige Optionen, der
> Superblock von /dev/mapper/satafp1-backup ist beschädigt, fehlende
> Kodierungsseite oder ein anderer Fehler
> 
>   Manchmal liefert das Systemprotokoll wertvolle Informationen –
>   versuchen Sie  dmesg | tail  oder ähnlich
> merkaba:~#32> dmesg | tail -6
> [ 3080.120687] BTRFS info (device dm-13): allowing degraded mounts
> [ 3080.120699] BTRFS info (device dm-13): force clearing of disk cache
> [ 3080.120703] BTRFS info (device dm-13): disk space caching is enabled
> [ 3080.120706] BTRFS info (device dm-13): has skinny extents
> [ 3080.150957] BTRFS warning (device dm-13): missing devices (1) exceeds 
> the limit (0), writeable mount is not allowed
> [ 3080.195941] BTRFS: open_ctree failed

I have to wonder did you read the above message? What you need at this point
is simply "-o degraded,ro". But I don't see that tried anywhere down the line.

See also (or try): https://patchwork.kernel.org/patch/9419189/

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2016-11-16 Thread Martin Steigerwald
Hello!

A degraded BTRFS RAID 1 from one 3TB SATA HDD of my former workstation is not 
mountable.

Debian 4.8 kernel + btrfs-tools 4.7.3.

A btrfs restore seems to work well enough, so on one hand there is no
urgency. But on the other hand I want to repurpose the harddisk and I
think I want to do it next weekend. So if you want me to gather some
debug data, please speak up quickly. Thank you.

AFAIR I have been able to mount the filesystems in degraded mode, but
this may have been on the other SATA HDD that I already wiped with shred
command.


I have this:

merkaba:~> btrfs fi sh
[…]
warning, device 2 is missing
warning, device 2 is missing
warning, device 2 is missing
Label: 'debian'  uuid: […]
Total devices 2 FS bytes used 20.10GiB
devid1 size 50.00GiB used 29.03GiB path 
/dev/mapper/satafp1-debian
*** Some devices missing

Label: 'daten'  uuid: […]
Total devices 2 FS bytes used 135.02GiB
devid1 size 1.00TiB used 142.06GiB path 
/dev/mapper/satafp1-daten
*** Some devices missing

Label: 'backup'  uuid: […]
Total devices 2 FS bytes used 88.38GiB
devid1 size 1.00TiB used 93.06GiB path 
/dev/mapper/satafp1-backup
*** Some devices missing

But none of these filesystems seem to be mountable. Here some attempts:

merkaba:~#130> LANG=C mount -o degraded /dev/satafp1/backup /mnt/zeit
mount: wrong fs type, bad option, bad superblock on 
/dev/mapper/satafp1-daten,
  missing codepage or helper program, or other error

  In some cases useful info is found in syslog - try
  dmesg | tail or so.
merkaba:~> dmesg | tail -5
[ 2945.155943] BTRFS info (device dm-13): allowing degraded mounts
[ 2945.155953] BTRFS info (device dm-13): disk space caching is enabled
[ 2945.155957] BTRFS info (device dm-13): has skinny extents
[ 2945.611236] BTRFS warning (device dm-13): missing devices (1) exceeds 
the limit (0), writeable mount is not allowed
[ 2945.646719] BTRFS: open_ctree failed


merkaba:~> LANG=C mount -o usebackuproot /dev/satafp1/daten /mnt/zeit   
  
mount: wrong fs type, bad option, bad superblock on 
/dev/mapper/satafp1-daten,
  missing codepage or helper program, or other error

  In some cases useful info is found in syslog - try
  dmesg | tail or so.
merkaba:~#32> dmesg | tail -5   
[ 5739.051433] BTRFS info (device dm-12): trying to use backup root at 
mount time
[ 5739.051441] BTRFS info (device dm-12): disk space caching is enabled
[ 5739.051444] BTRFS info (device dm-12): has skinny extents
[ 5739.103153] BTRFS error (device dm-12): failed to read chunk tree: -5
[ 5739.130304] BTRFS: open_ctree failed


merkaba:~> LANG=C mount -o degraded,usebackuproot /dev/satafp1/daten 
/mnt/zeit
mount: wrong fs type, bad option, bad superblock on 
/dev/mapper/satafp1-daten,
  missing codepage or helper program, or other error

  In some cases useful info is found in syslog - try
  dmesg | tail or so.
merkaba:~#32> dmesg | tail -5   
 
[ 5801.704202] BTRFS info (device dm-12): trying to use backup root at 
mount time
[ 5801.704206] BTRFS info (device dm-12): disk space caching is enabled
[ 5801.704208] BTRFS info (device dm-12): has skinny extents
[ 5803.928059] BTRFS warning (device dm-12): missing devices (1) exceeds 
the limit (0), writeable mount is not allowed
[ 5804.064638] BTRFS: open_ctree failed


`btrfs check` reports:

merkaba:~#32> btrfs check /dev/satafp1/backup 
warning, device 2 is missing
Checking filesystem on /dev/satafp1/backup
UUID: 01cf0493-476f-42e8-8905-61ef205313db
checking extents
checking free space cache
failed to load free space cache for block group 58003030016
failed to load free space cache for block group 60150513664
failed to load free space cache for block group 62297997312
[…]
checking fs roots
^C

I aborted it at this time as I wanted to try clear_cache mount option
after seeing this. I can redo this thing after btrfs restore completed.

merkaba:~> mount -o degraded,clear_cache /dev/satafp1/backup /mnt/zeit
mount: Falscher Dateisystemtyp, ungültige Optionen, der
Superblock von /dev/mapper/satafp1-backup ist beschädigt, fehlende
Kodierungsseite oder ein anderer Fehler

  Manchmal liefert das Systemprotokoll wertvolle Informationen –
  versuchen Sie  dmesg | tail  oder ähnlich
merkaba:~#32> dmesg | tail -6
[ 3080.120687] BTRFS info (device dm-13): allowing degraded mounts
[ 3080.120699] BTRFS info (device dm-13): force clearing of disk cache
[ 3080.120703] BTRFS info (device dm-13): disk space caching is enabled
[ 3080.120706] BTRFS info (device dm-13): has skinny extents
[