Re: [RFC][PATCH] btrfs-progs: inspect: new subcommand to dump chunks

2016-06-22 Thread Liu Bo
On Wed, Jun 22, 2016 at 07:26:46PM +0200, David Sterba wrote:
> From: David Sterba 
> 
> Hi,
> 
> the chunk dump is a useful thing, for debugging or balance filters.
> 
> Example output:
> 
> Chunks on device id: 1
> PNumberTypePStartLength  PEnd Age 
> LStart  Usage
> ---
>   0   System/RAID11.00MiB  32.00MiB  33.00MiB  47 
>1.40TiB   0.06
>   1 Metadata/RAID1   33.00MiB   1.00GiB   1.03GiB  31 
>1.36TiB  64.03
>   2 Metadata/RAID11.03GiB  32.00MiB   1.06GiB  36 
>1.36TiB  77.28
>   3 Data/single   1.06GiB   1.00GiB   2.06GiB  12 
>  422.30GiB  78.90
>   4 Data/single   2.06GiB   1.00GiB   3.06GiB  11 
>  420.30GiB  78.47
> ...
> 
> (full output is at http://susepaste.org/view/raw/089a9877)
> 
> This patch does the basic output, not filtering or sorting besides physical 
> and
> logical offset. There are possiblities to do more or enhance the output (eg.
> starting with logical chunk and list the related physcial chunks together).
> Or filter by type/profile, or understand the balance filters format.
> 
> As it is now, it's per-device dump of physical layout, which was the original
> idea.
> 
> Printing 'usage' is not default as it's quite slow, it uses the search ioctl
> and probably not in the best way, or there's some other issue in the
> implementation.
> 
> I'll add the patch to devel branch but will not add it for any particular
> release yet, I'd like some feedback first. Thanks.
> 
> -
> New command 'btrfs inspect-internal dump-chunks' will dump layout of
> chunks as stored on the devices. This corresponds to the physical
> layout, sorted by the physical offset. The block group usage can be
> shown as well, but the search is too slow so it's off by default.
> 
> If the physical offset sorting is selected, the empty space between
> chunks is also shown.
> 
> Signed-off-by: David Sterba 
> ---
>  cmds-inspect.c | 364 
> +
>  1 file changed, 364 insertions(+)
> 
> diff --git a/cmds-inspect.c b/cmds-inspect.c
> index dd7b9dd278f2..4eace63d8517 100644
> --- a/cmds-inspect.c
> +++ b/cmds-inspect.c
> @@ -623,6 +623,368 @@ static int cmd_inspect_min_dev_size(int argc, char 
> **argv)
>   return !!ret;
>  }
>  
> +static const char * const cmd_dump_chunks_usage[] = {
> + "btrfs inspect-internal chunk-stats [options] ",
> + "Show chunks (block groups) layout",
> + "Show chunks (block groups) layout for all devices",
> + "",
> + HELPINFO_UNITS_LONG,
> + "--sort=MODEsort by the physical or logical chunk start",
> + "   MODE is one of pstart or lstart (default: pstart)",
> + "--usageshow usage per block group, note this can be slow",
> + NULL
> +};
> +
> +enum {
> + CHUNK_SORT_PSTART,
> + CHUNK_SORT_LSTART,
> + CHUNK_SORT_DEFAULT = CHUNK_SORT_PSTART
> +};
> +
> +struct dump_chunks_entry {
> + u64 devid;
> + u64 start;
> + u64 lstart;
> + u64 length;
> + u64 flags;
> + u64 age;
> + u64 used;
> + u32 pnumber;
> +};
> +
> +struct dump_chunks_ctx {
> + unsigned length;
> + unsigned size;
> + struct dump_chunks_entry *stats;
> +};
> +
> +int cmp_cse_devid_start(const void *va, const void *vb)
> +{
> + const struct dump_chunks_entry *a = va;
> + const struct dump_chunks_entry *b = vb;
> +
> + if (a->devid < b->devid)
> + return -1;
> + if (a->devid > b->devid)
> + return 1;
> +
> + if (a->start < b->start)
> + return -1;
> + if (a->start == b->start) {
> + error(
> + "chunks start on same offset in the same device: devid %llu start %llu",
> + (unsigned long long)a->devid, (unsigned long long)a->start);
> + return 0;
> + }
> + return 1;
> +}
> +
> +int cmp_cse_devid_lstart(const void *va, const void *vb)
> +{
> + const struct dump_chunks_entry *a = va;
> + const struct dump_chunks_entry *b = vb;
> +
> + if (a->devid < b->devid)
> + return -1;
> + if (a->devid > b->devid)
> + return 1;
> +
> + if (a->lstart < b->lstart)
> + return -1;
> + if (a->lstart == b->lstart) {
> + error(
> +"chunks logically start on same offset in the same device: devid %llu start 
> %llu",
> + (unsigned long long)a->devid, (unsigned long 
> long)a->lstart);
> + return 0;
> + }
> + return 1;
> +}
> +
> +void print_dump_chunks(struct dump_chunks_ctx *ctx, unsigned sort_mode,
> + unsigned unit_mode, int with_usage)
> +{
> + u64 devid;
> + struct dump_chunks_entry e;
> + int i;
> + int 

[PATCH] Btrfs: cleanup BUG_ON in merge_bio

2016-06-22 Thread Liu Bo
One can use btrfs-corrupt-block to hit BUG_ON() in merge_bio(),
thus this aims to stop anyone to panic the whole system by using
 their btrfs.

Since the error in merge_bio can only come from __btrfs_map_block()
when chunk tree mapping has something insane and __btrfs_map_block()
has already had printed the reason, we can just return errors in
merge_bio.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent_io.c | 1 -
 fs/btrfs/inode.c | 8 ++--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index aa44d3e..a8661fb 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2756,7 +2756,6 @@ static int merge_bio(int rw, struct extent_io_tree *tree, 
struct page *page,
if (tree->ops && tree->ops->merge_bio_hook)
ret = tree->ops->merge_bio_hook(rw, page, offset, size, bio,
bio_flags);
-   BUG_ON(ret < 0);
return ret;
 
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8b1212e..d203c06 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1822,6 +1822,10 @@ static void btrfs_clear_bit_hook(struct inode *inode,
 /*
  * extent_io.c merge_bio_hook, this must check the chunk tree to make sure
  * we don't create bios that span stripes or chunks
+ *
+ * return 1 if page cannot be merged to bio
+ * return 0 if page can be merged to bio
+ * return error otherwise
  */
 int btrfs_merge_bio_hook(int rw, struct page *page, unsigned long offset,
 size_t size, struct bio *bio,
@@ -1840,8 +1844,8 @@ int btrfs_merge_bio_hook(int rw, struct page *page, 
unsigned long offset,
map_length = length;
ret = btrfs_map_block(root->fs_info, rw, logical,
  _length, NULL, 0);
-   /* Will always return 0 with map_multi == NULL */
-   BUG_ON(ret < 0);
+   if (ret < 0)
+   return ret;
if (map_length < length + size)
return 1;
return 0;
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: error out if generic_bin_search get invalid arguments

2016-06-22 Thread Liu Bo
With btrfs-corrupt-block, one can set btree node/leaf's field, if
we assign a negative value to node/leaf, we can get various hangs,
eg. if extent_root's nritems is -2ULL, then we get stuck in
 btrfs_read_block_groups() because it has a while loop and
btrfs_search_slot() on extent_root will always return the first
 child.

This lets us know what's happening and returns a EINVAL to callers
instead of returning the first item.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index c49a500..915d224 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1770,6 +1770,14 @@ static noinline int generic_bin_search(struct 
extent_buffer *eb,
unsigned long map_len = 0;
int err;
 
+   if (low > high) {
+   btrfs_err(eb->fs_info,
+"%s: low (%d) < high (%d) eb %llu owner %llu level %d",
+ __func__, low, high, eb->start,
+ btrfs_header_owner(eb), btrfs_header_level(eb));
+   return -EINVAL;
+   }
+
while (low < high) {
mid = (low + high) / 2;
offset = p + mid * item_size;
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: check inconsistence between chunk and block group

2016-06-22 Thread Liu Bo
With btrfs-corrupt-block, one can drop one chunk item and mounting
will end up with a panic in btrfs_full_stripe_len().

This doesn't not remove the BUG_ON, but instead checks it a bit
earlier when we find the block group item.

Signed-off-by: Liu Bo 
---
 fs/btrfs/extent-tree.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 689d25a..d860291 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9662,7 +9662,22 @@ static int find_first_block_group(struct btrfs_root 
*root,
 
if (found_key.objectid >= key->objectid &&
found_key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
-   ret = 0;
+   struct extent_map_tree *em_tree;
+   struct extent_map *em;
+
+   em_tree = >fs_info->mapping_tree.map_tree;
+   read_lock(_tree->lock);
+   em = lookup_extent_mapping(em_tree, found_key.objectid,
+  found_key.offset);
+   read_unlock(_tree->lock);
+   if (!em) {
+   btrfs_err(root->fs_info,
+   "logical %llu len %llu found bg but no related chunk",
+ found_key.objectid, found_key.offset);
+   ret = -ENOENT;
+   } else {
+   ret = 0;
+   }
goto out;
}
path->slots[0]++;
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix BUG_ON in btrfs_submit_compressed_write

2016-06-22 Thread Liu Bo
This is similar to btrfs_submit_compressed_read(), if we fail after
bio is allocated, then we can use bio_endio() and errors are saved
 in bio->bi_error.  But please note that we don't return errors to
its caller because the caller assumes it won't call endio to cleanup
on error.

Signed-off-by: Liu Bo 
---
 fs/btrfs/compression.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 658c39b..7a4d9c8 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -402,7 +402,10 @@ int btrfs_submit_compressed_write(struct inode *inode, u64 
start,
}
 
ret = btrfs_map_bio(root, WRITE, bio, 0, 1);
-   BUG_ON(ret); /* -ENOMEM */
+   if (ret) {
+   bio->bi_error = ret;
+   bio_endio(bio);
+   }
 
bio_put(bio);
 
@@ -432,7 +435,10 @@ int btrfs_submit_compressed_write(struct inode *inode, u64 
start,
}
 
ret = btrfs_map_bio(root, WRITE, bio, 0, 1);
-   BUG_ON(ret); /* -ENOMEM */
+   if (ret) {
+   bio->bi_error = ret;
+   bio_endio(bio);
+   }
 
bio_put(bio);
return 0;
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] btrfs-progs: inspect: new subcommand to dump chunks

2016-06-22 Thread Qu Wenruo



At 06/23/2016 01:26 AM, David Sterba wrote:

From: David Sterba 

Hi,

the chunk dump is a useful thing, for debugging or balance filters.


Yes, quite useful.
I don't ever need to manually check btrfs-debug-tree output to figure 
out the dev-extent layout.


But according to the output, it's more like dev-extent dump, not chunk dump.

It's better to make things more clear for objects at different logical 
level.




Example output:

Chunks on device id: 1
PNumberTypePStartLength  PEnd Age   
  LStart  Usage
---
  0   System/RAID11.00MiB  32.00MiB  33.00MiB  47   
 1.40TiB   0.06


[Mixed chunk/dev-extent output]
For "PStart/Length/Pend/Lstart" that's OK.
(Although I prefer "Plength" to "Length", as it's a little confusing for 
chunk logical length and dev_extent physical length)


Since "pstart" = key.offset, "length" = dev_extent_length, "Pend" = 
pstart + length and "Lstart" = dev_extent_chunk_offset, these values are 
all straightforward.


But for Usage, it's completely a chunk/block group level thing, so 
personally speaking it's better not to output per chunk usage in 
dev-extent dump


If there is real "chunk dump", then that's the best place to output 
per-chunk "usage".


For dev-extent level usage, it should be the usage of a device, not per 
chunk.


["Age" calculation]
Another concern is about "age", since there is no generation in 
chunk_item nor dev_extent, so you're using logical bytenr as a hint.


This method implies the fact that btrfs will always allocates new chunk 
with increasing logical bytenr, never reuse lower logical bytenr.


Such behavior provides the basis for a lot of other btrfs functions, 
like balance/qgroup rescan.
IMHO it won't be changed in any time soon. But I'm still a little 
concerned about using such assumption.



Since "age" here is more like a sorting index, not the generation when 
the chunk is created.

It may not really reflect how long one chunk really lived.

It maybe completely OK for most people, but at least it's not as 
straightforward as generation.



[Support for offline btrfs]
BTW, it's also a good idea to support not only mounted fs, but also 
offlined one.


As for offlined one, no need to bother about the tree search ioctl, just 
straightforward search_slots().

(That's why I like offline btrfs tools)

Thanks,
Qu


  1 Metadata/RAID1   33.00MiB   1.00GiB   1.03GiB  31   
 1.36TiB  64.03
  2 Metadata/RAID11.03GiB  32.00MiB   1.06GiB  36   
 1.36TiB  77.28
  3 Data/single   1.06GiB   1.00GiB   2.06GiB  12  
422.30GiB  78.90
  4 Data/single   2.06GiB   1.00GiB   3.06GiB  11  
420.30GiB  78.47
...

(full output is at http://susepaste.org/view/raw/089a9877)

This patch does the basic output, not filtering or sorting besides physical and
logical offset. There are possiblities to do more or enhance the output (eg.
starting with logical chunk and list the related physcial chunks together).
Or filter by type/profile, or understand the balance filters format.

As it is now, it's per-device dump of physical layout, which was the original
idea.

Printing 'usage' is not default as it's quite slow, it uses the search ioctl
and probably not in the best way, or there's some other issue in the
implementation.

I'll add the patch to devel branch but will not add it for any particular
release yet, I'd like some feedback first. Thanks.

-
New command 'btrfs inspect-internal dump-chunks' will dump layout of
chunks as stored on the devices. This corresponds to the physical
layout, sorted by the physical offset. The block group usage can be
shown as well, but the search is too slow so it's off by default.

If the physical offset sorting is selected, the empty space between
chunks is also shown.

Signed-off-by: David Sterba 
---
 cmds-inspect.c | 364 +
 1 file changed, 364 insertions(+)

diff --git a/cmds-inspect.c b/cmds-inspect.c
index dd7b9dd278f2..4eace63d8517 100644
--- a/cmds-inspect.c
+++ b/cmds-inspect.c
@@ -623,6 +623,368 @@ static int cmd_inspect_min_dev_size(int argc, char **argv)
return !!ret;
 }

+static const char * const cmd_dump_chunks_usage[] = {
+   "btrfs inspect-internal chunk-stats [options] ",
+   "Show chunks (block groups) layout",
+   "Show chunks (block groups) layout for all devices",
+   "",
+   HELPINFO_UNITS_LONG,
+   "--sort=MODEsort by the physical or logical chunk start",
+   "   MODE is one of pstart or lstart (default: pstart)",
+   "--usageshow usage per block group, note this can be slow",
+   NULL
+};
+
+enum {
+   CHUNK_SORT_PSTART,
+   CHUNK_SORT_LSTART,
+   

Re: [RFC][PATCH] btrfs-progs: inspect: new subcommand to dump chunks

2016-06-22 Thread Hans van Kranenburg

On 06/22/2016 07:26 PM, David Sterba wrote:

From: David Sterba 

Hi,

the chunk dump is a useful thing, for debugging or balance filters.

Example output:

Chunks on device id: 1
PNumberTypePStartLength  PEnd Age   
  LStart  Usage
---
  0   System/RAID11.00MiB  32.00MiB  33.00MiB  47   
 1.40TiB   0.06
  1 Metadata/RAID1   33.00MiB   1.00GiB   1.03GiB  31   
 1.36TiB  64.03
  2 Metadata/RAID11.03GiB  32.00MiB   1.06GiB  36   
 1.36TiB  77.28
  3 Data/single   1.06GiB   1.00GiB   2.06GiB  12  
422.30GiB  78.90
  4 Data/single   2.06GiB   1.00GiB   3.06GiB  11  
420.30GiB  78.47
...


On RAID0, it looks funny :)

# ./btrfs inspect-internal dump-chunks /mnt/extents/
Chunks on device id: 1
PNumberTypePStartLength  PEnd 
Age LStart


  0 Metadata/RAID01.00MiB 256.00MiB 257.00MiB 
1   13.36GiB
  .   empty .  16.00EiB . 
.  .
  1   System/RAID0  129.00MiB  64.00MiB 193.00MiB 
2   13.61GiB
  .   empty .  16.00EiB . 
.  .
  2 Data/RAID0  161.00MiB   2.00GiB   2.16GiB 
3   15.68GiB
  .   empty . 115.00MiB . 
.  .
  3 Data/RAID02.27GiB   2.00GiB   4.27GiB 
0   11.36GiB


Chunks on device id: 2
PNumberTypePStartLength  PEnd 
Age LStart


  0 Data/RAID01.00MiB   2.00GiB   2.00GiB 
0   11.36GiB
  .   empty .  16.00EiB . 
.  .
  1 Metadata/RAID01.00GiB 256.00MiB   1.25GiB 
1   13.36GiB
  .   empty .  16.00EiB . 
.  .
  2   System/RAID01.13GiB  64.00MiB   1.19GiB 
2   13.61GiB
  .   empty .  16.00EiB . 
.  .
  3 Data/RAID01.16GiB   2.00GiB   3.16GiB 
3   15.68GiB


For the correct sizes, it would be needed to either do some math, like 
if RAID0 then divide by 2, but I guess this gets horrible really soon 
with RAID5 etc... Or, if you want to print everything in the physical 
order, why not just dump the device tree with DEV_EXTENT which has the 
exact length on disk, and fill in the missing information using the 
reference to the chunk they belong to?


https://github.com/knorrie/python-btrfs/commit/be85c2a0e2774909b532326e99e0c23ab8467263

Sorry, couldn't resist /:D\

--
Hans van Kranenburg - System / Network Engineer
T +31 (0)10 2760434 | hans.van.kranenb...@mendix.com | www.mendix.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs: page allocation failure: order:1, mode:0x2204020

2016-06-22 Thread Hans van Kranenburg

On 06/20/2016 02:49 PM, David Sterba wrote:

On Sat, Jun 18, 2016 at 08:47:55PM +0200, Hans van Kranenburg wrote:

Last night, one of my btrfs filesystems went read-only after a memory
allocation failure (logging attached).


According to the logs, the allocation itself happens out of btrfs so we
can't do much here.

More specifically, when creating a new subvolume and requesting an
anonymous block device (via get_anon_bdev), there's a call to request a
free id for it. This could ask for "order-1" ie 8kb of contiguous
memory. And it failed.

This depends on memory fragmentation, ie. how the pages are allocated
and freed over time. Tweaking vm.min_free_kbytes could help but it's not
designed to prevent memory allocations in such scenario.

The id range sturctures themselves do not need more than a 4k page, but
the slab cache tries to provide more objects per slab, I see this on my
box right now:

Excerpt from /proc/slabinfo:

idr_layer_cache  474486   209632

where  3 == objperslab and 2 == pages per slab, which corresponds to the
8kb. This seems to depend on internal slab cache logic, and nothing I'd
like to go chaning right now.


Aha, interesting. The output of "sort -nk 6 < /proc/slabinfo" shows me 
that most of them use the 4k.



[...]

I've seen this happen once before somewhere else, also during snapshot
creation, also with a 4.5.x kernel.


Speaking of the devil... Yesterday another one crashed, and this one is 
using Debian 3.16.7-ckt25-2. Errors attached.



There's a bug report at Debian, in which is suggested to increase the
value of vm.min_free_kbytes:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=666021


[...]


[2363000.815554] Node 0 Normal: 2424*4kB (U) 0*8kB 0*16kB 0*32kB 0*64kB
0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 9696kB


Just to confirm the fragmentation, there are no page orders higher than
0 (ie.  8k and up).

So, technically not a btrfs bug but we could get affected by it badly.


Ah, so the list is what's still free, and there's no 8kB block available.

Now this also makes perfect sense to me:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=666021#90

So while increasing the min_free_kbytes will not fix the fragmentation 
issue, it is at least a workaround by just throwing some more dedicated 
space at it and decreasing the chance of this happening.


This one was even more spectacular, since it ends in oopsing and the 
whole box going down a bit later. I see no line with "Normal" in here by 
the way...


btrfs: page allocation failure: order:1, mode:0x204020
CPU: 0 PID: 9475 Comm: btrfs Not tainted 3.16.0-4-amd64 #1 Debian 
3.16.7-ckt25-2

  8150e835 00204020 8800052f7838
 81142d3f   8802
 0041 8800ef8a5c00  
Call Trace:
 [] ? dump_stack+0x5d/0x78
 [] ? warn_alloc_failed+0xdf/0x130
 [] ? __alloc_pages_nodemask+0x8ca/0xb30
 [] ? kmem_getpages+0x5b/0x110
 [] ? fallback_alloc+0x1cf/0x210
 [] ? kmem_cache_alloc+0x1f0/0x450
 [] ? ida_pre_get+0x60/0xe0
 [] ? get_anon_bdev+0x62/0xe0
 [] ? btrfs_init_fs_root+0x101/0x190 [btrfs]
 [] ? btrfs_get_fs_root+0xb6/0x270 [btrfs]
 [] ? create_pending_snapshot+0x6f6/0x960 [btrfs]
 [] ? create_pending_snapshots+0x89/0xb0 [btrfs]
 [] ? btrfs_commit_transaction+0x40c/0xa10 [btrfs]
 [] ? start_transaction+0x90/0x570 [btrfs]
 [] ? btrfs_mksubvol.isra.29+0x54a/0x5a0 [btrfs]
 [] ? prepare_to_wait_event+0xf0/0xf0
 [] ? btrfs_ioctl_snap_create_transid+0x191/0x1a0 [btrfs]
 [] ? btrfs_ioctl_snap_create_v2+0xee/0x140 [btrfs]
 [] ? btrfs_ioctl+0xcc4/0x2b50 [btrfs]
 [] ? handle_mm_fault+0x63c/0x11c0
 [] ? mmap_region+0x15f/0x650
 [] ? __do_page_fault+0x1d1/0x4f0
 [] ? do_mmap_pgoff+0x2e9/0x3b0
 [] ? do_vfs_ioctl+0x2cf/0x4b0
 [] ? SyS_ioctl+0x81/0xa0
 [] ? page_fault+0x28/0x30
 [] ? system_call_fast_compare_end+0x10/0x15
Mem-Info:
Node 0 DMA per-cpu:
CPU0: hi:0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU0: hi:  186, btch:  31 usd: 135
active_anon:58342 inactive_anon:20518 isolated_anon:0
 active_file:371028 inactive_file:364279 isolated_file:0
 unevictable:0 dirty:25166 writeback:0 unstable:0
 free:85089 slab_reclaimable:46135 slab_unreclaimable:9219
 mapped:5560 shmem:20579 pagetables:1492 bounce:0
 free_cma:0
Node 0 DMA free:14984kB min:32kB low:40kB high:48kB active_anon:344kB 
inactive_anon:160kB active_file:40kB inactive_file:36kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB present:15996kB managed:15912kB 
mlocked:0kB dirty:16kB writeback:0kB mapped:32kB shmem:160kB 
slab_reclaimable:84kB slab_unreclaimable:60kB kernel_stack:64kB 
pagetables:16kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:11 all_unreclaimable? no

lowmem_reserve[]: 0 3743 3743 3743
Node 0 DMA32 free:325372kB min:7808kB low:9760kB high:11712kB 
active_anon:233024kB inactive_anon:81912kB active_file:1484072kB 
inactive_file:1457080kB unevictable:0kB isolated(anon):0kB 

Re: [RFC][PATCH] btrfs-progs: inspect: new subcommand to dump chunks

2016-06-22 Thread Hans van Kranenburg

On 06/22/2016 07:26 PM, David Sterba wrote:

From: David Sterba 

Hi,

the chunk dump is a useful thing, for debugging or balance filters.


Nice!


Example output:

Chunks on device id: 1
PNumberTypePStartLength  PEnd Age   
  LStart  Usage
---
  0   System/RAID11.00MiB  32.00MiB  33.00MiB  47   
 1.40TiB   0.06
  1 Metadata/RAID1   33.00MiB   1.00GiB   1.03GiB  31   
 1.36TiB  64.03
  2 Metadata/RAID11.03GiB  32.00MiB   1.06GiB  36   
 1.36TiB  77.28
  3 Data/single   1.06GiB   1.00GiB   2.06GiB  12  
422.30GiB  78.90
  4 Data/single   2.06GiB   1.00GiB   3.06GiB  11  
420.30GiB  78.47
...

(full output is at http://susepaste.org/view/raw/089a9877)

This patch does the basic output, not filtering or sorting besides physical and
logical offset. There are possiblities to do more or enhance the output (eg.
starting with logical chunk and list the related physcial chunks together).
Or filter by type/profile, or understand the balance filters format.

As it is now, it's per-device dump of physical layout, which was the original
idea.

Printing 'usage' is not default as it's quite slow, it uses the search ioctl
and probably not in the best way, or there's some other issue in the
implementation.


Interesting.

So after reading this, I wrote a little test to test some scenarios:

https://github.com/knorrie/python-btrfs/commit/1ca99880dfa0e14b148f3d9e2b6b381b781eb52d

It's very clear that the most optimal way of doing this search is to 
have nr_items=1 and if possible, specify the length in offset.


I have no idea yet what happens inside the kernel when nr_items is not 
1, in combination with a large extent tree. I'll try to follow the code 
and see if I can find what's making this big difference.


Also, the filesystem with the 1200 second search result is Debian 
3.16.7-ckt25-2, I haven't looked yet if there's different behaviour 
inside handling searches between that kernel and e.g. 4.5.


After loading a huge amount of data in memory, the search with nr_items 
> 1 only takes 4.6 seconds to come up with the block group item itself, 
and another 4.6 seconds to come up with 0 additional results, only using 
cpu time.


It seems that searching in the empty space between
 (vaddr BLOCK_GROUP_ITEM length+1) and
 (vaddr BLOCK_GROUP_ITEM ULLONG_MAX)
is really expensive, while there's absolutely nothing to find.

I ran into this two days ago, caused by doing an unnecessary extra 
search after finding the block item already:


https://github.com/knorrie/python-btrfs/commit/5a69d9cf0477515ce2d6e50f1741276a49b33af8


I'll add the patch to devel branch but will not add it for any particular
release yet, I'd like some feedback first. Thanks.

-
New command 'btrfs inspect-internal dump-chunks' will dump layout of
chunks as stored on the devices. This corresponds to the physical
layout, sorted by the physical offset. The block group usage can be
shown as well, but the search is too slow so it's off by default.

If the physical offset sorting is selected, the empty space between
chunks is also shown.

Signed-off-by: David Sterba 
---
 cmds-inspect.c | 364 +
 1 file changed, 364 insertions(+)
[...]
+}
+
+static u64 fill_usage(int fd, u64 lstart)
+{
+   struct btrfs_ioctl_search_args args;
+   struct btrfs_ioctl_search_key *sk = 
+   struct btrfs_ioctl_search_header sh;
+   struct btrfs_block_group_item *item;
+   int ret;
+
+   memset(, 0, sizeof(args));
+   sk->tree_id = BTRFS_EXTENT_TREE_OBJECTID;
+   sk->min_objectid = lstart;
+   sk->max_objectid = lstart;
+   sk->min_type = BTRFS_BLOCK_GROUP_ITEM_KEY;
+   sk->max_type = BTRFS_BLOCK_GROUP_ITEM_KEY;
+   sk->min_offset = 0;
+   sk->max_offset = (u64)-1;


The chunk length is already known here, so it can go in offset.


+   sk->max_transid = (u64)-1;
+
+   sk->nr_items = 4096;


Or set to 1 because we already know there can only be one result.

--
Hans van Kranenburg - System / Network Engineer
T +31 (0)10 2760434 | hans.van.kranenb...@mendix.com | www.mendix.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] btrfs: wait for bdev put

2016-06-22 Thread Chris Mason



On Wed, Jun 22, 2016 at 6:18 AM, Anand Jain  
wrote:



 Thanks for the review Chris.

On 06/21/2016 09:00 PM, Chris Mason wrote:

On 06/21/2016 06:24 AM, Anand Jain wrote:

From: Anand Jain 

Further to the commit
  bc178622d40d87e75abc131007342429c9b03351
  btrfs: use rcu_barrier() to wait for bdev puts at unmount

This patch implements a method to time wait on the __free_device()
which actually does the bdev put. This is needed as the user space
running 'btrfs fi show -d' immediately after the replace and
unmount, is still reading older information from the device.


Thanks for working on this Anand.  Since it looks like blkdev_put can
deadlock against us, can we please switch to making sure we fully 
flush
the outstanding IO?  It's probably enough to do a sync_blockdev() 
call

before we allow the unmount to finish, but we can toss in an
invalidate_bdev for good measure.




# git diff
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 604daf315669..e0ad29d6fe9a 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -870,6 +870,11 @@ static void btrfs_close_one_device(struct 
btrfs_device *device)

if (device->missing)
fs_devices->missing_devices--;

+   if (device->bdev && device->writeable) {
+   sync_blockdev(device->bdev);
+   invalidate_bdev(device->bdev);
+   }
+
new_device = btrfs_alloc_device(NULL, >devid,
device->uuid);
BUG_ON(IS_ERR(new_device)); /* -ENOMEM */
---


 However, theoretically still there might be a problem - at the end of
 unmount, if the device exclusive open is not actually closed, then
 there might be a race with another program which is trying to open
 the device in exclusive mode. Like for eg:
  unmount /btrfs; fsck /dev/X
 and here fsck might fail to open the device if it wins the race.


This true, but at least we know he'll have up to date buffers if he 
does manage to open the device.


With the generic code, the blkdev_put happens after the super is gone, 
so I'm not sure we can completely fix this from inside our callback.


-chris



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-22 Thread Zygo Blaxell
On Wed, Jun 22, 2016 at 11:14:30AM -0600, Chris Murphy wrote:
> > Before deploying raid5, I tested these by intentionally corrupting
> > one disk in an otherwise healthy raid5 array and watching the result.
> 
> It's difficult to reproduce if no one understands how you
> intentionally corrupted that disk. Literal reading, you corrupted the
> entire disk, but that's impractical. The fs is expected to behave
> differently depending on what's been corrupted and how much.

The first round of testing I did (a year ago, when deciding whether
btrfs raid5 was mature enough to start using) was:

Create a 5-disk RAID5

Put some known data on it until it's full (i.e. random test
patterns).  At the time I didn't do any tests involving
compressible data, which I now realize was a serious gap in
my test coverage.

Pick 1000 random blocks (excluding superblocks) on one of the
disks and write random data to them

Read and verify the data through the filesystem, do scrub, etc.
Exercise all the btrfs features related to error reporting
and recovery.

I expected scrub and dev stat to report accurate corruption counts (except
for the 1 in 4 billion case where a bad csum matches by random chance),
and I expect all the data to be reconstructed since only one drive was
corrupted (assuming there are no unplanned disk failures during the
test, obviously) and the corruption occurred while the filesystem was
offline so there was no possibility of RAID write hole.

My results from that testing were that everything worked except for the
mostly-harmless quirk where scrub counts errors on random disks instead
of the disk where the errors occur.

> I don't often use the -Bd options, so I haven't tested it thoroughly,
> but what you're describing sounds like a bug in user space tools. I've
> found it reflects the same information as btrfs dev stats, and dev
> stats have been reliable in my testing.

Don't the user space tools just read what the kernel tells them?
I don't know how *not* to produce this behavior on btrfs raid5 or raid6.
It should show up on any btrfs raid56 system.

> > A different thing happens if there is a crash.  In that case, scrub cannot
> > repair the errors.  Every btrfs raid5 filesystem I've deployed so far
> > behaves this way when disks turn bad.  I had assumed it was a software bug
> > in the comparatively new raid5 support that would get fixed eventually.
> 
> This is really annoyingly vague. You don't give a complete recipe for
> reproducing this sequence. Here's what I'm understanding and what I'm
> missing:
> 
> 1. The intentional corruption, extent of which is undefined, is still present.

No intentional corruption here (quote:  "A different thing happens if
there is a crash...").  Now we are talking about the baseline behavior
when there is a crash on a btrfs raid5 array, especially crashes
triggered by a disk-level failure (e.g. watchdog timeout because a disk
or controller has hung) but also ordinary power failures or other crashes
triggered by external causes.

> 2. A drive is bad, but that doesn't tell us if it's totally dead, or
> only intermittently spitting out spurious information.

The most common drive-initiated reboot case is that one drive temporarily
locks up and triggers the host to perform a watchdog reset.  The reset
is successful and the filesystem can be mounted again with all drives
present; however, a small amount of raid5 data appears to be corrupted
each time.  The raid1 metadata passes all the integrity checks I can
throw at it:  btrfs check, scrub, balance, walk the filesystem with find
-type f -exec cat ..., compare with the last backup, etc.

Usually when I detect this case, I delete any corrupted data, delete
the disk that triggers the lockups and have no further problems with
that array.

> 3. Is the volume remounted degraded or is the bad drive still being
> used by Btrfs? Because Btrfs has no concept (patches pending) of drive
> faulty state like md, let alone an automatic change to that faulty
> state. It just keeps on trying to read or write to bad drives, even if
> they're physically removed.

In the baseline case the filesystem has all drives present after remount.
It could be as simple as power-cycling the host while writes are active.

> 4. You've initiated a scrub, and the corruption in 1 is not fixed.

In this pattern, btrfs may find both correctable and uncorrectable
corrupted data, usually on one of the drives.  scrub fixes the correctable
corruption, but fails on the uncorrectable.

> OK so what am I missing?

Nothing yet.  The above is the "normal" btrfs raid5 crash experience with
a non-degraded raid5 array.  A few megabytes of corrupt extents can easily
be restored from backups or deleted and everything's fine after that.

In my *current* failure case, I'm experiencing *additional* issues.
In total:

all of the above, plus:

BUG_ON()s as described in the first mail on 

Re: Confusing output from fi us/df

2016-06-22 Thread Chris Murphy
On Mon, Jun 20, 2016 at 5:30 PM, Marc Grondin  wrote:

> Metadata,single: Size:3.00GiB, Used:1.53GiB

Read this as:

3GiB of space on the device is reserved for metadata block group, and
1.53GiB of that is being used. The reservation means that this space
can't be used for other block group types, such as system or data. So
in some sense it is "used" but strictly speaking it's just a
reservation.




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH] btrfs-progs: inspect: new subcommand to dump chunks

2016-06-22 Thread David Sterba
From: David Sterba 

Hi,

the chunk dump is a useful thing, for debugging or balance filters.

Example output:

Chunks on device id: 1
PNumberTypePStartLength  PEnd Age   
  LStart  Usage
---
  0   System/RAID11.00MiB  32.00MiB  33.00MiB  47   
 1.40TiB   0.06
  1 Metadata/RAID1   33.00MiB   1.00GiB   1.03GiB  31   
 1.36TiB  64.03
  2 Metadata/RAID11.03GiB  32.00MiB   1.06GiB  36   
 1.36TiB  77.28
  3 Data/single   1.06GiB   1.00GiB   2.06GiB  12  
422.30GiB  78.90
  4 Data/single   2.06GiB   1.00GiB   3.06GiB  11  
420.30GiB  78.47
...

(full output is at http://susepaste.org/view/raw/089a9877)

This patch does the basic output, not filtering or sorting besides physical and
logical offset. There are possiblities to do more or enhance the output (eg.
starting with logical chunk and list the related physcial chunks together).
Or filter by type/profile, or understand the balance filters format.

As it is now, it's per-device dump of physical layout, which was the original
idea.

Printing 'usage' is not default as it's quite slow, it uses the search ioctl
and probably not in the best way, or there's some other issue in the
implementation.

I'll add the patch to devel branch but will not add it for any particular
release yet, I'd like some feedback first. Thanks.

-
New command 'btrfs inspect-internal dump-chunks' will dump layout of
chunks as stored on the devices. This corresponds to the physical
layout, sorted by the physical offset. The block group usage can be
shown as well, but the search is too slow so it's off by default.

If the physical offset sorting is selected, the empty space between
chunks is also shown.

Signed-off-by: David Sterba 
---
 cmds-inspect.c | 364 +
 1 file changed, 364 insertions(+)

diff --git a/cmds-inspect.c b/cmds-inspect.c
index dd7b9dd278f2..4eace63d8517 100644
--- a/cmds-inspect.c
+++ b/cmds-inspect.c
@@ -623,6 +623,368 @@ static int cmd_inspect_min_dev_size(int argc, char **argv)
return !!ret;
 }
 
+static const char * const cmd_dump_chunks_usage[] = {
+   "btrfs inspect-internal chunk-stats [options] ",
+   "Show chunks (block groups) layout",
+   "Show chunks (block groups) layout for all devices",
+   "",
+   HELPINFO_UNITS_LONG,
+   "--sort=MODEsort by the physical or logical chunk start",
+   "   MODE is one of pstart or lstart (default: pstart)",
+   "--usageshow usage per block group, note this can be slow",
+   NULL
+};
+
+enum {
+   CHUNK_SORT_PSTART,
+   CHUNK_SORT_LSTART,
+   CHUNK_SORT_DEFAULT = CHUNK_SORT_PSTART
+};
+
+struct dump_chunks_entry {
+   u64 devid;
+   u64 start;
+   u64 lstart;
+   u64 length;
+   u64 flags;
+   u64 age;
+   u64 used;
+   u32 pnumber;
+};
+
+struct dump_chunks_ctx {
+   unsigned length;
+   unsigned size;
+   struct dump_chunks_entry *stats;
+};
+
+int cmp_cse_devid_start(const void *va, const void *vb)
+{
+   const struct dump_chunks_entry *a = va;
+   const struct dump_chunks_entry *b = vb;
+
+   if (a->devid < b->devid)
+   return -1;
+   if (a->devid > b->devid)
+   return 1;
+
+   if (a->start < b->start)
+   return -1;
+   if (a->start == b->start) {
+   error(
+   "chunks start on same offset in the same device: devid %llu start %llu",
+   (unsigned long long)a->devid, (unsigned long long)a->start);
+   return 0;
+   }
+   return 1;
+}
+
+int cmp_cse_devid_lstart(const void *va, const void *vb)
+{
+   const struct dump_chunks_entry *a = va;
+   const struct dump_chunks_entry *b = vb;
+
+   if (a->devid < b->devid)
+   return -1;
+   if (a->devid > b->devid)
+   return 1;
+
+   if (a->lstart < b->lstart)
+   return -1;
+   if (a->lstart == b->lstart) {
+   error(
+"chunks logically start on same offset in the same device: devid %llu start 
%llu",
+   (unsigned long long)a->devid, (unsigned long 
long)a->lstart);
+   return 0;
+   }
+   return 1;
+}
+
+void print_dump_chunks(struct dump_chunks_ctx *ctx, unsigned sort_mode,
+   unsigned unit_mode, int with_usage)
+{
+   u64 devid;
+   struct dump_chunks_entry e;
+   int i;
+   int chidx;
+   u64 lastend = 0;
+   u64 age;
+
+   /*
+* Chunks are sorted logically as found by the ioctl, we need to sort
+* them once to find the physical ordering. This is the default mode.
+*/
+   qsort(ctx->stats, 

Re: Adventures in btrfs raid5 disk recovery

2016-06-22 Thread Chris Murphy
On Mon, Jun 20, 2016 at 7:55 PM, Zygo Blaxell
 wrote:
> On Mon, Jun 20, 2016 at 03:27:03PM -0600, Chris Murphy wrote:
>> On Mon, Jun 20, 2016 at 2:40 PM, Zygo Blaxell
>>  wrote:
>> > On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote:
>>
>> >> For me the critical question is what does "some corrupted sectors" mean?
>> >
>> > On other raid5 arrays, I would observe a small amount of corruption every
>> > time there was a system crash (some of which were triggered by disk
>> > failures, some not).
>>
>> What test are you using to determine there is corruption, and how much
>> data is corrupted? Is this on every disk? Non-deterministically fewer
>> than all disks? Have you identified this as a torn write or
>> misdirected write or is it just garbage at some sectors? And what's
>> the size? Partial sector? Partial md chunk (or fs block?)
>
> In earlier cases, scrub, read(), and btrfs dev stat all reported the
> incidents differently.  Scrub would attribute errors randomly to disks
> (error counts spread randomly across all the disks in the 'btrfs scrub
> status -d' output).  'dev stat' would correctly increment counts on only
> those disks which had individually had an event (e.g. media error or
> SATA bus reset).
>
> Before deploying raid5, I tested these by intentionally corrupting
> one disk in an otherwise healthy raid5 array and watching the result.

It's difficult to reproduce if no one understands how you
intentionally corrupted that disk. Literal reading, you corrupted the
entire disk, but that's impractical. The fs is expected to behave
differently depending on what's been corrupted and how much.

> When scrub identified an inode and offset in the kernel log, the csum
> failure log message matched the offsets producing EIO on read(), but
> the statistics reported by scrub about which disk had been corrupted
> were mostly wrong.  In such cases a scrub could repair the data.

I don't often use the -Bd options, so I haven't tested it thoroughly,
but what you're describing sounds like a bug in user space tools. I've
found it reflects the same information as btrfs dev stats, and dev
stats have been reliable in my testing.


> A different thing happens if there is a crash.  In that case, scrub cannot
> repair the errors.  Every btrfs raid5 filesystem I've deployed so far
> behaves this way when disks turn bad.  I had assumed it was a software bug
> in the comparatively new raid5 support that would get fixed eventually.

This is really annoyingly vague. You don't give a complete recipe for
reproducing this sequence. Here's what I'm understanding and what I'm
missing:

1. The intentional corruption, extent of which is undefined, is still present.
2. A drive is bad, but that doesn't tell us if it's totally dead, or
only intermittently spitting out spurious information.
3. Is the volume remounted degraded or is the bad drive still being
used by Btrfs? Because Btrfs has no concept (patches pending) of drive
faulty state like md, let alone an automatic change to that faulty
state. It just keeps on trying to read or write to bad drives, even if
they're physically removed.
4. You've initiated a scrub, and the corruption in 1 is not fixed.

OK so what am I missing?

Because it sounds to me like you have two copies of data that are
gone. For raid 5 that's data loss, scrub can't fix things. Corruption
is missing data. The bad drive is missing data.

What values do you get for

smartctl -l scterc /dev/sdX
cat /sys/block/sdX/device/timeout



>> This is on Btrfs? This isn't supposed to be possible. Even a literal
>> overwrite of a file is not an overwrite on Btrfs unless the file is
>> nodatacow. Data extents get written, then the metadata is updated to
>> point to those new blocks. There should be flush or fua requests to
>> make sure the order is such that the fs points to either the old or
>> new file, in either case uncorrupted. That's why I'm curious about the
>> nature of this corruption. It sounds like your hardware is not exactly
>> honoring flush requests.
>
> That's true when all the writes are ordered within a single device, but
> possibly not when writes must be synchronized across multiple devices.

I think that's a big problem, the fs cannot be consistent if the super
block points to any tree whose metadata or data isn't on stable media.

But if you think it's happening you might benefit from integrity
checking, maybe try just the metadata one for starters which is the
check_int mount option (it must be compiled in first for that mount
option to work).

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/check-integrity.c?id=refs/tags/v4.6.2



> It could be possible to corrupt existing data if:
>
> - btrfs puts multiple extents in the same RAID5 stripe
>
> - writes two of the extents in different transactions
>
> - the array is degraded so the RAID5 write hole applies
>
>

Re: btrfs send/receive error

2016-06-22 Thread Michael Duell
Hallo Lubos!

I have got the same problems with send/receive. Had those problems
for several weeks or months now. Not sure when those started.

These send/receive problems are known and there are several bug 
reports all over the internet and also on this list.

Not sure if or when this gets fixed. It's broken atm.

Michael


On 2016-06-22 14:01, Lubos Kolouch wrote:
> Hello,
> 
> I am having strange issue with btrfs send/receive:
> 
> #btrfs send -p /mnt/a/_btrbk_snap/debian.20160621_1 
> /mnt/a/_btrbk_snap/debian.20160622 | btrfs receive /mnt/ext/backups/vaclfw/
> 
> At subvol /mnt/a/_btrbk_snap/debian.20160622
> At snapshot debian.20160622
> ERROR: unlink var/lib/samba/private/msg.sock/7992 failed. No such file or 
> directory
> 
> 
> But :
> 
> ls -la /mnt/a/_btrbk_snap/debian.20160621_1/var/lib/samba/private/msg.sock/
> 2330   2388   2394   29375  7992
> 
> ls -la 
> /mnt/ext/backups/vaclfw/debian.20160621_1/var/lib/samba/private/msg.sock/
> 2330   2388   2394   29375  7992
> 
> uname -a
> Linux vaclfw 4.6.0-1-amd64 #1 SMP Debian 4.6.2-1 (2016-06-15) x86_64 GNU/Linux
> 
> # btrfs version
> # btrfs-progs v4.5.2
> 
> What could be wrong?
> 
> Thanks
> 
> Lubos
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



signature.asc
Description: OpenPGP digital signature


Re: [Y2038] [PATCH v2 00/24] Delete CURRENT_TIME and CURRENT_TIME_SEC macros

2016-06-22 Thread Arnd Bergmann
On Sunday, June 19, 2016 5:26:59 PM CEST Deepa Dinamani wrote:
> The series is aimed at getting rid of CURRENT_TIME and CURRENT_TIME_SEC 
> macros.
> The macros are not y2038 safe. There is no plan to transition them into being
> y2038 safe.
> ktime_get_* api's can be used in their place. And, these are y2038 safe.
> 
> Thanks to Arnd Bergmann for all the guidance and discussions.
> 
> Patches 2-4 were mostly generated using coccinelle scripts.
> 
> All filesystem timestamps use current_fs_time() for right granularity as
> mentioned in the respective commit texts of patches. This has a changed
> signature, renamed to current_time() and moved to the fs/inode.c.
> 
> This series also serves as a preparatory series to transition vfs to 64 bit
> timestamps as outlined here: https://lkml.org/lkml/2016/2/12/104 .
> 
> As per Linus's suggestion in https://lkml.org/lkml/2016/5/24/663 , all the
> inode timestamp changes have been squashed into a single patch. Also,
> current_time() now is used as a single generic vfs filesystem timestamp api.
> It also takes struct inode* as argument instead of struct super_block*.
> Posting all patches together in a bigger series so that the big picture is
> clear.
> 
> As per the suggestion in https://lwn.net/Articles/672598/, CURRENT_TIME macro
> bug fixes are being handled in a series separate from transitioning vfs to 
> use.

I've looked in detail at all the patches in this version now, and while
overall everything is fine, I found that two patches cannot be part of the
series because of the dependency on the patch that John already took (adding
time64_to_tm), but I think that's ok because we just need to change over
all the users of CURRENT_TIME and CURRENT_TIME_SEC that assign to inode
timestamps in order to prepare for the type change, the other ones
can be changed later.

I also found a few things that could be done differently to make the
later conversion slightly easier, but it's also possible that I missed
part of your bigger plan for those files, and none of them seem important.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs send/receive error

2016-06-22 Thread Lubos Kolouch

Hello,

I am having strange issue with btrfs send/receive:

#btrfs send -p /mnt/a/_btrbk_snap/debian.20160621_1 
/mnt/a/_btrbk_snap/debian.20160622 | btrfs receive 
/mnt/ext/backups/vaclfw/


At subvol /mnt/a/_btrbk_snap/debian.20160622
At snapshot debian.20160622
ERROR: unlink var/lib/samba/private/msg.sock/7992 failed. No such file 
or directory



But :

ls -la 
/mnt/a/_btrbk_snap/debian.20160621_1/var/lib/samba/private/msg.sock/

2330   2388   2394   29375  7992

ls -la 
/mnt/ext/backups/vaclfw/debian.20160621_1/var/lib/samba/private/msg.sock/

2330   2388   2394   29375  7992

uname -a
Linux vaclfw 4.6.0-1-amd64 #1 SMP Debian 4.6.2-1 (2016-06-15) x86_64 
GNU/Linux


# btrfs version
# btrfs-progs v4.5.2

What could be wrong?

Thanks

Lubos
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/6] fstests: btrfs: test RAID1 device reappear and balanced

2016-06-22 Thread Anand Jain



On 06/21/2016 09:31 PM, Eryu Guan wrote:

On Wed, Jun 15, 2016 at 04:48:47PM +0800, Anand Jain wrote:

From: Anand Jain 

The test does the following:
  Initialize a RAID1 with some data

  Re-mount RAID1 degraded with _dev1_ and write up to
  half of the FS capacity


If test devices are big enough, this test consumes much longer test
time. I tested with 15G scratch dev pool and this test ran ~200s on my
4vcpu 8G memory test vm.


 Right. Isn't that a good design? So that it gets tested differently
 on different HW config. ?
 However the test time can be reduced by using smaller vdisk.

Thanks, Anand



Is it possible to limit the file size or the device size used? So it
won't grow with device size. I'm thinking about something like
_scratch_mkfs_sized, but that doesn't work for dev pool.


  Save md5sum checkpoint1

  Re-mount healthy RAID1

  Let balance re-silver.
  Save md5sum checkpoint2

  Re-mount RAID1 degraded with _dev2_
  Save md5sum checkpoint3

  Verify if all three md5sum match

Signed-off-by: Anand Jain 
---
v2:
  add tmp= and its rm
  add comments to why _reload_btrfs_ko is used
  add missing put and test_mount at notrun exit
  use echo instead of _fail when checkpoints are checked
  .out updated to remove Silence..

 tests/btrfs/123 | 169 
 tests/btrfs/123.out |   7 +++
 tests/btrfs/group   |   1 +
 3 files changed, 177 insertions(+)
 create mode 100755 tests/btrfs/123
 create mode 100644 tests/btrfs/123.out

diff --git a/tests/btrfs/123 b/tests/btrfs/123
new file mode 100755
index ..33decfd1c434
--- /dev/null
+++ b/tests/btrfs/123
@@ -0,0 +1,169 @@
+#! /bin/bash
+# FS QA Test 123
+#
+# This test verify the RAID1 reconstruction on the reappeared
+# device. By using the following steps:
+# Initialize a RAID1 with some data
+#
+# Re-mount RAID1 degraded with dev2 missing and write up to
+# half of the FS capacity.
+# Save md5sum checkpoint1
+#
+# Re-mount healthy RAID1
+#
+# Let balance re-silver.
+# Save md5sum checkpoint2
+#
+# Re-mount RAID1 degraded with dev1 missing
+# Save md5sum checkpoint3
+#
+# Verify if all three checkpoints match
+#
+#-
+# Copyright (c) 2016 Oracle.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#-
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch_nocheck


Why don't check filesystem after test? A comment would be good if
there's a good reason. Patch 6 needs it as well :)

Thanks,
Eryu


+_require_scratch_dev_pool 2
+
+# the mounted test dir prevent btrfs unload, we need to unmount
+_test_unmount
+_require_btrfs_loadable
+
+_scratch_dev_pool_get 2
+
+dev1=`echo $SCRATCH_DEV_POOL | awk '{print $1}'`
+dev2=`echo $SCRATCH_DEV_POOL | awk '{print $2}'`
+
+dev1_sz=`blockdev --getsize64 $dev1`
+dev2_sz=`blockdev --getsize64 $dev2`
+# get min of both
+max_fs_sz=`echo -e "$dev1_sz\n$dev2_sz" | sort | head -1`
+max_fs_sz=$(( max_fs_sz/2 ))
+if [ $max_fs_sz -gt 100 ]; then
+   bs="1M"
+   count=$(( max_fs_sz/100 ))
+else
+   max_fs_sz=$(( max_fs_sz*2 ))
+   _scratch_dev_pool_put
+   _test_mount
+   _notrun "Smallest dev size $max_fs_sz, Need at least 2M"
+fi
+
+echo >> $seqres.full
+echo "max_fs_sz=$max_fs_sz count=$count" >> $seqres.full
+echo "-Initialize -" >> $seqres.full
+_scratch_pool_mkfs "-mraid1 -draid1" >> $seqres.full 2>&1
+_scratch_mount >> $seqres.full 2>&1
+_run_btrfs_util_prog filesystem show
+dd if=/dev/zero of="$SCRATCH_MNT"/tf1 bs=$bs count=1 \
+   >>$seqres.full 2>&1
+count=$(( count-- ))
+echo "unmount" >> $seqres.full
+echo "clean btrfs ko" >> $seqres.full
+_scratch_unmount
+
+# un-scan the btrfs devices
+_reload_btrfs_ko
+
+
+echo >> $seqres.full
+echo "-Write degraded mount fill upto 

Re: [PATCH v2 1/6] fstests: btrfs: add functions to set and reset required number of SCRATCH_DEV_POOL

2016-06-22 Thread Anand Jain



On 06/21/2016 09:20 PM, Eryu Guan wrote:

On Wed, Jun 15, 2016 at 04:46:03PM +0800, Anand Jain wrote:

From: Anand Jain 

This patch provides functions
 _scratch_dev_pool_get()
 _scratch_dev_pool_put()

Which will help to set/reset SCRATCH_DEV_POOL with the required
number of devices. SCRATCH_DEV_POOL_SAVED will hold all the devices.

Usage:
  _scratch_dev_pool_get() 
  :: do stuff

  _scratch_dev_pool_put()

Signed-off-by: Anand Jain 


I think the helpers should be introduced when they are first used, so
that we know why they're introduced, and know how they're used with
clear examples.

So patch 1, 2 and 4 can be merged as one patch, patch 3 updates
btrfs/027, patch 4 and patch 5 can be merged as one patch, then comes
patch 6.

What do you think?


 Hm. I won't bother much on that. will do if needed.
 (just a note- On the other hand it gets noticed about the
 new helper function if its a separate patch for helper
 function).


---
 v2: Error message and comment section updates.

 common/rc | 55 +++
 1 file changed, 55 insertions(+)

diff --git a/common/rc b/common/rc
index 51092a0644f0..31c46ba1226e 100644
--- a/common/rc
+++ b/common/rc
@@ -802,6 +802,61 @@ _scratch_mkfs()
 esac
 }

+#
+# Generally test cases will have..
+#   _require_scratch_dev_pool X
+# to make sure it has the enough scratch devices including
+# replace-target and spare device. Now arg1 here is the
+# required number of scratch devices by a-test-case excluding
+# the replace-target and spare device. So this function will
+# set SCRATCH_DEV_POOL to the specified number of devices.
+#
+# Usage:
+#  _scratch_dev_pool_get() 
+# :: do stuff
+#
+#  _scratch_dev_pool_put()
+#
+_scratch_dev_pool_get()
+{
+   if [ $# != 1 ]; then


"-ne" "-eq" etc. are used for comparing integers, "=!" "==" are for
strings. And I think $#, $? are integers here, "-ne" would be better.


 Thanks for the catch. fixed in next rev.

- Anand



Thanks,
Eryu


+   _fail "Usage: _scratch_dev_pool_get ndevs"
+   fi
+
+   local test_ndevs=$1
+   local config_ndevs=`echo $SCRATCH_DEV_POOL| wc -w`
+   local devs[]="( $SCRATCH_DEV_POOL )"
+
+   typeset -p config_ndevs >/dev/null 2>&1
+   if [ $? != 0 ]; then
+   _fail "Bug: cant find SCRATCH_DEV_POOL ndevs"
+   fi
+
+   if [ $config_ndevs -lt $test_ndevs ]; then
+   _notrun "Need at least test requested number of ndevs 
$test_ndevs"
+   fi
+
+   SCRATCH_DEV_POOL_SAVED=${SCRATCH_DEV_POOL}
+   export SCRATCH_DEV_POOL_SAVED
+   SCRATCH_DEV_POOL=${devs[@]:0:$test_ndevs}
+   export SCRATCH_DEV_POOL
+}
+
+_scratch_dev_pool_put()
+{
+   typeset -p SCRATCH_DEV_POOL_SAVED >/dev/null 2>&1
+   if [ $? != 0 ]; then
+   _fail "Bug: unset val, must call _scratch_dev_pool_get before 
_scratch_dev_pool_put"
+   fi
+
+   if [ -z "$SCRATCH_DEV_POOL_SAVED" ]; then
+   _fail "Bug: str empty, must call _scratch_dev_pool_get before 
_scratch_dev_pool_put"
+   fi
+
+   export SCRATCH_DEV_POOL=$SCRATCH_DEV_POOL_SAVED
+   export SCRATCH_DEV_POOL_SAVED=""
+}
+
 _scratch_pool_mkfs()
 {
 case $FSTYP in
--
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe fstests" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] btrfs: wait for bdev put

2016-06-22 Thread Anand Jain



 Thanks for the review Chris.

On 06/21/2016 09:00 PM, Chris Mason wrote:

On 06/21/2016 06:24 AM, Anand Jain wrote:

From: Anand Jain 

Further to the commit
  bc178622d40d87e75abc131007342429c9b03351
  btrfs: use rcu_barrier() to wait for bdev puts at unmount

This patch implements a method to time wait on the __free_device()
which actually does the bdev put. This is needed as the user space
running 'btrfs fi show -d' immediately after the replace and
unmount, is still reading older information from the device.


Thanks for working on this Anand.  Since it looks like blkdev_put can
deadlock against us, can we please switch to making sure we fully flush
the outstanding IO?  It's probably enough to do a sync_blockdev() call
before we allow the unmount to finish, but we can toss in an
invalidate_bdev for good measure.




# git diff
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 604daf315669..e0ad29d6fe9a 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -870,6 +870,11 @@ static void btrfs_close_one_device(struct 
btrfs_device *device)

if (device->missing)
fs_devices->missing_devices--;

+   if (device->bdev && device->writeable) {
+   sync_blockdev(device->bdev);
+   invalidate_bdev(device->bdev);
+   }
+
new_device = btrfs_alloc_device(NULL, >devid,
device->uuid);
BUG_ON(IS_ERR(new_device)); /* -ENOMEM */
---


 However, theoretically still there might be a problem - at the end of
 unmount, if the device exclusive open is not actually closed, then
 there might be a race with another program which is trying to open
 the device in exclusive mode. Like for eg:
  unmount /btrfs; fsck /dev/X
 and here fsck might fail to open the device if it wins the race.



Then we can get rid of the mdelay loop completely, which seems pretty
error prone to me.


 Yes, the code got little complex here (and also when sysfs fixes
 were wrote) as struct btrfs_device is getting migrated to a new
 struct btrfs_device at unmount, which I don't think was necessary?


Thanks, Anand



Thanks!

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs multi device handling

2016-06-22 Thread sri
Hi,
any inputs on this would be appreciated.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix disk_i_size update bug when fallocate() fails

2016-06-22 Thread Wang Xiaoguang

hello,

On 06/22/2016 02:24 PM, Christoph Hellwig wrote:

On Wed, Jun 22, 2016 at 09:57:01AM +0800, Wang Xiaoguang wrote:

And below test case can reveal this bug:

Please wire this up for xfstests, thanks!

OK.

Regards,
Xiaoguang Wang







--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix disk_i_size update bug when fallocate() fails

2016-06-22 Thread Christoph Hellwig
On Wed, Jun 22, 2016 at 09:57:01AM +0800, Wang Xiaoguang wrote:
> And below test case can reveal this bug:

Please wire this up for xfstests, thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html