Re: btrfs fi defrag does not defrag files >256kB?

2016-07-27 Thread Duncan
Nicholas D Steeves posted on Wed, 27 Jul 2016 13:19:01 -0400 as excerpted:

> Is there any reason why defrag without -t cannot detect and default to
> the data chunk size, or why it does not default to 1 GiB?

I don't know the answer, but have wondered that myself.  256 KiB seems a 
rather small default, to me.  I'd expect something in the MiB range, at 
least, maybe the same 2 MiB that modern partitioners tend to use for 
alignment, for the same reason, that tends to be a reasonable whole 
multiple of most erase-block sizes, and if the partition is aligned, 
should prevent unnecessary read-modify-write cycles on ssd, and help with 
tiled zones on SMR drives as well.

As to the question in the subject line, AFAIK btrfs fi defrag works on 
extents, not filesize per-se, so using the default 256 KiB target, yes 
it'll defrag files larger than that, but only for extents that are 
smaller than that.  If all the extents are 256 KiB plus, defrag won't do 
anything with it without a larger target option or unless the compress 
option is also used, in which case it rewrites everything it is pointed 
at, in ordered to recompress it.

Talking about compression, it's worth mentioning that filefrag doesn't 
understand btrfs compression either, and will count each 128 KiB 
(uncompressed size) compression block as a separate extent.  To get the 
true picture using filefrag, you need to use verbose and either eyeball 
the results manually or feed them into a script that processes the 
numbers and combines "extents" if they are reported as immediately 
consecutive on the filesystem.

As such, filefrag, given the regular opportunistic compression, turns out 
to be a good method of determining whether a file (of over 128 KiB in 
size) is actually compressed or not, since if it is filefrag will report 
multiple 128 KiB extents, while if it's not, extent size should be much 
less regular, likely with larger extents unless the file is often 
modified and rewritten in-place.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: Make RAID stripesize configurable

2016-07-27 Thread Sanidhya Solanki
On Wed, 27 Jul 2016 18:25:48 +0200
Goffredo Baroncelli  wrote:
> I am not able to understand this sentence: on the best of my knowledge,
> in btrfs the RAID5/RAID6 stripe is composed by several sub-stripes (I am
> not sure about the terminology to adopt); the number of sub-stripe is equal 
> to the number of the disk.
> 
> Until now, in btrfs the size of sub-stripe is fixed to 64k, so the size
> of a stripe is equal to 64k * . So for raid5 the minimum
> stripe size is 192k, for raid6 is 256k.
> Why you are writing that the real stripe size is 4kb (may be that you are
> referring to the be the page size ?).

No problem with going over the details one more time. 
What I called and what was agreed to be called stripe size in the email sent
originally sent by Chris Murphy (link below) is actually how a single block 
of data is laid out on disk.

This number is a component of the stripe element size (which you called the
sub-stripe). This has nothing to do with how DIFFERENT but concurrent stripes
(which you defined as 64k * ) of data are laid out on disk.
Their relation is such that they follow the order when they are read, but are
otherwise unrelated to each other. The correct way to look at a stripe is as
follows (with its current value before this patch in brackets):

Stripe Element Size (64 KiB) = 
Stripe size (1024B) * Number of elements per stripe element (64)

For the stripe code, the stripe element size matters, and for the metadata,
the stripe size matters.

The (64k * ) is how concurrent stripes of data are distributed
across the RAID disks. The order is only important when performing an I/O 
operation.

Reference email: https://www.spinics.net/lists/linux-btrfs/msg57471.html

Hope this makes it simpler.

Sanidhya
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] btrfs: fix fsfreeze hang caused by delayed iputs deal

2016-07-27 Thread Wang Xiaoguang
When running fstests generic/068, sometimes we got below deadlock:
  xfs_io  D 8800331dbb20 0  6697   6693 0x0080
  8800331dbb20 88007acfc140 880034d895c0 8800331dc000
  880032d243e8 fffe 880032d24400 0001
  8800331dbb38 816a9045 880034d895c0 8800331dbba8
  Call Trace:
  [] schedule+0x35/0x80
  [] rwsem_down_read_failed+0xf2/0x140
  [] ? __filemap_fdatawrite_range+0xd1/0x100
  [] call_rwsem_down_read_failed+0x18/0x30
  [] ? btrfs_alloc_block_rsv+0x2c/0xb0 [btrfs]
  [] percpu_down_read+0x35/0x50
  [] __sb_start_write+0x2c/0x40
  [] start_transaction+0x2a5/0x4d0 [btrfs]
  [] btrfs_join_transaction+0x17/0x20 [btrfs]
  [] btrfs_evict_inode+0x3c4/0x5d0 [btrfs]
  [] evict+0xba/0x1a0
  [] iput+0x196/0x200
  [] btrfs_run_delayed_iputs+0x70/0xc0 [btrfs]
  [] btrfs_commit_transaction+0x928/0xa80 [btrfs]
  [] btrfs_freeze+0x30/0x40 [btrfs]
  [] freeze_super+0xf0/0x190
  [] do_vfs_ioctl+0x4a5/0x5c0
  [] ? do_audit_syscall_entry+0x66/0x70
  [] ? syscall_trace_enter_phase1+0x11f/0x140
  [] SyS_ioctl+0x79/0x90
  [] do_syscall_64+0x62/0x110
  [] entry_SYSCALL64_slow_path+0x25/0x25

>From this warning, freeze_super() already holds SB_FREEZE_FS, but
btrfs_freeze() will call btrfs_commit_transaction() again, if
btrfs_commit_transaction() finds that it has delayed iputs to handle,
it'll start_transaction(), which will try to get SB_FREEZE_FS lock
again, then deadlock occurs.

The root cause is that in btrfs, sync_filesystem(sb) does not make
sure all metadata is updated. There still maybe some codes adding
delayed iputs, see below sample race window:

 CPU1  | CPU2
|-> freeze_super() |
|-> sync_filesystem(sb);   |
|  |-> cleaner_kthread()
|  |   |-> btrfs_delete_unused_bgs()
|  |   |-> btrfs_remove_chunk()
|  |   |-> 
btrfs_remove_block_group()
|  |   |-> 
btrfs_add_delayed_iput()
|  |
|-> sb->s_writers.frozen = SB_FREEZE_FS;   |
|-> sb_wait_write(sb, SB_FREEZE_FS);   |
|   acquire SB_FREEZE_FS lock. |
|  |
|-> btrfs_freeze() |
|-> btrfs_commit_transaction() |
|-> btrfs_run_delayed_iputs()  |
|   will handle delayed iputs, |
|   that means start_transaction() |
|   will be called, which will try |
|   to get SB_FREEZE_FS lock.  |

To fix this issue, introduce a atomic_t fs_frozen to record internally whether
fs has been frozen. If fs has been frozen, we can not handle delayed iputs.

Signed-off-by: Wang Xiaoguang 
---
v3: we introduce a atomic_t fs_frozen, and if fs_frozen is 1, we can not
handle delayed iputs.
---
 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/disk-io.c |  1 +
 fs/btrfs/super.c   | 10 ++
 fs/btrfs/transaction.c |  7 ++-
 4 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 90041a2..9dddaa0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1091,6 +1091,8 @@ struct btrfs_fs_info {
struct list_head pinned_chunks;
 
int creating_free_space_tree;
+   /* Use this atomic to record internally whether fs has been frozen */
+   atomic_t fs_frozen;
 };
 
 struct btrfs_subvolume_writers {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 86cad9a..1b85f55 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2621,6 +2621,7 @@ int open_ctree(struct super_block *sb,
atomic_set(_info->qgroup_op_seq, 0);
atomic_set(_info->reada_works_cnt, 0);
atomic64_set(_info->tree_mod_seq, 0);
+   atomic_set(_info->fs_frozen, 0);
fs_info->sb = sb;
fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
fs_info->metadata_ratio = 0;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 60e7179..c417008 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2216,6 +2216,7 @@ static int btrfs_freeze(struct super_block *sb)
struct btrfs_trans_handle *trans;
struct btrfs_root *root = btrfs_sb(sb)->tree_root;
 
+   atomic_set(>fs_info->fs_frozen, 1);
trans = btrfs_attach_transaction_barrier(root);
if (IS_ERR(trans)) {
/* no transaction, don't bother */
@@ -2226,6 +2227,14 @@ static int btrfs_freeze(struct super_block *sb)
return btrfs_commit_transaction(trans, root);
 }
 
+static int btrfs_unfreeze(struct super_block *sb)
+{
+   struct btrfs_root *root = btrfs_sb(sb)->tree_root;
+
+   atomic_set(>fs_info->fs_frozen, 0);
+   return 0;
+}

Re: [PATCH 2/5] New btrfs command: "btrfs inspect physical-find"

2016-07-27 Thread Qu Wenruo

At 07/28/2016 01:43 AM, Goffredo Baroncelli wrote:

From: Goffredo Baroncelli 

The aim of this new command is to show the physical placement on the disk
of a file.
Currently it handles all the profiles (single, dup, raid1/10/5/6).

The syntax is simple:


Uh...
Where is the synatx?

I guess the synatx is
physical-find  []



where:
   is the file to inspect
   is the offset of the file to inspect (default 0)


Normally  is paired with .
What about add a new optional parameter ?
Its default value would be the length of the file.

And for the optional , would you mind to make it as an option?
like -o|--offset  and -s|--size ?

For resolve logical directly, then -l|--logical .



Below some examples:

** Single

$ sudo mkfs.btrfs -f -d single -m single /dev/loop0
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt 
>/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
mnt/out.txt: 0


So 0 is the offset inside the file.
And how long that file extent is?


devid: 1 dev_name: /dev/loop0 offset: 12582912 type: LINEAR


LINEAR seems a little different, as normally we just call it SINGLE in 
btrfs.


And what about changing the output format to the following?
(This combines both fiemap style and map-logical style)
--
EXT: FILE-OFFSET LOGICAL RANGE DEVICE   DEVICE RANG   TYPE
0:   0-128K  X-X   1:/dev/loop0 X-X   RAID1
   2:/dev/loop1 X-X   RAID1
1:   128K-256K   X-X   1:/dev/loop2 X-X   RAID5D1
   1:/dev/loop3 X-X   RAID5D2
   1:/dev/loop4 X-X   RAID5P
 X-X   1:/dev/loop2 X-X   RAID5D1
   1:/dev/loop3 X-X   RAID5D2
   1:/dev/loop4 X-X   RAID5P
--
Extent 0 and 1 are in different raid profile, it's only possible during 
convert, just used as an exmple
And Extent 1 are crossing 2 RAID5 stripes, so needs 2 logical range to 
show them all.



Although it's quite hard to put the above output into 80 characters per 
line, it provides almost every info we need:

1) File offset and its length
2) Logical bytenr and its length
3) Device bytenr and its length (since its length can differ from 
logical length)

4) RAID type and its role.



$ dd 2>/dev/null if=/dev/loop0 skip=12582912 bs=1 count=5; echo
adaaa

** Dup

The command shows both the copies

$ sudo mkfs.btrfs -f -d single -m single /dev/loop0
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt 
>/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
mnt/out.txt: 0
devid: 1 dev_name: /dev/loop0 offset: 71303168 type: DUP
devid: 1 dev_name: /dev/loop0 offset: 104857600 type: DUP
$ dd 2>/dev/null if=/dev/loop0 skip=104857600 bs=1 count=5 ; echo
adaaa

** Raid1

The command shows both the copies

$ sudo mkfs.btrfs -f -d raid1 -m raid1 /dev/loop0 /dev/loop1
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt 
>/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0
devid: 2 dev_name: /dev/loop1 offset: 61865984 type: RAID1
devid: 1 dev_name: /dev/loop0 offset: 81788928 type: RAID1
$ dd 2>/dev/null if=/dev/loop0 skip=81788928 bs=1 count=5; echo
adaaa

** Raid10

The command show both the copies; if you set an offset to the next disk-stripe, 
you can see the next pair of disk-stripe

$ sudo mkfs.btrfs -f -d raid10 -m raid10 /dev/loop[0123]
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt 
>/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0
devid: 4 dev_name: /dev/loop3 offset: 61931520 type: RAID10
devid: 3 dev_name: /dev/loop2 offset: 61931520 type: RAID10
$ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5; echo
adaaa
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt 65536
mnt/out.txt: 65536
devid: 2 dev_name: /dev/loop1 offset: 61931520 type: RAID10
devid: 1 dev_name: /dev/loop0 offset: 81854464 type: RAID10
$ dd 2>/dev/null if=/dev/loop0 skip=81854464 bs=1 count=5; echo
bdbbb

** Raid5

Depending by the offset, you can see which disk-stripe is used.

$ sudo mkfs.btrfs -f -d raid5 -m raid5 /dev/loop[012]
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt 
>/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
mnt/out.txt: 0
devid: 2 dev_name: /dev/loop1 offset: 61931520 type: DATA
devid: 1 dev_name: /dev/loop0 offset: 81854464 type: OTHER
devid: 3 dev_name: /dev/loop2 offset: 61931520 type: PARITY


Here DATA/OTHER is a little confusing.
For 4 disks 

[PATCH] Btrfs: deal with unexpected return value in flush_space

2016-07-27 Thread Liu Bo
Function start_transaction() can return ERR_PTR(1) when flush is
BTRFS_RESERVE_FLUSH_LIMIT, so the call graph is

start_transaction (return ERR_PTR(1))
  -> btrfs_block_rsv_add (return 1)
 -> reserve_metadata_bytes (return 1)
-> flush_space (return 1)
   -> do_chunk_alloc  (return 1)

With BTRFS_RESERVE_FLUSH_LIMIT, if flush_space is already on the
flush_state of ALLOC_CHUNK and it successfully allocates a new
chunk, then instead of trying to reserve space again,
reserve_metadata_bytes returns 1 immediately.

Eventually the callers who call start_transaction() usually just
do the IS_ERR() check which ERR_PTR(1) can pass, then it'll get
a panic when dereferencing a pointer which is ERR_PTR(1).

This makes flush_space() translate 'ret = 1' to 'ret = 0'.

Signed-off-by: Liu Bo 
---
We found this 'NULL pointer dereference' on an old 3.8 kernel but
it's not going to happen on the upstream since there is no caller
of btrfs_start_transaction_lflush().

 fs/btrfs/extent-tree.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 7a35c9d..a00fb67 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4457,6 +4457,15 @@ void check_system_chunk(struct btrfs_trans_handle *trans,
}
 }
 
+/*
+ * If force is CHUNK_ALLOC_FORCE:
+ *- return 1 if it successfully allocates a chunk,
+ *- return errors including -ENOSPC otherwise.
+ * If force is NOT CHUNK_ALLOC_FORCE:
+ *- return 0 if it doesn't need to allocate a new chunk,
+ *- return 1 if it successfully allocates a chunk,
+ *- return errors including -ENOSPC otherwise.
+ */
 static int do_chunk_alloc(struct btrfs_trans_handle *trans,
  struct btrfs_root *extent_root, u64 flags, int force)
 {
@@ -4857,7 +4866,7 @@ static int flush_space(struct btrfs_root *root,
 btrfs_get_alloc_profile(root, 0),
 CHUNK_ALLOC_NO_FORCE);
btrfs_end_transaction(trans, root);
-   if (ret == -ENOSPC)
+   if (ret == -ENOSPC || ret == 1)
ret = 0;
break;
case COMMIT_TRANS:
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] Add some helper functions

2016-07-27 Thread Qu Wenruo



At 07/28/2016 01:43 AM, Goffredo Baroncelli wrote:

From: Goffredo Baroncelli 

Add the following functions:
- int is_btrfs_fs(const char *path) -> returns 0 if path is a btrfs filesystem
- void check_root_or_exit() -> checks if the user has the root capability or
   it exits writing an error message
- void check_btrfs_or_exit(const char *path)
checks if path is a valid btrfs filesystem,
otherwise it exits

Signed-off-by: Goffredo baroncelli 
---
 utils.c | 41 +
 utils.h | 14 ++
 2 files changed, 55 insertions(+)

diff --git a/utils.c b/utils.c
index 578fdb0..b99706c 100644
--- a/utils.c
+++ b/utils.c
@@ -4131,3 +4131,44 @@ unsigned int rand_range(unsigned int upper)
 */
return (unsigned int)(jrand48(rand_seed) % upper);
 }
+
+/*
+ * check if path is a btrfs filesystem
+ */
+int is_btrfs_fs(const char *path)
+{
+   struct statfs stfs;
+
+   if (statfs(path, ) != 0) {
+   /* cannot access */
+   return -1;
+   }
+
+   if (stfs.f_type != BTRFS_SUPER_MAGIC) {
+   /* not a btrfs filesystem */
+   return -2;
+   }
+
+   return 0;
+}
+
+/*
+ * check if the user is root
+ */
+void check_root_or_exit()
+{
+if (geteuid() == 0)
+return;
+
+error("You need to be root to execute this command");
+exit(100);

No immediate exit value, especially such like 100.

Normally we only use 1 and 0 as exit value.

Another concern about the function is, we don't really do such early 
check on root privilege.


Under most case, we just call privilege function, like tree search 
ioctl, and when it fails, it will return -EPERM to info user that they 
lacks the privilege.


Such behavior makes code more extendable, for case like the ioctl 
becomes non-privilege, btrfs-progs don't need any modification.


So I think it's better to let ioctl itself to do the privilege check 
other than in btrfs-progs.

+}
+
+void check_btrfs_or_exit(const char *path)
+{
+if (!is_btrfs_fs(path))
+return;
+
+error("'%s' must be a valid btrfs filesystem", path);
+exit(100);
+}


Same exit value problem.

This btrfs check seems quite good.

What about merge it into functions like open_file_or_dir?
As most caller uses such function to open file/dir inside a btrfs mount 
point.


Thanks,
Qu

diff --git a/utils.h b/utils.h
index 98bfb34..0bd6ecb 100644
--- a/utils.h
+++ b/utils.h
@@ -399,4 +399,18 @@ unsigned int rand_range(unsigned int upper);
 /* Also allow setting the seed manually */
 void init_rand_seed(u64 seed);

+/* return 0 if path is a valid btrfs filesystem */
+int is_btrfs_fs(const char *path);
+
+/*
+ * check if the user has the root capability, otherwise it exits printing an
+ * error message
+ */
+void check_root_or_exit();
+/*
+ * check if path is a valid btrfs filesystem, otherwise it exits printing an
+ * error message
+ */
+void check_btrfs_or_exit(const char *path);
+
 #endif




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [sparc64] mkfs.btrfs bus error / align issue?

2016-07-27 Thread Patrick Baggett
On Wed, Jul 27, 2016 at 3:40 PM, John Paul Adrian Glaubitz
 wrote:
> On 07/27/2016 03:59 PM, Anatoly Pugachev wrote:
>> Program received signal SIGBUS, Bus error.
>> 0x0015e160 in write_raid56_with_parity (info=0x2b17b0,
>> eb=0x2c7fe0, multi=0x2c2870, stripe_len=65536, raid_map=0x2c2570) at
>> volumes.c:2156
>> 2156*(unsigned long *)(p_eb->data + i) ^=
>
> Well, that pretty much looks some creative pointer arithmetics that will 
> provoke
> unaligned access. Just check what the declaration of p_eb->data is. If it's
> not "unsigned long", then you know why the code breaks here.
>

Yeah, so basically, the best solution (assuming you can't change the
alignment of `data` somehow) would look something like:

Compare lower 'n' bits of `data` pointers (n=2 for ILP32, n=3 for
LP64), something like (data1 & sizeof(long)-1) == (data2 &
sizeof(long)-1). If they are equal, fast loop possible. Not equal ->
slow loop (all byte-by-byte copy).

fast loop path:
do byte-by-byte XOR until lower n bits of (data+i) are zero.

do word-by-word XOR until < 1 word

do byte-by-bye XOR until last bytes done.

This sort of processing is pretty standard for "chunking". If this was
done with some cool SIMD instruction set, it would have similar sort
of approach, I'd imagine.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [sparc64] mkfs.btrfs bus error / align issue?

2016-07-27 Thread Patrick Baggett
On Wed, Jul 27, 2016 at 3:39 PM, Patrick Baggett
 wrote:
> On Wed, Jul 27, 2016 at 8:59 AM, Anatoly Pugachev  wrote:
>>
>> Hello!
>>
>> Running xfstests suite, got in logs mkfs.btrfs bus error, debugging it
>> shows the following :
>>
>> mator@nvg5120:~/btrfs-progs$ git log -1 --oneline
>> 40650bf Btrfs progs v4.6.1
>>
>> root@nvg5120:/home/mator/xfstests# gdb
>> GNU gdb (Debian 7.11.1-2) 7.11.1
>> (gdb) file /opt/btrfs/bin/mkfs.btrfs
>> Reading symbols from /opt/btrfs/bin/mkfs.btrfs...done.
>> (gdb) set args -f -draid5 -mraid5 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3
>> (gdb) run
>> Starting program: /opt/btrfs/bin/mkfs.btrfs -f -draid5 -mraid5
>> /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3
>> [Thread debugging using libthread_db enabled]
>> Using host libthread_db library "/lib/sparc64-linux-gnu/libthread_db.so.1".
>> btrfs-progs v4.6.1
>> See http://btrfs.wiki.kernel.org for more information.
>>
>> ERROR: superblock checksum mismatch
>> ERROR: superblock checksum mismatch
>> ERROR: superblock checksum mismatch
>> Performing full device TRIM (2.00GiB) ...
>> Performing full device TRIM (2.00GiB) ...
>> Performing full device TRIM (2.00GiB) ...
>> Performing full device TRIM (2.00GiB) ...
>>
>> Program received signal SIGBUS, Bus error.
>> 0x0015e160 in write_raid56_with_parity (info=0x2b17b0,
>> eb=0x2c7fe0, multi=0x2c2870, stripe_len=65536, raid_map=0x2c2570) at
>> volumes.c:2156
>> 2156*(unsigned long *)(p_eb->data + i) ^=
>> (gdb) bt
>> #0  0x0015e160 in write_raid56_with_parity (info=0x2b17b0,
>> eb=0x2c7fe0, multi=0x2c2870, stripe_len=65536, raid_map=0x2c2570)
>> at volumes.c:2156
>> #1  0x00119b30 in write_and_map_eb (trans=0x2cc250,
>> root=0x2c7d80, eb=0x2c7fe0) at disk-io.c:426
>> #2  0x00119e74 in write_tree_block (trans=0x2cc250,
>> root=0x2c7d80, eb=0x2c7fe0) at disk-io.c:459
>> #3  0x0011a4ac in __commit_transaction (trans=0x2cc250,
>> root=0x2c7d80) at disk-io.c:562
>> #4  0x0011a7b8 in btrfs_commit_transaction (trans=0x2cc250,
>> root=0x2c7d80) at disk-io.c:598
>> #5  0x001a2b04 in main (argc=8, argv=0x7fef698) at mkfs.c:1786
>> (gdb)
>>
>> Can someone help please? Thanks.
>>
>
> The code that faults:
>
> (unsigned long *)(p_eb->data + i) ^= *(unsigned long *)(ebs[j]->data + i);
>
> Because struct extent_buffer has 'data' as a char[], this will always
> fault on sparc64 and probably a number of other RISC architectures. It
> increments the address by 1, then reads an 8-byte chunk, XORs, and
> writes the 8-byte chunk, repeat. In other words, 7 out of 8 reads
> would fault, even if both `data` pointers were 8-byte aligned.

Actually, I misread that: the code increments i by sizeof(unsigned
long), not by 1 so only if either data point is misaligned, then this
will fault.

Disregard that.

>
> This would probably fix it, though it looks ugly.
> unsigned long a, b;
> memcpy(, p_eb->data + i, sizeof(a));  /* Read 8 bytes from p_eb->data+i */
> memcpy(, ebs[j]->data + i, sizeof(b)); /* Read 8 bytes from ebs[j]->data+i 
> */
> a ^= b; /* XOR */
> memcpy(p_eb->data + i, , sizeof(a)); /* Write back to p_eb->data+i */
>
> I'm not familiar with btrfs, but the results seems like they depend on
> the sizeof(unsigned long). Given that they used parentheses, I assume
> it was intentional. However, if this was supposed to do an XOR
> operation 8 bytes at a time, then it would need to be something like:
>
> *(((unsigned long *)p_eb->data)+i) ^= *(((unsigned long *)ebs[j]->data) + i);
>
> i.e. cast pointer to unsigned long*, then add i (which would index
> array of unsigned long, not char).
>
> --Patrick
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [sparc64] mkfs.btrfs bus error / align issue?

2016-07-27 Thread John Paul Adrian Glaubitz
On 07/27/2016 03:59 PM, Anatoly Pugachev wrote:
> Program received signal SIGBUS, Bus error.
> 0x0015e160 in write_raid56_with_parity (info=0x2b17b0,
> eb=0x2c7fe0, multi=0x2c2870, stripe_len=65536, raid_map=0x2c2570) at
> volumes.c:2156
> 2156*(unsigned long *)(p_eb->data + i) ^=

Well, that pretty much looks some creative pointer arithmetics that will provoke
unaligned access. Just check what the declaration of p_eb->data is. If it's
not "unsigned long", then you know why the code breaks here.

You should be able to fix this issue by replacing the code line with:

unsigned long tmp;
memcpy(, &(ebs[j]->data + i), sizeof(unsigned long));
tmp ^= tmp;
memcpy(&(p_eb->data + i), , sizeof(unsigned long));

I'm currently not sure whether this can be done more elegantly without
provoking unaligned access due to the bitwise XOR (^). In either case,
your problem is the casting and using memcpy() should fix the problem.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [sparc64] mkfs.btrfs bus error / align issue?

2016-07-27 Thread Patrick Baggett
On Wed, Jul 27, 2016 at 8:59 AM, Anatoly Pugachev  wrote:
>
> Hello!
>
> Running xfstests suite, got in logs mkfs.btrfs bus error, debugging it
> shows the following :
>
> mator@nvg5120:~/btrfs-progs$ git log -1 --oneline
> 40650bf Btrfs progs v4.6.1
>
> root@nvg5120:/home/mator/xfstests# gdb
> GNU gdb (Debian 7.11.1-2) 7.11.1
> (gdb) file /opt/btrfs/bin/mkfs.btrfs
> Reading symbols from /opt/btrfs/bin/mkfs.btrfs...done.
> (gdb) set args -f -draid5 -mraid5 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3
> (gdb) run
> Starting program: /opt/btrfs/bin/mkfs.btrfs -f -draid5 -mraid5
> /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/sparc64-linux-gnu/libthread_db.so.1".
> btrfs-progs v4.6.1
> See http://btrfs.wiki.kernel.org for more information.
>
> ERROR: superblock checksum mismatch
> ERROR: superblock checksum mismatch
> ERROR: superblock checksum mismatch
> Performing full device TRIM (2.00GiB) ...
> Performing full device TRIM (2.00GiB) ...
> Performing full device TRIM (2.00GiB) ...
> Performing full device TRIM (2.00GiB) ...
>
> Program received signal SIGBUS, Bus error.
> 0x0015e160 in write_raid56_with_parity (info=0x2b17b0,
> eb=0x2c7fe0, multi=0x2c2870, stripe_len=65536, raid_map=0x2c2570) at
> volumes.c:2156
> 2156*(unsigned long *)(p_eb->data + i) ^=
> (gdb) bt
> #0  0x0015e160 in write_raid56_with_parity (info=0x2b17b0,
> eb=0x2c7fe0, multi=0x2c2870, stripe_len=65536, raid_map=0x2c2570)
> at volumes.c:2156
> #1  0x00119b30 in write_and_map_eb (trans=0x2cc250,
> root=0x2c7d80, eb=0x2c7fe0) at disk-io.c:426
> #2  0x00119e74 in write_tree_block (trans=0x2cc250,
> root=0x2c7d80, eb=0x2c7fe0) at disk-io.c:459
> #3  0x0011a4ac in __commit_transaction (trans=0x2cc250,
> root=0x2c7d80) at disk-io.c:562
> #4  0x0011a7b8 in btrfs_commit_transaction (trans=0x2cc250,
> root=0x2c7d80) at disk-io.c:598
> #5  0x001a2b04 in main (argc=8, argv=0x7fef698) at mkfs.c:1786
> (gdb)
>
> Can someone help please? Thanks.
>

The code that faults:

(unsigned long *)(p_eb->data + i) ^= *(unsigned long *)(ebs[j]->data + i);

Because struct extent_buffer has 'data' as a char[], this will always
fault on sparc64 and probably a number of other RISC architectures. It
increments the address by 1, then reads an 8-byte chunk, XORs, and
writes the 8-byte chunk, repeat. In other words, 7 out of 8 reads
would fault, even if both `data` pointers were 8-byte aligned.

This would probably fix it, though it looks ugly.
unsigned long a, b;
memcpy(, p_eb->data + i, sizeof(a));  /* Read 8 bytes from p_eb->data+i */
memcpy(, ebs[j]->data + i, sizeof(b)); /* Read 8 bytes from ebs[j]->data+i */
a ^= b; /* XOR */
memcpy(p_eb->data + i, , sizeof(a)); /* Write back to p_eb->data+i */

I'm not familiar with btrfs, but the results seems like they depend on
the sizeof(unsigned long). Given that they used parentheses, I assume
it was intentional. However, if this was supposed to do an XOR
operation 8 bytes at a time, then it would need to be something like:

*(((unsigned long *)p_eb->data)+i) ^= *(((unsigned long *)ebs[j]->data) + i);

i.e. cast pointer to unsigned long*, then add i (which would index
array of unsigned long, not char).

--Patrick
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [sparc64] mkfs.btrfs bus error / align issue?

2016-07-27 Thread David Sterba
On Wed, Jul 27, 2016 at 04:59:27PM +0300, Anatoly Pugachev wrote:
> Hello!
> 
> Running xfstests suite, got in logs mkfs.btrfs bus error, debugging it
> shows the following :
> 
> Program received signal SIGBUS, Bus error.
> 0x0015e160 in write_raid56_with_parity (info=0x2b17b0,
> eb=0x2c7fe0, multi=0x2c2870, stripe_len=65536, raid_map=0x2c2570) at
> volumes.c:2156
> 2156*(unsigned long *)(p_eb->data + i) ^=

Yeah, clear unaligned access. We have helpers for so I'll fix it. I was
looking for a way to simulate and catch that on x86 or at least let gcc
warn but no such thing seems to exist. Which means we might accidentally
introduce that in the future.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] Btrfs: remove BUG() in raid56

2016-07-27 Thread Liu Bo
This BUG() has been triggered by a fuzz testing image, which contains
an invalid chunk type, ie. a single stripe chunk has the raid6 type.

Btrfs can handle this gracefully by returning -EIO, so besides using
btrfs_warn to give us more debugging information rather than a single
BUG(), we can return error properly.

Signed-off-by: Liu Bo 
---
v2: use btrfs_warn with more debugging information instead of WARN_ONCE.
v3: - give a short summary about what happens when the error occurs.
- add a ASSERT() which only takes effect for developers.

 fs/btrfs/raid56.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index f8b6d41..6f8addf 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2139,7 +2139,11 @@ int raid56_parity_recover(struct btrfs_root *root, 
struct bio *bio,
 
rbio->faila = find_logical_bio_stripe(rbio, bio);
if (rbio->faila == -1) {
-   BUG();
+   btrfs_warn(root->fs_info,
+   "%s could not find the bad stripe in raid56 so that we cannot recover 
any more (bio has logical %llu len %llu, bbio has map_type %llu)",
+  __func__, (u64)bio->bi_iter.bi_sector << 9,
+  (u64)bio->bi_iter.bi_size, bbio->map_type);
+   ASSERT(0);
if (generic_io)
btrfs_put_bbio(bbio);
kfree(rbio);
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[BTRFS-PROGS][PATCH][V2] Add two new commands: 'btrfs insp physical-find' and 'btrfs insp physical-dump'

2016-07-27 Thread Goffredo Baroncelli

Hi all,

the following patches add two new commands: 
1) btrfs inspect-internal physical-find
2) btrfs inspect-internal physical-dump

The aim of these two new commands is to locate (1) and dump (2) the stripe 
elements
stored on the disks. I developed these two new command to simplify the
debugging of some RAID5 bugs (but this is another discussion).

An example of 'btrfs inspect-internal physical-find' is the following:

# btrfs inspect physical-find mnt/out.txt
mnt/out.txt: 0
devid: 3 dev_name: /dev/loop2 offset: 61931520 type: DATA
devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER
devid: 1 dev_name: /dev/loop0 offset: 81854464 type: PARITY
devid: 4 dev_name: /dev/loop3 offset: 61931520 type: PARITY

In the output above, DATA is the stripe elemnt conaining data. OTHER
is the sibling stripe elemnt: it contains data related to or other files
or to the same file but different position. The two stripe elements contain
the RAID6 parity (P and Q).

It is possible to pass the offset of the file to inspect.

An example of 'btrfs inspect-internal physical-dump' is the following

# btrfs insp physical-find mnt/out.txt 
mnt/out.txt: 0
devid: 5 dev_name: /dev/loop4 offset: 56819712 type: OTHER
devid: 4 dev_name: /dev/loop3 offset: 56819712 type: OTHER
devid: 3 dev_name: /dev/loop2 offset: 56819712 type: DATA
devid: 2 dev_name: /dev/loop1 offset: 56819712 type: PARITY
devid: 1 dev_name: /dev/loop0 offset: 76742656 type: PARITY

# btrfs insp physical-dump mnt/out.txt | xxd 
mnt/out.txt: 0
file: /dev/loop2 off=56819712
: 6164 6161 6161 6161 6161 6161 6161 6161  adaa
0010: 6161 6161 6161 6161 6161 6161 6161 6161  
0020: 6161 6161 6161 6161 6161 6161 6161 6161  
0030: 6161 6161 6161 6161 6161 6161 6161 6161  
0040: 6161 6161 6161 6161 6161 6161 6161 6161  
[...]

In this case it is dumped the content of the first 4k of the file. It 
is possible to pass also an offset (at step of 4k). Moreover
it is possible to select to dump: which copy has to be dumped (switch -c,
only for RAID1/RAID10/DUP); which parity has to be dumped (switch -p,
only for RAID5/RAID6); which stripe element other than data (switch -s,
only for RAID5/RAID6).

ChangeLog:

v1: 2016-07-24 First issue
v2: 2016-07-27 After Qu suggestion, it is added the switch '-l' to dump
the info from a "logical" address

BR
G.Baroncelli




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] New btrfs command: "btrfs inspect physical-find"

2016-07-27 Thread Goffredo Baroncelli
From: Goffredo Baroncelli 

The aim of this new command is to show the physical placement on the disk
of a file.
Currently it handles all the profiles (single, dup, raid1/10/5/6).

The syntax is simple:

where:
   is the file to inspect
   is the offset of the file to inspect (default 0)

Below some examples:

** Single

$ sudo mkfs.btrfs -f -d single -m single /dev/loop0
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt 
>/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
mnt/out.txt: 0
devid: 1 dev_name: /dev/loop0 offset: 12582912 type: LINEAR
$ dd 2>/dev/null if=/dev/loop0 skip=12582912 bs=1 count=5; echo
adaaa

** Dup

The command shows both the copies

$ sudo mkfs.btrfs -f -d single -m single /dev/loop0
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt 
>/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
mnt/out.txt: 0
devid: 1 dev_name: /dev/loop0 offset: 71303168 type: DUP
devid: 1 dev_name: /dev/loop0 offset: 104857600 type: DUP
$ dd 2>/dev/null if=/dev/loop0 skip=104857600 bs=1 count=5 ; echo
adaaa

** Raid1

The command shows both the copies

$ sudo mkfs.btrfs -f -d raid1 -m raid1 /dev/loop0 /dev/loop1
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt 
>/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0
devid: 2 dev_name: /dev/loop1 offset: 61865984 type: RAID1
devid: 1 dev_name: /dev/loop0 offset: 81788928 type: RAID1
$ dd 2>/dev/null if=/dev/loop0 skip=81788928 bs=1 count=5; echo
adaaa

** Raid10

The command show both the copies; if you set an offset to the next disk-stripe, 
you can see the next pair of disk-stripe

$ sudo mkfs.btrfs -f -d raid10 -m raid10 /dev/loop[0123]
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt 
>/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0
devid: 4 dev_name: /dev/loop3 offset: 61931520 type: RAID10
devid: 3 dev_name: /dev/loop2 offset: 61931520 type: RAID10
$ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5; echo
adaaa
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt 65536
mnt/out.txt: 65536
devid: 2 dev_name: /dev/loop1 offset: 61931520 type: RAID10
devid: 1 dev_name: /dev/loop0 offset: 81854464 type: RAID10
$ dd 2>/dev/null if=/dev/loop0 skip=81854464 bs=1 count=5; echo
bdbbb

** Raid5

Depending by the offset, you can see which disk-stripe is used.

$ sudo mkfs.btrfs -f -d raid5 -m raid5 /dev/loop[012]
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt 
>/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
mnt/out.txt: 0
devid: 2 dev_name: /dev/loop1 offset: 61931520 type: DATA
devid: 1 dev_name: /dev/loop0 offset: 81854464 type: OTHER
devid: 3 dev_name: /dev/loop2 offset: 61931520 type: PARITY
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt 65536mnt/out.txt: 
65536
devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER
devid: 1 dev_name: /dev/loop0 offset: 81854464 type: DATA
devid: 3 dev_name: /dev/loop2 offset: 61931520 type: PARITY
$ dd 2>/dev/null if=/dev/loop1 skip=61931520 bs=1 count=5; echo
adaaa
$ dd 2>/dev/null if=/dev/loop0 skip=81854464 bs=1 count=5; echo
bdbbb
$ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5 | xxd
: 0300 0303 03 .

The parity is computed as: parity=disk1^disk2. So "adaa" ^ "bdbb" == 
"\x03\x00\x03\x03

** Raid6
$ sudo mkfs.btrfs -f -mraid6 -draid6 /dev/loop[0-4]^C
$ sudo mount /dev/loop0 mnt/
$ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt 
>/dev/null
$ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt
mnt/out.txt: 0
devid: 3 dev_name: /dev/loop2 offset: 61931520 type: DATA
devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER
devid: 1 dev_name: /dev/loop0 offset: 81854464 type: PARITY
devid: 4 dev_name: /dev/loop3 offset: 61931520 type: PARITY

$ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5 ; echo
adaaa


Signed-off-by: Goffredo Baroncelli 
---
 cmds-inspect.c | 587 +
 1 file changed, 587 insertions(+)

diff --git a/cmds-inspect.c b/cmds-inspect.c
index dd7b9dd..fc2e7c3 100644
--- a/cmds-inspect.c
+++ b/cmds-inspect.c
@@ -22,6 +22,11 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 #include "kerncompat.h"
 #include "ioctl.h"
@@ -623,6 +628,586 @@ out:
return !!ret;
 }
 
+
+static const char * const cmd_inspect_physical_find_usage[] = {
+   

[PATCH 1/5] Add some helper functions

2016-07-27 Thread Goffredo Baroncelli
From: Goffredo Baroncelli 

Add the following functions:
- int is_btrfs_fs(const char *path) -> returns 0 if path is a btrfs filesystem
- void check_root_or_exit() -> checks if the user has the root capability or
   it exits writing an error message
- void check_btrfs_or_exit(const char *path)
checks if path is a valid btrfs filesystem,
otherwise it exits

Signed-off-by: Goffredo baroncelli 
---
 utils.c | 41 +
 utils.h | 14 ++
 2 files changed, 55 insertions(+)

diff --git a/utils.c b/utils.c
index 578fdb0..b99706c 100644
--- a/utils.c
+++ b/utils.c
@@ -4131,3 +4131,44 @@ unsigned int rand_range(unsigned int upper)
 */
return (unsigned int)(jrand48(rand_seed) % upper);
 }
+
+/*
+ * check if path is a btrfs filesystem
+ */
+int is_btrfs_fs(const char *path)
+{
+   struct statfs stfs;
+
+   if (statfs(path, ) != 0) {
+   /* cannot access */
+   return -1;
+   }
+
+   if (stfs.f_type != BTRFS_SUPER_MAGIC) {
+   /* not a btrfs filesystem */
+   return -2;
+   }
+
+   return 0;
+}
+
+/*
+ * check if the user is root
+ */
+void check_root_or_exit()
+{
+if (geteuid() == 0)
+return;
+
+error("You need to be root to execute this command");
+exit(100);
+}
+
+void check_btrfs_or_exit(const char *path)
+{
+if (!is_btrfs_fs(path))
+return;
+
+error("'%s' must be a valid btrfs filesystem", path);
+exit(100);
+}
diff --git a/utils.h b/utils.h
index 98bfb34..0bd6ecb 100644
--- a/utils.h
+++ b/utils.h
@@ -399,4 +399,18 @@ unsigned int rand_range(unsigned int upper);
 /* Also allow setting the seed manually */
 void init_rand_seed(u64 seed);
 
+/* return 0 if path is a valid btrfs filesystem */
+int is_btrfs_fs(const char *path);
+
+/*
+ * check if the user has the root capability, otherwise it exits printing an
+ * error message
+ */
+void check_root_or_exit();
+/*
+ * check if path is a valid btrfs filesystem, otherwise it exits printing an
+ * error message
+ */
+void check_btrfs_or_exit(const char *path);
+
 #endif
-- 
2.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] new command btrfs inspect physical-dump

2016-07-27 Thread Goffredo Baroncelli
From: Goffredo Baroncelli 

The aim of this command, is to dump the disk content of a file bypassing the
btrfs filesystem. This could help to test the btrfs filesystem.
The dump size is a page (4k) (even if the file is shorter). It is possible
to set an offset for the file portion to read, but even this offset must be
multiple of 4k.

With the switch -c , it is possible to select whch copy will be
dumped (RAID1/RAID10/DUP).
With the switch -p, it is possible to select which parity will
be dumped (RAID5/RAID6)
With the switch -s, it is possible to dump the other elemnt of the
stripe (RAID5, RAID6)

# btrfs insp physical-dump /bin/ls 8192 | xxd
/bin/ls: 8192
file: /dev/sda3 off=16600629248
: b0e2 6100   0700  5200   ..a.R...
0010:     b8e2 6100    ..a.
0020: 0700  5300       S...
0030: c0e2 6100   0700  5400   ..a.T...
[...]


Signed-off-by: Goffredo Baroncelli 
---
 cmds-inspect.c | 320 +
 1 file changed, 320 insertions(+)

diff --git a/cmds-inspect.c b/cmds-inspect.c
index fc2e7c3..0e7d725 100644
--- a/cmds-inspect.c
+++ b/cmds-inspect.c
@@ -1208,6 +1208,324 @@ out:
return ret;
 }
 
+static const char * const cmd_inspect_physical_dump_usage[] = {
+   "btrfs inspect-internal physical-dump [-c |-s |-p 
]  [-l |]",
+   "Dump the physical content of a file offset",
+   "  file to dump",
+   "   file offset to dump; 0 if not specified",
+   "   dump logical address of filesystem, ",
+   "number of copy to dump (for raid1,dup/raid10)",
+   "  number of parity to dump (for raid5/raid6)",
+   "  number of stripe elemnt to dump (for raid5/raid6)",
+   "This command requires root privileges",
+   NULL
+};
+
+static int dumpfile(const char *fname, u64 off)
+{
+   int fd = -1;
+   int size = 4096;
+   char buf[size];
+   int r;
+   int e = 0;
+   off_t r1;
+
+   fprintf(stderr, "dev: %s off=%llu\n", fname, off);
+
+   fd = open(fname, O_RDONLY|O_APPEND);
+   if (fd < 0) {
+   int e = errno;
+
+   error("cannot open file: '%s'\n", strerror(e));
+   return -e;
+   }
+
+   r1 = lseek(fd, off, SEEK_SET);
+   if (r1 == (off_t)-1) {
+   e = -errno;
+   error("cannot seek file: '%s'\n", strerror(-e));
+   goto out;
+   }
+
+   while (size) {
+   r = read(fd, buf, size);
+   if (r < 0) {
+   e = -errno;
+   error("cannot read file: '%s'\n", strerror(-e));
+   goto out;
+   }
+
+   size -= r;
+   r = fwrite(buf, r, 1, stdout);
+   if (r < 0) {
+   e = -errno;
+   error("cannot write: '%s'\n", strerror(-e));
+   goto out;
+   }
+
+   }
+
+out:
+   if (fd != -1)
+   close(fd);
+   return e;
+}
+
+static int cmd_inspect_physical_dump(int argc, char **argv)
+{
+   int ret = 0;
+   int fd;
+   char *fname;
+   u64 profile_type;
+   struct btrfs_ioctl_dev_info_args *disks = NULL;
+   struct btrfs_ioctl_fs_info_args fi_args = {0};
+   char btrfs_chunk_data[4096];
+   struct btrfs_chunk *chunk_item = (struct btrfs_chunk 
*)_chunk_data;
+   u64 chunk_offset = 0;
+   struct stripe_info *stripes = NULL;
+   int stripes_count = 0;
+   int rc;
+   int copynr = 0;
+   int paritynr = -1;
+   int stripenr = -1;
+   const char *logical_arg = NULL;
+   u64 logical = 0ull;
+
+   optind = 1;
+   while (1) {
+   int c = getopt(argc, argv, "c:p:s:l:");
+
+   if (c < 0)
+   break;
+
+   switch (c) {
+   case 'c':
+   copynr = atoi(optarg);
+   break;
+   case 'p':
+   paritynr = atoi(optarg);
+   break;
+   case 's':
+   stripenr = atoi(optarg);
+   break;
+   case 'l':
+   logical_arg = optarg;
+   break;
+   default:
+   usage(cmd_inspect_physical_dump_usage);
+   }
+   }
+
+   if ((logical_arg != NULL && check_argc_exact(argc - optind, 1)) ||
+   (check_argc_min(argc - optind, 1) || check_argc_max(argc - optind, 
2)))
+   usage(cmd_inspect_physical_dump_usage);
+
+   fname = argv[optind];
+
+   check_root_or_exit();
+   check_btrfs_or_exit(fname);
+
+   fprintf(stderr, "%s: %llu\n", fname, logical);
+
+   fd = open(fname, O_RDONLY|O_DIRECT);
+   if (fd < 0) {
+   

[PATCH 4/5] Add man page for command btrfs insp physical-find

2016-07-27 Thread Goffredo Baroncelli
From: Goffredo Baroncelli 

Signed-off-by: Goffredo Baroncelli 
---
 Documentation/btrfs-inspect-internal.asciidoc | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/Documentation/btrfs-inspect-internal.asciidoc 
b/Documentation/btrfs-inspect-internal.asciidoc
index 74f6dea..c132a0e 100644
--- a/Documentation/btrfs-inspect-internal.asciidoc
+++ b/Documentation/btrfs-inspect-internal.asciidoc
@@ -146,6 +146,13 @@ Print sizes and statistics of trees.
 -b
 Print raw numbers in bytes.
 
+*physical-find*  [|-l ]::
+(needs root privileges)
++
+Show the placement of a given file (at offset 'off', default 0) on the disks.
+If 'logical' is given, this command shows the palcement of a logical address
+on the disks.
+
 EXIT STATUS
 ---
 *btrfs inspect-internal* returns a zero exit status if it succeeds. Non zero is
-- 
2.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] Add new command to man pages: btrfs insp physical-dump

2016-07-27 Thread Goffredo Baroncelli
From: Goffredo Baroncelli 

Signed-off-by: Goffredo Baroncelli 
---
 Documentation/btrfs-inspect-internal.asciidoc | 12 
 1 file changed, 12 insertions(+)

diff --git a/Documentation/btrfs-inspect-internal.asciidoc 
b/Documentation/btrfs-inspect-internal.asciidoc
index c132a0e..569681d 100644
--- a/Documentation/btrfs-inspect-internal.asciidoc
+++ b/Documentation/btrfs-inspect-internal.asciidoc
@@ -153,6 +153,18 @@ Show the placement of a given file (at offset 'off', 
default 0) on the disks.
 If 'logical' is given, this command shows the palcement of a logical address
 on the disks.
 
+*physical-dump* [-c |-s |-p ]  [|-l 
]::
+(needs root privileges)
++
+Dump the disk content of a given file (at offset 'off', default 0). If 
'logical'
+is passed, it is dumped the content of the given logical address.
+For RAID1/RAID10/DUP 'copynr', select which copy will be dumped. For
+RAID5/RAID6, 'paritynr' specifies which parity will be dumped. For
+RAID5/RAID6, 'stripenr' specifies which stripe elemnt will be dumped.
++
+'off' and 'logical' must be a multiple of 4096. 4096 bytes are dumped, even if
+the file is shorter.
+
 EXIT STATUS
 ---
 *btrfs inspect-internal* returns a zero exit status if it succeeds. Non zero is
-- 
2.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs fi defrag does not defrag files >256kB?

2016-07-27 Thread Nicholas D Steeves
On 26 July 2016 at 21:10, Duncan <1i5t5.dun...@cox.net> wrote:
> Nicholas D Steeves posted on Tue, 26 Jul 2016 19:03:53 -0400 as excerpted:
>
>> Hi,
>>
>> I've been using btrfs fi defrag with out the "-r -t 32M" option for
>> regular maintenance.  I just learned, in
>> Documentation/btrfs-convert.asciidoc, that there is a recommendation
>> to run with "-t 32M" after a conversion from ext2/3/4.  I then
>> cross-referenced this with btrfs-filesystem(8), and found that:
>>
>> Extents bigger than value given by -t will be skipped, otherwise
>> this value is used as a target extent size, but is only advisory
>> and may not be reached if the free space is too fragmented. Use 0
>> to take the kernel default, which is 256kB but may change in the
>> future.
>>
>> I understand the default behaviour of target extent size of 256kB to
>> mean only defragment small files and metadata.  Or does this mean that
>> the default behaviour is to defragment extent tree metadata >256kB,
>> and then defragment the (larger than 256kB) data from many extents
>> into a single extent?  I was surprised to read this!
>>
>> What's really happening with this default behaviour?  Should everyone
>> be using -t with a much larger value to actually defragment their
>> databases?
>
> Something about defrag's -t option should really be in the FAQ, as it is
> known to be somewhat confusing and to come up from time to time, tho this
> is the first time I've seen it in the context of convert.
>
> In general, you are correct in that the larger the value given to -t, the
> more defragging you should ultimately get.  There's a practical upper
> limit, however, the data chunk size, which is nominally 1 GiB (tho on
> tiny btrfs it's smaller and on TB-scale it can be larger, to 8 or 10 GiB
> IIRC).  32-bit btrfs-progs defrag also had a bug at one point that would
> (IIRC) kill the parameter if it was set to 2+ GiB -- that has been fixed
> by hard-coding the 32-bit max to 1 GiB, I believe.  The bug didn't affect
> 64-bit.  In any case, 1 GiB is fine, and often the largest btrfs can do
> anyway, due as I said to that being the normal data chunk size.
>
> And btrfs defrag only deals with data.  There's no metadata defrag, tho
> balance -m (or whole filesystem) will normally consolidate the metadata
> into the fewest (nominally 256 MiB) metadata chunks possible as it
> rewrites them.

Thank you for this metadata consolidation tip!

> In that regard a defrag -t 32M recommendation is reasonable for a
> converted filesystem, tho you can certainly go larger... to 1 GiB as I
> said.
>

I only mentioned btrfs-convert.asciidoc, because that's what led me to
the discrepancy between the default target extent size value, and a
recommended value.  I was searching for everything I could find on
defrag, because I had begun to suspect that it wasn't functioning as
expected.

Is there any reason why defrag without -t cannot detect and default to
the data chunk size, or why it does not default to 1 GiB?  In the same
way that balance's default behaviour is a full balance, shouldn't
defrag's default behaviour defrag whole chunks?  Does it not default
to 1 GiB because that would increase the number of cases where defrag
unreflinks and duplicates files--leading to an ENOSPC?

https://github.com/kdave/btrfsmaintenance/blob/master/btrfs-defrag.sh
uses -t 32M ; if a default target extent size of 1GiB is too radical,
why not set it it 32M?  If SLED ships btrfsmaintenance, then defrag -t
32M should be well-tested, no?

Thank you,
Nicholas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: Make RAID stripesize configurable

2016-07-27 Thread Goffredo Baroncelli
Hi Sanidhya,

On 2016-07-27 08:12, Sanidhya Solanki wrote:
> The reason for this limit is the fact that, as I noted above the real 
> stripe size is currently 4KiB, with an element size of 64KiB. 


I am not able to understand this sentence: on the best of my knowledge,
in btrfs the RAID5/RAID6 stripe is composed by several sub-stripes (I am
not sure about the terminology to adopt); the number of sub-stripe is equal 
to the number of the disk.

Until now, in btrfs the size of sub-stripe is fixed to 64k, so the size
of a stripe is equal to 64k * . So for raid5 the minimum
stripe size is 192k, for raid6 is 256k.

Why you are writing that the real stripe size is 4kb (may be that you are
referring to the be the page size ?).

I am quite sure that the problem is the terminology. Could you be so kindly 
to explain what you are meaning ?

Thanks in advance.
BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V20 04/19] Btrfs: subpage-blocksize: Define extent_buffer_head

2016-07-27 Thread Chandan Rajendra
On Tuesday, July 26, 2016 01:42:08 PM Josef Bacik wrote:
> On 07/04/2016 12:34 AM, Chandan Rajendra wrote:
> > In order to handle multiple extent buffers per page, first we need to 
> > create a
> > way to handle all the extent buffers that are attached to a page.
> >
> > This patch creates a new data structure 'struct extent_buffer_head', and 
> > moves
> > fields that are common to all extent buffers from 'struct extent_buffer' to
> > 'struct extent_buffer_head'
> >
> > Also, this patch moves EXTENT_BUFFER_TREE_REF, EXTENT_BUFFER_DUMMY and
> > EXTENT_BUFFER_IN_TREE flags from extent_buffer->ebflags  to
> > extent_buffer_head->bflags.
> >
> > Reviewed-by: Liu Bo 
> > Signed-off-by: Chandan Rajendra 
> 
> I'm sorry Chandan I'm still having problems with this one.  XFS kmalloc()'s 
> its 
> sub pagesize ranges for it's metadata buffers, how about we do that instead 
> of 
> doing the extent_buffer_head.  Look at xfs_buf_allocate_memory() for what I'm 
> thinking.  Thanks,
> 

Ok. I will look into the xfs metadata buffer allocation code. Thanks for the
guidance.

-- 
chandan

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[sparc64] mkfs.btrfs bus error / align issue?

2016-07-27 Thread Anatoly Pugachev
Hello!

Running xfstests suite, got in logs mkfs.btrfs bus error, debugging it
shows the following :

mator@nvg5120:~/btrfs-progs$ git log -1 --oneline
40650bf Btrfs progs v4.6.1

root@nvg5120:/home/mator/xfstests# gdb
GNU gdb (Debian 7.11.1-2) 7.11.1
(gdb) file /opt/btrfs/bin/mkfs.btrfs
Reading symbols from /opt/btrfs/bin/mkfs.btrfs...done.
(gdb) set args -f -draid5 -mraid5 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3
(gdb) run
Starting program: /opt/btrfs/bin/mkfs.btrfs -f -draid5 -mraid5
/dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/sparc64-linux-gnu/libthread_db.so.1".
btrfs-progs v4.6.1
See http://btrfs.wiki.kernel.org for more information.

ERROR: superblock checksum mismatch
ERROR: superblock checksum mismatch
ERROR: superblock checksum mismatch
Performing full device TRIM (2.00GiB) ...
Performing full device TRIM (2.00GiB) ...
Performing full device TRIM (2.00GiB) ...
Performing full device TRIM (2.00GiB) ...

Program received signal SIGBUS, Bus error.
0x0015e160 in write_raid56_with_parity (info=0x2b17b0,
eb=0x2c7fe0, multi=0x2c2870, stripe_len=65536, raid_map=0x2c2570) at
volumes.c:2156
2156*(unsigned long *)(p_eb->data + i) ^=
(gdb) bt
#0  0x0015e160 in write_raid56_with_parity (info=0x2b17b0,
eb=0x2c7fe0, multi=0x2c2870, stripe_len=65536, raid_map=0x2c2570)
at volumes.c:2156
#1  0x00119b30 in write_and_map_eb (trans=0x2cc250,
root=0x2c7d80, eb=0x2c7fe0) at disk-io.c:426
#2  0x00119e74 in write_tree_block (trans=0x2cc250,
root=0x2c7d80, eb=0x2c7fe0) at disk-io.c:459
#3  0x0011a4ac in __commit_transaction (trans=0x2cc250,
root=0x2c7d80) at disk-io.c:562
#4  0x0011a7b8 in btrfs_commit_transaction (trans=0x2cc250,
root=0x2c7d80) at disk-io.c:598
#5  0x001a2b04 in main (argc=8, argv=0x7fef698) at mkfs.c:1786
(gdb)

Can someone help please? Thanks.

PS: /dev/loop is ramdisk devices:

# mount tmpfs -t tmpfs -o size=12g /ramdisk
# fallocate -l 3.9G /ramdisk/testvol
# for i in 1 2 3 4; do fallocate -l 2G /ramdisk/scratch${i} ; done
# ls -lh /ramdisk/
total 12G
-rw-r--r-- 1 root root 2.0G Jul 27 16:16 scratch1
-rw-r--r-- 1 root root 2.0G Jul 27 16:16 scratch2
-rw-r--r-- 1 root root 2.0G Jul 27 16:16 scratch3
-rw-r--r-- 1 root root 2.0G Jul 27 16:16 scratch4
-rw-r--r-- 1 root root 3.9G Jul 27 16:15 testvol

# for i in /ramdisk/*; do echo -n "$i : "; losetup -f --show $i; done
/ramdisk/scratch1 : /dev/loop0
/ramdisk/scratch2 : /dev/loop1
/ramdisk/scratch3 : /dev/loop2
/ramdisk/scratch4 : /dev/loop3
/ramdisk/testvol : /dev/loop4
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V20 01/19] Btrfs: subpage-blocksize: Fix whole page read.

2016-07-27 Thread Chandan Rajendra
On Tuesday, July 26, 2016 12:11:49 PM Josef Bacik wrote:
> On 07/04/2016 12:34 AM, Chandan Rajendra wrote:
> > For the subpage-blocksize scenario, a page can contain multiple
> > blocks. In such cases, this patch handles reading data from files.
> >
> > To track the status of individual blocks of a page, this patch makes use
> > of a bitmap pointed to by the newly introduced per-page 'struct
> > btrfs_page_private'.
> >
> > The per-page btrfs_page_private->io_lock plays the same role as
> > BH_Uptodate_Lock (see end_buffer_async_read()) i.e. without the io_lock
> > we may end up in the following situation,
> >
> > NOTE: Assume 64k page size and 4k block size. Also assume that the first
> > 12 blocks of the page are contiguous while the next 4 blocks are
> > contiguous. When reading the page we end up submitting two "logical
> > address space" bios. So end_bio_extent_readpage function is invoked
> > twice, once for each bio.
> >
> > |-+-+-|
> > | Task A  | Task B  | Task C  |
> > |-+-+-|
> > | end_bio_extent_readpage | | |
> > | process block 0 | | |
> > | - clear BLK_STATE_IO| | |
> > | - page_read_complete| | |
> > | process block 1 | | |
> > | | | |
> > | | | |
> > | | end_bio_extent_readpage | |
> > | | process block 0 | |
> > | | - clear BLK_STATE_IO| |
> > | | - page_read_complete| |
> > | | process block 1 | |
> > | | | |
> > | process block 11| process block 3 | |
> > | - clear BLK_STATE_IO| - clear BLK_STATE_IO| |
> > | - page_read_complete| - page_read_complete| |
> > |   - returns true|   - returns true| |
> > |   - unlock_page()   | | |
> > | | | lock_page() |
> > | |   - unlock_page()   | |
> > |-+-+-|
> >
> > We end up incorrectly unlocking the page twice and "Task C" ends up
> > working on an unlocked page. So private->io_lock makes sure that only
> > one of the tasks gets "true" as the return value when page_io_complete()
> > is invoked. As an optimization the patch gets the io_lock only when the
> > last block of the bio_vec is being processed.
> >
> > Signed-off-by: Chandan Rajendra 
> > ---
> >  fs/btrfs/extent_io.c | 371 
> > ---
> >  fs/btrfs/extent_io.h |  74 +-
> >  fs/btrfs/inode.c |  16 +--
> >  3 files changed, 338 insertions(+), 123 deletions(-)
> >
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index e197d47..a349f99 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -24,6 +24,7 @@
> >
> >  static struct kmem_cache *extent_state_cache;
> >  static struct kmem_cache *extent_buffer_cache;
> > +static struct kmem_cache *page_private_cache;
> >  static struct bio_set *btrfs_bioset;
> >
> >  static inline bool extent_state_in_tree(const struct extent_state *state)
> > @@ -174,10 +175,16 @@ int __init extent_io_init(void)
> > if (!extent_buffer_cache)
> > goto free_state_cache;
> >
> > +   page_private_cache = kmem_cache_create("btrfs_page_private",
> > +   sizeof(struct btrfs_page_private), 0,
> > +   SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
> > +   if (!page_private_cache)
> > +   goto free_buffer_cache;
> > +
> > btrfs_bioset = bioset_create(BIO_POOL_SIZE,
> >  offsetof(struct btrfs_io_bio, bio));
> > if (!btrfs_bioset)
> > -   goto free_buffer_cache;
> > +   goto free_page_private_cache;
> >
> > if (bioset_integrity_create(btrfs_bioset, BIO_POOL_SIZE))
> > goto free_bioset;
> > @@ -188,6 +195,10 @@ free_bioset:
> > bioset_free(btrfs_bioset);
> > btrfs_bioset = NULL;
> >
> > +free_page_private_cache:
> > +   kmem_cache_destroy(page_private_cache);
> > +   page_private_cache = NULL;
> > +
> >  free_buffer_cache:
> > kmem_cache_destroy(extent_buffer_cache);
> > extent_buffer_cache = NULL;
> > @@ -1323,6 +1334,95 @@ int clear_record_extent_bits(struct extent_io_tree 
> > *tree, u64 start, u64 end,
> > 

Re: [PATCH v2] btrfs: fix fsfreeze hang caused by delayed iputs deal

2016-07-27 Thread Wang Xiaoguang

hello,

On 07/27/2016 04:10 PM, Wang Xiaoguang wrote:

When running fstests generic/068, sometimes we got below deadlock:
   xfs_io  D 8800331dbb20 0  6697   6693 0x0080
   8800331dbb20 88007acfc140 880034d895c0 8800331dc000
   880032d243e8 fffe 880032d24400 0001
   8800331dbb38 816a9045 880034d895c0 8800331dbba8
   Call Trace:
   [] schedule+0x35/0x80
   [] rwsem_down_read_failed+0xf2/0x140
   [] ? __filemap_fdatawrite_range+0xd1/0x100
   [] call_rwsem_down_read_failed+0x18/0x30
   [] ? btrfs_alloc_block_rsv+0x2c/0xb0 [btrfs]
   [] percpu_down_read+0x35/0x50
   [] __sb_start_write+0x2c/0x40
   [] start_transaction+0x2a5/0x4d0 [btrfs]
   [] btrfs_join_transaction+0x17/0x20 [btrfs]
   [] btrfs_evict_inode+0x3c4/0x5d0 [btrfs]
   [] evict+0xba/0x1a0
   [] iput+0x196/0x200
   [] btrfs_run_delayed_iputs+0x70/0xc0 [btrfs]
   [] btrfs_commit_transaction+0x928/0xa80 [btrfs]
   [] btrfs_freeze+0x30/0x40 [btrfs]
   [] freeze_super+0xf0/0x190
   [] do_vfs_ioctl+0x4a5/0x5c0
   [] ? do_audit_syscall_entry+0x66/0x70
   [] ? syscall_trace_enter_phase1+0x11f/0x140
   [] SyS_ioctl+0x79/0x90
   [] do_syscall_64+0x62/0x110
   [] entry_SYSCALL64_slow_path+0x25/0x25

 From this warning, freeze_super() already holds SB_FREEZE_FS, but
btrfs_freeze() will call btrfs_commit_transaction() again, if
btrfs_commit_transaction() finds that it has delayed iputs to handle,
it'll start_transaction(), which will try to get SB_FREEZE_FS lock
again, then deadlock occurs.

The root cause is that in btrfs, sync_filesystem(sb) does not make
sure all metadata is updated. There still maybe some codes adding
delayed iputs, see below sample race window:

  CPU1  | CPU2
|-> freeze_super() |
 |-> sync_filesystem(sb);   |
 |  |-> cleaner_kthread()
 |  |   |-> 
btrfs_delete_unused_bgs()
 |  |   |-> btrfs_remove_chunk()
 |  |   |-> 
btrfs_remove_block_group()
 |  |   |-> 
btrfs_add_delayed_iput()
 |  |
 |-> sb->s_writers.frozen = SB_FREEZE_FS;   |
 |-> sb_wait_write(sb, SB_FREEZE_FS);   |
 |   acquire SB_FREEZE_FS lock. |
 |  |
 |-> btrfs_freeze() |
 |-> btrfs_commit_transaction() |
 |-> btrfs_run_delayed_iputs()  |
 |   will handle delayed iputs, |
 |   that means start_transaction() |
 |   will be called, which will try |
 |   to get SB_FREEZE_FS lock.  |

To fix this issue, in btrfs_commit_transaction(), if fs already is in 
SB_FREEZE_FS,
we do not handle delayed iputs.

Signed-off-by: Wang Xiaoguang 
---
  fs/btrfs/transaction.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 948aa18..e6931d9 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2277,8 +2277,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
  
  	kmem_cache_free(btrfs_trans_handle_cachep, trans);
  
+	/*

+* If s_writers.frozen >= SB_FREEZE_FS, we can not handle delayed
+* iputs, otherwise it'll result in deallock about SB_FREEZE_FS.
+*/
if (current != root->fs_info->transaction_kthread &&
-   current != root->fs_info->cleaner_kthread)
+   current != root->fs_info->cleaner_kthread &&
+   root->fs_info->sb->s_writers.frozen >= SB_FREEZE_FS)
btrfs_run_delayed_iputs(root);

Sorry, I sent this too quickly...
Above deadlock still occurs, please wait a while.

Regards,
Xiaoguang Wang

  
  	return ret;




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: fix fsfreeze hang caused by delayed iputs deal

2016-07-27 Thread Wang Xiaoguang
When running fstests generic/068, sometimes we got below deadlock:
  xfs_io  D 8800331dbb20 0  6697   6693 0x0080
  8800331dbb20 88007acfc140 880034d895c0 8800331dc000
  880032d243e8 fffe 880032d24400 0001
  8800331dbb38 816a9045 880034d895c0 8800331dbba8
  Call Trace:
  [] schedule+0x35/0x80
  [] rwsem_down_read_failed+0xf2/0x140
  [] ? __filemap_fdatawrite_range+0xd1/0x100
  [] call_rwsem_down_read_failed+0x18/0x30
  [] ? btrfs_alloc_block_rsv+0x2c/0xb0 [btrfs]
  [] percpu_down_read+0x35/0x50
  [] __sb_start_write+0x2c/0x40
  [] start_transaction+0x2a5/0x4d0 [btrfs]
  [] btrfs_join_transaction+0x17/0x20 [btrfs]
  [] btrfs_evict_inode+0x3c4/0x5d0 [btrfs]
  [] evict+0xba/0x1a0
  [] iput+0x196/0x200
  [] btrfs_run_delayed_iputs+0x70/0xc0 [btrfs]
  [] btrfs_commit_transaction+0x928/0xa80 [btrfs]
  [] btrfs_freeze+0x30/0x40 [btrfs]
  [] freeze_super+0xf0/0x190
  [] do_vfs_ioctl+0x4a5/0x5c0
  [] ? do_audit_syscall_entry+0x66/0x70
  [] ? syscall_trace_enter_phase1+0x11f/0x140
  [] SyS_ioctl+0x79/0x90
  [] do_syscall_64+0x62/0x110
  [] entry_SYSCALL64_slow_path+0x25/0x25

>From this warning, freeze_super() already holds SB_FREEZE_FS, but
btrfs_freeze() will call btrfs_commit_transaction() again, if
btrfs_commit_transaction() finds that it has delayed iputs to handle,
it'll start_transaction(), which will try to get SB_FREEZE_FS lock
again, then deadlock occurs.

The root cause is that in btrfs, sync_filesystem(sb) does not make
sure all metadata is updated. There still maybe some codes adding
delayed iputs, see below sample race window:

 CPU1  | CPU2
|-> freeze_super() |
|-> sync_filesystem(sb);   |
|  |-> cleaner_kthread()
|  |   |-> btrfs_delete_unused_bgs()
|  |   |-> btrfs_remove_chunk()
|  |   |-> 
btrfs_remove_block_group()
|  |   |-> 
btrfs_add_delayed_iput()
|  |
|-> sb->s_writers.frozen = SB_FREEZE_FS;   |
|-> sb_wait_write(sb, SB_FREEZE_FS);   |
|   acquire SB_FREEZE_FS lock. |
|  |
|-> btrfs_freeze() |
|-> btrfs_commit_transaction() |
|-> btrfs_run_delayed_iputs()  |
|   will handle delayed iputs, |
|   that means start_transaction() |
|   will be called, which will try |
|   to get SB_FREEZE_FS lock.  |

To fix this issue, in btrfs_commit_transaction(), if fs already is in 
SB_FREEZE_FS,
we do not handle delayed iputs.

Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/transaction.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 948aa18..e6931d9 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2277,8 +2277,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
 
kmem_cache_free(btrfs_trans_handle_cachep, trans);
 
+   /*
+* If s_writers.frozen >= SB_FREEZE_FS, we can not handle delayed
+* iputs, otherwise it'll result in deallock about SB_FREEZE_FS.
+*/
if (current != root->fs_info->transaction_kthread &&
-   current != root->fs_info->cleaner_kthread)
+   current != root->fs_info->cleaner_kthread &&
+   root->fs_info->sb->s_writers.frozen >= SB_FREEZE_FS)
btrfs_run_delayed_iputs(root);
 
return ret;
-- 
2.9.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: Make RAID stripesize configurable

2016-07-27 Thread Sanidhya Solanki
On Tue, 26 Jul 2016 11:14:37 -0600
Chris Murphy  wrote:

> On Fri, Jul 22, 2016 at 8:58 AM, Austin S. Hemmelgarn
>  wrote:
> > On 2016-07-22 09:42, Sanidhya Solanki wrote:  
> 
> >> +*stripesize=*;;
> >> +Specifies the new stripe size  
> 
> It'd be nice to stop conflating stripe size and stripe element size as
> if they're the same thing. I realize that LVM gets this wrong also,
> and uses stripes to mean "data strips", and stripesize for stripe
> element size. From a user perspective I find the inconsistency
> annoying, users are always confused about these terms.
> 
> So I think we need to pay the piper now, and use either strip size or
> stripe element size for this. Stripe size is the data portion of a
> full stripe read or write across all devices in the array. So right
> now with a 64KiB stripe element size on Btrfs, the stripe size for a 4
> disk raid0 is 256KiB, and the stripe size for a 4 disk raid 5 is
> 192KiB.
 
I absolutely agree with the statement regarding the difference between
those two separate settings. THis difference was more clearly visible
pre-Dec 2015, when it was removed for code appearance reasons by commit 
ee22184b53c823f6956314c2815d4068e3820737 (at the end of the commit).I
will update the documentation in the next patch to make it clear that
the balance option affects stripe size directly and the stripe element
size indirectly.


> It's 64KiB right now. Why go so much smaller?
> 
> mdadm goes from 4KiB to GiB's, with a 512KiB default.
> 
> lvm goes from 4KiB to the physical extent size, which can be GiB's.
> 
> I'm OK with an upper limit that's sane, maybe 16MiB? Hundreds of MiB's
> or even GiB's seems a bit far fetched but other RAID tools on Linux
> permit that.

The reason for this limit is the fact that, as I noted above the real 
stripe size is currently 4KiB, with an element size of 64KiB. 
Ostensibly, we can change the stripe size to any 512B multiple that is
less than 64KiB. Increasing it beyond 64KiB is risky because a lot of
calculations (only the basis of which I modified for this patch, and not
the dependencies of those algorithms and calculations) rely on the stripe
element size being 64KiB. I do not want to increase this limit as it may
lead to un-discovered bugs in the already buggy RAID 5/6 code. 

If this patch is accepted, I intend in the next few patches to do the
following:
-increase maximum stripe size to 64KiB, by reducing the number of blocks
 to 1 per stripe extent.
-Update the documentation to notify user of this change and the need for
 caution, as well as trial and error, to find an appropriate size upto
 64KiB, with a warning to only change it if users understand the
 consequences and reasons for the change, as suggested by ASH.
-Clean up the RAID 5/6 recovery code and stripe code over the coming
 months.
-Clean up the code that relies on calculations that depend on stripe size
 and their dependencies.
-Remove this stripe size and stripe element size limitation completely, as
 suggested by both ASH and CMu.

Just waiting on reviews and acceptance for this patch as the basis of the
above work. I started on the RAID recovery code yesterday.

It also appears that according to the commit that I stated above that the
stripe size used to be 1KiB, with 64 elements per stripe element, but was
changed in Dec 2015, so maybe as long as you do not change the stripe size
to be more than 64KiB, you do not need to balance after using this balance 
option (atleast the first time). I do not remember seeing any bug reports
on the mailing list since then that called out stripe size as the problem.

Interesting.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html