Re: btrfs fi defrag does not defrag files >256kB?
Nicholas D Steeves posted on Wed, 27 Jul 2016 13:19:01 -0400 as excerpted: > Is there any reason why defrag without -t cannot detect and default to > the data chunk size, or why it does not default to 1 GiB? I don't know the answer, but have wondered that myself. 256 KiB seems a rather small default, to me. I'd expect something in the MiB range, at least, maybe the same 2 MiB that modern partitioners tend to use for alignment, for the same reason, that tends to be a reasonable whole multiple of most erase-block sizes, and if the partition is aligned, should prevent unnecessary read-modify-write cycles on ssd, and help with tiled zones on SMR drives as well. As to the question in the subject line, AFAIK btrfs fi defrag works on extents, not filesize per-se, so using the default 256 KiB target, yes it'll defrag files larger than that, but only for extents that are smaller than that. If all the extents are 256 KiB plus, defrag won't do anything with it without a larger target option or unless the compress option is also used, in which case it rewrites everything it is pointed at, in ordered to recompress it. Talking about compression, it's worth mentioning that filefrag doesn't understand btrfs compression either, and will count each 128 KiB (uncompressed size) compression block as a separate extent. To get the true picture using filefrag, you need to use verbose and either eyeball the results manually or feed them into a script that processes the numbers and combines "extents" if they are reported as immediately consecutive on the filesystem. As such, filefrag, given the regular opportunistic compression, turns out to be a good method of determining whether a file (of over 128 KiB in size) is actually compressed or not, since if it is filefrag will report multiple 128 KiB extents, while if it's not, extent size should be much less regular, likely with larger extents unless the file is often modified and rewritten in-place. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: Make RAID stripesize configurable
On Wed, 27 Jul 2016 18:25:48 +0200 Goffredo Baroncelliwrote: > I am not able to understand this sentence: on the best of my knowledge, > in btrfs the RAID5/RAID6 stripe is composed by several sub-stripes (I am > not sure about the terminology to adopt); the number of sub-stripe is equal > to the number of the disk. > > Until now, in btrfs the size of sub-stripe is fixed to 64k, so the size > of a stripe is equal to 64k * . So for raid5 the minimum > stripe size is 192k, for raid6 is 256k. > Why you are writing that the real stripe size is 4kb (may be that you are > referring to the be the page size ?). No problem with going over the details one more time. What I called and what was agreed to be called stripe size in the email sent originally sent by Chris Murphy (link below) is actually how a single block of data is laid out on disk. This number is a component of the stripe element size (which you called the sub-stripe). This has nothing to do with how DIFFERENT but concurrent stripes (which you defined as 64k * ) of data are laid out on disk. Their relation is such that they follow the order when they are read, but are otherwise unrelated to each other. The correct way to look at a stripe is as follows (with its current value before this patch in brackets): Stripe Element Size (64 KiB) = Stripe size (1024B) * Number of elements per stripe element (64) For the stripe code, the stripe element size matters, and for the metadata, the stripe size matters. The (64k * ) is how concurrent stripes of data are distributed across the RAID disks. The order is only important when performing an I/O operation. Reference email: https://www.spinics.net/lists/linux-btrfs/msg57471.html Hope this makes it simpler. Sanidhya -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] btrfs: fix fsfreeze hang caused by delayed iputs deal
When running fstests generic/068, sometimes we got below deadlock: xfs_io D 8800331dbb20 0 6697 6693 0x0080 8800331dbb20 88007acfc140 880034d895c0 8800331dc000 880032d243e8 fffe 880032d24400 0001 8800331dbb38 816a9045 880034d895c0 8800331dbba8 Call Trace: [] schedule+0x35/0x80 [] rwsem_down_read_failed+0xf2/0x140 [] ? __filemap_fdatawrite_range+0xd1/0x100 [] call_rwsem_down_read_failed+0x18/0x30 [] ? btrfs_alloc_block_rsv+0x2c/0xb0 [btrfs] [] percpu_down_read+0x35/0x50 [] __sb_start_write+0x2c/0x40 [] start_transaction+0x2a5/0x4d0 [btrfs] [] btrfs_join_transaction+0x17/0x20 [btrfs] [] btrfs_evict_inode+0x3c4/0x5d0 [btrfs] [] evict+0xba/0x1a0 [] iput+0x196/0x200 [] btrfs_run_delayed_iputs+0x70/0xc0 [btrfs] [] btrfs_commit_transaction+0x928/0xa80 [btrfs] [] btrfs_freeze+0x30/0x40 [btrfs] [] freeze_super+0xf0/0x190 [] do_vfs_ioctl+0x4a5/0x5c0 [] ? do_audit_syscall_entry+0x66/0x70 [] ? syscall_trace_enter_phase1+0x11f/0x140 [] SyS_ioctl+0x79/0x90 [] do_syscall_64+0x62/0x110 [] entry_SYSCALL64_slow_path+0x25/0x25 >From this warning, freeze_super() already holds SB_FREEZE_FS, but btrfs_freeze() will call btrfs_commit_transaction() again, if btrfs_commit_transaction() finds that it has delayed iputs to handle, it'll start_transaction(), which will try to get SB_FREEZE_FS lock again, then deadlock occurs. The root cause is that in btrfs, sync_filesystem(sb) does not make sure all metadata is updated. There still maybe some codes adding delayed iputs, see below sample race window: CPU1 | CPU2 |-> freeze_super() | |-> sync_filesystem(sb); | | |-> cleaner_kthread() | | |-> btrfs_delete_unused_bgs() | | |-> btrfs_remove_chunk() | | |-> btrfs_remove_block_group() | | |-> btrfs_add_delayed_iput() | | |-> sb->s_writers.frozen = SB_FREEZE_FS; | |-> sb_wait_write(sb, SB_FREEZE_FS); | | acquire SB_FREEZE_FS lock. | | | |-> btrfs_freeze() | |-> btrfs_commit_transaction() | |-> btrfs_run_delayed_iputs() | | will handle delayed iputs, | | that means start_transaction() | | will be called, which will try | | to get SB_FREEZE_FS lock. | To fix this issue, introduce a atomic_t fs_frozen to record internally whether fs has been frozen. If fs has been frozen, we can not handle delayed iputs. Signed-off-by: Wang Xiaoguang--- v3: we introduce a atomic_t fs_frozen, and if fs_frozen is 1, we can not handle delayed iputs. --- fs/btrfs/ctree.h | 2 ++ fs/btrfs/disk-io.c | 1 + fs/btrfs/super.c | 10 ++ fs/btrfs/transaction.c | 7 ++- 4 files changed, 19 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 90041a2..9dddaa0 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1091,6 +1091,8 @@ struct btrfs_fs_info { struct list_head pinned_chunks; int creating_free_space_tree; + /* Use this atomic to record internally whether fs has been frozen */ + atomic_t fs_frozen; }; struct btrfs_subvolume_writers { diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 86cad9a..1b85f55 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2621,6 +2621,7 @@ int open_ctree(struct super_block *sb, atomic_set(_info->qgroup_op_seq, 0); atomic_set(_info->reada_works_cnt, 0); atomic64_set(_info->tree_mod_seq, 0); + atomic_set(_info->fs_frozen, 0); fs_info->sb = sb; fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE; fs_info->metadata_ratio = 0; diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 60e7179..c417008 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -2216,6 +2216,7 @@ static int btrfs_freeze(struct super_block *sb) struct btrfs_trans_handle *trans; struct btrfs_root *root = btrfs_sb(sb)->tree_root; + atomic_set(>fs_info->fs_frozen, 1); trans = btrfs_attach_transaction_barrier(root); if (IS_ERR(trans)) { /* no transaction, don't bother */ @@ -2226,6 +2227,14 @@ static int btrfs_freeze(struct super_block *sb) return btrfs_commit_transaction(trans, root); } +static int btrfs_unfreeze(struct super_block *sb) +{ + struct btrfs_root *root = btrfs_sb(sb)->tree_root; + + atomic_set(>fs_info->fs_frozen, 0); + return 0; +}
Re: [PATCH 2/5] New btrfs command: "btrfs inspect physical-find"
At 07/28/2016 01:43 AM, Goffredo Baroncelli wrote: From: Goffredo BaroncelliThe aim of this new command is to show the physical placement on the disk of a file. Currently it handles all the profiles (single, dup, raid1/10/5/6). The syntax is simple: Uh... Where is the synatx? I guess the synatx is physical-find [] where: is the file to inspect is the offset of the file to inspect (default 0) Normally is paired with . What about add a new optional parameter ? Its default value would be the length of the file. And for the optional , would you mind to make it as an option? like -o|--offset and -s|--size ? For resolve logical directly, then -l|--logical . Below some examples: ** Single $ sudo mkfs.btrfs -f -d single -m single /dev/loop0 $ sudo mount /dev/loop0 mnt/ $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 So 0 is the offset inside the file. And how long that file extent is? devid: 1 dev_name: /dev/loop0 offset: 12582912 type: LINEAR LINEAR seems a little different, as normally we just call it SINGLE in btrfs. And what about changing the output format to the following? (This combines both fiemap style and map-logical style) -- EXT: FILE-OFFSET LOGICAL RANGE DEVICE DEVICE RANG TYPE 0: 0-128K X-X 1:/dev/loop0 X-X RAID1 2:/dev/loop1 X-X RAID1 1: 128K-256K X-X 1:/dev/loop2 X-X RAID5D1 1:/dev/loop3 X-X RAID5D2 1:/dev/loop4 X-X RAID5P X-X 1:/dev/loop2 X-X RAID5D1 1:/dev/loop3 X-X RAID5D2 1:/dev/loop4 X-X RAID5P -- Extent 0 and 1 are in different raid profile, it's only possible during convert, just used as an exmple And Extent 1 are crossing 2 RAID5 stripes, so needs 2 logical range to show them all. Although it's quite hard to put the above output into 80 characters per line, it provides almost every info we need: 1) File offset and its length 2) Logical bytenr and its length 3) Device bytenr and its length (since its length can differ from logical length) 4) RAID type and its role. $ dd 2>/dev/null if=/dev/loop0 skip=12582912 bs=1 count=5; echo adaaa ** Dup The command shows both the copies $ sudo mkfs.btrfs -f -d single -m single /dev/loop0 $ sudo mount /dev/loop0 mnt/ $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid: 1 dev_name: /dev/loop0 offset: 71303168 type: DUP devid: 1 dev_name: /dev/loop0 offset: 104857600 type: DUP $ dd 2>/dev/null if=/dev/loop0 skip=104857600 bs=1 count=5 ; echo adaaa ** Raid1 The command shows both the copies $ sudo mkfs.btrfs -f -d raid1 -m raid1 /dev/loop0 /dev/loop1 $ sudo mount /dev/loop0 mnt/ $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid: 2 dev_name: /dev/loop1 offset: 61865984 type: RAID1 devid: 1 dev_name: /dev/loop0 offset: 81788928 type: RAID1 $ dd 2>/dev/null if=/dev/loop0 skip=81788928 bs=1 count=5; echo adaaa ** Raid10 The command show both the copies; if you set an offset to the next disk-stripe, you can see the next pair of disk-stripe $ sudo mkfs.btrfs -f -d raid10 -m raid10 /dev/loop[0123] $ sudo mount /dev/loop0 mnt/ $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid: 4 dev_name: /dev/loop3 offset: 61931520 type: RAID10 devid: 3 dev_name: /dev/loop2 offset: 61931520 type: RAID10 $ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5; echo adaaa $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt 65536 mnt/out.txt: 65536 devid: 2 dev_name: /dev/loop1 offset: 61931520 type: RAID10 devid: 1 dev_name: /dev/loop0 offset: 81854464 type: RAID10 $ dd 2>/dev/null if=/dev/loop0 skip=81854464 bs=1 count=5; echo bdbbb ** Raid5 Depending by the offset, you can see which disk-stripe is used. $ sudo mkfs.btrfs -f -d raid5 -m raid5 /dev/loop[012] $ sudo mount /dev/loop0 mnt/ $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid: 2 dev_name: /dev/loop1 offset: 61931520 type: DATA devid: 1 dev_name: /dev/loop0 offset: 81854464 type: OTHER devid: 3 dev_name: /dev/loop2 offset: 61931520 type: PARITY Here DATA/OTHER is a little confusing. For 4 disks
[PATCH] Btrfs: deal with unexpected return value in flush_space
Function start_transaction() can return ERR_PTR(1) when flush is BTRFS_RESERVE_FLUSH_LIMIT, so the call graph is start_transaction (return ERR_PTR(1)) -> btrfs_block_rsv_add (return 1) -> reserve_metadata_bytes (return 1) -> flush_space (return 1) -> do_chunk_alloc (return 1) With BTRFS_RESERVE_FLUSH_LIMIT, if flush_space is already on the flush_state of ALLOC_CHUNK and it successfully allocates a new chunk, then instead of trying to reserve space again, reserve_metadata_bytes returns 1 immediately. Eventually the callers who call start_transaction() usually just do the IS_ERR() check which ERR_PTR(1) can pass, then it'll get a panic when dereferencing a pointer which is ERR_PTR(1). This makes flush_space() translate 'ret = 1' to 'ret = 0'. Signed-off-by: Liu Bo--- We found this 'NULL pointer dereference' on an old 3.8 kernel but it's not going to happen on the upstream since there is no caller of btrfs_start_transaction_lflush(). fs/btrfs/extent-tree.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 7a35c9d..a00fb67 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4457,6 +4457,15 @@ void check_system_chunk(struct btrfs_trans_handle *trans, } } +/* + * If force is CHUNK_ALLOC_FORCE: + *- return 1 if it successfully allocates a chunk, + *- return errors including -ENOSPC otherwise. + * If force is NOT CHUNK_ALLOC_FORCE: + *- return 0 if it doesn't need to allocate a new chunk, + *- return 1 if it successfully allocates a chunk, + *- return errors including -ENOSPC otherwise. + */ static int do_chunk_alloc(struct btrfs_trans_handle *trans, struct btrfs_root *extent_root, u64 flags, int force) { @@ -4857,7 +4866,7 @@ static int flush_space(struct btrfs_root *root, btrfs_get_alloc_profile(root, 0), CHUNK_ALLOC_NO_FORCE); btrfs_end_transaction(trans, root); - if (ret == -ENOSPC) + if (ret == -ENOSPC || ret == 1) ret = 0; break; case COMMIT_TRANS: -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] Add some helper functions
At 07/28/2016 01:43 AM, Goffredo Baroncelli wrote: From: Goffredo BaroncelliAdd the following functions: - int is_btrfs_fs(const char *path) -> returns 0 if path is a btrfs filesystem - void check_root_or_exit() -> checks if the user has the root capability or it exits writing an error message - void check_btrfs_or_exit(const char *path) checks if path is a valid btrfs filesystem, otherwise it exits Signed-off-by: Goffredo baroncelli --- utils.c | 41 + utils.h | 14 ++ 2 files changed, 55 insertions(+) diff --git a/utils.c b/utils.c index 578fdb0..b99706c 100644 --- a/utils.c +++ b/utils.c @@ -4131,3 +4131,44 @@ unsigned int rand_range(unsigned int upper) */ return (unsigned int)(jrand48(rand_seed) % upper); } + +/* + * check if path is a btrfs filesystem + */ +int is_btrfs_fs(const char *path) +{ + struct statfs stfs; + + if (statfs(path, ) != 0) { + /* cannot access */ + return -1; + } + + if (stfs.f_type != BTRFS_SUPER_MAGIC) { + /* not a btrfs filesystem */ + return -2; + } + + return 0; +} + +/* + * check if the user is root + */ +void check_root_or_exit() +{ +if (geteuid() == 0) +return; + +error("You need to be root to execute this command"); +exit(100); No immediate exit value, especially such like 100. Normally we only use 1 and 0 as exit value. Another concern about the function is, we don't really do such early check on root privilege. Under most case, we just call privilege function, like tree search ioctl, and when it fails, it will return -EPERM to info user that they lacks the privilege. Such behavior makes code more extendable, for case like the ioctl becomes non-privilege, btrfs-progs don't need any modification. So I think it's better to let ioctl itself to do the privilege check other than in btrfs-progs. +} + +void check_btrfs_or_exit(const char *path) +{ +if (!is_btrfs_fs(path)) +return; + +error("'%s' must be a valid btrfs filesystem", path); +exit(100); +} Same exit value problem. This btrfs check seems quite good. What about merge it into functions like open_file_or_dir? As most caller uses such function to open file/dir inside a btrfs mount point. Thanks, Qu diff --git a/utils.h b/utils.h index 98bfb34..0bd6ecb 100644 --- a/utils.h +++ b/utils.h @@ -399,4 +399,18 @@ unsigned int rand_range(unsigned int upper); /* Also allow setting the seed manually */ void init_rand_seed(u64 seed); +/* return 0 if path is a valid btrfs filesystem */ +int is_btrfs_fs(const char *path); + +/* + * check if the user has the root capability, otherwise it exits printing an + * error message + */ +void check_root_or_exit(); +/* + * check if path is a valid btrfs filesystem, otherwise it exits printing an + * error message + */ +void check_btrfs_or_exit(const char *path); + #endif -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [sparc64] mkfs.btrfs bus error / align issue?
On Wed, Jul 27, 2016 at 3:40 PM, John Paul Adrian Glaubitzwrote: > On 07/27/2016 03:59 PM, Anatoly Pugachev wrote: >> Program received signal SIGBUS, Bus error. >> 0x0015e160 in write_raid56_with_parity (info=0x2b17b0, >> eb=0x2c7fe0, multi=0x2c2870, stripe_len=65536, raid_map=0x2c2570) at >> volumes.c:2156 >> 2156*(unsigned long *)(p_eb->data + i) ^= > > Well, that pretty much looks some creative pointer arithmetics that will > provoke > unaligned access. Just check what the declaration of p_eb->data is. If it's > not "unsigned long", then you know why the code breaks here. > Yeah, so basically, the best solution (assuming you can't change the alignment of `data` somehow) would look something like: Compare lower 'n' bits of `data` pointers (n=2 for ILP32, n=3 for LP64), something like (data1 & sizeof(long)-1) == (data2 & sizeof(long)-1). If they are equal, fast loop possible. Not equal -> slow loop (all byte-by-byte copy). fast loop path: do byte-by-byte XOR until lower n bits of (data+i) are zero. do word-by-word XOR until < 1 word do byte-by-bye XOR until last bytes done. This sort of processing is pretty standard for "chunking". If this was done with some cool SIMD instruction set, it would have similar sort of approach, I'd imagine. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [sparc64] mkfs.btrfs bus error / align issue?
On Wed, Jul 27, 2016 at 3:39 PM, Patrick Baggettwrote: > On Wed, Jul 27, 2016 at 8:59 AM, Anatoly Pugachev wrote: >> >> Hello! >> >> Running xfstests suite, got in logs mkfs.btrfs bus error, debugging it >> shows the following : >> >> mator@nvg5120:~/btrfs-progs$ git log -1 --oneline >> 40650bf Btrfs progs v4.6.1 >> >> root@nvg5120:/home/mator/xfstests# gdb >> GNU gdb (Debian 7.11.1-2) 7.11.1 >> (gdb) file /opt/btrfs/bin/mkfs.btrfs >> Reading symbols from /opt/btrfs/bin/mkfs.btrfs...done. >> (gdb) set args -f -draid5 -mraid5 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 >> (gdb) run >> Starting program: /opt/btrfs/bin/mkfs.btrfs -f -draid5 -mraid5 >> /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 >> [Thread debugging using libthread_db enabled] >> Using host libthread_db library "/lib/sparc64-linux-gnu/libthread_db.so.1". >> btrfs-progs v4.6.1 >> See http://btrfs.wiki.kernel.org for more information. >> >> ERROR: superblock checksum mismatch >> ERROR: superblock checksum mismatch >> ERROR: superblock checksum mismatch >> Performing full device TRIM (2.00GiB) ... >> Performing full device TRIM (2.00GiB) ... >> Performing full device TRIM (2.00GiB) ... >> Performing full device TRIM (2.00GiB) ... >> >> Program received signal SIGBUS, Bus error. >> 0x0015e160 in write_raid56_with_parity (info=0x2b17b0, >> eb=0x2c7fe0, multi=0x2c2870, stripe_len=65536, raid_map=0x2c2570) at >> volumes.c:2156 >> 2156*(unsigned long *)(p_eb->data + i) ^= >> (gdb) bt >> #0 0x0015e160 in write_raid56_with_parity (info=0x2b17b0, >> eb=0x2c7fe0, multi=0x2c2870, stripe_len=65536, raid_map=0x2c2570) >> at volumes.c:2156 >> #1 0x00119b30 in write_and_map_eb (trans=0x2cc250, >> root=0x2c7d80, eb=0x2c7fe0) at disk-io.c:426 >> #2 0x00119e74 in write_tree_block (trans=0x2cc250, >> root=0x2c7d80, eb=0x2c7fe0) at disk-io.c:459 >> #3 0x0011a4ac in __commit_transaction (trans=0x2cc250, >> root=0x2c7d80) at disk-io.c:562 >> #4 0x0011a7b8 in btrfs_commit_transaction (trans=0x2cc250, >> root=0x2c7d80) at disk-io.c:598 >> #5 0x001a2b04 in main (argc=8, argv=0x7fef698) at mkfs.c:1786 >> (gdb) >> >> Can someone help please? Thanks. >> > > The code that faults: > > (unsigned long *)(p_eb->data + i) ^= *(unsigned long *)(ebs[j]->data + i); > > Because struct extent_buffer has 'data' as a char[], this will always > fault on sparc64 and probably a number of other RISC architectures. It > increments the address by 1, then reads an 8-byte chunk, XORs, and > writes the 8-byte chunk, repeat. In other words, 7 out of 8 reads > would fault, even if both `data` pointers were 8-byte aligned. Actually, I misread that: the code increments i by sizeof(unsigned long), not by 1 so only if either data point is misaligned, then this will fault. Disregard that. > > This would probably fix it, though it looks ugly. > unsigned long a, b; > memcpy(, p_eb->data + i, sizeof(a)); /* Read 8 bytes from p_eb->data+i */ > memcpy(, ebs[j]->data + i, sizeof(b)); /* Read 8 bytes from ebs[j]->data+i > */ > a ^= b; /* XOR */ > memcpy(p_eb->data + i, , sizeof(a)); /* Write back to p_eb->data+i */ > > I'm not familiar with btrfs, but the results seems like they depend on > the sizeof(unsigned long). Given that they used parentheses, I assume > it was intentional. However, if this was supposed to do an XOR > operation 8 bytes at a time, then it would need to be something like: > > *(((unsigned long *)p_eb->data)+i) ^= *(((unsigned long *)ebs[j]->data) + i); > > i.e. cast pointer to unsigned long*, then add i (which would index > array of unsigned long, not char). > > --Patrick -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [sparc64] mkfs.btrfs bus error / align issue?
On 07/27/2016 03:59 PM, Anatoly Pugachev wrote: > Program received signal SIGBUS, Bus error. > 0x0015e160 in write_raid56_with_parity (info=0x2b17b0, > eb=0x2c7fe0, multi=0x2c2870, stripe_len=65536, raid_map=0x2c2570) at > volumes.c:2156 > 2156*(unsigned long *)(p_eb->data + i) ^= Well, that pretty much looks some creative pointer arithmetics that will provoke unaligned access. Just check what the declaration of p_eb->data is. If it's not "unsigned long", then you know why the code breaks here. You should be able to fix this issue by replacing the code line with: unsigned long tmp; memcpy(, &(ebs[j]->data + i), sizeof(unsigned long)); tmp ^= tmp; memcpy(&(p_eb->data + i), , sizeof(unsigned long)); I'm currently not sure whether this can be done more elegantly without provoking unaligned access due to the bitwise XOR (^). In either case, your problem is the casting and using memcpy() should fix the problem. Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer - glaub...@debian.org `. `' Freie Universitaet Berlin - glaub...@physik.fu-berlin.de `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [sparc64] mkfs.btrfs bus error / align issue?
On Wed, Jul 27, 2016 at 8:59 AM, Anatoly Pugachevwrote: > > Hello! > > Running xfstests suite, got in logs mkfs.btrfs bus error, debugging it > shows the following : > > mator@nvg5120:~/btrfs-progs$ git log -1 --oneline > 40650bf Btrfs progs v4.6.1 > > root@nvg5120:/home/mator/xfstests# gdb > GNU gdb (Debian 7.11.1-2) 7.11.1 > (gdb) file /opt/btrfs/bin/mkfs.btrfs > Reading symbols from /opt/btrfs/bin/mkfs.btrfs...done. > (gdb) set args -f -draid5 -mraid5 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 > (gdb) run > Starting program: /opt/btrfs/bin/mkfs.btrfs -f -draid5 -mraid5 > /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib/sparc64-linux-gnu/libthread_db.so.1". > btrfs-progs v4.6.1 > See http://btrfs.wiki.kernel.org for more information. > > ERROR: superblock checksum mismatch > ERROR: superblock checksum mismatch > ERROR: superblock checksum mismatch > Performing full device TRIM (2.00GiB) ... > Performing full device TRIM (2.00GiB) ... > Performing full device TRIM (2.00GiB) ... > Performing full device TRIM (2.00GiB) ... > > Program received signal SIGBUS, Bus error. > 0x0015e160 in write_raid56_with_parity (info=0x2b17b0, > eb=0x2c7fe0, multi=0x2c2870, stripe_len=65536, raid_map=0x2c2570) at > volumes.c:2156 > 2156*(unsigned long *)(p_eb->data + i) ^= > (gdb) bt > #0 0x0015e160 in write_raid56_with_parity (info=0x2b17b0, > eb=0x2c7fe0, multi=0x2c2870, stripe_len=65536, raid_map=0x2c2570) > at volumes.c:2156 > #1 0x00119b30 in write_and_map_eb (trans=0x2cc250, > root=0x2c7d80, eb=0x2c7fe0) at disk-io.c:426 > #2 0x00119e74 in write_tree_block (trans=0x2cc250, > root=0x2c7d80, eb=0x2c7fe0) at disk-io.c:459 > #3 0x0011a4ac in __commit_transaction (trans=0x2cc250, > root=0x2c7d80) at disk-io.c:562 > #4 0x0011a7b8 in btrfs_commit_transaction (trans=0x2cc250, > root=0x2c7d80) at disk-io.c:598 > #5 0x001a2b04 in main (argc=8, argv=0x7fef698) at mkfs.c:1786 > (gdb) > > Can someone help please? Thanks. > The code that faults: (unsigned long *)(p_eb->data + i) ^= *(unsigned long *)(ebs[j]->data + i); Because struct extent_buffer has 'data' as a char[], this will always fault on sparc64 and probably a number of other RISC architectures. It increments the address by 1, then reads an 8-byte chunk, XORs, and writes the 8-byte chunk, repeat. In other words, 7 out of 8 reads would fault, even if both `data` pointers were 8-byte aligned. This would probably fix it, though it looks ugly. unsigned long a, b; memcpy(, p_eb->data + i, sizeof(a)); /* Read 8 bytes from p_eb->data+i */ memcpy(, ebs[j]->data + i, sizeof(b)); /* Read 8 bytes from ebs[j]->data+i */ a ^= b; /* XOR */ memcpy(p_eb->data + i, , sizeof(a)); /* Write back to p_eb->data+i */ I'm not familiar with btrfs, but the results seems like they depend on the sizeof(unsigned long). Given that they used parentheses, I assume it was intentional. However, if this was supposed to do an XOR operation 8 bytes at a time, then it would need to be something like: *(((unsigned long *)p_eb->data)+i) ^= *(((unsigned long *)ebs[j]->data) + i); i.e. cast pointer to unsigned long*, then add i (which would index array of unsigned long, not char). --Patrick -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [sparc64] mkfs.btrfs bus error / align issue?
On Wed, Jul 27, 2016 at 04:59:27PM +0300, Anatoly Pugachev wrote: > Hello! > > Running xfstests suite, got in logs mkfs.btrfs bus error, debugging it > shows the following : > > Program received signal SIGBUS, Bus error. > 0x0015e160 in write_raid56_with_parity (info=0x2b17b0, > eb=0x2c7fe0, multi=0x2c2870, stripe_len=65536, raid_map=0x2c2570) at > volumes.c:2156 > 2156*(unsigned long *)(p_eb->data + i) ^= Yeah, clear unaligned access. We have helpers for so I'll fix it. I was looking for a way to simulate and catch that on x86 or at least let gcc warn but no such thing seems to exist. Which means we might accidentally introduce that in the future. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] Btrfs: remove BUG() in raid56
This BUG() has been triggered by a fuzz testing image, which contains an invalid chunk type, ie. a single stripe chunk has the raid6 type. Btrfs can handle this gracefully by returning -EIO, so besides using btrfs_warn to give us more debugging information rather than a single BUG(), we can return error properly. Signed-off-by: Liu Bo--- v2: use btrfs_warn with more debugging information instead of WARN_ONCE. v3: - give a short summary about what happens when the error occurs. - add a ASSERT() which only takes effect for developers. fs/btrfs/raid56.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index f8b6d41..6f8addf 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -2139,7 +2139,11 @@ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio, rbio->faila = find_logical_bio_stripe(rbio, bio); if (rbio->faila == -1) { - BUG(); + btrfs_warn(root->fs_info, + "%s could not find the bad stripe in raid56 so that we cannot recover any more (bio has logical %llu len %llu, bbio has map_type %llu)", + __func__, (u64)bio->bi_iter.bi_sector << 9, + (u64)bio->bi_iter.bi_size, bbio->map_type); + ASSERT(0); if (generic_io) btrfs_put_bbio(bbio); kfree(rbio); -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[BTRFS-PROGS][PATCH][V2] Add two new commands: 'btrfs insp physical-find' and 'btrfs insp physical-dump'
Hi all, the following patches add two new commands: 1) btrfs inspect-internal physical-find 2) btrfs inspect-internal physical-dump The aim of these two new commands is to locate (1) and dump (2) the stripe elements stored on the disks. I developed these two new command to simplify the debugging of some RAID5 bugs (but this is another discussion). An example of 'btrfs inspect-internal physical-find' is the following: # btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid: 3 dev_name: /dev/loop2 offset: 61931520 type: DATA devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER devid: 1 dev_name: /dev/loop0 offset: 81854464 type: PARITY devid: 4 dev_name: /dev/loop3 offset: 61931520 type: PARITY In the output above, DATA is the stripe elemnt conaining data. OTHER is the sibling stripe elemnt: it contains data related to or other files or to the same file but different position. The two stripe elements contain the RAID6 parity (P and Q). It is possible to pass the offset of the file to inspect. An example of 'btrfs inspect-internal physical-dump' is the following # btrfs insp physical-find mnt/out.txt mnt/out.txt: 0 devid: 5 dev_name: /dev/loop4 offset: 56819712 type: OTHER devid: 4 dev_name: /dev/loop3 offset: 56819712 type: OTHER devid: 3 dev_name: /dev/loop2 offset: 56819712 type: DATA devid: 2 dev_name: /dev/loop1 offset: 56819712 type: PARITY devid: 1 dev_name: /dev/loop0 offset: 76742656 type: PARITY # btrfs insp physical-dump mnt/out.txt | xxd mnt/out.txt: 0 file: /dev/loop2 off=56819712 : 6164 6161 6161 6161 6161 6161 6161 6161 adaa 0010: 6161 6161 6161 6161 6161 6161 6161 6161 0020: 6161 6161 6161 6161 6161 6161 6161 6161 0030: 6161 6161 6161 6161 6161 6161 6161 6161 0040: 6161 6161 6161 6161 6161 6161 6161 6161 [...] In this case it is dumped the content of the first 4k of the file. It is possible to pass also an offset (at step of 4k). Moreover it is possible to select to dump: which copy has to be dumped (switch -c, only for RAID1/RAID10/DUP); which parity has to be dumped (switch -p, only for RAID5/RAID6); which stripe element other than data (switch -s, only for RAID5/RAID6). ChangeLog: v1: 2016-07-24 First issue v2: 2016-07-27 After Qu suggestion, it is added the switch '-l' to dump the info from a "logical" address BR G.Baroncelli -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] New btrfs command: "btrfs inspect physical-find"
From: Goffredo BaroncelliThe aim of this new command is to show the physical placement on the disk of a file. Currently it handles all the profiles (single, dup, raid1/10/5/6). The syntax is simple: where: is the file to inspect is the offset of the file to inspect (default 0) Below some examples: ** Single $ sudo mkfs.btrfs -f -d single -m single /dev/loop0 $ sudo mount /dev/loop0 mnt/ $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid: 1 dev_name: /dev/loop0 offset: 12582912 type: LINEAR $ dd 2>/dev/null if=/dev/loop0 skip=12582912 bs=1 count=5; echo adaaa ** Dup The command shows both the copies $ sudo mkfs.btrfs -f -d single -m single /dev/loop0 $ sudo mount /dev/loop0 mnt/ $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid: 1 dev_name: /dev/loop0 offset: 71303168 type: DUP devid: 1 dev_name: /dev/loop0 offset: 104857600 type: DUP $ dd 2>/dev/null if=/dev/loop0 skip=104857600 bs=1 count=5 ; echo adaaa ** Raid1 The command shows both the copies $ sudo mkfs.btrfs -f -d raid1 -m raid1 /dev/loop0 /dev/loop1 $ sudo mount /dev/loop0 mnt/ $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid: 2 dev_name: /dev/loop1 offset: 61865984 type: RAID1 devid: 1 dev_name: /dev/loop0 offset: 81788928 type: RAID1 $ dd 2>/dev/null if=/dev/loop0 skip=81788928 bs=1 count=5; echo adaaa ** Raid10 The command show both the copies; if you set an offset to the next disk-stripe, you can see the next pair of disk-stripe $ sudo mkfs.btrfs -f -d raid10 -m raid10 /dev/loop[0123] $ sudo mount /dev/loop0 mnt/ $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid: 4 dev_name: /dev/loop3 offset: 61931520 type: RAID10 devid: 3 dev_name: /dev/loop2 offset: 61931520 type: RAID10 $ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5; echo adaaa $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt 65536 mnt/out.txt: 65536 devid: 2 dev_name: /dev/loop1 offset: 61931520 type: RAID10 devid: 1 dev_name: /dev/loop0 offset: 81854464 type: RAID10 $ dd 2>/dev/null if=/dev/loop0 skip=81854464 bs=1 count=5; echo bdbbb ** Raid5 Depending by the offset, you can see which disk-stripe is used. $ sudo mkfs.btrfs -f -d raid5 -m raid5 /dev/loop[012] $ sudo mount /dev/loop0 mnt/ $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid: 2 dev_name: /dev/loop1 offset: 61931520 type: DATA devid: 1 dev_name: /dev/loop0 offset: 81854464 type: OTHER devid: 3 dev_name: /dev/loop2 offset: 61931520 type: PARITY $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt 65536mnt/out.txt: 65536 devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER devid: 1 dev_name: /dev/loop0 offset: 81854464 type: DATA devid: 3 dev_name: /dev/loop2 offset: 61931520 type: PARITY $ dd 2>/dev/null if=/dev/loop1 skip=61931520 bs=1 count=5; echo adaaa $ dd 2>/dev/null if=/dev/loop0 skip=81854464 bs=1 count=5; echo bdbbb $ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5 | xxd : 0300 0303 03 . The parity is computed as: parity=disk1^disk2. So "adaa" ^ "bdbb" == "\x03\x00\x03\x03 ** Raid6 $ sudo mkfs.btrfs -f -mraid6 -draid6 /dev/loop[0-4]^C $ sudo mount /dev/loop0 mnt/ $ python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null $ sudo ../btrfs-progs/btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid: 3 dev_name: /dev/loop2 offset: 61931520 type: DATA devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER devid: 1 dev_name: /dev/loop0 offset: 81854464 type: PARITY devid: 4 dev_name: /dev/loop3 offset: 61931520 type: PARITY $ dd 2>/dev/null if=/dev/loop2 skip=61931520 bs=1 count=5 ; echo adaaa Signed-off-by: Goffredo Baroncelli --- cmds-inspect.c | 587 + 1 file changed, 587 insertions(+) diff --git a/cmds-inspect.c b/cmds-inspect.c index dd7b9dd..fc2e7c3 100644 --- a/cmds-inspect.c +++ b/cmds-inspect.c @@ -22,6 +22,11 @@ #include #include #include +#include +#include +#include +#include +#include #include "kerncompat.h" #include "ioctl.h" @@ -623,6 +628,586 @@ out: return !!ret; } + +static const char * const cmd_inspect_physical_find_usage[] = { +
[PATCH 1/5] Add some helper functions
From: Goffredo BaroncelliAdd the following functions: - int is_btrfs_fs(const char *path) -> returns 0 if path is a btrfs filesystem - void check_root_or_exit() -> checks if the user has the root capability or it exits writing an error message - void check_btrfs_or_exit(const char *path) checks if path is a valid btrfs filesystem, otherwise it exits Signed-off-by: Goffredo baroncelli --- utils.c | 41 + utils.h | 14 ++ 2 files changed, 55 insertions(+) diff --git a/utils.c b/utils.c index 578fdb0..b99706c 100644 --- a/utils.c +++ b/utils.c @@ -4131,3 +4131,44 @@ unsigned int rand_range(unsigned int upper) */ return (unsigned int)(jrand48(rand_seed) % upper); } + +/* + * check if path is a btrfs filesystem + */ +int is_btrfs_fs(const char *path) +{ + struct statfs stfs; + + if (statfs(path, ) != 0) { + /* cannot access */ + return -1; + } + + if (stfs.f_type != BTRFS_SUPER_MAGIC) { + /* not a btrfs filesystem */ + return -2; + } + + return 0; +} + +/* + * check if the user is root + */ +void check_root_or_exit() +{ +if (geteuid() == 0) +return; + +error("You need to be root to execute this command"); +exit(100); +} + +void check_btrfs_or_exit(const char *path) +{ +if (!is_btrfs_fs(path)) +return; + +error("'%s' must be a valid btrfs filesystem", path); +exit(100); +} diff --git a/utils.h b/utils.h index 98bfb34..0bd6ecb 100644 --- a/utils.h +++ b/utils.h @@ -399,4 +399,18 @@ unsigned int rand_range(unsigned int upper); /* Also allow setting the seed manually */ void init_rand_seed(u64 seed); +/* return 0 if path is a valid btrfs filesystem */ +int is_btrfs_fs(const char *path); + +/* + * check if the user has the root capability, otherwise it exits printing an + * error message + */ +void check_root_or_exit(); +/* + * check if path is a valid btrfs filesystem, otherwise it exits printing an + * error message + */ +void check_btrfs_or_exit(const char *path); + #endif -- 2.8.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] new command btrfs inspect physical-dump
From: Goffredo BaroncelliThe aim of this command, is to dump the disk content of a file bypassing the btrfs filesystem. This could help to test the btrfs filesystem. The dump size is a page (4k) (even if the file is shorter). It is possible to set an offset for the file portion to read, but even this offset must be multiple of 4k. With the switch -c , it is possible to select whch copy will be dumped (RAID1/RAID10/DUP). With the switch -p, it is possible to select which parity will be dumped (RAID5/RAID6) With the switch -s, it is possible to dump the other elemnt of the stripe (RAID5, RAID6) # btrfs insp physical-dump /bin/ls 8192 | xxd /bin/ls: 8192 file: /dev/sda3 off=16600629248 : b0e2 6100 0700 5200 ..a.R... 0010: b8e2 6100 ..a. 0020: 0700 5300 S... 0030: c0e2 6100 0700 5400 ..a.T... [...] Signed-off-by: Goffredo Baroncelli --- cmds-inspect.c | 320 + 1 file changed, 320 insertions(+) diff --git a/cmds-inspect.c b/cmds-inspect.c index fc2e7c3..0e7d725 100644 --- a/cmds-inspect.c +++ b/cmds-inspect.c @@ -1208,6 +1208,324 @@ out: return ret; } +static const char * const cmd_inspect_physical_dump_usage[] = { + "btrfs inspect-internal physical-dump [-c |-s |-p ] [-l |]", + "Dump the physical content of a file offset", + " file to dump", + " file offset to dump; 0 if not specified", + " dump logical address of filesystem, ", + "number of copy to dump (for raid1,dup/raid10)", + " number of parity to dump (for raid5/raid6)", + " number of stripe elemnt to dump (for raid5/raid6)", + "This command requires root privileges", + NULL +}; + +static int dumpfile(const char *fname, u64 off) +{ + int fd = -1; + int size = 4096; + char buf[size]; + int r; + int e = 0; + off_t r1; + + fprintf(stderr, "dev: %s off=%llu\n", fname, off); + + fd = open(fname, O_RDONLY|O_APPEND); + if (fd < 0) { + int e = errno; + + error("cannot open file: '%s'\n", strerror(e)); + return -e; + } + + r1 = lseek(fd, off, SEEK_SET); + if (r1 == (off_t)-1) { + e = -errno; + error("cannot seek file: '%s'\n", strerror(-e)); + goto out; + } + + while (size) { + r = read(fd, buf, size); + if (r < 0) { + e = -errno; + error("cannot read file: '%s'\n", strerror(-e)); + goto out; + } + + size -= r; + r = fwrite(buf, r, 1, stdout); + if (r < 0) { + e = -errno; + error("cannot write: '%s'\n", strerror(-e)); + goto out; + } + + } + +out: + if (fd != -1) + close(fd); + return e; +} + +static int cmd_inspect_physical_dump(int argc, char **argv) +{ + int ret = 0; + int fd; + char *fname; + u64 profile_type; + struct btrfs_ioctl_dev_info_args *disks = NULL; + struct btrfs_ioctl_fs_info_args fi_args = {0}; + char btrfs_chunk_data[4096]; + struct btrfs_chunk *chunk_item = (struct btrfs_chunk *)_chunk_data; + u64 chunk_offset = 0; + struct stripe_info *stripes = NULL; + int stripes_count = 0; + int rc; + int copynr = 0; + int paritynr = -1; + int stripenr = -1; + const char *logical_arg = NULL; + u64 logical = 0ull; + + optind = 1; + while (1) { + int c = getopt(argc, argv, "c:p:s:l:"); + + if (c < 0) + break; + + switch (c) { + case 'c': + copynr = atoi(optarg); + break; + case 'p': + paritynr = atoi(optarg); + break; + case 's': + stripenr = atoi(optarg); + break; + case 'l': + logical_arg = optarg; + break; + default: + usage(cmd_inspect_physical_dump_usage); + } + } + + if ((logical_arg != NULL && check_argc_exact(argc - optind, 1)) || + (check_argc_min(argc - optind, 1) || check_argc_max(argc - optind, 2))) + usage(cmd_inspect_physical_dump_usage); + + fname = argv[optind]; + + check_root_or_exit(); + check_btrfs_or_exit(fname); + + fprintf(stderr, "%s: %llu\n", fname, logical); + + fd = open(fname, O_RDONLY|O_DIRECT); + if (fd < 0) { +
[PATCH 4/5] Add man page for command btrfs insp physical-find
From: Goffredo BaroncelliSigned-off-by: Goffredo Baroncelli --- Documentation/btrfs-inspect-internal.asciidoc | 7 +++ 1 file changed, 7 insertions(+) diff --git a/Documentation/btrfs-inspect-internal.asciidoc b/Documentation/btrfs-inspect-internal.asciidoc index 74f6dea..c132a0e 100644 --- a/Documentation/btrfs-inspect-internal.asciidoc +++ b/Documentation/btrfs-inspect-internal.asciidoc @@ -146,6 +146,13 @@ Print sizes and statistics of trees. -b Print raw numbers in bytes. +*physical-find* [|-l ]:: +(needs root privileges) ++ +Show the placement of a given file (at offset 'off', default 0) on the disks. +If 'logical' is given, this command shows the palcement of a logical address +on the disks. + EXIT STATUS --- *btrfs inspect-internal* returns a zero exit status if it succeeds. Non zero is -- 2.8.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] Add new command to man pages: btrfs insp physical-dump
From: Goffredo BaroncelliSigned-off-by: Goffredo Baroncelli --- Documentation/btrfs-inspect-internal.asciidoc | 12 1 file changed, 12 insertions(+) diff --git a/Documentation/btrfs-inspect-internal.asciidoc b/Documentation/btrfs-inspect-internal.asciidoc index c132a0e..569681d 100644 --- a/Documentation/btrfs-inspect-internal.asciidoc +++ b/Documentation/btrfs-inspect-internal.asciidoc @@ -153,6 +153,18 @@ Show the placement of a given file (at offset 'off', default 0) on the disks. If 'logical' is given, this command shows the palcement of a logical address on the disks. +*physical-dump* [-c |-s |-p ] [|-l ]:: +(needs root privileges) ++ +Dump the disk content of a given file (at offset 'off', default 0). If 'logical' +is passed, it is dumped the content of the given logical address. +For RAID1/RAID10/DUP 'copynr', select which copy will be dumped. For +RAID5/RAID6, 'paritynr' specifies which parity will be dumped. For +RAID5/RAID6, 'stripenr' specifies which stripe elemnt will be dumped. ++ +'off' and 'logical' must be a multiple of 4096. 4096 bytes are dumped, even if +the file is shorter. + EXIT STATUS --- *btrfs inspect-internal* returns a zero exit status if it succeeds. Non zero is -- 2.8.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs fi defrag does not defrag files >256kB?
On 26 July 2016 at 21:10, Duncan <1i5t5.dun...@cox.net> wrote: > Nicholas D Steeves posted on Tue, 26 Jul 2016 19:03:53 -0400 as excerpted: > >> Hi, >> >> I've been using btrfs fi defrag with out the "-r -t 32M" option for >> regular maintenance. I just learned, in >> Documentation/btrfs-convert.asciidoc, that there is a recommendation >> to run with "-t 32M" after a conversion from ext2/3/4. I then >> cross-referenced this with btrfs-filesystem(8), and found that: >> >> Extents bigger than value given by -t will be skipped, otherwise >> this value is used as a target extent size, but is only advisory >> and may not be reached if the free space is too fragmented. Use 0 >> to take the kernel default, which is 256kB but may change in the >> future. >> >> I understand the default behaviour of target extent size of 256kB to >> mean only defragment small files and metadata. Or does this mean that >> the default behaviour is to defragment extent tree metadata >256kB, >> and then defragment the (larger than 256kB) data from many extents >> into a single extent? I was surprised to read this! >> >> What's really happening with this default behaviour? Should everyone >> be using -t with a much larger value to actually defragment their >> databases? > > Something about defrag's -t option should really be in the FAQ, as it is > known to be somewhat confusing and to come up from time to time, tho this > is the first time I've seen it in the context of convert. > > In general, you are correct in that the larger the value given to -t, the > more defragging you should ultimately get. There's a practical upper > limit, however, the data chunk size, which is nominally 1 GiB (tho on > tiny btrfs it's smaller and on TB-scale it can be larger, to 8 or 10 GiB > IIRC). 32-bit btrfs-progs defrag also had a bug at one point that would > (IIRC) kill the parameter if it was set to 2+ GiB -- that has been fixed > by hard-coding the 32-bit max to 1 GiB, I believe. The bug didn't affect > 64-bit. In any case, 1 GiB is fine, and often the largest btrfs can do > anyway, due as I said to that being the normal data chunk size. > > And btrfs defrag only deals with data. There's no metadata defrag, tho > balance -m (or whole filesystem) will normally consolidate the metadata > into the fewest (nominally 256 MiB) metadata chunks possible as it > rewrites them. Thank you for this metadata consolidation tip! > In that regard a defrag -t 32M recommendation is reasonable for a > converted filesystem, tho you can certainly go larger... to 1 GiB as I > said. > I only mentioned btrfs-convert.asciidoc, because that's what led me to the discrepancy between the default target extent size value, and a recommended value. I was searching for everything I could find on defrag, because I had begun to suspect that it wasn't functioning as expected. Is there any reason why defrag without -t cannot detect and default to the data chunk size, or why it does not default to 1 GiB? In the same way that balance's default behaviour is a full balance, shouldn't defrag's default behaviour defrag whole chunks? Does it not default to 1 GiB because that would increase the number of cases where defrag unreflinks and duplicates files--leading to an ENOSPC? https://github.com/kdave/btrfsmaintenance/blob/master/btrfs-defrag.sh uses -t 32M ; if a default target extent size of 1GiB is too radical, why not set it it 32M? If SLED ships btrfsmaintenance, then defrag -t 32M should be well-tested, no? Thank you, Nicholas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: Make RAID stripesize configurable
Hi Sanidhya, On 2016-07-27 08:12, Sanidhya Solanki wrote: > The reason for this limit is the fact that, as I noted above the real > stripe size is currently 4KiB, with an element size of 64KiB. I am not able to understand this sentence: on the best of my knowledge, in btrfs the RAID5/RAID6 stripe is composed by several sub-stripes (I am not sure about the terminology to adopt); the number of sub-stripe is equal to the number of the disk. Until now, in btrfs the size of sub-stripe is fixed to 64k, so the size of a stripe is equal to 64k * . So for raid5 the minimum stripe size is 192k, for raid6 is 256k. Why you are writing that the real stripe size is 4kb (may be that you are referring to the be the page size ?). I am quite sure that the problem is the terminology. Could you be so kindly to explain what you are meaning ? Thanks in advance. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V20 04/19] Btrfs: subpage-blocksize: Define extent_buffer_head
On Tuesday, July 26, 2016 01:42:08 PM Josef Bacik wrote: > On 07/04/2016 12:34 AM, Chandan Rajendra wrote: > > In order to handle multiple extent buffers per page, first we need to > > create a > > way to handle all the extent buffers that are attached to a page. > > > > This patch creates a new data structure 'struct extent_buffer_head', and > > moves > > fields that are common to all extent buffers from 'struct extent_buffer' to > > 'struct extent_buffer_head' > > > > Also, this patch moves EXTENT_BUFFER_TREE_REF, EXTENT_BUFFER_DUMMY and > > EXTENT_BUFFER_IN_TREE flags from extent_buffer->ebflags to > > extent_buffer_head->bflags. > > > > Reviewed-by: Liu Bo> > Signed-off-by: Chandan Rajendra > > I'm sorry Chandan I'm still having problems with this one. XFS kmalloc()'s > its > sub pagesize ranges for it's metadata buffers, how about we do that instead > of > doing the extent_buffer_head. Look at xfs_buf_allocate_memory() for what I'm > thinking. Thanks, > Ok. I will look into the xfs metadata buffer allocation code. Thanks for the guidance. -- chandan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[sparc64] mkfs.btrfs bus error / align issue?
Hello! Running xfstests suite, got in logs mkfs.btrfs bus error, debugging it shows the following : mator@nvg5120:~/btrfs-progs$ git log -1 --oneline 40650bf Btrfs progs v4.6.1 root@nvg5120:/home/mator/xfstests# gdb GNU gdb (Debian 7.11.1-2) 7.11.1 (gdb) file /opt/btrfs/bin/mkfs.btrfs Reading symbols from /opt/btrfs/bin/mkfs.btrfs...done. (gdb) set args -f -draid5 -mraid5 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 (gdb) run Starting program: /opt/btrfs/bin/mkfs.btrfs -f -draid5 -mraid5 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/sparc64-linux-gnu/libthread_db.so.1". btrfs-progs v4.6.1 See http://btrfs.wiki.kernel.org for more information. ERROR: superblock checksum mismatch ERROR: superblock checksum mismatch ERROR: superblock checksum mismatch Performing full device TRIM (2.00GiB) ... Performing full device TRIM (2.00GiB) ... Performing full device TRIM (2.00GiB) ... Performing full device TRIM (2.00GiB) ... Program received signal SIGBUS, Bus error. 0x0015e160 in write_raid56_with_parity (info=0x2b17b0, eb=0x2c7fe0, multi=0x2c2870, stripe_len=65536, raid_map=0x2c2570) at volumes.c:2156 2156*(unsigned long *)(p_eb->data + i) ^= (gdb) bt #0 0x0015e160 in write_raid56_with_parity (info=0x2b17b0, eb=0x2c7fe0, multi=0x2c2870, stripe_len=65536, raid_map=0x2c2570) at volumes.c:2156 #1 0x00119b30 in write_and_map_eb (trans=0x2cc250, root=0x2c7d80, eb=0x2c7fe0) at disk-io.c:426 #2 0x00119e74 in write_tree_block (trans=0x2cc250, root=0x2c7d80, eb=0x2c7fe0) at disk-io.c:459 #3 0x0011a4ac in __commit_transaction (trans=0x2cc250, root=0x2c7d80) at disk-io.c:562 #4 0x0011a7b8 in btrfs_commit_transaction (trans=0x2cc250, root=0x2c7d80) at disk-io.c:598 #5 0x001a2b04 in main (argc=8, argv=0x7fef698) at mkfs.c:1786 (gdb) Can someone help please? Thanks. PS: /dev/loop is ramdisk devices: # mount tmpfs -t tmpfs -o size=12g /ramdisk # fallocate -l 3.9G /ramdisk/testvol # for i in 1 2 3 4; do fallocate -l 2G /ramdisk/scratch${i} ; done # ls -lh /ramdisk/ total 12G -rw-r--r-- 1 root root 2.0G Jul 27 16:16 scratch1 -rw-r--r-- 1 root root 2.0G Jul 27 16:16 scratch2 -rw-r--r-- 1 root root 2.0G Jul 27 16:16 scratch3 -rw-r--r-- 1 root root 2.0G Jul 27 16:16 scratch4 -rw-r--r-- 1 root root 3.9G Jul 27 16:15 testvol # for i in /ramdisk/*; do echo -n "$i : "; losetup -f --show $i; done /ramdisk/scratch1 : /dev/loop0 /ramdisk/scratch2 : /dev/loop1 /ramdisk/scratch3 : /dev/loop2 /ramdisk/scratch4 : /dev/loop3 /ramdisk/testvol : /dev/loop4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V20 01/19] Btrfs: subpage-blocksize: Fix whole page read.
On Tuesday, July 26, 2016 12:11:49 PM Josef Bacik wrote: > On 07/04/2016 12:34 AM, Chandan Rajendra wrote: > > For the subpage-blocksize scenario, a page can contain multiple > > blocks. In such cases, this patch handles reading data from files. > > > > To track the status of individual blocks of a page, this patch makes use > > of a bitmap pointed to by the newly introduced per-page 'struct > > btrfs_page_private'. > > > > The per-page btrfs_page_private->io_lock plays the same role as > > BH_Uptodate_Lock (see end_buffer_async_read()) i.e. without the io_lock > > we may end up in the following situation, > > > > NOTE: Assume 64k page size and 4k block size. Also assume that the first > > 12 blocks of the page are contiguous while the next 4 blocks are > > contiguous. When reading the page we end up submitting two "logical > > address space" bios. So end_bio_extent_readpage function is invoked > > twice, once for each bio. > > > > |-+-+-| > > | Task A | Task B | Task C | > > |-+-+-| > > | end_bio_extent_readpage | | | > > | process block 0 | | | > > | - clear BLK_STATE_IO| | | > > | - page_read_complete| | | > > | process block 1 | | | > > | | | | > > | | | | > > | | end_bio_extent_readpage | | > > | | process block 0 | | > > | | - clear BLK_STATE_IO| | > > | | - page_read_complete| | > > | | process block 1 | | > > | | | | > > | process block 11| process block 3 | | > > | - clear BLK_STATE_IO| - clear BLK_STATE_IO| | > > | - page_read_complete| - page_read_complete| | > > | - returns true| - returns true| | > > | - unlock_page() | | | > > | | | lock_page() | > > | | - unlock_page() | | > > |-+-+-| > > > > We end up incorrectly unlocking the page twice and "Task C" ends up > > working on an unlocked page. So private->io_lock makes sure that only > > one of the tasks gets "true" as the return value when page_io_complete() > > is invoked. As an optimization the patch gets the io_lock only when the > > last block of the bio_vec is being processed. > > > > Signed-off-by: Chandan Rajendra> > --- > > fs/btrfs/extent_io.c | 371 > > --- > > fs/btrfs/extent_io.h | 74 +- > > fs/btrfs/inode.c | 16 +-- > > 3 files changed, 338 insertions(+), 123 deletions(-) > > > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > > index e197d47..a349f99 100644 > > --- a/fs/btrfs/extent_io.c > > +++ b/fs/btrfs/extent_io.c > > @@ -24,6 +24,7 @@ > > > > static struct kmem_cache *extent_state_cache; > > static struct kmem_cache *extent_buffer_cache; > > +static struct kmem_cache *page_private_cache; > > static struct bio_set *btrfs_bioset; > > > > static inline bool extent_state_in_tree(const struct extent_state *state) > > @@ -174,10 +175,16 @@ int __init extent_io_init(void) > > if (!extent_buffer_cache) > > goto free_state_cache; > > > > + page_private_cache = kmem_cache_create("btrfs_page_private", > > + sizeof(struct btrfs_page_private), 0, > > + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL); > > + if (!page_private_cache) > > + goto free_buffer_cache; > > + > > btrfs_bioset = bioset_create(BIO_POOL_SIZE, > > offsetof(struct btrfs_io_bio, bio)); > > if (!btrfs_bioset) > > - goto free_buffer_cache; > > + goto free_page_private_cache; > > > > if (bioset_integrity_create(btrfs_bioset, BIO_POOL_SIZE)) > > goto free_bioset; > > @@ -188,6 +195,10 @@ free_bioset: > > bioset_free(btrfs_bioset); > > btrfs_bioset = NULL; > > > > +free_page_private_cache: > > + kmem_cache_destroy(page_private_cache); > > + page_private_cache = NULL; > > + > > free_buffer_cache: > > kmem_cache_destroy(extent_buffer_cache); > > extent_buffer_cache = NULL; > > @@ -1323,6 +1334,95 @@ int clear_record_extent_bits(struct extent_io_tree > > *tree, u64 start, u64 end, > >
Re: [PATCH v2] btrfs: fix fsfreeze hang caused by delayed iputs deal
hello, On 07/27/2016 04:10 PM, Wang Xiaoguang wrote: When running fstests generic/068, sometimes we got below deadlock: xfs_io D 8800331dbb20 0 6697 6693 0x0080 8800331dbb20 88007acfc140 880034d895c0 8800331dc000 880032d243e8 fffe 880032d24400 0001 8800331dbb38 816a9045 880034d895c0 8800331dbba8 Call Trace: [] schedule+0x35/0x80 [] rwsem_down_read_failed+0xf2/0x140 [] ? __filemap_fdatawrite_range+0xd1/0x100 [] call_rwsem_down_read_failed+0x18/0x30 [] ? btrfs_alloc_block_rsv+0x2c/0xb0 [btrfs] [] percpu_down_read+0x35/0x50 [] __sb_start_write+0x2c/0x40 [] start_transaction+0x2a5/0x4d0 [btrfs] [] btrfs_join_transaction+0x17/0x20 [btrfs] [] btrfs_evict_inode+0x3c4/0x5d0 [btrfs] [] evict+0xba/0x1a0 [] iput+0x196/0x200 [] btrfs_run_delayed_iputs+0x70/0xc0 [btrfs] [] btrfs_commit_transaction+0x928/0xa80 [btrfs] [] btrfs_freeze+0x30/0x40 [btrfs] [] freeze_super+0xf0/0x190 [] do_vfs_ioctl+0x4a5/0x5c0 [] ? do_audit_syscall_entry+0x66/0x70 [] ? syscall_trace_enter_phase1+0x11f/0x140 [] SyS_ioctl+0x79/0x90 [] do_syscall_64+0x62/0x110 [] entry_SYSCALL64_slow_path+0x25/0x25 From this warning, freeze_super() already holds SB_FREEZE_FS, but btrfs_freeze() will call btrfs_commit_transaction() again, if btrfs_commit_transaction() finds that it has delayed iputs to handle, it'll start_transaction(), which will try to get SB_FREEZE_FS lock again, then deadlock occurs. The root cause is that in btrfs, sync_filesystem(sb) does not make sure all metadata is updated. There still maybe some codes adding delayed iputs, see below sample race window: CPU1 | CPU2 |-> freeze_super() | |-> sync_filesystem(sb); | | |-> cleaner_kthread() | | |-> btrfs_delete_unused_bgs() | | |-> btrfs_remove_chunk() | | |-> btrfs_remove_block_group() | | |-> btrfs_add_delayed_iput() | | |-> sb->s_writers.frozen = SB_FREEZE_FS; | |-> sb_wait_write(sb, SB_FREEZE_FS); | | acquire SB_FREEZE_FS lock. | | | |-> btrfs_freeze() | |-> btrfs_commit_transaction() | |-> btrfs_run_delayed_iputs() | | will handle delayed iputs, | | that means start_transaction() | | will be called, which will try | | to get SB_FREEZE_FS lock. | To fix this issue, in btrfs_commit_transaction(), if fs already is in SB_FREEZE_FS, we do not handle delayed iputs. Signed-off-by: Wang Xiaoguang--- fs/btrfs/transaction.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 948aa18..e6931d9 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -2277,8 +2277,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, kmem_cache_free(btrfs_trans_handle_cachep, trans); + /* +* If s_writers.frozen >= SB_FREEZE_FS, we can not handle delayed +* iputs, otherwise it'll result in deallock about SB_FREEZE_FS. +*/ if (current != root->fs_info->transaction_kthread && - current != root->fs_info->cleaner_kthread) + current != root->fs_info->cleaner_kthread && + root->fs_info->sb->s_writers.frozen >= SB_FREEZE_FS) btrfs_run_delayed_iputs(root); Sorry, I sent this too quickly... Above deadlock still occurs, please wait a while. Regards, Xiaoguang Wang return ret; -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] btrfs: fix fsfreeze hang caused by delayed iputs deal
When running fstests generic/068, sometimes we got below deadlock: xfs_io D 8800331dbb20 0 6697 6693 0x0080 8800331dbb20 88007acfc140 880034d895c0 8800331dc000 880032d243e8 fffe 880032d24400 0001 8800331dbb38 816a9045 880034d895c0 8800331dbba8 Call Trace: [] schedule+0x35/0x80 [] rwsem_down_read_failed+0xf2/0x140 [] ? __filemap_fdatawrite_range+0xd1/0x100 [] call_rwsem_down_read_failed+0x18/0x30 [] ? btrfs_alloc_block_rsv+0x2c/0xb0 [btrfs] [] percpu_down_read+0x35/0x50 [] __sb_start_write+0x2c/0x40 [] start_transaction+0x2a5/0x4d0 [btrfs] [] btrfs_join_transaction+0x17/0x20 [btrfs] [] btrfs_evict_inode+0x3c4/0x5d0 [btrfs] [] evict+0xba/0x1a0 [] iput+0x196/0x200 [] btrfs_run_delayed_iputs+0x70/0xc0 [btrfs] [] btrfs_commit_transaction+0x928/0xa80 [btrfs] [] btrfs_freeze+0x30/0x40 [btrfs] [] freeze_super+0xf0/0x190 [] do_vfs_ioctl+0x4a5/0x5c0 [] ? do_audit_syscall_entry+0x66/0x70 [] ? syscall_trace_enter_phase1+0x11f/0x140 [] SyS_ioctl+0x79/0x90 [] do_syscall_64+0x62/0x110 [] entry_SYSCALL64_slow_path+0x25/0x25 >From this warning, freeze_super() already holds SB_FREEZE_FS, but btrfs_freeze() will call btrfs_commit_transaction() again, if btrfs_commit_transaction() finds that it has delayed iputs to handle, it'll start_transaction(), which will try to get SB_FREEZE_FS lock again, then deadlock occurs. The root cause is that in btrfs, sync_filesystem(sb) does not make sure all metadata is updated. There still maybe some codes adding delayed iputs, see below sample race window: CPU1 | CPU2 |-> freeze_super() | |-> sync_filesystem(sb); | | |-> cleaner_kthread() | | |-> btrfs_delete_unused_bgs() | | |-> btrfs_remove_chunk() | | |-> btrfs_remove_block_group() | | |-> btrfs_add_delayed_iput() | | |-> sb->s_writers.frozen = SB_FREEZE_FS; | |-> sb_wait_write(sb, SB_FREEZE_FS); | | acquire SB_FREEZE_FS lock. | | | |-> btrfs_freeze() | |-> btrfs_commit_transaction() | |-> btrfs_run_delayed_iputs() | | will handle delayed iputs, | | that means start_transaction() | | will be called, which will try | | to get SB_FREEZE_FS lock. | To fix this issue, in btrfs_commit_transaction(), if fs already is in SB_FREEZE_FS, we do not handle delayed iputs. Signed-off-by: Wang Xiaoguang--- fs/btrfs/transaction.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 948aa18..e6931d9 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -2277,8 +2277,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, kmem_cache_free(btrfs_trans_handle_cachep, trans); + /* +* If s_writers.frozen >= SB_FREEZE_FS, we can not handle delayed +* iputs, otherwise it'll result in deallock about SB_FREEZE_FS. +*/ if (current != root->fs_info->transaction_kthread && - current != root->fs_info->cleaner_kthread) + current != root->fs_info->cleaner_kthread && + root->fs_info->sb->s_writers.frozen >= SB_FREEZE_FS) btrfs_run_delayed_iputs(root); return ret; -- 2.9.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: Make RAID stripesize configurable
On Tue, 26 Jul 2016 11:14:37 -0600 Chris Murphywrote: > On Fri, Jul 22, 2016 at 8:58 AM, Austin S. Hemmelgarn > wrote: > > On 2016-07-22 09:42, Sanidhya Solanki wrote: > > >> +*stripesize=*;; > >> +Specifies the new stripe size > > It'd be nice to stop conflating stripe size and stripe element size as > if they're the same thing. I realize that LVM gets this wrong also, > and uses stripes to mean "data strips", and stripesize for stripe > element size. From a user perspective I find the inconsistency > annoying, users are always confused about these terms. > > So I think we need to pay the piper now, and use either strip size or > stripe element size for this. Stripe size is the data portion of a > full stripe read or write across all devices in the array. So right > now with a 64KiB stripe element size on Btrfs, the stripe size for a 4 > disk raid0 is 256KiB, and the stripe size for a 4 disk raid 5 is > 192KiB. I absolutely agree with the statement regarding the difference between those two separate settings. THis difference was more clearly visible pre-Dec 2015, when it was removed for code appearance reasons by commit ee22184b53c823f6956314c2815d4068e3820737 (at the end of the commit).I will update the documentation in the next patch to make it clear that the balance option affects stripe size directly and the stripe element size indirectly. > It's 64KiB right now. Why go so much smaller? > > mdadm goes from 4KiB to GiB's, with a 512KiB default. > > lvm goes from 4KiB to the physical extent size, which can be GiB's. > > I'm OK with an upper limit that's sane, maybe 16MiB? Hundreds of MiB's > or even GiB's seems a bit far fetched but other RAID tools on Linux > permit that. The reason for this limit is the fact that, as I noted above the real stripe size is currently 4KiB, with an element size of 64KiB. Ostensibly, we can change the stripe size to any 512B multiple that is less than 64KiB. Increasing it beyond 64KiB is risky because a lot of calculations (only the basis of which I modified for this patch, and not the dependencies of those algorithms and calculations) rely on the stripe element size being 64KiB. I do not want to increase this limit as it may lead to un-discovered bugs in the already buggy RAID 5/6 code. If this patch is accepted, I intend in the next few patches to do the following: -increase maximum stripe size to 64KiB, by reducing the number of blocks to 1 per stripe extent. -Update the documentation to notify user of this change and the need for caution, as well as trial and error, to find an appropriate size upto 64KiB, with a warning to only change it if users understand the consequences and reasons for the change, as suggested by ASH. -Clean up the RAID 5/6 recovery code and stripe code over the coming months. -Clean up the code that relies on calculations that depend on stripe size and their dependencies. -Remove this stripe size and stripe element size limitation completely, as suggested by both ASH and CMu. Just waiting on reviews and acceptance for this patch as the basis of the above work. I started on the RAID recovery code yesterday. It also appears that according to the commit that I stated above that the stripe size used to be 1KiB, with 64 elements per stripe element, but was changed in Dec 2015, so maybe as long as you do not change the stripe size to be more than 64KiB, you do not need to balance after using this balance option (atleast the first time). I do not remember seeing any bug reports on the mailing list since then that called out stripe size as the problem. Interesting. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html