Re: how long should "btrfs device delete missing ..." take?
On Sep 11, 2014, at 11:19 PM, Russell Coker wrote: > It would be nice if a file system mounted ro counted as ro snapshots for > btrfs send. > > When a file system is so messed up it can't be mounted rw it should be > regarded as ro for all operations. Yes it's come up before, and there's a question whether mount -o ro is reliably ro enough for this. Maybe a force option? But then another one is a recursive btrfs send to go along with the above. I might want them all, or I might want all of the ones in two particular subvolumes, etc. And even combine the recursive ro snapshot and recursive send as a btrfs rescue option that would work even if the volume is mounted read-only. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how long should "btrfs device delete missing ..." take?
Chris Murphy posted on Thu, 11 Sep 2014 20:10:26 -0600 as excerpted: > Sure. But what's the next step? Given 260+ snapshots might mean well > more than 350GB of data, depending on how deduplicated the fs is, it > still probably would be faster to rsync this to a pile of drives in > linear/concat+XFS than wait a month (?) for device delete to finish. That was what I was getting at in my other just-finished short reply. It may be time to give up on the btrfs specific solutions for the moment and go with tried and tested traditional solutions (tho I'd definitely *NOT* try rsync or the like with the delete still going, we know from other reports that rsync places its own stresses on btrfs and one major stressor, the delete-triggered rebalance, at a time, is bad enough). > Alternatively, script some way to create 260+ ro snapshots to btrfs > send/receive to a new btrfs volume; and turn it into a raid1 later. No confirmation yet but I strongly suspect most of those subs are snapshots. Assuming that's the case, it's very likely most of them can simply be eliminated as I originally suggested, a process that /should/ be fast, decomplexifying the situation dramatically. > I'm curious if a sysrq+s followed by sysrq+u might leave the filessystem > in a state where it could still be rw mountable. But I'm skeptical of > anything interrupting the device delete before being fully prepared for > the fs to be toast for rw mount. If only ro mount is possible, any > chance of creating ro snapshots is out. In theory, that is, barring bugs, interrupting the delete with normal shutdown to the extent possible, then sysrq+s, sysrq+u, should not be a problem. The delete is basically a balance, going chunk by chunk, and either the chunk has been duplicated to the new device or it hasn't. In either case, the existing chunk on the remaining old device shouldn't be affected. So rebooting in that way in ordered to stop the delete temporarily /should/ have no bad effects. Of course, that's barring bugs. Btrfs is still not fully stabilized, and bugs do happen, so anything's possible. But I'd consider it safe enough to try here, certainly so if I had backups, as is still STRONGLY recommended for btrfs at this point, much more so than the routine sysadmin "if it's not backed up by definition it's not valuable to you" rule. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs-progs: deal with conflict options for btrfs fi show
Hi Gui, (2014/09/12 10:15), Gui Hecheng wrote: > For btrfs fi show, -d|--all-devices & -m|--mounted will > overwrite each other, so if specified both, let the user > know that he should not use them at the same time. > > Signed-off-by: Gui Hecheng > --- > changelog: > v1->v2: add option conflict descriptions to manpage and usage. > --- > Documentation/btrfs-filesystem.txt | 9 ++--- > cmds-filesystem.c | 12 ++-- > 2 files changed, 16 insertions(+), 5 deletions(-) > > diff --git a/Documentation/btrfs-filesystem.txt > b/Documentation/btrfs-filesystem.txt > index c9c0b00..d3d2dcc 100644 > --- a/Documentation/btrfs-filesystem.txt > +++ b/Documentation/btrfs-filesystem.txt > @@ -20,15 +20,18 @@ SUBCOMMAND > *df* [...]:: > Show space usage information for a mount point. > > -*show* [--mounted|--all-devices]:: > +*show* [-m|--mounted|-d|--all-devices]:: This line seems to be too long. Please see also the following thread. https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg36270.html Thanks, Satoru > Show the btrfs filesystem with some additional info. > + > If no option nor ||| is passed, btrfs shows > information of all the btrfs filesystem both mounted and unmounted. > -If '--mounted' is passed, it would probe btrfs kernel to list mounted btrfs > +If '-m|--mounted' is passed, it would probe btrfs kernel to list mounted > btrfs > filesystem(s); > -If '--all-devices' is passed, all the devices under /dev are scanned; > +If '-d|--all-devices' is passed, all the devices under /dev are scanned; > otherwise the devices list is extracted from the /proc/partitions file. > +Don't combine -m|--mounted and -d|--all-devices, because these two options > +will overwrite each other, and only one scan way will be adopted, > +probe the kernel to scan or scan devices under /dev. > > *sync* :: > Force a sync for the filesystem identified by . > diff --git a/cmds-filesystem.c b/cmds-filesystem.c > index 69c1ca5..51c4c55 100644 > --- a/cmds-filesystem.c > +++ b/cmds-filesystem.c > @@ -495,6 +495,7 @@ static const char * const cmd_show_usage[] = { > "-d|--all-devices show only disks under /dev containing btrfs > filesystem", > "-m|--mounted show only mounted btrfs", > "If no argument is given, structure of all present filesystems is > shown.", > + "Don't combine -d|--all-devices and -m|--mounted, refer to manpage for > details.", > NULL > }; > > @@ -526,16 +527,23 @@ static int cmd_show(int argc, char **argv) > break; > switch (c) { > case 'd': > - where = BTRFS_SCAN_PROC; > + where &= ~BTRFS_SCAN_LBLKID; > + where |= BTRFS_SCAN_PROC; > break; > case 'm': > - where = BTRFS_SCAN_MOUNTED; > + where &= ~BTRFS_SCAN_LBLKID; > + where |= BTRFS_SCAN_MOUNTED; > break; > default: > usage(cmd_show_usage); > } > } > > + if ((where & BTRFS_SCAN_PROC) && (where & BTRFS_SCAN_MOUNTED)) { > + fprintf(stderr, "Don't use -d|--all-devices and -m|--mounted > options at the same time.\n"); > + usage(cmd_show_usage); > + } > + > if (check_argc_max(argc, optind + 1)) > usage(cmd_show_usage); > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how long should "btrfs device delete missing ..." take?
Chris Murphy posted on Thu, 11 Sep 2014 20:10:26 -0600 as excerpted: > And then when I think about just creating a new fs, using btrfs > send/receive, the snapshots need to be ro first. FWIW, at this point I'd forget about send/receive and create the backup (assuming one doesn't exist already) using more normal methods. At least if the original send/receive hasn't yet been done, so it'd be copying off all the data anyway. Perhaps mount selected snapshots and back them up too (after the current case is backed up), but throw away most of the snapshots. Of course if there's an existing relatively current sent/received base to build on, and no indication that send/receive is broken, definitely try that first as the amount of data to sync in that case should be MUCH lower, but if not... -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how long should "btrfs device delete missing ..." take?
Russell Coker posted on Fri, 12 Sep 2014 15:19:04 +1000 as excerpted: > It would be nice if a file system mounted ro counted as ro snapshots for > btrfs send. > > When a file system is so messed up it can't be mounted rw it should be > regarded as ro for all operations. Indeed, and that has been suggested before, but unfortunately it's not current behavior. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how long should "btrfs device delete missing ..." take?
It would be nice if a file system mounted ro counted as ro snapshots for btrfs send. When a file system is so messed up it can't be mounted rw it should be regarded as ro for all operations. -- Sent from my Samsung Galaxy Note 2 with K-9 Mail. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how long should "btrfs device delete missing ..." take?
On Sep 11, 2014, at 5:51 PM, Duncan <1i5t5.dun...@cox.net> wrote: > Chris Murphy posted on Thu, 11 Sep 2014 15:25:51 -0600 as excerpted: > >> On Sep 11, 2014, at 1:31 PM, Duncan <1i5t5.dun...@cox.net> wrote: >>> >>> I wouldn't try defragging now, but it might be worthwhile to stop the >>> device delete (rebooting to do so since I don't think there's a cancel) >> >> 'btrfs replace cancel' does exist, although I haven't tried it. > > Btrfs replace cancel exists, yes, but does it work for btrfs device > delete, which is what the OP used? Oops, right! I'm not sure what can do this safely. And then when I think about just creating a new fs, using btrfs send/receive, the snapshots need to be ro first. So if there's any uncertainty about safely canceling the 'device delete' those ro snapshots need to be taken first, in the event only an ro mount is possible subsequently. And then there's some uncertainty how long 260+ ro snapshots will take (should be fast, but) and how much worse that makes the current situation. But probably worth the risk to take the snapshots and just wait a while before trying something like umount or sysrq+s followed by sysrq+u. > >> Something isn't right though, because it's clearly neither reading nor >> writing at anywhere close to 1/2 the drive read throughput. I'm curious >> what 'iotop -d30 -o' shows (during the replace, before cancel), which >> should be pretty consistent by averaging 30 seconds worth of io. And >> then try 'iotop -d3 -o' and see if there are spikes. I'm willing to bet >> there's a lot of nothing going on, with occasional spikes, rather than a >> constant trickle. >> >> And then the question is to find out what btrfs is thinking about while >> nothing is reading or writing. Even though it's not 5000+ snapshots, I >> wonder if the balance code (and hence btrfs replace) makes extensive use >> of fiemap that's causing this to go catatonic. >> http://comments.gmane.org/gmane.comp.file-systems.btrfs/35724 > > Not sure (some of that stuff's beyond me), but one thing we /do/ know is > that btrfs has so far been focused mostly on features and debugging, not > on optimization beyond the worst-cases, which themselves remain a big > enough problem, tho it's slowly getting better. Sure. But what's the next step? Given 260+ snapshots might mean well more than 350GB of data, depending on how deduplicated the fs is, it still probably would be faster to rsync this to a pile of drives in linear/concat+XFS than wait a month (?) for device delete to finish. Alternatively, script some way to create 260+ ro snapshots to btrfs send/receive to a new btrfs volume; and turn it into a raid1 later. I'm curious if a sysrq+s followed by sysrq+u might leave the filessystem in a state where it could still be rw mountable. But I'm skeptical of anything interrupting the device delete before being fully prepared for the fs to be toast for rw mount. If only ro mount is possible, any chance of creating ro snapshots is out. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix a overflowing boundary writing in csum_tree_block
Do you means we can handle it like below? I think it is not better, if that, csum size can not the expand, and the code in csum_tree_block seems redundancy; If you do not want to truncate in first patch, I think we can try to avoid it diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 22a3676..baed6c0 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -879,6 +879,7 @@ static struct dentry *btrfs_mount(struct file_system_type *fs_type, int flags, u64 subvol_objectid = 0; u64 subvol_rootid = 0; int error = 0; + u16 csum_size; if (!(flags & MS_RDONLY)) mode |= FMODE_WRITE; @@ -974,6 +975,13 @@ static struct dentry *btrfs_mount(struct file_system_type *fs_type, int flags, return root; } + csum_size = btrfs_super_csum_size(root->fs_info->super_copy); + if (csum_size > 4) { + printk(KERN_ERR "btrfs: csume size is larger than 4\n"); + deactivate_locked_super(s); + return -EINVAL; + } + return root; error_close_devices: On Thu, Sep 11, 2014 at 10:02 PM, Chris Mason wrote: > On 09/09/2014 05:19 AM, rongqing...@windriver.com wrote: >> From: Li RongQing >> >> It is impossible that csum_size is larger than sizeof(long), but the codes >> still add the handler for this condition, like allocate new memory, for >> extension. If it becomes true someday, copying csum_size size memory to local >> 32bit variable found and val will overflow these two variables. >> >> Fix it by returning the max 4 byte checksum. > > Thanks for the patch. I'd rather not silently truncate the copy down > though. How about a one time check at mount to make sure the value in > the super is reasonable? > > -chris > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] btrfs-progs: deal with conflict options for btrfs fi show
For btrfs fi show, -d|--all-devices & -m|--mounted will overwrite each other, so if specified both, let the user know that he should not use them at the same time. Signed-off-by: Gui Hecheng --- changelog: v1->v2: add option conflict descriptions to manpage and usage. --- Documentation/btrfs-filesystem.txt | 9 ++--- cmds-filesystem.c | 12 ++-- 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/Documentation/btrfs-filesystem.txt b/Documentation/btrfs-filesystem.txt index c9c0b00..d3d2dcc 100644 --- a/Documentation/btrfs-filesystem.txt +++ b/Documentation/btrfs-filesystem.txt @@ -20,15 +20,18 @@ SUBCOMMAND *df* [...]:: Show space usage information for a mount point. -*show* [--mounted|--all-devices]:: +*show* [-m|--mounted|-d|--all-devices]:: Show the btrfs filesystem with some additional info. + If no option nor ||| is passed, btrfs shows information of all the btrfs filesystem both mounted and unmounted. -If '--mounted' is passed, it would probe btrfs kernel to list mounted btrfs +If '-m|--mounted' is passed, it would probe btrfs kernel to list mounted btrfs filesystem(s); -If '--all-devices' is passed, all the devices under /dev are scanned; +If '-d|--all-devices' is passed, all the devices under /dev are scanned; otherwise the devices list is extracted from the /proc/partitions file. +Don't combine -m|--mounted and -d|--all-devices, because these two options +will overwrite each other, and only one scan way will be adopted, +probe the kernel to scan or scan devices under /dev. *sync* :: Force a sync for the filesystem identified by . diff --git a/cmds-filesystem.c b/cmds-filesystem.c index 69c1ca5..51c4c55 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -495,6 +495,7 @@ static const char * const cmd_show_usage[] = { "-d|--all-devices show only disks under /dev containing btrfs filesystem", "-m|--mounted show only mounted btrfs", "If no argument is given, structure of all present filesystems is shown.", + "Don't combine -d|--all-devices and -m|--mounted, refer to manpage for details.", NULL }; @@ -526,16 +527,23 @@ static int cmd_show(int argc, char **argv) break; switch (c) { case 'd': - where = BTRFS_SCAN_PROC; + where &= ~BTRFS_SCAN_LBLKID; + where |= BTRFS_SCAN_PROC; break; case 'm': - where = BTRFS_SCAN_MOUNTED; + where &= ~BTRFS_SCAN_LBLKID; + where |= BTRFS_SCAN_MOUNTED; break; default: usage(cmd_show_usage); } } + if ((where & BTRFS_SCAN_PROC) && (where & BTRFS_SCAN_MOUNTED)) { + fprintf(stderr, "Don't use -d|--all-devices and -m|--mounted options at the same time.\n"); + usage(cmd_show_usage); + } + if (check_argc_max(argc, optind + 1)) usage(cmd_show_usage); -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how long should "btrfs device delete missing ..." take?
Chris Murphy posted on Thu, 11 Sep 2014 15:25:51 -0600 as excerpted: > On Sep 11, 2014, at 1:31 PM, Duncan <1i5t5.dun...@cox.net> wrote: >> >> I wouldn't try defragging now, but it might be worthwhile to stop the >> device delete (rebooting to do so since I don't think there's a cancel) > > 'btrfs replace cancel' does exist, although I haven't tried it. Btrfs replace cancel exists, yes, but does it work for btrfs device delete, which is what the OP used? > Something isn't right though, because it's clearly neither reading nor > writing at anywhere close to 1/2 the drive read throughput. I'm curious > what 'iotop -d30 -o' shows (during the replace, before cancel), which > should be pretty consistent by averaging 30 seconds worth of io. And > then try 'iotop -d3 -o' and see if there are spikes. I'm willing to bet > there's a lot of nothing going on, with occasional spikes, rather than a > constant trickle. > > And then the question is to find out what btrfs is thinking about while > nothing is reading or writing. Even though it's not 5000+ snapshots, I > wonder if the balance code (and hence btrfs replace) makes extensive use > of fiemap that's causing this to go catatonic. > http://comments.gmane.org/gmane.comp.file-systems.btrfs/35724 Not sure (some of that stuff's beyond me), but one thing we /do/ know is that btrfs has so far been focused mostly on features and debugging, not on optimization beyond the worst-cases, which themselves remain a big enough problem, tho it's slowly getting better. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how long should "btrfs device delete missing ..." take?
After a disk died and was replaced, "btrfs device delete missing" is taking more than 10 days on an otherwise idle server: Something isn't right though, because it's clearly neither reading nor writing at \ anywhere close to 1/2 the drive read throughput. I'm curious what 'iotop -d30 -o' \ shows (during the replace, before cancel), which should be pretty consistent by \ averaging 30 seconds worth of io. And then try 'iotop -d3 -o' and see if there are \ spikes. I'm willing to bet there's a lot of nothing going on, with occasional spikes, \ rather than a constant trickle. That's more or less what I'm seeing with both. The numbers will go up or down slightly, but it's counted in kilobytes per second: Total DISK READ: 0.00 B/s | Total DISK WRITE: 545.82 B/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO>COMMAND 940 be/3 root0.00 B/s 136.46 B/s 0.00 % 0.10 % [jbd2/md2-8] 4714 be/4 root0.00 B/s 329.94 K/s 0.00 % 0.00 % [btrfs-transacti] 25534 be/4 root0.00 B/s 402.97 K/s 0.00 % 0.00 % [kworker/u16:0] The bottleneck may be here - one CPU core is mostly 100% busy (kworker). Not sure what it's really busy with though: PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 25546 root 20 0 0 0 0 R 93.0 0.0 18:22.94 kworker/u16:7 14473 root 20 0 0 0 0 S 5.0 0.0 25:00.14 kworker/0:0 [912979.063432] SysRq : Show Blocked State [912979.063485] taskPC stack pid father [912979.063545] btrfs D 88083fa515c0 0 4793 4622 0x [912979.063601] 88061a29b878 0086 88003683e040 [912979.063701] 000115c0 4000 880813e3 88003683e040 [912979.063800] 88061a29b7e8 8105d8e9 88083fa4 88083fa115c0 [912979.063899] Call Trace: [912979.063951] [] ? enqueue_task_fair+0x3e5/0x44f [912979.064006] [] ? resched_curr+0x47/0x57 [912979.064058] [] ? check_preempt_curr+0x3e/0x6d [912979.064111] [] ? ttwu_do_wakeup+0x12/0x7f [912979.064164] [] ? ttwu_do_activate.constprop.74+0x57/0x5c [912979.064220] [] schedule+0x65/0x67 [912979.064272] [] schedule_timeout+0x26/0x198 [912979.064324] [] ? wake_up_process+0x31/0x35 [912979.064378] [] ? wake_up_worker+0x1f/0x21 [912979.064431] [] ? insert_work+0x87/0x94 [912979.064493] [] ? free_block_list+0x1f/0x34 [btrfs] [912979.064548] [] wait_for_common+0x10d/0x13e [912979.064600] [] ? try_to_wake_up+0x251/0x251 [912979.064653] [] wait_for_completion+0x18/0x1a [912979.064710] [] btrfs_async_run_delayed_refs+0xc1/0xe4 [btrfs] [912979.064814] [] __btrfs_end_transaction+0x2bb/0x2e1 [btrfs] [912979.064916] [] btrfs_end_transaction_throttle+0xe/0x10 [btrfs] [912979.065020] [] relocate_block_group+0x2ad/0x4de [btrfs] [912979.065079] [] btrfs_relocate_block_group+0x158/0x278 [btrfs] [912979.065184] [] btrfs_relocate_chunk.isra.62+0x58/0x5f7 [btrfs] [912979.065286] [] ? btrfs_set_lock_blocking_rw+0x68/0x95 [btrfs] [912979.065387] [] ? btrfs_set_path_blocking+0x23/0x54 [btrfs] [912979.065486] [] ? btrfs_search_slot+0x7bc/0x816 [btrfs] [912979.065546] [] ? free_extent_buffer+0x6f/0x7c [btrfs] [912979.065605] [] btrfs_shrink_device+0x23c/0x3a5 [btrfs] [912979.065679] [] btrfs_rm_device+0x2a1/0x759 [btrfs] [912979.065747] [] btrfs_ioctl+0xa52/0x227f [btrfs] [912979.065811] [] ? putname+0x23/0x2c [912979.065863] [] ? user_path_at_empty+0x60/0x90 [912979.065918] [] ? avc_has_perm+0x2e/0xf7 [912979.065978] [] ? __vm_enough_memory+0x25/0x13c [912979.066032] [] do_vfs_ioctl+0x3f2/0x43c [912979.066084] [] ? vfs_stat+0x16/0x18 [912979.066136] [] SyS_ioctl+0x4e/0x7d [912979.066188] [] ? do_page_fault+0xc/0xf [912979.066240] [] system_call_fastpath+0x16/0x1b [912979.066296] Sched Debug Version: v0.11, 3.17.0-rc3 #1 [912979.066347] ktime : 913460840.666210 [912979.066401] sched_clk : 912979066.295474 [912979.066454] cpu_clk : 912979066.295485 [912979.066507] jiffies : 4386283381 [912979.066560] sched_clock_stable(): 1 [912979.066610] [912979.066656] sysctl_sched [912979.066703] .sysctl_sched_latency: 24.00 [912979.066756] .sysctl_sched_min_granularity: 3.00 [912979.066809] .sysctl_sched_wakeup_granularity : 4.00 [912979.066861] .sysctl_sched_child_runs_first : 0 [912979.066913] .sysctl_sched_features : 11899 [912979.068196] .sysctl_sched_tunable_scaling: 1 (logaritmic) [912979.068248] [912979.068294] cpu#0, 3411.379 MHz [912979.068342] .nr_running: 0 [912979.068393] .load : 0 [912979.068443] .nr_switches : 374480347 [912979.068495] .nr_load_updates : 562
Re: [PATCH v2] btrfs: LLVMLinux: Remove VLAIS
On 09/11/14 09:34, Chris Mason wrote: On 09/05/2014 06:58 PM, beh...@converseincode.com wrote: From: Vinícius Tinti Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99 compliant equivalent. This is the original VLAIS struct. struct { struct shash_desc shash; char ctx[crypto_shash_descsize(tfm)]; } desc; This patch instead allocates the appropriate amount of memory using an char array. The new code can be compiled with both gcc and clang. struct shash_desc contains a flexible array member member ctx declared with CRYPTO_MINALIGN_ATTR, so sizeof(struct shash_desc) aligns the beginning of the array declared after struct shash_desc with long long. No trailing padding is required because it is not a struct type that can be used in an array. The CRYPTO_MINALIGN_ATTR is required so that desc is aligned with long long as would be the case for a struct containing a member with CRYPTO_MINALIGN_ATTR. We copied this from one of the other crypto api users, and I'm sure cooking up all these patches was not a great way to your afternoon. But, can I talk you into making a helper macro of some kind for this and putting it into the crypto api headers? Honestly this setup seems really error prone. You're not the only person to ask for this. I've got one already coded up based on suggestions from tglx. I've regenerated all the patches. I will resubmit soon. Will fix. Behan -- Behan Webster beh...@converseincode.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how long should "btrfs device delete missing ..." take?
On Sep 11, 2014, at 1:31 PM, Duncan <1i5t5.dun...@cox.net> wrote: > > I wouldn't try defragging now, but it might be worthwhile to stop the > device delete (rebooting to do so since I don't think there's a cancel) 'btrfs replace cancel' does exist, although I haven't tried it. Something isn't right though, because it's clearly neither reading nor writing at anywhere close to 1/2 the drive read throughput. I'm curious what 'iotop -d30 -o' shows (during the replace, before cancel), which should be pretty consistent by averaging 30 seconds worth of io. And then try 'iotop -d3 -o' and see if there are spikes. I'm willing to bet there's a lot of nothing going on, with occasional spikes, rather than a constant trickle. And then the question is to find out what btrfs is thinking about while nothing is reading or writing. Even though it's not 5000+ snapshots, I wonder if the balance code (and hence btrfs replace) makes extensive use of fiemap that's causing this to go catatonic. http://comments.gmane.org/gmane.comp.file-systems.btrfs/35724 Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how long should "btrfs device delete missing ..." take?
Tomasz Chmielewski posted on Thu, 11 Sep 2014 17:22:15 +0200 as excerpted: > After a disk died and was replaced, "btrfs device delete missing" is > taking more than 10 days on an otherwise idle server: > > # btrfs fi show /home > Label: none uuid: 84d087aa-3a32-46da-844f-a233237cf04f > Total devices 3 FS bytes used 362.44GiB > devid2 size 1.71TiB used 365.03GiB path /dev/sdb4 > devid3 size 1.71TiB used 58.00GiB path /dev/sda4 > *** Some devices missing > > Btrfs v3.16 > > So far, it has copied 58 GB out of 365 GB - and it took 10 days. At this > speed, the whole operation will take 2-3 months (assuming that the only > healthy disk doesn't die in the meantime). > Is this expected time for btrfs RAID-1? Device delete definitely takes time. For the sub-GiB usage shown above, 10 days for 50 GiB out of 350+ does seem excessive, but there are extreme cases where it isn't entirely out of line. See below. > There are no errors in dmesg/smart, performance of both disks is fine: > # btrfs sub list /home | wc -l > 260 > I've tried running this on the latest 3.16.x kernel earlier, but since > the progress was so slow, rebooted after about a week to see if the > latest RC will be any faster. The good thing is that once a block group is copied over, it should be fine and won't need re-copied if the process is stopped over a reboot and restarted on a new kernel, etc. The bad thing is that if I'm interpreting your report correctly, that likely means 7+10=17 days for that 58 gig. =:^( Questions: * Presumably most of those 260 subvolumes are snapshots, correct? What was your snapshotting schedule and did you have old snapshot cleanup-deletion scheduled as well? * Do you run with autodefrag or was the system otherwise regularly defragged? * Do you have large (GiB plus) database or virtual machine image files on that filesystem? If so, had you properly set the NOCOW file attribute (chattr +C) on them and were they on dedicated subvolumes? 200+ snapshots is somewhat high and could be part of the issue, tho it's nothing like the extremes (thousands) we've seen posted in the past. Were it me, I'd have tried deleting as many as possible before the device delete missing, in ordered to simplify the process and eliminate as much "extra" data as possible. The real issue is going to be fragmentation, on spinning-rust drives. Run filefrag on some of your gig-plus files that get written to frequently (vm-images and database files are the classic cases) and see how many extents are reported. (Tho note that filefrag doesn't understand btrfs compression and won't be accurate in that case, and also that due to the btrfs data chunk size of 1 GiB, that's the maximum extent size, so multi-gig files will typically be two extents more than the number of gigs, filling up the current chunk, N whole-gig chunks, the file tail.) The nocow file attribute (which must be set while the file is zero-sized to be effective, see discussion elsewhere) can help with that, but snapshotting an actively being rewritten nocow file more or less defeats the purpose of nocow, since the snapshot locks in place the existing data and the first rewrite to a block must then be cowed anyway. But putting those files on dedicated subvolumes and then not snapshotting those subvolumes is a workaround. I wouldn't try defragging now, but it might be worthwhile to stop the device delete (rebooting to do so since I don't think there's a cancel) and delete as many snapshots as possible. That should help matters. Additionally, if you have recent backups of highly fragmented files such as the VM-images and DBs I mentioned, you might consider simply deleting them, thus eliminating that fragment processing from the device delete. I don't know that making a backup now would go much faster than the device delete, however, so I don't know whether to recommend that or not. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: set inode's logged_trans/last_log_commit after ranged fsync
When a ranged fsync finishes if there are still extent maps in the modified list, still set the inode's logged_trans and last_log_commit. This is important in case an inode is fsync'ed and unlinked in the same transaction, to ensure its inode ref gets deleted from the log and the respective dentries in its parent are deleted too from the log (if the parent directory was fsync'ed in the same transaction). Instead make btrfs_inode_in_log() return false if the list of modified extent maps isn't empty. This is an incremental on top of the v4 version of the patch: "Btrfs: fix fsync data loss after a ranged fsync" which was added to its v5, but didn't make it on time. Signed-off-by: Filipe Manana --- fs/btrfs/btrfs_inode.h | 13 +++-- fs/btrfs/tree-log.c| 14 ++ 2 files changed, 13 insertions(+), 14 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 74ff403..3511031 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -246,8 +246,17 @@ static inline int btrfs_inode_in_log(struct inode *inode, u64 generation) BTRFS_I(inode)->last_sub_trans <= BTRFS_I(inode)->last_log_commit && BTRFS_I(inode)->last_sub_trans <= - BTRFS_I(inode)->root->last_log_commit) - return 1; + BTRFS_I(inode)->root->last_log_commit) { + /* +* After a ranged fsync we might have left some extent maps +* (that fall outside the fsync's range). So return false +* here if the list isn't empty, to make sure btrfs_log_inode() +* will be called and process those extent maps. +*/ + smp_mb(); + if (list_empty(&BTRFS_I(inode)->extent_tree.modified_extents)) + return 1; + } return 0; } diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 82db14f..d7c1459 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -4090,18 +4090,8 @@ log_extents: } } - write_lock(&em_tree->lock); - /* -* If we're doing a ranged fsync and there are still modified extents -* in the list, we must run on the next fsync call as it might cover -* those extents (a full fsync or an fsync for other range). -*/ - if (list_empty(&em_tree->modified_extents)) { - BTRFS_I(inode)->logged_trans = trans->transid; - BTRFS_I(inode)->last_log_commit = - BTRFS_I(inode)->last_sub_trans; - } - write_unlock(&em_tree->lock); + BTRFS_I(inode)->logged_trans = trans->transid; + BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->last_sub_trans; out_unlock: if (unlikely(err)) btrfs_put_logged_extents(&logged_list); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs listing is wrong
Mark Murawski posted on Thu, 11 Sep 2014 10:45:58 -0400 as excerpted: > Label: 'Root' uuid: d71404d4-468e-47d5-8f06-3b65fa7776aa > Total devices 2 FS bytes used 7.46GiB > devid1 size 9.31GiB used 8.06GiB path /dev/sdh6 > devid3 size 9.31GiB used 8.06GiB path > /dev/disk/by-uuid/d71404d4-468e-47d5-8f06-3b65fa7776aa > > # ls -al /dev/disk/by-uuid/d71404d4-468e-47d5-8f06-3b65fa7776aa > lrwxrwxrwx 1 root root 10 Sep 11 10:43 > /dev/disk/by-uuid/d71404d4-468e-47d5-8f06-3b65fa7776aa -> ../../sdh6 > > devid 3 should really be showing /dev/sde6 I believe that's an instance of a known bug (or depending on how you look at it possibly two) with a recently posted patch, but I doubt it has made it to mainline just yet. What they're going to do is resolve to the canonical paths. They don't do that at this point, and the paths can change, sometimes showing the wrong one for certain devices. I'm not sure the fix will make it into this kernel cycle, however. It should make it into 3.18 in any case, and will likely be stable-backported after that. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deadlock with 3.15.10
Hi Liu, >> I've recently run into a deadlock on 3.15.10, don't know if the kernel >> stack-trace is useful: http://pastebin.com/guQNDAMX > > FYI, this's been fixed by https://patchwork.kernel.org/patch/4727711/ Thanks for letting me know. - Clemens -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 12/18] Btrfs: Fix misuse of chunk mutex
On 09/03/2014 09:35 AM, Miao Xie wrote: > There were several problems about chunk mutex usage: > - Lock chunk mutex when updating metadata. It would cause the nested > deadlock because updating metadata might need allocate new chunks > that need acquire chunk mutex. We remove chunk mutex at this case, > because b-tree lock and other lock mechanism can help us. > - ABBA deadlock occured between device_list_mutex and chunk_mutex. > When we update device status, we must acquire device_list_mutex at the > beginning, and then we might get chunk_mutex during the device status > update because we need allocate new chunks for metadata COW. But at > most place, we acquire chunk_mutex at first and then acquire device list > mutex. We need change the lock order. > - Some place we needn't acquire chunk_mutex. For example we needn't get > chunk_mutex when we free a empty seed fs_devices structure. > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 9f22398d..357f911 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > > @@ -2619,10 +2615,23 @@ static int btrfs_relocate_chunk(struct btrfs_root > *root, > map = (struct map_lookup *)em->bdev; > > for (i = 0; i < map->num_stripes; i++) { > - ret = btrfs_free_dev_extent(trans, map->stripes[i].dev, > - map->stripes[i].physical); > + device = map->stripes[i].dev; > + ret = btrfs_free_dev_extent(trans, device, > + map->stripes[i].physical, > + &dev_extent_len); > BUG_ON(ret); gcc is worried that dev_extent_len may be used uninitialized here. The BUG_ON makes it unlikely we'll notice dev_extent_len, but I set it to zero in my version here. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs listing is wrong
Label: 'Root' uuid: d71404d4-468e-47d5-8f06-3b65fa7776aa Total devices 2 FS bytes used 7.46GiB devid1 size 9.31GiB used 8.06GiB path /dev/sdh6 devid3 size 9.31GiB used 8.06GiB path /dev/disk/by-uuid/d71404d4-468e-47d5-8f06-3b65fa7776aa # ls -al /dev/disk/by-uuid/d71404d4-468e-47d5-8f06-3b65fa7776aa lrwxrwxrwx 1 root root 10 Sep 11 10:43 /dev/disk/by-uuid/d71404d4-468e-47d5-8f06-3b65fa7776aa -> ../../sdh6 devid 3 should really be showing /dev/sde6 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
how long should "btrfs device delete missing ..." take?
After a disk died and was replaced, "btrfs device delete missing" is taking more than 10 days on an otherwise idle server: # btrfs fi show /home Label: none uuid: 84d087aa-3a32-46da-844f-a233237cf04f Total devices 3 FS bytes used 362.44GiB devid2 size 1.71TiB used 365.03GiB path /dev/sdb4 devid3 size 1.71TiB used 58.00GiB path /dev/sda4 *** Some devices missing Btrfs v3.16 So far, it has copied 58 GB out of 365 GB - and it took 10 days. At this speed, the whole operation will take 2-3 months (assuming that the only healthy disk doesn't die in the meantime). Is this expected time for btrfs RAID-1? There are no errors in dmesg/smart, performance of both disks is fine: # hdparm -t /dev/sda /dev/sdb /dev/sda: Timing buffered disk reads: 442 MB in 3.01 seconds = 146.99 MB/sec /dev/sdb: Timing buffered disk reads: 402 MB in 3.39 seconds = 118.47 MB/sec # btrfs fi df /home Data, RAID1: total=352.00GiB, used=351.02GiB System, RAID1: total=32.00MiB, used=96.00KiB Metadata, RAID1: total=13.00GiB, used=11.38GiB unknown, single: total=512.00MiB, used=67.05MiB # btrfs sub list /home | wc -l 260 # uptime 17:21:53 up 10 days, 6:01, 2 users, load average: 3.22, 3.53, 3.55 I've tried running this on the latest 3.16.x kernel earlier, but since the progress was so slow, rebooted after about a week to see if the latest RC will be any faster. -- Tomasz Chmielewski http://www.sslrack.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: LLVMLinux: Remove VLAIS
On 09/05/2014 06:58 PM, beh...@converseincode.com wrote: > From: Vinícius Tinti > > Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99 > compliant equivalent. This is the original VLAIS struct. > > struct { > struct shash_desc shash; > char ctx[crypto_shash_descsize(tfm)]; > } desc; > > This patch instead allocates the appropriate amount of memory using an char > array. > > The new code can be compiled with both gcc and clang. > > struct shash_desc contains a flexible array member member ctx declared with > CRYPTO_MINALIGN_ATTR, so sizeof(struct shash_desc) aligns the beginning > of the array declared after struct shash_desc with long long. > > No trailing padding is required because it is not a struct type that can > be used in an array. > > The CRYPTO_MINALIGN_ATTR is required so that desc is aligned with long long > as would be the case for a struct containing a member with > CRYPTO_MINALIGN_ATTR. We copied this from one of the other crypto api users, and I'm sure cooking up all these patches was not a great way to your afternoon. But, can I talk you into making a helper macro of some kind for this and putting it into the crypto api headers? Honestly this setup seems really error prone. -chris > > Signed-off-by: Jan-Simon Möller > Signed-off-by: Behan Webster > Signed-off-by: Vinícius Tinti > Signed-off-by: Mark Charlebois > --- > fs/btrfs/hash.c | 18 +- > 1 file changed, 9 insertions(+), 9 deletions(-) > > diff --git a/fs/btrfs/hash.c b/fs/btrfs/hash.c > index 85889aa..a62743f 100644 > --- a/fs/btrfs/hash.c > +++ b/fs/btrfs/hash.c > @@ -33,18 +33,18 @@ void btrfs_hash_exit(void) > > u32 btrfs_crc32c(u32 crc, const void *address, unsigned int length) > { > - struct { > - struct shash_desc shash; > - char ctx[crypto_shash_descsize(tfm)]; > - } desc; > + char desc[sizeof(struct shash_desc) + > + crypto_shash_descsize(tfm)] CRYPTO_MINALIGN_ATTR; > + struct shash_desc *shash = (struct shash_desc *)desc; > + u32 *ctx = (u32 *)shash_desc_ctx(shash); > int err; > > - desc.shash.tfm = tfm; > - desc.shash.flags = 0; > - *(u32 *)desc.ctx = crc; > + shash->tfm = tfm; > + shash->flags = 0; > + *ctx = crc; > > - err = crypto_shash_update(&desc.shash, address, length); > + err = crypto_shash_update(shash, address, length); > BUG_ON(err); > > - return *(u32 *)desc.ctx; > + return *ctx; > } > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix a overflowing boundary writing in csum_tree_block
On 09/09/2014 05:19 AM, rongqing...@windriver.com wrote: > From: Li RongQing > > It is impossible that csum_size is larger than sizeof(long), but the codes > still add the handler for this condition, like allocate new memory, for > extension. If it becomes true someday, copying csum_size size memory to local > 32bit variable found and val will overflow these two variables. > > Fix it by returning the max 4 byte checksum. Thanks for the patch. I'd rather not silently truncate the copy down though. How about a one time check at mount to make sure the value in the super is reasonable? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mkdir and fsync
On 09/10/2014 04:55 PM, Samer Al-Kiswany wrote: > Hi, > > Thank you for help. > > I am seeing a strange behavior when fsync()ing a directory. > > Here is what I do > > for (i=0; i < 100,000, i++){ > . > mkdir(p/child_i) > fsync(p) > } > > Btrfs seems to achieve around 100k fsycs/second, which makes me believe it > is not touching the disk during these fsyncs. > After looking at the code, it seems indeed that fsync adds the inode to the > current transaction but does not sync the transaction to disk. > > Is this the intended behavior for metadata fsync or is this a bug? > Is this POSIX compliant? Which kernel and hardware? We had some dir fsync handling bugs in the past which may have been related. I just did a test here, and we're definitely doing the IO. Christoph is right about the requirements for fsync being sloppy. For btrfs, we do put directory changes into the log during an fsync, but we may end up logging only what you fsync. So this will get child_i: mkdir(p/child_i) fsync(p) This will not: mkdir(p/child_i) fsync(some_other_directory_that_isn't_p) (This is different from ext34) -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: No space on empty, degraded raid10
On Thu, Sep 11, 2014 at 08:06:21AM -0400, Austin S Hemmelgarn wrote: > On 2014-09-11 07:38, Hugo Mills wrote: > > On Thu, Sep 11, 2014 at 07:19:00AM -0400, Austin S Hemmelgarn wrote: > >> On 2014-09-11 02:40, Russell Coker wrote: > >>> Also it would be nice if there was a N-way mirror option for system data. > >>> As > >>> such data is tiny (32MB on the 120G filesystem in my workstation) the > >>> space > >>> used by having a copy on every disk in the array shouldn't matter. > >> > >> N-way mirroring is in the queue for after RAID5/6 work; ideally, once it > >> is ready, mkfs should default to one copy per disk in the filesystem. > > > >Why change the default from 2-copies, which it's been for years? > > Sorry about the ambiguity in my statement, I meant that the default for > system chunks should be one copy per disk in the filesystem. If you > don't have a copy of the system chunks, then you essentially don't have > a filesystem, and that means that BTRFS RAID6 can't provide true > resilience against 2 disks failing catastrophically unless there are at > least 3 copies of the system chunks. Aah, OK. That makes perfect sense, then. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Some days, it's just not worth gnawing through the straps --- signature.asc Description: Digital signature
Re: No space on empty, degraded raid10
On 2014-09-11 07:38, Hugo Mills wrote: > On Thu, Sep 11, 2014 at 07:19:00AM -0400, Austin S Hemmelgarn wrote: >> On 2014-09-11 02:40, Russell Coker wrote: >>> Also it would be nice if there was a N-way mirror option for system data. >>> As >>> such data is tiny (32MB on the 120G filesystem in my workstation) the space >>> used by having a copy on every disk in the array shouldn't matter. >> >> N-way mirroring is in the queue for after RAID5/6 work; ideally, once it >> is ready, mkfs should default to one copy per disk in the filesystem. > >Why change the default from 2-copies, which it's been for years? Sorry about the ambiguity in my statement, I meant that the default for system chunks should be one copy per disk in the filesystem. If you don't have a copy of the system chunks, then you essentially don't have a filesystem, and that means that BTRFS RAID6 can't provide true resilience against 2 disks failing catastrophically unless there are at least 3 copies of the system chunks. smime.p7s Description: S/MIME Cryptographic Signature
Re: No space on empty, degraded raid10
On Thu, Sep 11, 2014 at 07:19:00AM -0400, Austin S Hemmelgarn wrote: > On 2014-09-11 02:40, Russell Coker wrote: > > Also it would be nice if there was a N-way mirror option for system data. > > As > > such data is tiny (32MB on the 120G filesystem in my workstation) the space > > used by having a copy on every disk in the array shouldn't matter. > > N-way mirroring is in the queue for after RAID5/6 work; ideally, once it > is ready, mkfs should default to one copy per disk in the filesystem. Why change the default from 2-copies, which it's been for years? Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Ceci est un travail pour l'Australien. --- signature.asc Description: Digital signature
Re: No space on empty, degraded raid10
On 2014-09-11 02:40, Russell Coker wrote: > On Mon, 8 Sep 2014, Austin S Hemmelgarn wrote: >> Also, I've found out the hard way that system chunks really should be >> RAID1, NOT RAID10, otherwise it's very likely that the filesystem >> won't mount at all if you lose 2 disks. > > Why would that be different? > > In a RAID-1 you expect system problems if 2 disks fail, why would RAID-10 be > different? That's still the case, but in a RAID1 with four disks, of the six different pairs of two disks you could lose, only one will make the filesystem un-mountable, whereas for a four disk RAID10, there are two different pairs of two disks you could lose to make the filesystem un-mountable. In haven't run the numbers for higher numbers of disks, but things are likely not better, because if you lose both copies of the same stripe, things will fail. > > Also it would be nice if there was a N-way mirror option for system data. As > such data is tiny (32MB on the 120G filesystem in my workstation) the space > used by having a copy on every disk in the array shouldn't matter. > N-way mirroring is in the queue for after RAID5/6 work; ideally, once it is ready, mkfs should default to one copy per disk in the filesystem. smime.p7s Description: S/MIME Cryptographic Signature
Re: Is it necessary to balance a btrfs raid1 array?
Bob Williams posted on Thu, 11 Sep 2014 10:56:14 +0100 as excerpted: > So if a RAID1/two disk system uses the disks symmetrically, why did my > balance command take 22 hours? That's what puzzles me, as my > understanding of RAID1 is that the disk use *is* symmetrical. What you're missing is what balance actually /does/. Balance will take every chunk it sees, data or metadata (with metadata including system as well), and rewrite it to a new location. In simplest conception that's /all/ it does. So your 22 hours was the time it took to rewrite-shift, effectively, the entire filesystem, one chunk at a time, from one location to another. Now it so happens that in the process balance does a bunch of other stuff too, like combine partially empty blocks of the same type during the rewrite, filling them up such that the rewritten version likely takes fewer chunks than the original, thus having the effect of freeing the extra chunks back to unallocated space that's now again free to be used for data or metadata instead of tied up in chunks that are one or the other but can't be switched. And after adding/deleting devices, that rewrite process balances out usage between devices. And with the convert option (used with -d or -m, below) that rewrite can be used to convert the rewritten chunks to some other raid layout than the original. And with the -d and -m options (along with -s), you can limit the chunks balance looks at to data or metadata (the latter including system as well, -s) instead of all chunks. And with the usage option (along with -d or -m, above), you can limit the chunks looked at to those under a particular percentage fill, thus allowing to do the chunk consolidation more efficiently without taking time to do ALL chunks of that type as it would otherwise do. But bottom line, balance is a chunk rewriter, and you told it to rewrite everything on the filesystem, so that's exactly what it did. And with nearly a TB of data on spinning rust, that took awhile, about 22 hours "awhile"! -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it necessary to balance a btrfs raid1 array?
On 10/09/14 19:43, Goffredo Baroncelli wrote: > On 09/10/2014 02:27 PM, Bob Williams wrote: >> I have two 2TB disks formatted as a btrfs raid1 array, mirroring both >> data and metadata. Last night I started >> >> # btrfs filesystem balance > > > May be that I am missing something obvious, however I have to ask which > would be the purpose to balance a two disks RAID1 system. > The balance command should move the data between the disks in order to > avoid some disk full and other empty; but this assume that there is a > not symmetrical uses of the disks. Which is not the case for a RAID1/two > disks system. > > If the disk were more than two the situation would be completely different. > But Bob reports that the system is compose by two disks only. > >> >> and it is still running 18 hours later. This suggests that most stuff >> only gets written to one physical device, which in turn suggests that >> there is a risk of lost data if one physical device fails. Or is >> there something clever about btrfs raid that I've missed? I've used >> linux software raid (mdraid) before, and it appeared to write to both >> devices simultaneously. >> So if a RAID1/two disk system uses the disks symmetrically, why did my balance command take 22 hours? That's what puzzles me, as my understanding of RAID1 is that the disk use *is* symmetrical. Bob -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: add missing compression property remove in btrfs_ioctl_setflags
On Thu, Sep 11, 2014 at 5:41 AM, Satoru Takeuchi wrote: > Hi Filipe, > > (2014/09/11 0:10), Filipe Manana wrote: >> The behaviour of a 'chattr -c' consists of getting the current flags, >> clearing the FS_COMPR_FL bit and then sending the result to the set >> flags ioctl - this means the bit FS_NOCOMP_FL isn't set in the flags >> passed to the ioctl. This results in the compression property not being >> cleared from the inode - it was cleared only if the bit FS_NOCOMP_FL >> was set in the received flags. >> >> Reproducer: >> >> $ mkfs.btrfs -f /dev/sdd >> $ mount /dev/sdd /mnt && cd /mnt >> $ mkdir a >> $ chattr +c a >> $ touch a/file >> $ lsattr a/file >> c--- a/file >> $ chattr -c a >> $ touch a/file2 >> $ lsattr a/file2 >> c--- a/file2 >> $ lsattr -d a >> a >> >> Reported-by: Andreas Schneider >> Signed-off-by: Filipe Manana >> --- >> fs/btrfs/ioctl.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c >> index a010c44..8e6950c 100644 >> --- a/fs/btrfs/ioctl.c >> +++ b/fs/btrfs/ioctl.c >> @@ -333,6 +333,9 @@ static int btrfs_ioctl_setflags(struct file *file, void >> __user *arg) >> >> } else { >> ip->flags &= ~(BTRFS_INODE_COMPRESS | BTRFS_INODE_NOCOMPRESS); >> + ret = btrfs_set_prop(inode, "btrfs.compression", NULL, 0, 0); > > IMHO, this patch is going on a wrong way since this patch > changes the meaning of chattr. Here is the correct behavior. > > flag | BTRFS_INODE_COMPRESS | BTRFS_INODE_NOCOMPRESS | > =+==++ > +c |set | unset | > -c | unset | unset | > > However, by your change, chattr -c results in unset > BTRFS_INODE_COMPRESS and *set* BTRFS_INODE_NOCOMPRESS. Right, for the reason mentioned below it's not a big deal, but nevertheless better to not break the expectations and behaviour from pre-properties era. > It's because btrfs_set_prop() will finally lead to > prop_compression_apply() with setting its value to "". > It's the behavior of calling ioctl() with FS_NOCOMP_FL. > > Please note that inode flag without both BTRFS_INODE_COMPRESS > nor BTRFS_INODE_NOCOMPRESS has its own meaning: whether file > is compress or not is decided by "compress" mount option. Right, which will set NOCOMPRESS if compression of the first write is worthless (compressed size >= uncompressed size). The fs has the right of setting NOCOMPRESS/FS_NOCOMP_FL if it wishes to (due to lack of any specification that forbids it). > > My suggestion is as follows and I'll write a patch based > on it soon. > > - Change the meaning of btrfs.compression == "" to mean >unset both BTRFS_INODE_COMPRESS and BTRFS_INODE_NOCOMPRESS, >for chattr -c. Do that and you're breaking existing semantics - existing apps or users might already depend on the current semantics/apis (i.e. not a good idea). However you can add a new value that implements those semantics. > - Add new value of "btrfs.compression" property, for example >"no" or "off", to mean unset BTRFS_INODE_COMPRESS and set >BTRFS_INODE_NOCOMPRESS. It's for ioctl() with FS_NOCOMP_FL. > - Unify duplicated code of changing inode flags to props.c. > > Thanks, > Satoru > >> + if (ret && ret != -ENODATA) >> + goto out_drop; >> } >> >> trans = btrfs_start_transaction(root, 1); >> > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] xfstests: btrfs: add test regarding clearing compression flag/property
Regression test for btrfs where removing the flag FS_COMPR_FL (chattr -c) from an inode wouldn't clear its compression property. This was fixed in the following linux kernel patch: Btrfs: add missing compression property remove in btrfs_ioctl_setflags Signed-off-by: Filipe Manana --- tests/btrfs/059 | 85 + tests/btrfs/059.out | 11 +++ tests/btrfs/group | 1 + 3 files changed, 97 insertions(+) create mode 100755 tests/btrfs/059 create mode 100644 tests/btrfs/059.out diff --git a/tests/btrfs/059 b/tests/btrfs/059 new file mode 100755 index 000..3379ead --- /dev/null +++ b/tests/btrfs/059 @@ -0,0 +1,85 @@ +#! /bin/bash +# FS QA Test No. btrfs/059 +# +# Regression test for btrfs where removing the flag FS_COMPR_FL (chattr -c) +# from an inode wouldn't clear its compression property. +# This was fixed in the following linux kernel patch: +# +# Btrfs: add missing compression property remove in btrfs_ioctl_setflags +# +#--- +# Copyright (C) 2014 SUSE Linux Products GmbH. All Rights Reserved. +# Author: Filipe Manana +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ + +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + rm -fr $tmp +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux +_require_test +_require_scratch +_require_btrfs "property" +_need_to_be_root + +rm -f $seqres.full + +_scratch_mkfs >> $seqres.full 2>&1 +_scratch_mount + +mkdir $SCRATCH_MNT/testdir +echo "Setting compression flag in the directory..." +chattr +c $SCRATCH_MNT/testdir +echo "Directory compression property value:" +$BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir compression + +touch $SCRATCH_MNT/testdir/file1 +echo "file1 compression property value:" +$BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir/file1 compression + +echo "Clearing compression flag from directory..." +chattr -c $SCRATCH_MNT/testdir +echo "Directory compression property value:" +$BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir compression + +touch $SCRATCH_MNT/testdir/file2 +echo "file2 compression property value:" +$BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir/file2 compression + +touch $SCRATCH_MNT/testdir/file1 +echo "file1 compression property value:" +$BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir/file1 compression + +status=0 +exit diff --git a/tests/btrfs/059.out b/tests/btrfs/059.out new file mode 100644 index 000..9ec9a53 --- /dev/null +++ b/tests/btrfs/059.out @@ -0,0 +1,11 @@ +QA output created by 059 +Setting compression flag in the directory... +Directory compression property value: +compression=zlib +file1 compression property value: +compression=zlib +Clearing compression flag from directory... +Directory compression property value: +file2 compression property value: +file1 compression property value: +compression=zlib diff --git a/tests/btrfs/group b/tests/btrfs/group index 3fa9778..68b5c79 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -61,3 +61,4 @@ 056 auto quick 057 auto quick 058 auto quick +059 auto quick -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] Btrfs: add missing compression property remove in btrfs_ioctl_setflags
The behaviour of a 'chattr -c' consists of getting the current flags, clearing the FS_COMPR_FL bit and then sending the result to the set flags ioctl - this means the bit FS_NOCOMP_FL isn't set in the flags passed to the ioctl. This results in the compression property not being cleared from the inode - it was cleared only if the bit FS_NOCOMP_FL was set in the received flags. Reproducer: $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt && cd /mnt $ mkdir a $ chattr +c a $ touch a/file $ lsattr a/file c--- a/file $ chattr -c a $ touch a/file2 $ lsattr a/file2 c--- a/file2 $ lsattr -d a a Reported-by: Andreas Schneider Signed-off-by: Filipe Manana --- V2: Ensure BTRFS_INODE_NOCOMPRESS isn't set (unless the bit FS_NOCOMP_FL is set). fs/btrfs/ioctl.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index a010c44..a46c169 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -332,6 +332,9 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) goto out_drop; } else { + ret = btrfs_set_prop(inode, "btrfs.compression", NULL, 0, 0); + if (ret && ret != -ENODATA) + goto out_drop; ip->flags &= ~(BTRFS_INODE_COMPRESS | BTRFS_INODE_NOCOMPRESS); } -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it necessary to balance a btrfs raid1 array?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/09/14 05:30, Zygo Blaxell wrote: > On Wed, Sep 10, 2014 at 01:27:36PM +0100, Bob Williams wrote: >> I have two 2TB disks formatted as a btrfs raid1 array, mirroring >> both data and metadata. Last night I started >> >> # btrfs filesystem balance >> [...] > >> As a rough guide, how often should one perform >> >> a) balance > > I have a cron job that runs 'btrfs balance resume' or 'btrfs > balance start' (depending on whether a balance is already in > progress) nightly at 1AM. Another cron job comes along at 6AM to > run 'btrfs balance pause' on my headless servers. On my desktops > and laptops I have a daemon that detects keyboard/mouse input and > does 'btrfs balance pause' when some is detected (the balance > remains paused until the next day at 1AM, as it is really heavy and > takes a long time to come to a stop). > [...] Many thanks to everyone who has contributed to this thread. I have learnt a lot, and now have weekly cronjobs to balance and scrub. Bob - -- Bob Williams System: Linux 3.11.10-21-desktop Distro: openSUSE 13.1 (x86_64) with KDE Development Platform: 4.14.0 Uptime: 06:00am up 6 days 15:04, 4 users, load average: 0.02, 0.02, 0.05 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlQRZp4ACgkQ0Sr7eZJrmU7YcACgpMcD4w3J8IV8m4MpYSG8jl1/ kLEAoITiw3EcpvnELw2KlRaa3GlYEWUV =ICmv -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html