Re: how long should btrfs device delete missing ... take?
Chris Murphy posted on Thu, 11 Sep 2014 20:10:26 -0600 as excerpted: Sure. But what's the next step? Given 260+ snapshots might mean well more than 350GB of data, depending on how deduplicated the fs is, it still probably would be faster to rsync this to a pile of drives in linear/concat+XFS than wait a month (?) for device delete to finish. That was what I was getting at in my other just-finished short reply. It may be time to give up on the btrfs specific solutions for the moment and go with tried and tested traditional solutions (tho I'd definitely *NOT* try rsync or the like with the delete still going, we know from other reports that rsync places its own stresses on btrfs and one major stressor, the delete-triggered rebalance, at a time, is bad enough). Alternatively, script some way to create 260+ ro snapshots to btrfs send/receive to a new btrfs volume; and turn it into a raid1 later. No confirmation yet but I strongly suspect most of those subs are snapshots. Assuming that's the case, it's very likely most of them can simply be eliminated as I originally suggested, a process that /should/ be fast, decomplexifying the situation dramatically. I'm curious if a sysrq+s followed by sysrq+u might leave the filessystem in a state where it could still be rw mountable. But I'm skeptical of anything interrupting the device delete before being fully prepared for the fs to be toast for rw mount. If only ro mount is possible, any chance of creating ro snapshots is out. In theory, that is, barring bugs, interrupting the delete with normal shutdown to the extent possible, then sysrq+s, sysrq+u, should not be a problem. The delete is basically a balance, going chunk by chunk, and either the chunk has been duplicated to the new device or it hasn't. In either case, the existing chunk on the remaining old device shouldn't be affected. So rebooting in that way in ordered to stop the delete temporarily /should/ have no bad effects. Of course, that's barring bugs. Btrfs is still not fully stabilized, and bugs do happen, so anything's possible. But I'd consider it safe enough to try here, certainly so if I had backups, as is still STRONGLY recommended for btrfs at this point, much more so than the routine sysadmin if it's not backed up by definition it's not valuable to you rule. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how long should btrfs device delete missing ... take?
On Sep 11, 2014, at 11:19 PM, Russell Coker russ...@coker.com.au wrote: It would be nice if a file system mounted ro counted as ro snapshots for btrfs send. When a file system is so messed up it can't be mounted rw it should be regarded as ro for all operations. Yes it's come up before, and there's a question whether mount -o ro is reliably ro enough for this. Maybe a force option? But then another one is a recursive btrfs send to go along with the above. I might want them all, or I might want all of the ones in two particular subvolumes, etc. And even combine the recursive ro snapshot and recursive send as a btrfs rescue option that would work even if the volume is mounted read-only. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
how long should btrfs device delete missing ... take?
After a disk died and was replaced, btrfs device delete missing is taking more than 10 days on an otherwise idle server: # btrfs fi show /home Label: none uuid: 84d087aa-3a32-46da-844f-a233237cf04f Total devices 3 FS bytes used 362.44GiB devid2 size 1.71TiB used 365.03GiB path /dev/sdb4 devid3 size 1.71TiB used 58.00GiB path /dev/sda4 *** Some devices missing Btrfs v3.16 So far, it has copied 58 GB out of 365 GB - and it took 10 days. At this speed, the whole operation will take 2-3 months (assuming that the only healthy disk doesn't die in the meantime). Is this expected time for btrfs RAID-1? There are no errors in dmesg/smart, performance of both disks is fine: # hdparm -t /dev/sda /dev/sdb /dev/sda: Timing buffered disk reads: 442 MB in 3.01 seconds = 146.99 MB/sec /dev/sdb: Timing buffered disk reads: 402 MB in 3.39 seconds = 118.47 MB/sec # btrfs fi df /home Data, RAID1: total=352.00GiB, used=351.02GiB System, RAID1: total=32.00MiB, used=96.00KiB Metadata, RAID1: total=13.00GiB, used=11.38GiB unknown, single: total=512.00MiB, used=67.05MiB # btrfs sub list /home | wc -l 260 # uptime 17:21:53 up 10 days, 6:01, 2 users, load average: 3.22, 3.53, 3.55 I've tried running this on the latest 3.16.x kernel earlier, but since the progress was so slow, rebooted after about a week to see if the latest RC will be any faster. -- Tomasz Chmielewski http://www.sslrack.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how long should btrfs device delete missing ... take?
Tomasz Chmielewski posted on Thu, 11 Sep 2014 17:22:15 +0200 as excerpted: After a disk died and was replaced, btrfs device delete missing is taking more than 10 days on an otherwise idle server: # btrfs fi show /home Label: none uuid: 84d087aa-3a32-46da-844f-a233237cf04f Total devices 3 FS bytes used 362.44GiB devid2 size 1.71TiB used 365.03GiB path /dev/sdb4 devid3 size 1.71TiB used 58.00GiB path /dev/sda4 *** Some devices missing Btrfs v3.16 So far, it has copied 58 GB out of 365 GB - and it took 10 days. At this speed, the whole operation will take 2-3 months (assuming that the only healthy disk doesn't die in the meantime). Is this expected time for btrfs RAID-1? Device delete definitely takes time. For the sub-GiB usage shown above, 10 days for 50 GiB out of 350+ does seem excessive, but there are extreme cases where it isn't entirely out of line. See below. There are no errors in dmesg/smart, performance of both disks is fine: # btrfs sub list /home | wc -l 260 I've tried running this on the latest 3.16.x kernel earlier, but since the progress was so slow, rebooted after about a week to see if the latest RC will be any faster. The good thing is that once a block group is copied over, it should be fine and won't need re-copied if the process is stopped over a reboot and restarted on a new kernel, etc. The bad thing is that if I'm interpreting your report correctly, that likely means 7+10=17 days for that 58 gig. =:^( Questions: * Presumably most of those 260 subvolumes are snapshots, correct? What was your snapshotting schedule and did you have old snapshot cleanup-deletion scheduled as well? * Do you run with autodefrag or was the system otherwise regularly defragged? * Do you have large (GiB plus) database or virtual machine image files on that filesystem? If so, had you properly set the NOCOW file attribute (chattr +C) on them and were they on dedicated subvolumes? 200+ snapshots is somewhat high and could be part of the issue, tho it's nothing like the extremes (thousands) we've seen posted in the past. Were it me, I'd have tried deleting as many as possible before the device delete missing, in ordered to simplify the process and eliminate as much extra data as possible. The real issue is going to be fragmentation, on spinning-rust drives. Run filefrag on some of your gig-plus files that get written to frequently (vm-images and database files are the classic cases) and see how many extents are reported. (Tho note that filefrag doesn't understand btrfs compression and won't be accurate in that case, and also that due to the btrfs data chunk size of 1 GiB, that's the maximum extent size, so multi-gig files will typically be two extents more than the number of gigs, filling up the current chunk, N whole-gig chunks, the file tail.) The nocow file attribute (which must be set while the file is zero-sized to be effective, see discussion elsewhere) can help with that, but snapshotting an actively being rewritten nocow file more or less defeats the purpose of nocow, since the snapshot locks in place the existing data and the first rewrite to a block must then be cowed anyway. But putting those files on dedicated subvolumes and then not snapshotting those subvolumes is a workaround. I wouldn't try defragging now, but it might be worthwhile to stop the device delete (rebooting to do so since I don't think there's a cancel) and delete as many snapshots as possible. That should help matters. Additionally, if you have recent backups of highly fragmented files such as the VM-images and DBs I mentioned, you might consider simply deleting them, thus eliminating that fragment processing from the device delete. I don't know that making a backup now would go much faster than the device delete, however, so I don't know whether to recommend that or not. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how long should btrfs device delete missing ... take?
On Sep 11, 2014, at 1:31 PM, Duncan 1i5t5.dun...@cox.net wrote: I wouldn't try defragging now, but it might be worthwhile to stop the device delete (rebooting to do so since I don't think there's a cancel) 'btrfs replace cancel' does exist, although I haven't tried it. Something isn't right though, because it's clearly neither reading nor writing at anywhere close to 1/2 the drive read throughput. I'm curious what 'iotop -d30 -o' shows (during the replace, before cancel), which should be pretty consistent by averaging 30 seconds worth of io. And then try 'iotop -d3 -o' and see if there are spikes. I'm willing to bet there's a lot of nothing going on, with occasional spikes, rather than a constant trickle. And then the question is to find out what btrfs is thinking about while nothing is reading or writing. Even though it's not 5000+ snapshots, I wonder if the balance code (and hence btrfs replace) makes extensive use of fiemap that's causing this to go catatonic. http://comments.gmane.org/gmane.comp.file-systems.btrfs/35724 Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how long should btrfs device delete missing ... take?
After a disk died and was replaced, btrfs device delete missing is taking more than 10 days on an otherwise idle server: Something isn't right though, because it's clearly neither reading nor writing at \ anywhere close to 1/2 the drive read throughput. I'm curious what 'iotop -d30 -o' \ shows (during the replace, before cancel), which should be pretty consistent by \ averaging 30 seconds worth of io. And then try 'iotop -d3 -o' and see if there are \ spikes. I'm willing to bet there's a lot of nothing going on, with occasional spikes, \ rather than a constant trickle. That's more or less what I'm seeing with both. The numbers will go up or down slightly, but it's counted in kilobytes per second: Total DISK READ: 0.00 B/s | Total DISK WRITE: 545.82 B/s TID PRIO USER DISK READ DISK WRITE SWAPIN IOCOMMAND 940 be/3 root0.00 B/s 136.46 B/s 0.00 % 0.10 % [jbd2/md2-8] 4714 be/4 root0.00 B/s 329.94 K/s 0.00 % 0.00 % [btrfs-transacti] 25534 be/4 root0.00 B/s 402.97 K/s 0.00 % 0.00 % [kworker/u16:0] The bottleneck may be here - one CPU core is mostly 100% busy (kworker). Not sure what it's really busy with though: PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 25546 root 20 0 0 0 0 R 93.0 0.0 18:22.94 kworker/u16:7 14473 root 20 0 0 0 0 S 5.0 0.0 25:00.14 kworker/0:0 [912979.063432] SysRq : Show Blocked State [912979.063485] taskPC stack pid father [912979.063545] btrfs D 88083fa515c0 0 4793 4622 0x [912979.063601] 88061a29b878 0086 88003683e040 [912979.063701] 000115c0 4000 880813e3 88003683e040 [912979.063800] 88061a29b7e8 8105d8e9 88083fa4 88083fa115c0 [912979.063899] Call Trace: [912979.063951] [8105d8e9] ? enqueue_task_fair+0x3e5/0x44f [912979.064006] [81053484] ? resched_curr+0x47/0x57 [912979.064058] [81053aed] ? check_preempt_curr+0x3e/0x6d [912979.064111] [81053b2e] ? ttwu_do_wakeup+0x12/0x7f [912979.064164] [81053c3c] ? ttwu_do_activate.constprop.74+0x57/0x5c [912979.064220] [813acc1e] schedule+0x65/0x67 [912979.064272] [813aed0c] schedule_timeout+0x26/0x198 [912979.064324] [8105639d] ? wake_up_process+0x31/0x35 [912979.064378] [81049baf] ? wake_up_worker+0x1f/0x21 [912979.064431] [81049df6] ? insert_work+0x87/0x94 [912979.064493] [a02d524b] ? free_block_list+0x1f/0x34 [btrfs] [912979.064548] [813ad443] wait_for_common+0x10d/0x13e [912979.064600] [8105635d] ? try_to_wake_up+0x251/0x251 [912979.064653] [813ad48c] wait_for_completion+0x18/0x1a [912979.064710] [a0283a01] btrfs_async_run_delayed_refs+0xc1/0xe4 [btrfs] [912979.064814] [a02983c5] __btrfs_end_transaction+0x2bb/0x2e1 [btrfs] [912979.064916] [a02983f9] btrfs_end_transaction_throttle+0xe/0x10 [btrfs] [912979.065020] [a02d973d] relocate_block_group+0x2ad/0x4de [btrfs] [912979.065079] [a02d9ac6] btrfs_relocate_block_group+0x158/0x278 [btrfs] [912979.065184] [a02b66f0] btrfs_relocate_chunk.isra.62+0x58/0x5f7 [btrfs] [912979.065286] [a02c58d7] ? btrfs_set_lock_blocking_rw+0x68/0x95 [btrfs] [912979.065387] [a0276b04] ? btrfs_set_path_blocking+0x23/0x54 [btrfs] [912979.065486] [a027b517] ? btrfs_search_slot+0x7bc/0x816 [btrfs] [912979.065546] [a02b2bd5] ? free_extent_buffer+0x6f/0x7c [btrfs] [912979.065605] [a02b89e9] btrfs_shrink_device+0x23c/0x3a5 [btrfs] [912979.065679] [a02bb2c7] btrfs_rm_device+0x2a1/0x759 [btrfs] [912979.065747] [a02c3ab3] btrfs_ioctl+0xa52/0x227f [btrfs] [912979.065811] [81107182] ? putname+0x23/0x2c [912979.065863] [8110b3cb] ? user_path_at_empty+0x60/0x90 [912979.065918] [81173b1a] ? avc_has_perm+0x2e/0xf7 [912979.065978] [810d7ad5] ? __vm_enough_memory+0x25/0x13c [912979.066032] [8110d3c1] do_vfs_ioctl+0x3f2/0x43c [912979.066084] [811026fd] ? vfs_stat+0x16/0x18 [912979.066136] [8110d459] SyS_ioctl+0x4e/0x7d [912979.066188] [81030a71] ? do_page_fault+0xc/0xf [912979.066240] [813afd92] system_call_fastpath+0x16/0x1b [912979.066296] Sched Debug Version: v0.11, 3.17.0-rc3 #1 [912979.066347] ktime : 913460840.666210 [912979.066401] sched_clk : 912979066.295474 [912979.066454] cpu_clk : 912979066.295485 [912979.066507] jiffies : 4386283381 [912979.066560] sched_clock_stable(): 1 [912979.066610] [912979.066656] sysctl_sched [912979.066703] .sysctl_sched_latency: 24.00 [912979.066756] .sysctl_sched_min_granularity
Re: how long should btrfs device delete missing ... take?
Chris Murphy posted on Thu, 11 Sep 2014 15:25:51 -0600 as excerpted: On Sep 11, 2014, at 1:31 PM, Duncan 1i5t5.dun...@cox.net wrote: I wouldn't try defragging now, but it might be worthwhile to stop the device delete (rebooting to do so since I don't think there's a cancel) 'btrfs replace cancel' does exist, although I haven't tried it. Btrfs replace cancel exists, yes, but does it work for btrfs device delete, which is what the OP used? Something isn't right though, because it's clearly neither reading nor writing at anywhere close to 1/2 the drive read throughput. I'm curious what 'iotop -d30 -o' shows (during the replace, before cancel), which should be pretty consistent by averaging 30 seconds worth of io. And then try 'iotop -d3 -o' and see if there are spikes. I'm willing to bet there's a lot of nothing going on, with occasional spikes, rather than a constant trickle. And then the question is to find out what btrfs is thinking about while nothing is reading or writing. Even though it's not 5000+ snapshots, I wonder if the balance code (and hence btrfs replace) makes extensive use of fiemap that's causing this to go catatonic. http://comments.gmane.org/gmane.comp.file-systems.btrfs/35724 Not sure (some of that stuff's beyond me), but one thing we /do/ know is that btrfs has so far been focused mostly on features and debugging, not on optimization beyond the worst-cases, which themselves remain a big enough problem, tho it's slowly getting better. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how long should btrfs device delete missing ... take?
On Sep 11, 2014, at 5:51 PM, Duncan 1i5t5.dun...@cox.net wrote: Chris Murphy posted on Thu, 11 Sep 2014 15:25:51 -0600 as excerpted: On Sep 11, 2014, at 1:31 PM, Duncan 1i5t5.dun...@cox.net wrote: I wouldn't try defragging now, but it might be worthwhile to stop the device delete (rebooting to do so since I don't think there's a cancel) 'btrfs replace cancel' does exist, although I haven't tried it. Btrfs replace cancel exists, yes, but does it work for btrfs device delete, which is what the OP used? Oops, right! I'm not sure what can do this safely. And then when I think about just creating a new fs, using btrfs send/receive, the snapshots need to be ro first. So if there's any uncertainty about safely canceling the 'device delete' those ro snapshots need to be taken first, in the event only an ro mount is possible subsequently. And then there's some uncertainty how long 260+ ro snapshots will take (should be fast, but) and how much worse that makes the current situation. But probably worth the risk to take the snapshots and just wait a while before trying something like umount or sysrq+s followed by sysrq+u. Something isn't right though, because it's clearly neither reading nor writing at anywhere close to 1/2 the drive read throughput. I'm curious what 'iotop -d30 -o' shows (during the replace, before cancel), which should be pretty consistent by averaging 30 seconds worth of io. And then try 'iotop -d3 -o' and see if there are spikes. I'm willing to bet there's a lot of nothing going on, with occasional spikes, rather than a constant trickle. And then the question is to find out what btrfs is thinking about while nothing is reading or writing. Even though it's not 5000+ snapshots, I wonder if the balance code (and hence btrfs replace) makes extensive use of fiemap that's causing this to go catatonic. http://comments.gmane.org/gmane.comp.file-systems.btrfs/35724 Not sure (some of that stuff's beyond me), but one thing we /do/ know is that btrfs has so far been focused mostly on features and debugging, not on optimization beyond the worst-cases, which themselves remain a big enough problem, tho it's slowly getting better. Sure. But what's the next step? Given 260+ snapshots might mean well more than 350GB of data, depending on how deduplicated the fs is, it still probably would be faster to rsync this to a pile of drives in linear/concat+XFS than wait a month (?) for device delete to finish. Alternatively, script some way to create 260+ ro snapshots to btrfs send/receive to a new btrfs volume; and turn it into a raid1 later. I'm curious if a sysrq+s followed by sysrq+u might leave the filessystem in a state where it could still be rw mountable. But I'm skeptical of anything interrupting the device delete before being fully prepared for the fs to be toast for rw mount. If only ro mount is possible, any chance of creating ro snapshots is out. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how long should btrfs device delete missing ... take?
It would be nice if a file system mounted ro counted as ro snapshots for btrfs send. When a file system is so messed up it can't be mounted rw it should be regarded as ro for all operations. -- Sent from my Samsung Galaxy Note 2 with K-9 Mail. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how long should btrfs device delete missing ... take?
Russell Coker posted on Fri, 12 Sep 2014 15:19:04 +1000 as excerpted: It would be nice if a file system mounted ro counted as ro snapshots for btrfs send. When a file system is so messed up it can't be mounted rw it should be regarded as ro for all operations. Indeed, and that has been suggested before, but unfortunately it's not current behavior. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how long should btrfs device delete missing ... take?
Chris Murphy posted on Thu, 11 Sep 2014 20:10:26 -0600 as excerpted: And then when I think about just creating a new fs, using btrfs send/receive, the snapshots need to be ro first. FWIW, at this point I'd forget about send/receive and create the backup (assuming one doesn't exist already) using more normal methods. At least if the original send/receive hasn't yet been done, so it'd be copying off all the data anyway. Perhaps mount selected snapshots and back them up too (after the current case is backed up), but throw away most of the snapshots. Of course if there's an existing relatively current sent/received base to build on, and no indication that send/receive is broken, definitely try that first as the amount of data to sync in that case should be MUCH lower, but if not... -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html