Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
Nicholas D Steeves wrote on 2016/04/05 23:47 -0400: On 4 April 2016 at 12:55, David Sterbawrote: Not exactly. If we are using unsafe hash, e.g MD5, we will use MD5 only for both in-memory and on-disk backend. No SHA256 again. I'm proposing unsafe but fast, which MD5 is not. Look for xxhash or murmur. As they're both order-of-magnitutes faster than sha1/md5, we can actually hash both to reduce the collisions. Don't quite like the idea to use 2 hash other than 1. Yes, some program like rsync uses this method, but this also involves a lot of details, like the order to restore them on disk. I'm considering fast-but-unsafe hashes for the in-memory backend, where the speed matters and we cannot hide the slow sha256 calculations behind the IO (ie. no point to save microseconds if the IO is going to take milliseconds). In that case, for MD5 hit case, we will do a full byte-to-byte comparison. It may be slow or fast, depending on the cache. If the probability of hash collision is low, so the number of needed byte-to-byte comparisions is also low. It is unlikely that I will use dedupe, but I imagine your work will apply tot he following wishlist: 1. Allow disabling of memory-backend hash via a kernel argument, sysctl, or mount option for those of us have ECC RAM. * page_cache never gets pushed to swap, so this should be safe, no? Why not use current ioctl to disable dedupe? And why it's related to ECC RAM? To avoid memory corruption which will finally lead to file corruption? If so, it makes sense. Also I didn't get the point when you mention page_cache. For hash pool, we didn't use page cache. We just use kmalloc, which won't be swapped out. For file page cache, it's not affected at all. 2. Implementing an intelligent cache so that it's possible to offset the cost of hashing the most actively read data. I'm guessing there's already some sort of weighed cache eviction algorithm in place, but I don't yet know how to look into it, let alone enough to leverage it... I not quite a fan of such intelligent but complicated cache design. The main problem is we are putting police into kernel space. Currently, either use last-recent-use in-memory backend, or use all-in ondisk backend. For user want more precious control on which file/dir shouldn't go through dedupe, they have the btrfs prop to set per-file flag to avoid dedupe. * on the topic of leaning on the cache, I've been thinking about ways to optimize reads, while minimizing seeks on multi-spindle raid1 btrfs volumes. I'm guessing that someone will commit a solution before I manage to teach myself enough about filesystems to contribute something useful. That's it, in terms of features I want ;-) It's probably a well-known fact, but sha512 is roughly 40 to 50% faster than sha256, and 40 to 50% slower than sha1 on my 1200-series Xeon v3 (Haswell), for 8192 size blocks. Sadly I didn't know it until recent days. :( Or I would have implemented SHA512 hash algorithm instead SHA256. Anyway, it's not that hard to add a new hash algorithm. Thanks for your comments. Qu Wish I could do more right now! Nicholas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: do not write corrupted metadata blocks to disk
Hi Alex, On 13 March 2016 at 05:51, Alex Lyakaswrote: > Nicholas, > > On Sat, Mar 12, 2016 at 12:19 AM, Nicholas D Steeves > wrote: >> On 10 March 2016 at 06:10, Alex Lyakas wrote: >> Does this mean there is a good chance that everyone has corrupted >> metadata? > No, this definitely does not. > > The code that I added prevents btrfs from writing a metadata block, if > it somehow got corrupted before being sent to disk. If it happens, it > indicates a bug somewhere in the kernel. For example, if some other > kernel module erroneously uses a page-cache entry, which does not > belong to it (and contains btrfs metadata block or part of it). Oh wow, I didn't know that was possible. If I understand correctly, this patch makes using bcache a little bit safer? (I don't use it since I'm too short on free time to what is--I suspect-- something that radically increases the chances of having to restore from backup) >> Is there any way to verify/rebuild it without wipefs+mkfs+restore from >> backups? > To verify btrfs metadata: unmount the filesystem and run "btrfs check > ...". Do not specify the "repair" parameter. Another way to verify is > to run "btrfs-debug-tree" and redirect its standard output to > /dev/null. It should not print anything to standard error. But "btrfs > check" is faster. Ah, that's exactly what I was looking for! Thank you. It took forever, and brought me back to what it was like to fsck large ext2 volumes. Is btrfs check conceptually identical to a read-only fsck of a ext2 volume? If now how does it defer? Are the following sort of errors still an issue?: Extent back ref already exists for 2148837945344 parent 0 root 257 leaf parent key incorrect 504993210368 bad block 504993210368 ( https://btrfs.wiki.kernel.org/index.php/Btrfsck ) Cheers, Nicholas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dstat shows unexpected result for two disk RAID1
On 11 March 2016 at 20:20, Chris Murphywrote: > On Fri, Mar 11, 2016 at 5:10 PM, Nicholas D Steeves > wrote: >> P.S. Rather than parity, I mean instead of distributing into stripes, do a >> copy! > > raid56 by definition are parity based, so I'd say no that's confusing > to turn it into something it's not. I just found the Multiple Device Support diagram. I'm trying to figure out how hard it's going for me to get up to speed, because I've only ever casually and informally read about filesystems. I worry that because I didn't study filesystem design in school, and because everything I worked on was in C++...well, the level of sophistication and design might be beyond what I can learn. What do you think? Can you recommend any books on file system design that will provide what is necessary to understand btrfs? Cheers, Nicholas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs4.4 with linux-3.16.7 (with truncation of extends patch)
Dear Duncan, Gmail seems to have mangled the draft of my reply :-/ It's at the bottom. On 06/02/16 12:57 AM, Duncan wrote: Most of the rest of the userspace tools, in particular, btrfs scrub, subvolume, balance, device, filesystem, send, receive, etc, work by making kernel calls to do the actual work in any case, and they will use the old calls if they need to. The compatibility discussion, meanwhile, is on making mkfs.btrfs (and btrfs-convert) check the running kernel and taking its defaults from what that kernel supports, instead of choosing arbitrary defaults that may be better when supported, but that older kernels don't actually support. Of course there will still be options to set these as desired regardless of defaults, just as there are now, so people using for instance booted to an old recovery kernel for system maintenance can still choose whatever options that version of mkfs.btrfs supports if they know they'll actually be mounting with a newer kernel, but the idea is simply to have mkfs.btrfs act more sanely /by/ /default/ when run on old kernels, so those same old kernels can actually mount a filesystem created with defaults. Along that line, as a distro maintainer of the btrfs-progs package, you may wish to patch the mkfs.btrfs defaults to what your kernel supports. Btrfs-progs will probably ship with kernel-sensitive defaults some time in the future (userspace 4.5 release, probably), but it doesn't do so yet... -- Thank you very much for taking the time to write such a thorough reply. I'm not the maintainer of Debian's btrfs-progs package, but I am investigating the issues preventing the addition of btrfs-progs-4.4 to the backports repository. [ edit: sorry it took me so long to reply, I've been swamped with work. In the meantime, it seems v4.4 has made it into the backports without warnings or compatibility checks, so I want to get my facts straight asap and patch the package with some kind of a notice/alert asap, if only through the debian/NEWS...since there isn't currently a way to depend on a particular kernel series, or ever a kernel version <= 4.4.0 Cheers, Nicholas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
On 4 April 2016 at 12:55, David Sterbawrote: >> >> Not exactly. If we are using unsafe hash, e.g MD5, we will use MD5 only >> >> for both in-memory and on-disk backend. No SHA256 again. >> > >> > I'm proposing unsafe but fast, which MD5 is not. Look for xxhash or >> > murmur. As they're both order-of-magnitutes faster than sha1/md5, we can >> > actually hash both to reduce the collisions. >> >> Don't quite like the idea to use 2 hash other than 1. >> Yes, some program like rsync uses this method, but this also involves a >> lot of details, like the order to restore them on disk. > > I'm considering fast-but-unsafe hashes for the in-memory backend, where > the speed matters and we cannot hide the slow sha256 calculations behind > the IO (ie. no point to save microseconds if the IO is going to take > milliseconds). > >> >> In that case, for MD5 hit case, we will do a full byte-to-byte >> >> comparison. It may be slow or fast, depending on the cache. >> > >> > If the probability of hash collision is low, so the number of needed >> > byte-to-byte comparisions is also low. It is unlikely that I will use dedupe, but I imagine your work will apply tot he following wishlist: 1. Allow disabling of memory-backend hash via a kernel argument, sysctl, or mount option for those of us have ECC RAM. * page_cache never gets pushed to swap, so this should be safe, no? 2. Implementing an intelligent cache so that it's possible to offset the cost of hashing the most actively read data. I'm guessing there's already some sort of weighed cache eviction algorithm in place, but I don't yet know how to look into it, let alone enough to leverage it... * on the topic of leaning on the cache, I've been thinking about ways to optimize reads, while minimizing seeks on multi-spindle raid1 btrfs volumes. I'm guessing that someone will commit a solution before I manage to teach myself enough about filesystems to contribute something useful. That's it, in terms of features I want ;-) It's probably a well-known fact, but sha512 is roughly 40 to 50% faster than sha256, and 40 to 50% slower than sha1 on my 1200-series Xeon v3 (Haswell), for 8192 size blocks. Wish I could do more right now! Nicholas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Possible Double Freeing of dentry in check_parent_dirs_for_sync
Greetings All, After some tracing I am not certain if this is correct due to being newer to the btrfs codebase. However if someone more experience can show me if I am missing something in my traces please let me known:) Firstly here is the bug trace or the part that matters: [ 7195.792492] [ cut here ] [ 7195.792532] WARNING: CPU: 0 PID: 5352 at /home/kernel/COD/linux/fs/btrfs/inode.c:9261 btrfs_destroy_inode+0x247/0x2c0 [btrfs] [ 7195.792535] Modules linked in: bnep binfmt_misc intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel samsung_laptop kvm irqbypass crct10dif_pclmul crc32_pclmul btusb ghash_clmulni_intel btrtl btbcm btintel cryptd snd_hda_codec_hdmi uvcvideo bluetooth snd_hda_codec_realtek videobuf2_vmalloc snd_hda_codec_generic videobuf2_memops arc4 videobuf2_v4l2 snd_hda_intel input_leds videobuf2_core snd_hda_codec joydev snd_hda_core iwldvm serio_raw snd_hwdep videodev snd_pcm mac80211 media snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device iwlwifi snd_timer cfg80211 snd lpc_ich mei_me soundcore shpchp mei dell_smo8800 mac_hid parport_pc ppdev lp parport autofs4 btrfs xor raid6_pq hid_generic usbhid hid i915 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect psmouse sysimgblt fb_sys_fops [ 7195.792593] drm r8169 ahci libahci mii wmi video fjes [ 7195.792602] CPU: 0 PID: 5352 Comm: aptitude Not tainted 4.6.0-040600rc1-generic #201603261930 [ 7195.792604] Hardware name: SAMSUNG ELECTRONICS CO., LTD. 530U3C/530U4C/SAMSUNG_NP1234567890, BIOS P14AAJ 04/15/2013 [ 7195.792607] 0286 002cb880 8800c91f3d40 813ee833 [ 7195.792611] 8800c91f3d80 8108275b [ 7195.792614] 242da71863c0 8800209bca58 8800209bca58 880034eda000 [ 7195.792618] Call Trace: [ 7195.792626] [] dump_stack+0x63/0x90 [ 7195.792631] [] __warn+0xcb/0xf0 [ 7195.792635] [] warn_slowpath_null+0x1d/0x20 [ 7195.792658] [] btrfs_destroy_inode+0x247/0x2c0 [btrfs] [ 7195.792663] [] destroy_inode+0x3b/0x60 [ 7195.792666] [] evict+0x136/0x1a0 [ 7195.792670] [] iput+0x1ba/0x240 [ 7195.792673] [] __dentry_kill+0x18d/0x1e0 [ 7195.792676] [] dput+0x12b/0x220 [ 7195.792680] [] SyS_rename+0x2f4/0x3c0 [ 7195.792686] [] entry_SYSCALL_64_fastpath+0x1e/0xa8 [ 7195.792689] ---[ end trace e42100b57fd49606 ]--- [ 7464.416637] perf: interrupt took too long (3157 > 3146), lowering kernel.perf_event_max_sample_rate to 63250 [ 9697.609514] perf: interrupt took too long (3950 > 3946), lowering kernel.perf_event_max_sample_rate to 50500 Firstly we start in the btrfs function for renaming btrfs_rename which in turn calls btrfs_log_new_name,which in turn calls check_parent_dirs_for_sync which in turn calls check_parent_dirs_for_sync. This is where I got confused though and wanted a more experienced viewpoint on this: parent = dget_parent(parent); dput(old_parent); old_parent = parent; inode = d_inode(parent); } Are we not double freeing and causing a NULL pointer deference here? dput(old_parent); Sorry for the stupid question :(, Bastien -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [RFC] fix potential access after free: return value of blk_check_plugged() must be used schedule() safe
On Wed, Apr 06 2016, Shaohua Li wrote: > On Tue, Apr 05, 2016 at 03:36:57PM +0200, Lars Ellenberg wrote: >> blk_check_plugged() will return a pointer >> to an object linked on current->plug->cb_list. >> >> That list may "at any time" be implicitly cleared by >> blk_flush_plug_list() >> flush_plug_callbacks() >> either as a result of blk_finish_plug(), >> or implicitly by schedule() [and maybe other implicit mechanisms?] >> >> If there is no protection against an implicit unplug >> between the call to blk_check_plug() and using its return value, >> that implicit unplug may have already happened, >> even before the plug is actually initialized or populated, >> and we may be using a pointer to already free()d data. > > This isn't correct. flush plug is never called in preemption, which is > designed > only called when the task is going to sleep. See sched_submit_work. Am I > missing anything? Ahh yes, thanks. Only two places call blk_schedule_flush_plug(). One is io_schedule_timeout() which must be called explicitly. There other is, as you say, sched_submit_work(). It starts: static inline void sched_submit_work(struct task_struct *tsk) { if (!tsk->state || tsk_is_pi_blocked(tsk)) return; so if the task is runnable, then as include/linux/sched.h:#define TASK_RUNNING 0 it will never call blk_schedule_flush_plug(). So I don't think you are missing anything, we were. Lars: have your concerns been relieved or do you still have reason to think there is a problem? Thanks, NeilBrown signature.asc Description: PGP signature
[PATCH] btrfs: qgroup: Fix qgroup accounting when creating snapshot
Current btrfs qgroup design implies a requirement that after calling btrfs_qgroup_account_extents() there must be a commit root switch. Normally this is OK, as btrfs_qgroup_accounting_extents() is only called inside btrfs_commit_transaction() just be commit_cowonly_roots(). However there is a exception at create_pending_snapshot(), which will call btrfs_qgroup_account_extents() but no any commit root switch. In case of creating a snapshot whose parent root is itself (create a snapshot of fs tree), it will corrupt qgroup by the following trace: (skipped unrelated data) == btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1 qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0 qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384 btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0 == The problem here is in first qgroup_account_extent(), the nr_new_roots of the extent is 1, which means its reference got increased, and qgroup increased its rfer and excl. But at second qgroup_account_extent(), its reference got decreased, but between these two qgroup_account_extent(), there is no switch roots. This leads to the same nr_old_roots, and this extent just got ignored by qgroup, which means this extent is wrongly accounted. Fix it by call commit_cowonly_roots() after qgroup_account_extent() in create_pending_snapshot(), with needed preparation. Reported-by: Mark FashehSigned-off-by: Qu Wenruo --- fs/btrfs/transaction.c | 66 +- 1 file changed, 49 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 43885e5..a3eb8ac 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1516,6 +1516,55 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans, goto fail; } + /* +* Account qgroups before insert the dir item +* As such dir item insert will modify parent_root, which could be +* src root. If we don't do it now, wrong accounting may be inherited +* to snapshot qgroup. +* +* For reason locking tree_log_mutex, see btrfs_commit_transaction() +* comment +*/ + mutex_lock(>fs_info->tree_log_mutex); + + ret = commit_fs_roots(trans, root); + if (ret) { + mutex_unlock(>fs_info->tree_log_mutex); + goto fail; + } + + btrfs_apply_pending_changes(root->fs_info); + + ret = btrfs_qgroup_prepare_account_extents(trans, root->fs_info); + if (ret < 0) { + mutex_unlock(>fs_info->tree_log_mutex); + goto fail; + } + ret = btrfs_qgroup_account_extents(trans, root->fs_info); + if (ret < 0) { + mutex_unlock(>fs_info->tree_log_mutex); + goto fail; + } + /* +* Now qgroup are all updated, we can inherit it to new qgroups +*/ + ret = btrfs_qgroup_inherit(trans, fs_info, + root->root_key.objectid, + objectid, pending->inherit); + if (ret < 0) { + mutex_unlock(>fs_info->tree_log_mutex); + goto fail; + } + /* +* qgroup_account_extents() must be followed by a +* switch_commit_roots(), or next qgroup_account_extents() will +* be corrupted +*/ + ret = commit_cowonly_roots(trans, root); + mutex_unlock(>fs_info->tree_log_mutex); + if (ret) + goto fail; + ret = btrfs_insert_dir_item(trans, parent_root, dentry->d_name.name, dentry->d_name.len, parent_inode, , @@ -1559,23 +1608,6 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans, goto fail; } - /* -* account qgroup counters before qgroup_inherit() -*/ - ret = btrfs_qgroup_prepare_account_extents(trans, fs_info); - if (ret) - goto fail; - ret = btrfs_qgroup_account_extents(trans, fs_info); - if (ret) - goto fail; - ret = btrfs_qgroup_inherit(trans, fs_info, - root->root_key.objectid, - objectid, pending->inherit); - if (ret) { - btrfs_abort_transaction(trans, root, ret); - goto fail; - } - fail: pending->error = ret; dir_item_existed: -- 2.8.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [RFC] fix potential access after free: return value of blk_check_plugged() must be used schedule() safe
On Tue, Apr 05, 2016 at 03:36:57PM +0200, Lars Ellenberg wrote: > blk_check_plugged() will return a pointer > to an object linked on current->plug->cb_list. > > That list may "at any time" be implicitly cleared by > blk_flush_plug_list() > flush_plug_callbacks() > either as a result of blk_finish_plug(), > or implicitly by schedule() [and maybe other implicit mechanisms?] > > If there is no protection against an implicit unplug > between the call to blk_check_plug() and using its return value, > that implicit unplug may have already happened, > even before the plug is actually initialized or populated, > and we may be using a pointer to already free()d data. This isn't correct. flush plug is never called in preemption, which is designed only called when the task is going to sleep. See sched_submit_work. Am I missing anything? Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [RFC] fix potential access after free: return value of blk_check_plugged() must be used schedule() safe
On Tue, Apr 05, 2016 at 03:36:57PM +0200, Lars Ellenberg wrote: > blk_check_plugged() will return a pointer > to an object linked on current->plug->cb_list. > > That list may "at any time" be implicitly cleared by > blk_flush_plug_list() > flush_plug_callbacks() > either as a result of blk_finish_plug(), > or implicitly by schedule() [and maybe other implicit mechanisms?] > > If there is no protection against an implicit unplug > between the call to blk_check_plug() and using its return value, > that implicit unplug may have already happened, > even before the plug is actually initialized or populated, > and we may be using a pointer to already free()d data. > > I suggest that both raid1 and raid10 can easily be fixed > by moving the call to blk_check_plugged() inside the spinlock. > > For md/raid5 and btrfs/raid56, > I'm unsure how (if) this needs to be fixed. I think you're right, digging in to see if there's something I missed. But as Neil said, it looks like we just got saved by preemption being off by default. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scrub priority, am I using it wrong?
Yeah, RAID5. I'm now doing pause and resume on it to let it take multiple nights, the idle should let other processes complete in reasonable time. On Wed, Apr 6, 2016 at 3:34 AM, Henk Slagerwrote: > On Tue, Apr 5, 2016 at 4:37 AM, Duncan <1i5t5.dun...@cox.net> wrote: >> Gareth Pye posted on Tue, 05 Apr 2016 09:36:48 +1000 as excerpted: >> >>> I've got a btrfs file system set up on 6 drbd disks running on 2Tb >>> spinning disks. The server is moderately loaded with various regular >>> tasks that use a fair bit of disk IO, but I've scheduled my weekly btrfs >>> scrub for the best quiet time in the week. >>> >>> The command that is run is: >>> /usr/local/bin/btrfs scrub start -Bd -c idle /data >>> >>> Which is my best attempt to try and get it to have a low impact on user >>> operations >>> >>> But iotop shows me: >>> >>> 1765 be/4 root 14.84 M/s0.00 B/s 0.00 % 96.65 % btrfs scrub >>> start -Bd -c idle /data >>> 1767 be/4 root 14.70 M/s0.00 B/s 0.00 % 95.35 % btrfs >>> scrub start -Bd -c idle /data >>> 1768 be/4 root 13.47 M/s0.00 B/s 0.00 % 92.59 % btrfs >>> scrub start -Bd -c idle /data >>> 1764 be/4 root 12.61 M/s0.00 B/s 0.00 % 88.77 % btrfs >>> scrub start -Bd -c idle /data >>> 1766 be/4 root 11.24 M/s0.00 B/s 0.00 % 85.18 % btrfs >>> scrub start -Bd -c idle /data >>> 1763 be/4 root7.79 M/s0.00 B/s 0.00 % 63.30 % btrfs >>> scrub start -Bd -c idle /data >>> 28858 be/4 root0.00 B/s 810.50 B/s 0.00 % 61.32 % [kworker/ >> u16:25] >>> >>> >>> Which doesn't look like an idle priority to me. And the system sure >>> feels like a system with a lot of heavy io going on. Is there something >>> I'm doing wrong? > > When I see the throughput numbers, it lets me think that the > filesystem is raid5 or raid6. On single or raid1 or raid10 one easily > gets around 100M/s without the notice/feeling of heavy IO ongoing, > mostly independent of scrub options. > >> Two points: >> >> 1) It appears btrfs scrub start's -c option only takes numeric class, so >> try -c3 instead of -c idle. > > Thanks to Duncan for pointing this out. I don't remember exactly, but > I think I also had issues with this in the past, but did not realize > or have a further look at it. > >> Works for me with the numeric class (same results as you with spelled out >> class), tho I'm on ssd with multiple independent btrfs on partitions, the >> biggest of which is 24 GiB, 18.something GiB used, which scrubs in all of >> 20 seconds, so I don't need and hadn't tried the -c option at all until >> now. >> >> 2) What a difference an ssd makes! >> >> $$ sudo btrfs scrub start -c3 /p >> scrub started on /p, [...] >> >> $$ sudo iotop -obn1 >> Total DISK READ : 626.53 M/s | Total DISK WRITE : 0.00 B/s >> Actual DISK READ: 596.93 M/s | Actual DISK WRITE: 0.00 B/s >> TID PRIO USER DISK READ DISK WRITE SWAPIN IOCOMMAND >> 872 idle root 268.40 M/s0.00 B/s 0.00 % 0.00 % btrfs scrub >> start -c3 /p >> 873 idle root 358.13 M/s0.00 B/s 0.00 % 0.00 % btrfs scrub >> start -c3 /p >> >> CPU bound, 0% IOWait even at idle IO priority, in addition to the >> hundreds of M/s values per thread/device, here. You OTOH are showing >> under 20 M/s per thread/device on spinning rust, with an IOWait near 90%, >> thus making it IO bound. > > This low M/s and high IOWait is the kind of behavior I noticed with 3x > 2TB raid5 when scrubbing or balancing (no bcache activated, kernel > 4.3.3). > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gareth Pye - blog.cerberos.id.au Level 2 MTG Judge, Melbourne, Australia -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix missing s_id setting
On 2016/04/05 17:52, Tsutomu Itoh wrote: On 2016/04/05 16:56, Anand Jain wrote: On 04/05/2016 08:08 AM, Tsutomu Itoh wrote: When fs_devices->latest_bdev is deleted or is replaced, sb->s_id has not been updated. As a result, the deleted device name is displayed by btrfs_printk. [before fix] # btrfs dev del /dev/sdc4 /mnt2 # btrfs dev add /dev/sdb6 /mnt2 [ 217.458249] BTRFS info (device sdc4): found 1 extents [ 217.695798] BTRFS info (device sdc4): disk deleted /dev/sdc4 [ 217.941284] BTRFS info (device sdc4): disk added /dev/sdb6 [after fix] # btrfs dev del /dev/sdc4 /mnt2 # btrfs dev add /dev/sdb6 /mnt2 [ 83.835072] BTRFS info (device sdc4): found 1 extents [ 84.080617] BTRFS info (device sdc3): disk deleted /dev/sdc4 [ 84.401951] BTRFS info (device sdc3): disk added /dev/sdb6 [PATCH 05/13] Btrfs: fix fs logging for multi device any comments ? We would want to maintain the logging prefix as constant, so that troubleshooters with filters/scripts will find it helpful. I think it is good to make the identifier constant for the troubleshooting. However, fsid(uuid) is a little long for the print purpose, I think. (But an appropriate value isn't found...) BTW, the state that the deleted device name is set to sb->s_id is not good. Thanks, Tsutomu Thanks, Tsutomu Thanks, Anand Signed-off-by: Tsutomu Itoh--- fs/btrfs/dev-replace.c | 5 - fs/btrfs/volumes.c | 11 +-- 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index a1d6652..11c4198 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -560,8 +560,11 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, tgt_device->commit_bytes_used = src_device->bytes_used; if (fs_info->sb->s_bdev == src_device->bdev) fs_info->sb->s_bdev = tgt_device->bdev; -if (fs_info->fs_devices->latest_bdev == src_device->bdev) +if (fs_info->fs_devices->latest_bdev == src_device->bdev) { fs_info->fs_devices->latest_bdev = tgt_device->bdev; +snprintf(fs_info->sb->s_id, sizeof(fs_info->sb->s_id), "%pg", + tgt_device->bdev); +} list_add(_device->dev_alloc_list, _info->fs_devices->alloc_list); fs_info->fs_devices->rw_devices++; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index e2b54d5..a471385 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1846,8 +1846,12 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path) struct btrfs_device, dev_list); if (device->bdev == root->fs_info->sb->s_bdev) root->fs_info->sb->s_bdev = next_device->bdev; -if (device->bdev == root->fs_info->fs_devices->latest_bdev) +if (device->bdev == root->fs_info->fs_devices->latest_bdev) { root->fs_info->fs_devices->latest_bdev = next_device->bdev; +snprintf(root->fs_info->sb->s_id, + sizeof(root->fs_info->sb->s_id), "%pg", + next_device->bdev); +} if (device->bdev) { device->fs_devices->open_devices--; @@ -2034,8 +2038,11 @@ void btrfs_destroy_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, struct btrfs_device, dev_list); if (tgtdev->bdev == fs_info->sb->s_bdev) fs_info->sb->s_bdev = next_device->bdev; -if (tgtdev->bdev == fs_info->fs_devices->latest_bdev) +if (tgtdev->bdev == fs_info->fs_devices->latest_bdev) { fs_info->fs_devices->latest_bdev = next_device->bdev; +snprintf(fs_info->sb->s_id, sizeof(fs_info->sb->s_id), "%pg", + next_device->bdev); +} list_del_rcu(>dev_list); call_rcu(>rcu, free_device); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [RFC] fix potential access after free: return value of blk_check_plugged() must be used schedule() safe
On Tue, Apr 05 2016, Lars Ellenberg wrote: > blk_check_plugged() will return a pointer > to an object linked on current->plug->cb_list. > > That list may "at any time" be implicitly cleared by > blk_flush_plug_list() > flush_plug_callbacks() > either as a result of blk_finish_plug(), > or implicitly by schedule() [and maybe other implicit mechanisms?] I think the only risk here is preemption, so preempt_disable() / preempt_enable() or as you say a spinlock, is sufficient protection. I would suggest preempt_{dis,en}able for the raid5 code. Maybe for raid1/raid10 too just for consistency. > > If there is no protection against an implicit unplug > between the call to blk_check_plug() and using its return value, > that implicit unplug may have already happened, > even before the plug is actually initialized or populated, > and we may be using a pointer to already free()d data. > > I suggest that both raid1 and raid10 can easily be fixed > by moving the call to blk_check_plugged() inside the spinlock. > > For md/raid5 and btrfs/raid56, > I'm unsure how (if) this needs to be fixed. > > The other current in-tree users of blk_check_plugged() > are mm_check_plugged(), and mddev_check_plugged(). > > mm_check_plugged() is already used safely inside a spinlock. > > with mddev_check_plugged() I'm unsure, at least on a preempt kernel. I think this is only an issue on a preempt kernel, and in that case: yes - mddev_check_plugged() needs protection. Maybe preempt enable/disable could be done in blk_check_plugged() so those calls which don't dereference the pointer don't need further protection. Or maybe blk_check_plugged should have WARN_ON_ONCE(!in_atomic()); > > Did I overlook any magic that protects against such implicit unplug? Just the fortunate lack of preemption probably. > > Also, why pretend that a custom plug struct (such as raid1_plug_cb) > may have its member "struct blk_plug_cb cb" at an arbitrary offset? > As it is, raid1_check_plugged() below is actually just a cast. Fair point. I generally prefer container_of to casts because it is more obviously correct, and type checked. However as blk_check_plugged performs the allocation, the blk_plug_cb must be at the start of the containing structure, so the complex tests for handling NULL are just noise. I'd be happy for that to be changed. Thanks, NeilBrown > > Signed-off-by: Lars Ellenberg> --- > drivers/md/raid1.c | 19 +-- > drivers/md/raid10.c | 21 + > drivers/md/raid5.c | 5 + > fs/btrfs/raid56.c | 5 + > 4 files changed, 36 insertions(+), 14 deletions(-) > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index 39fb21e..55dc960 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -1044,6 +1044,18 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool > from_schedule) > kfree(plug); > } > > +static struct raid1_plug_cb *raid1_check_plugged(struct mddev *mddev) > +{ > + /* return (struct raid1_plug_cb*)blk_check_plugged(...); */ > + struct blk_plug_cb *cb; > + struct raid1_plug_cb *plug = NULL; > + > + cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug)); > + if (cb) > + plug = container_of(cb, struct raid1_plug_cb, cb); > + return plug; > +} > + > static void raid1_make_request(struct mddev *mddev, struct bio * bio) > { > struct r1conf *conf = mddev->private; > @@ -1060,7 +1072,6 @@ static void raid1_make_request(struct mddev *mddev, > struct bio * bio) > & (REQ_DISCARD | REQ_SECURE)); > const unsigned long do_same = (bio->bi_rw & REQ_WRITE_SAME); > struct md_rdev *blocked_rdev; > - struct blk_plug_cb *cb; > struct raid1_plug_cb *plug = NULL; > int first_clone; > int sectors_handled; > @@ -1382,12 +1393,8 @@ read_again: > > atomic_inc(_bio->remaining); > > - cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug)); > - if (cb) > - plug = container_of(cb, struct raid1_plug_cb, cb); > - else > - plug = NULL; > spin_lock_irqsave(>device_lock, flags); > + plug = raid1_check_plugged(mddev); > if (plug) { > bio_list_add(>pending, mbio); > plug->pending_cnt++; > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c > index e3fd725..d7d4397 100644 > --- a/drivers/md/raid10.c > +++ b/drivers/md/raid10.c > @@ -1052,6 +1052,18 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool > from_schedule) > kfree(plug); > } > > +static struct raid10_plug_cb *raid10_check_plugged(struct mddev *mddev) > +{ > + /* return (struct raid1_plug_cb*)blk_check_plugged(...); */ > + struct blk_plug_cb *cb; > + struct raid10_plug_cb *plug = NULL; > + > + cb = blk_check_plugged(raid10_unplug, mddev, sizeof(*plug)); > + if (cb) > +
Re: Qgroups wrong after snapshot create
On Tue, Apr 05, 2016 at 03:16:54PM -0700, Mark Fasheh wrote: > On Tue, Apr 05, 2016 at 09:27:01AM +0800, Qu Wenruo wrote: > > Mark Fasheh wrote on 2016/04/04 16:06 -0700: > > >Hi, > > > > > >Making a snapshot gets us the wrong qgroup numbers. This is very easy to > > >reproduce. From a fresh btrfs filesystem, simply enable qgroups and create > > >a > > >snapshot. In this example we have mounted a newly created fresh filesystem > > >and mounted it at /btrfs: > > > > > ># btrfs quota enable /btrfs > > ># btrfs sub sna /btrfs/ /btrfs/snap1 > > ># btrfs qg show /btrfs > > > > > >qgroupid rfer excl > > > > > >0/5 32.00KiB 32.00KiB > > >0/25716.00KiB 16.00KiB > > > > > > > Also reproduced it. > > > > My first idea is, old snapshot qgroup hack is involved. > > > > Unlike btrfs_inc/dec_extent_ref(), snapshotting just use a dirty > > hack to handle it: > > Copy rfer from source subvolume, and directly set excl to nodesize. > > > > If such work is before adding snapshot inode into src subvolume, it > > may be the reason causing the bug. > > Ok, thanks very much for looking into this Qu. > > > > >In the example above, the default subvolume (0/5) should read 16KiB > > >referenced and 16KiB exclusive. > > > > > >A rescan fixes things, so we know the rescan process is doing the math > > >right: > > > > > ># btrfs quota rescan /btrfs > > ># btrfs qgroup show /btrfs > > >qgroupid rfer excl > > > > > >0/5 16.00KiB 16.00KiB > > >0/25716.00KiB 16.00KiB > > > > > > > So the base of qgroup code is not affected, or we may need another > > painful rework. > > Yeah as far as I can tell the core algorithm is fine. We're just running the > extents incorrectly somehow. Btw, I should add - my biggest fear was an algorithm change which would have made older versions of btrfsck incompatible. It seems though we can still use it for checking qgroups. --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Qgroups wrong after snapshot create
On Tue, Apr 05, 2016 at 09:27:01AM +0800, Qu Wenruo wrote: > Mark Fasheh wrote on 2016/04/04 16:06 -0700: > >Hi, > > > >Making a snapshot gets us the wrong qgroup numbers. This is very easy to > >reproduce. From a fresh btrfs filesystem, simply enable qgroups and create a > >snapshot. In this example we have mounted a newly created fresh filesystem > >and mounted it at /btrfs: > > > ># btrfs quota enable /btrfs > ># btrfs sub sna /btrfs/ /btrfs/snap1 > ># btrfs qg show /btrfs > > > >qgroupid rfer excl > > > >0/5 32.00KiB 32.00KiB > >0/25716.00KiB 16.00KiB > > > > Also reproduced it. > > My first idea is, old snapshot qgroup hack is involved. > > Unlike btrfs_inc/dec_extent_ref(), snapshotting just use a dirty > hack to handle it: > Copy rfer from source subvolume, and directly set excl to nodesize. > > If such work is before adding snapshot inode into src subvolume, it > may be the reason causing the bug. Ok, thanks very much for looking into this Qu. > >In the example above, the default subvolume (0/5) should read 16KiB > >referenced and 16KiB exclusive. > > > >A rescan fixes things, so we know the rescan process is doing the math > >right: > > > ># btrfs quota rescan /btrfs > ># btrfs qgroup show /btrfs > >qgroupid rfer excl > > > >0/5 16.00KiB 16.00KiB > >0/25716.00KiB 16.00KiB > > > > So the base of qgroup code is not affected, or we may need another > painful rework. Yeah as far as I can tell the core algorithm is fine. We're just running the extents incorrectly somehow. > ># _-=> irqs-off > ># / _=> need-resched > >#| / _---=> hardirq/softirq > >#|| / _--=> preempt-depth > >#||| / delay > ># TASK-PID CPU# TIMESTAMP FUNCTION > ># | | | | | > >btrfs-10233 [001] 260298.823339: > > btrfs_qgroup_account_extent: bytenr = 29360128, num_bytes = 16384, > > nr_old_roots = 1, nr_new_roots = 0 > >btrfs-10233 [001] 260298.823342: qgroup_update_counters: > > qgid = 5, cur_old_count = 1, cur_new_count = 0, rfer = 16384, excl = 16384 > >btrfs-10233 [001] 260298.823342: qgroup_update_counters: > > qgid = 5, cur_old_count = 1, cur_new_count = 0, rfer = 0, excl = 0 > >btrfs-10233 [001] 260298.823343: > > btrfs_qgroup_account_extent: bytenr = 29720576, num_bytes = 16384, > > nr_old_roots = 0, nr_new_roots = 0 > >btrfs-10233 [001] 260298.823345: > > btrfs_qgroup_account_extent: bytenr = 29736960, num_bytes = 16384, > > nr_old_roots = 0, nr_new_roots = 0 > >btrfs-10233 [001] 260298.823347: > > btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, > > nr_old_roots = 0, nr_new_roots = 1 > > Now, for extent 29786112, its nr_new_roots is 1. > > >btrfs-10233 [001] 260298.823347: qgroup_update_counters: > > qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0 > >btrfs-10233 [001] 260298.823348: qgroup_update_counters: > > qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384 > >btrfs-10233 [001] 260298.823421: > > btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, > > nr_old_roots = 0, nr_new_roots = 0 > > Now the problem is here, nr_old_roots should be 1, not 0. > Just as previous trace shows, we increased extent ref on that > extent, but now it dropped back to 0. > > Since its old_root == new_root == 0, qgroup code doesn't do anything on it. > If its nr_old_roots is 1, qgroup will drop it's excl/rfer to 0, and > then accounting may goes back to normal. Ok, so we're fine with the numbers going to zero so long as it gets back to where it should be. That also explains the 'strange' behavior I saw. > >btrfs-10233 [001] 260298.823422: > > btrfs_qgroup_account_extent: bytenr = 29835264, num_bytes = 16384, > > nr_old_roots = 0, nr_new_roots = 0 > >btrfs-10233 [001] 260298.823425: > > btrfs_qgroup_account_extent: bytenr = 29851648, num_bytes = 16384, > > nr_old_roots = 0, nr_new_roots = 1 > >btrfs-10233 [001] 260298.823426: qgroup_update_counters: > > qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384 > >btrfs-10233 [001] 260298.823426: qgroup_update_counters: > > qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 32768, excl = 32768 > > > >If you read through the whole log we do some ... interesting.. things - at > >the start, we *subtract* from qgroup 5, making it's count go to zero. I want > >to say that this is kind of unexpected for a snapshot create but perhaps > >there's something I'm missing. > > > >Remember that I'm
Re: good documentation on btrfs internals and on disk layout
2016-04-05 11:56 GMT-07:00 Austin S. Hemmelgarn: > On 2016-04-05 14:36, Yauhen Kharuzhy wrote: >> >> 2016-04-05 11:15 GMT-07:00 Austin S. Hemmelgarn : >>> >>> On 2016-04-05 13:53, Yauhen Kharuzhy wrote: >>> In general, it isn't allowed, but we don't explicitly disallow it either. >>> The worst case here is that the devices both get written two separately, >>> and >>> you end up with data not matching for correlated generation ID's. The >>> second scrub in this case shows no errors because the first one corrects >>> them (even though they are reported as uncorrectable, which is a bug as >>> far >>> as I can tell), and from what I can tell from reading the code, it does >>> this >>> by just picking the highest generation ID and dropping the data from the >>> lower generation. >> >> >> Hmm... Sounds reasonable but how to detect if filesystem should be >> checked by scrub after mounting? There is one way as I understand — to >> check kernel logs after mount for any btrfs errors and this is not a >> good way for case of some kind of automatic management. > > There really isn't any way that I know of. Personally, I just scrub all my > filesystems shortly after mount, but I also have pretty small filesystems > (the biggest are 64G) on relatively fast storage. In theory, it might be > possible to parse the filesystems before mounting to check the device > generation numbers, but that may be just as expensive as just scrubbing the > filesystem (and you really should be scrubbing somewhat regularly anyway). Yes, size matters — we have 96TB massive and scrubbing, rebalancing, replacing etc. can be expensive operations :) In fact, I implemented some kind of filesystem status reporting in kernel & btrfs progs already but still have no time to prepare this to sending as patches for commenting. But it will be done soon, I hope. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: good documentation on btrfs internals and on disk layout
On 2016-04-05 14:36, Yauhen Kharuzhy wrote: 2016-04-05 11:15 GMT-07:00 Austin S. Hemmelgarn: On 2016-04-05 13:53, Yauhen Kharuzhy wrote: Hello, I try to understand btrfs logic in mounting of multi-device filesystem when device generations are different. All my questions are related to RAID5/6 for system, metadata, and data case. Kernel can mount FS with different device generations (if drive was physically removed before last unmount and returned back after, for example) now but scrub will report uncorrectable errors after this (but second run doesn't show any errors). Does any documentation about algorithm of multiple device handling in such case exist? Does the case with different device generations is allowed in general and what worst cases can be here? In general, it isn't allowed, but we don't explicitly disallow it either. The worst case here is that the devices both get written two separately, and you end up with data not matching for correlated generation ID's. The second scrub in this case shows no errors because the first one corrects them (even though they are reported as uncorrectable, which is a bug as far as I can tell), and from what I can tell from reading the code, it does this by just picking the highest generation ID and dropping the data from the lower generation. Hmm... Sounds reasonable but how to detect if filesystem should be checked by scrub after mounting? There is one way as I understand — to check kernel logs after mount for any btrfs errors and this is not a good way for case of some kind of automatic management. There really isn't any way that I know of. Personally, I just scrub all my filesystems shortly after mount, but I also have pretty small filesystems (the biggest are 64G) on relatively fast storage. In theory, it might be possible to parse the filesystems before mounting to check the device generation numbers, but that may be just as expensive as just scrubbing the filesystem (and you really should be scrubbing somewhat regularly anyway). What should happen if device was removed and returned back after some time when filesystem is online? Should some kind of device reopening be possible or one possible way to guarantee FS consistensy is to mark such device as missing and to replace it? In this case, the device being removed (or some component between the device and the processor failing, or the device itself erroneously reporting failure) will force the FS read-only. If the device reappears while the FS is still online, it may just start working again (this is _really_ rare, and requires that the device appear with the same device node as it had previously, and this usually only happens when the device disappears for only a very short period of time), or it may not work until the FS gets remounted (this is usually the case), or the system may crash (thankfully this almost never happens, and it's usually not because of BTRFS when it does). Regardless of what happens, you may still have to run a scrub to make sure everything is consistent. So, one right way if we see device reconnected as new block device — is to reject it and don't include it in device list again, am I right? Existing code tries to 'reconnect' it with new device name but this works completely wrong for mounted FS (because btrfs device is renamed only, no real device reopening is performed) and I intend to propose patch based on Anand's 'global spare' patch series to handle this properly. In an ideal situation, you have nothing using the FS and can unmount, run a device scan, and then remount. In most cases this won't work, and being able to re-add the device via a hot-spare type setup (or even just use device replace on it, which I've done before myself when dealing with filesystems on USB devices, and it works well) would be useful. Ideally, we should have the option to auto-detect such a situation and handle it, but that _really_ needs to be optional (there are just too many things that could go wrong). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: good documentation on btrfs internals and on disk layout
2016-04-05 11:15 GMT-07:00 Austin S. Hemmelgarn: > On 2016-04-05 13:53, Yauhen Kharuzhy wrote: >> >> Hello, >> >> I try to understand btrfs logic in mounting of multi-device filesystem >> when device generations are different. All my questions are related to >> RAID5/6 for system, metadata, and data case. >> >> Kernel can mount FS with different device generations (if drive was >> physically removed before last unmount and returned back after, for >> example) now but scrub will report uncorrectable errors after this >> (but second run doesn't show any errors). Does any documentation about >> algorithm of multiple device handling in such case exist? Does the >> case with different device generations is allowed in general and what >> worst cases can be here? > > In general, it isn't allowed, but we don't explicitly disallow it either. > The worst case here is that the devices both get written two separately, and > you end up with data not matching for correlated generation ID's. The > second scrub in this case shows no errors because the first one corrects > them (even though they are reported as uncorrectable, which is a bug as far > as I can tell), and from what I can tell from reading the code, it does this > by just picking the highest generation ID and dropping the data from the > lower generation. Hmm... Sounds reasonable but how to detect if filesystem should be checked by scrub after mounting? There is one way as I understand — to check kernel logs after mount for any btrfs errors and this is not a good way for case of some kind of automatic management. >> What should happen if device was removed and returned back after some >> time when filesystem is online? Should some kind of device >> reopening be possible or one possible way to guarantee FS consistensy >> is to mark such device as missing and to replace it? > > In this case, the device being removed (or some component between the device > and the processor failing, or the device itself erroneously reporting > failure) will force the FS read-only. If the device reappears while the FS > is still online, it may just start working again (this is _really_ rare, and > requires that the device appear with the same device node as it had > previously, and this usually only happens when the device disappears for > only a very short period of time), or it may not work until the FS gets > remounted (this is usually the case), or the system may crash (thankfully > this almost never happens, and it's usually not because of BTRFS when it > does). Regardless of what happens, you may still have to run a scrub to > make sure everything is consistent. So, one right way if we see device reconnected as new block device — is to reject it and don't include it in device list again, am I right? Existing code tries to 'reconnect' it with new device name but this works completely wrong for mounted FS (because btrfs device is renamed only, no real device reopening is performed) and I intend to propose patch based on Anand's 'global spare' patch series to handle this properly. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: good documentation on btrfs internals and on disk layout
On 2016-04-05 13:53, Yauhen Kharuzhy wrote: Hello, I try to understand btrfs logic in mounting of multi-device filesystem when device generations are different. All my questions are related to RAID5/6 for system, metadata, and data case. Kernel can mount FS with different device generations (if drive was physically removed before last unmount and returned back after, for example) now but scrub will report uncorrectable errors after this (but second run doesn't show any errors). Does any documentation about algorithm of multiple device handling in such case exist? Does the case with different device generations is allowed in general and what worst cases can be here? In general, it isn't allowed, but we don't explicitly disallow it either. The worst case here is that the devices both get written two separately, and you end up with data not matching for correlated generation ID's. The second scrub in this case shows no errors because the first one corrects them (even though they are reported as uncorrectable, which is a bug as far as I can tell), and from what I can tell from reading the code, it does this by just picking the highest generation ID and dropping the data from the lower generation. What should happen if device was removed and returned back after some time when filesystem is online? Should some kind of device reopening be possible or one possible way to guarantee FS consistensy is to mark such device as missing and to replace it? In this case, the device being removed (or some component between the device and the processor failing, or the device itself erroneously reporting failure) will force the FS read-only. If the device reappears while the FS is still online, it may just start working again (this is _really_ rare, and requires that the device appear with the same device node as it had previously, and this usually only happens when the device disappears for only a very short period of time), or it may not work until the FS gets remounted (this is usually the case), or the system may crash (thankfully this almost never happens, and it's usually not because of BTRFS when it does). Regardless of what happens, you may still have to run a scrub to make sure everything is consistent. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: good documentation on btrfs internals and on disk layout
Hello, I try to understand btrfs logic in mounting of multi-device filesystem when device generations are different. All my questions are related to RAID5/6 for system, metadata, and data case. Kernel can mount FS with different device generations (if drive was physically removed before last unmount and returned back after, for example) now but scrub will report uncorrectable errors after this (but second run doesn't show any errors). Does any documentation about algorithm of multiple device handling in such case exist? Does the case with different device generations is allowed in general and what worst cases can be here? What should happen if device was removed and returned back after some time when filesystem is online? Should some kind of device reopening be possible or one possible way to guarantee FS consistensy is to mark such device as missing and to replace it? If any sources of such information exist (except btrfs code) please point me to them. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scrub priority, am I using it wrong?
On Tue, Apr 5, 2016 at 4:37 AM, Duncan <1i5t5.dun...@cox.net> wrote: > Gareth Pye posted on Tue, 05 Apr 2016 09:36:48 +1000 as excerpted: > >> I've got a btrfs file system set up on 6 drbd disks running on 2Tb >> spinning disks. The server is moderately loaded with various regular >> tasks that use a fair bit of disk IO, but I've scheduled my weekly btrfs >> scrub for the best quiet time in the week. >> >> The command that is run is: >> /usr/local/bin/btrfs scrub start -Bd -c idle /data >> >> Which is my best attempt to try and get it to have a low impact on user >> operations >> >> But iotop shows me: >> >> 1765 be/4 root 14.84 M/s0.00 B/s 0.00 % 96.65 % btrfs scrub >> start -Bd -c idle /data >> 1767 be/4 root 14.70 M/s0.00 B/s 0.00 % 95.35 % btrfs >> scrub start -Bd -c idle /data >> 1768 be/4 root 13.47 M/s0.00 B/s 0.00 % 92.59 % btrfs >> scrub start -Bd -c idle /data >> 1764 be/4 root 12.61 M/s0.00 B/s 0.00 % 88.77 % btrfs >> scrub start -Bd -c idle /data >> 1766 be/4 root 11.24 M/s0.00 B/s 0.00 % 85.18 % btrfs >> scrub start -Bd -c idle /data >> 1763 be/4 root7.79 M/s0.00 B/s 0.00 % 63.30 % btrfs >> scrub start -Bd -c idle /data >> 28858 be/4 root0.00 B/s 810.50 B/s 0.00 % 61.32 % [kworker/ > u16:25] >> >> >> Which doesn't look like an idle priority to me. And the system sure >> feels like a system with a lot of heavy io going on. Is there something >> I'm doing wrong? When I see the throughput numbers, it lets me think that the filesystem is raid5 or raid6. On single or raid1 or raid10 one easily gets around 100M/s without the notice/feeling of heavy IO ongoing, mostly independent of scrub options. > Two points: > > 1) It appears btrfs scrub start's -c option only takes numeric class, so > try -c3 instead of -c idle. Thanks to Duncan for pointing this out. I don't remember exactly, but I think I also had issues with this in the past, but did not realize or have a further look at it. > Works for me with the numeric class (same results as you with spelled out > class), tho I'm on ssd with multiple independent btrfs on partitions, the > biggest of which is 24 GiB, 18.something GiB used, which scrubs in all of > 20 seconds, so I don't need and hadn't tried the -c option at all until > now. > > 2) What a difference an ssd makes! > > $$ sudo btrfs scrub start -c3 /p > scrub started on /p, [...] > > $$ sudo iotop -obn1 > Total DISK READ : 626.53 M/s | Total DISK WRITE : 0.00 B/s > Actual DISK READ: 596.93 M/s | Actual DISK WRITE: 0.00 B/s > TID PRIO USER DISK READ DISK WRITE SWAPIN IOCOMMAND > 872 idle root 268.40 M/s0.00 B/s 0.00 % 0.00 % btrfs scrub > start -c3 /p > 873 idle root 358.13 M/s0.00 B/s 0.00 % 0.00 % btrfs scrub > start -c3 /p > > CPU bound, 0% IOWait even at idle IO priority, in addition to the > hundreds of M/s values per thread/device, here. You OTOH are showing > under 20 M/s per thread/device on spinning rust, with an IOWait near 90%, > thus making it IO bound. This low M/s and high IOWait is the kind of behavior I noticed with 3x 2TB raid5 when scrubbing or balancing (no bcache activated, kernel 4.3.3). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: device stats: Print devid instead of null
Print e.g. "[devid:4].write_io_errs 6" instead of "[(null)].write_io_errs 6" when device is missing. Signed-off-by: Patrik Lundquist--- cmds-device.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/cmds-device.c b/cmds-device.c index b17b6c6..7616c43 100644 --- a/cmds-device.c +++ b/cmds-device.c @@ -447,6 +447,13 @@ static int cmd_device_stats(int argc, char **argv) canonical_path = canonicalize_path((char *)path); + /* No path when device is missing. */ + if (!canonical_path) { + canonical_path = malloc(32); + snprintf(canonical_path, 32, +"devid:%llu", args.devid); + } + if (args.nr_items >= BTRFS_DEV_STAT_WRITE_ERRS + 1) printf("[%s].write_io_errs %llu\n", canonical_path, -- 2.8.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [RFC] fix potential access after free: return value of blk_check_plugged() must be used schedule() safe
blk_check_plugged() will return a pointer to an object linked on current->plug->cb_list. That list may "at any time" be implicitly cleared by blk_flush_plug_list() flush_plug_callbacks() either as a result of blk_finish_plug(), or implicitly by schedule() [and maybe other implicit mechanisms?] If there is no protection against an implicit unplug between the call to blk_check_plug() and using its return value, that implicit unplug may have already happened, even before the plug is actually initialized or populated, and we may be using a pointer to already free()d data. I suggest that both raid1 and raid10 can easily be fixed by moving the call to blk_check_plugged() inside the spinlock. For md/raid5 and btrfs/raid56, I'm unsure how (if) this needs to be fixed. The other current in-tree users of blk_check_plugged() are mm_check_plugged(), and mddev_check_plugged(). mm_check_plugged() is already used safely inside a spinlock. with mddev_check_plugged() I'm unsure, at least on a preempt kernel. Did I overlook any magic that protects against such implicit unplug? Also, why pretend that a custom plug struct (such as raid1_plug_cb) may have its member "struct blk_plug_cb cb" at an arbitrary offset? As it is, raid1_check_plugged() below is actually just a cast. Signed-off-by: Lars Ellenberg--- drivers/md/raid1.c | 19 +-- drivers/md/raid10.c | 21 + drivers/md/raid5.c | 5 + fs/btrfs/raid56.c | 5 + 4 files changed, 36 insertions(+), 14 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 39fb21e..55dc960 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1044,6 +1044,18 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule) kfree(plug); } +static struct raid1_plug_cb *raid1_check_plugged(struct mddev *mddev) +{ + /* return (struct raid1_plug_cb*)blk_check_plugged(...); */ + struct blk_plug_cb *cb; + struct raid1_plug_cb *plug = NULL; + + cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug)); + if (cb) + plug = container_of(cb, struct raid1_plug_cb, cb); + return plug; +} + static void raid1_make_request(struct mddev *mddev, struct bio * bio) { struct r1conf *conf = mddev->private; @@ -1060,7 +1072,6 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio) & (REQ_DISCARD | REQ_SECURE)); const unsigned long do_same = (bio->bi_rw & REQ_WRITE_SAME); struct md_rdev *blocked_rdev; - struct blk_plug_cb *cb; struct raid1_plug_cb *plug = NULL; int first_clone; int sectors_handled; @@ -1382,12 +1393,8 @@ read_again: atomic_inc(_bio->remaining); - cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug)); - if (cb) - plug = container_of(cb, struct raid1_plug_cb, cb); - else - plug = NULL; spin_lock_irqsave(>device_lock, flags); + plug = raid1_check_plugged(mddev); if (plug) { bio_list_add(>pending, mbio); plug->pending_cnt++; diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index e3fd725..d7d4397 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -1052,6 +1052,18 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule) kfree(plug); } +static struct raid10_plug_cb *raid10_check_plugged(struct mddev *mddev) +{ + /* return (struct raid1_plug_cb*)blk_check_plugged(...); */ + struct blk_plug_cb *cb; + struct raid10_plug_cb *plug = NULL; + + cb = blk_check_plugged(raid10_unplug, mddev, sizeof(*plug)); + if (cb) + plug = container_of(cb, struct raid10_plug_cb, cb); + return plug; +} + static void __make_request(struct mddev *mddev, struct bio *bio) { struct r10conf *conf = mddev->private; @@ -1066,7 +1078,6 @@ static void __make_request(struct mddev *mddev, struct bio *bio) const unsigned long do_same = (bio->bi_rw & REQ_WRITE_SAME); unsigned long flags; struct md_rdev *blocked_rdev; - struct blk_plug_cb *cb; struct raid10_plug_cb *plug = NULL; int sectors_handled; int max_sectors; @@ -1369,14 +1380,8 @@ retry_write: atomic_inc(_bio->remaining); - cb = blk_check_plugged(raid10_unplug, mddev, - sizeof(*plug)); - if (cb) - plug = container_of(cb, struct raid10_plug_cb, - cb); - else - plug = NULL; spin_lock_irqsave(>device_lock, flags); + plug = raid10_check_plugged(mddev);
Re: Another ENOSPC situation
On 2016-04-02 01:43, Chris Murphy wrote: On Fri, Apr 1, 2016 at 10:55 PM, Duncan <1i5t5.dun...@cox.net> wrote: Marc Haber posted on Fri, 01 Apr 2016 15:40:29 +0200 as excerpted: [4/502]mh@swivel:~$ sudo btrfs fi usage / Overall: Device size: 600.00GiB Device allocated:600.00GiB Device unallocated:1.00MiB That's the problem right there. The admin didn't do his job and spot the near full allocation issue I don't yet agree this is an admin problem. This is the 2nd or 3rd case we've seen only recently where there's plenty of space in all chunk types and yet ENOSPC happens, seemingly only because there's no unallocated space remaining. I don't know that this is a regression for sure, but it sure seems like one. I personally don't think it's a regression. I've hit this myself before (although I make a point not to anymore, having to jump through hoops to the degree I did to get the FS working again tends to provide a pretty big incentive to not let it happen again), I know a couple of other people who have and never reported it here or on IRC, and I'd be willing to bet that the reason we're seeing it recently is that more 'regular' users (in contrast to system administrators or developers) are using BTRFS, and they tend to be more likely to hit such issues (because they're not as likely to know about them in the first place, let alone how to avoid them). Data,single: Size:553.93GiB, Used:405.73GiB /dev/mapper/swivelbtr 553.93GiB Metadata,DUP: Size:23.00GiB, Used:3.83GiB /dev/mapper/swivelbtr 46.00GiB System,DUP: Size:32.00MiB, Used:112.00KiB /dev/mapper/swivelbtr 64.00MiB Unallocated: /dev/mapper/swivelbtr 1.00MiB [5/503]mh@swivel:~$ Both data and metadata have several GiB free, data ~140 GiB free, and metadata isn't into global reserve, so the system isn't totally wedged, only partially, due to the lack of unallocated space. Unallocated space alone hasn't ever caused this that I can remember. It's most often been totally full metadata chunks, with free space in allocated data chunks, with no unallocated space out of which to create another metadata chunk to write out changes. There should be plenty of space for either a -dusage=1 or -musage=1 balance to free up a bunch of partially allocated chunks. Offhand I don't think the profiles filter is helpful in this case. OK so where I could be wrong is that I'm expecting balance doesn't require allocated space to work. I'd expect that it can COW extents from one chunk into another existing chunk (of the same type) and then once that's successful, free up that chunk, i.e. revert it back to unallocated. If balance can only copy into newly allocated chunks, that seems like a big problem. I thought that problems had been fixed a very long time ago. Balance has always allocated new chunks. This is IMHO one of the big issues with the current implementation of it (the other being that it can't be made asynchronous without some creative userspace work). If we aren't converting chunk types and we're on a single device FS, we should be tail-packing existing chunks before we try to allocate new ones. And what we don't see from 'usage' that we will see from 'df' is the GlobalReserve values. I'd like to see that. Anyway, in the meantime there is a work around: btrfs dev add Just add a device, even if it's an 8GiB flash drive. But it can be a spare space on a partition, or it can be a logical volume, or whatever you want. That'll add some gigs of unallocated space. Now the balance will work, or for absolutely sure there's a bug (and a new one because this has always worked in the past). After whatever filtered or full balance is done, make sure to 'btfs dev rem' and confirm it's gone with 'btrfs fi show' before removing the device. It's a two device volume until that device is successfully removed and is in something of a fragile state until then because any loss of data on that 2nd device has a good chance of face planting the file system. If you can ensure with a relative degree of certainty that you won't lose power or crash, and you have lots of RAM, a small ramdisk (or even zram) works well for this too. I wouldn't use either personally for a critical filesystem (I'd pull out the disk and hook it up internally to another system with spare disk space and handle things there), but both options should work fine. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Qgroups wrong after snapshot create
On 04/05/2016 07:06 AM, Mark Fasheh wrote: Hi, Making a snapshot gets us the wrong qgroup numbers. This is very easy to reproduce. From a fresh btrfs filesystem, simply enable qgroups and create a snapshot. In this example we have mounted a newly created fresh filesystem and mounted it at /btrfs: # btrfs quota enable /btrfs # btrfs sub sna /btrfs/ /btrfs/snap1 # btrfs qg show /btrfs qgroupid rfer excl 0/5 32.00KiB 32.00KiB 0/25716.00KiB 16.00KiB In the example above, the default subvolume (0/5) should read 16KiB referenced and 16KiB exclusive. A rescan fixes things, so we know the rescan process is doing the math right: # btrfs quota rescan /btrfs # btrfs qgroup show /btrfs qgroupid rfer excl 0/5 16.00KiB 16.00KiB 0/25716.00KiB 16.00KiB The last kernel to get this right was v4.1: # uname -r 4.1.20 # btrfs quota enable /btrfs # btrfs sub sna /btrfs/ /btrfs/snap1 Create a snapshot of '/btrfs/' in '/btrfs/snap1' # btrfs qg show /btrfs qgroupid rfer excl 0/5 16.00KiB 16.00KiB 0/25716.00KiB 16.00KiB Which leads me to believe that this was a regression introduced by Qu's rewrite as that is the biggest change to qgroups during that development period. Going back to upstream, I applied my tracing patch from this list ( http://thread.gmane.org/gmane.comp.file-systems.btrfs/54685 ), with a couple changes - I'm printing the rfer/excl bytecounts in qgroup_update_counters AND I print them twice - once before we make any changes and once after the changes. If I enable tracing in btrfs_qgroup_account_extent and qgroup_update_counters just before the snapshot creation, we get the following trace: # btrfs quota enable /btrfs # # echo 1 > /sys/kernel/debug/tracing/events/btrfs/btrfs_qgroup_account_extent/enable # echo 1 > //sys/kernel/debug/tracing/events/btrfs/qgroup_update_counters/enable # btrfs sub sna /btrfs/ /btrfs/snap2 Create a snapshot of '/btrfs/' in '/btrfs/snap2' # btrfs qg show /btrfs qgroupid rfer excl 0/5 32.00KiB 32.00KiB 0/25716.00KiB 16.00KiB # fstest1:~ # cat /sys/kernel/debug/tracing/trace # tracer: nop # # entries-in-buffer/entries-written: 13/13 #P:2 # # _-=> irqs-off # / _=> need-resched #| / _---=> hardirq/softirq #|| / _--=> preempt-depth #||| / delay # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | btrfs-10233 [001] 260298.823339: btrfs_qgroup_account_extent: bytenr = 29360128, num_bytes = 16384, nr_old_roots = 1, nr_new_roots = 0 btrfs-10233 [001] 260298.823342: qgroup_update_counters: qgid = 5, cur_old_count = 1, cur_new_count = 0, rfer = 16384, excl = 16384 btrfs-10233 [001] 260298.823342: qgroup_update_counters: qgid = 5, cur_old_count = 1, cur_new_count = 0, rfer = 0, excl = 0 btrfs-10233 [001] 260298.823343: btrfs_qgroup_account_extent: bytenr = 29720576, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0 btrfs-10233 [001] 260298.823345: btrfs_qgroup_account_extent: bytenr = 29736960, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0 btrfs-10233 [001] 260298.823347: btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1 btrfs-10233 [001] 260298.823347: qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0 btrfs-10233 [001] 260298.823348: qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384 btrfs-10233 [001] 260298.823421: btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0 I think I caught the direct cause. Here for bytenr 29786112, it got quota dirtied twice, one for cow to a new block(although I don't got why it got cowed, as it is cowed again later), one for de-ref. But the problem is, qgroup accounting is run twice inside the same transaction. Current qgroup design heavily depends on transaction to get correct old and new roots. It always get old roots from *COMMIT ROOT*, and get new roots from current root just before *SWITCHING COMMIT ROOT*. However in this case, we got new roots in the middle of a transaction, and did update qgroup accounting. This means we get wrong new_roots for extent which will not be committed to disk. In this case, it is bytenr 29786112. It doesn't reach disk, so it's not on disk in either transaction before/after snapshot. Normally, its old and new roots should both
Re: Scrub priority, am I using it wrong?
On 2016-04-05 00:19, Duncan wrote: Gareth Pye posted on Tue, 05 Apr 2016 13:44:05 +1000 as excerpted: On Tue, Apr 5, 2016 at 12:37 PM, Duncan <1i5t5.dun...@cox.net> wrote: 1) It appears btrfs scrub start's -c option only takes numeric class, so try -c3 instead of -c idle. Does it count as a bug if it silently accepts the way I was doing it? I've switched to -c3 and at least now the idle class listed in iotop is idle, so I hope that means it will be more friendly to other processes. I'd say yes, particularly given that the value must be the numeric class isn't documented in the manpage at all. Whether the bug is then one of documentation (say it must be numeric) or of implementation (take the class name as well) is then up for debate. I'd call fixing either one a fix. If it must be numeric, document that (and optionally change the implementation to error in some way if a numeric parameter isn't supplied for -c), else change the implementation so the name can be taken as well (and optionally change the documentation to explicitly mention that either one can be used). Doesn't matter to me which. In general, I'd say: 1. The documentation needs to be updated. Even if it's intended to accept class names eventually, it should match the implementation. 2. The implementation needs error checking added. At the very least, it should say something to the effect of "Hey, this doesn't work the way you appear to think it does.", and possibly bail. That said, I don't think there's a huge amount of value in adding the ability to accept symbolic names. Nothing that I know of accepts them on the command line, because most things pass the integer directly to the syscall, thus allowing for people to use new classes (if any are ever introduced) with old tools. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix missing s_id setting
On 2016/04/05 16:56, Anand Jain wrote: On 04/05/2016 08:08 AM, Tsutomu Itoh wrote: When fs_devices->latest_bdev is deleted or is replaced, sb->s_id has not been updated. As a result, the deleted device name is displayed by btrfs_printk. [before fix] # btrfs dev del /dev/sdc4 /mnt2 # btrfs dev add /dev/sdb6 /mnt2 [ 217.458249] BTRFS info (device sdc4): found 1 extents [ 217.695798] BTRFS info (device sdc4): disk deleted /dev/sdc4 [ 217.941284] BTRFS info (device sdc4): disk added /dev/sdb6 [after fix] # btrfs dev del /dev/sdc4 /mnt2 # btrfs dev add /dev/sdb6 /mnt2 [ 83.835072] BTRFS info (device sdc4): found 1 extents [ 84.080617] BTRFS info (device sdc3): disk deleted /dev/sdc4 [ 84.401951] BTRFS info (device sdc3): disk added /dev/sdb6 [PATCH 05/13] Btrfs: fix fs logging for multi device any comments ? We would want to maintain the logging prefix as constant, so that troubleshooters with filters/scripts will find it helpful. I think it is good to make the identifier constant for the troubleshooting. However, fsid(uuid) is a little long for the print purpose, I think. (But an appropriate value isn't found...) Thanks, Tsutomu Thanks, Anand Signed-off-by: Tsutomu Itoh--- fs/btrfs/dev-replace.c | 5 - fs/btrfs/volumes.c | 11 +-- 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index a1d6652..11c4198 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -560,8 +560,11 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, tgt_device->commit_bytes_used = src_device->bytes_used; if (fs_info->sb->s_bdev == src_device->bdev) fs_info->sb->s_bdev = tgt_device->bdev; -if (fs_info->fs_devices->latest_bdev == src_device->bdev) +if (fs_info->fs_devices->latest_bdev == src_device->bdev) { fs_info->fs_devices->latest_bdev = tgt_device->bdev; +snprintf(fs_info->sb->s_id, sizeof(fs_info->sb->s_id), "%pg", + tgt_device->bdev); +} list_add(_device->dev_alloc_list, _info->fs_devices->alloc_list); fs_info->fs_devices->rw_devices++; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index e2b54d5..a471385 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1846,8 +1846,12 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path) struct btrfs_device, dev_list); if (device->bdev == root->fs_info->sb->s_bdev) root->fs_info->sb->s_bdev = next_device->bdev; -if (device->bdev == root->fs_info->fs_devices->latest_bdev) +if (device->bdev == root->fs_info->fs_devices->latest_bdev) { root->fs_info->fs_devices->latest_bdev = next_device->bdev; +snprintf(root->fs_info->sb->s_id, + sizeof(root->fs_info->sb->s_id), "%pg", + next_device->bdev); +} if (device->bdev) { device->fs_devices->open_devices--; @@ -2034,8 +2038,11 @@ void btrfs_destroy_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, struct btrfs_device, dev_list); if (tgtdev->bdev == fs_info->sb->s_bdev) fs_info->sb->s_bdev = next_device->bdev; -if (tgtdev->bdev == fs_info->fs_devices->latest_bdev) +if (tgtdev->bdev == fs_info->fs_devices->latest_bdev) { fs_info->fs_devices->latest_bdev = next_device->bdev; +snprintf(fs_info->sb->s_id, sizeof(fs_info->sb->s_id), "%pg", + next_device->bdev); +} list_del_rcu(>dev_list); call_rcu(>rcu, free_device); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: fsck: Fix a false metadata extent warning
On Tue, Apr 05, 2016 at 09:28:31AM +0800, Qu Wenruo wrote: > > > David Sterba wrote on 2016/04/04 13:18 +0200: > > On Fri, Apr 01, 2016 at 04:50:06PM +0800, Qu Wenruo wrote: > >>> After another look, why don't we use nodesize directly? Or stripesize > >>> where applies. With max_size == 0 the test does not make sense, we ought > >>> to know the alignment. > >>> > >>> > >> Yes, my first though is also to use nodesize directly, which should be > >> always correct. > >> > >> But the problem is, the related function call stack doesn't have any > >> member to reach btrfs_root or btrfs_fs_info. > > > > JFYI, there's global_info avalaible, so it's not necessary to pass > > fs_info down the callstacks. > > > > > Oh, that's a good news. > > Do I need to re-submit the patch to use fs_info->tree_root->nodesize to > avoid false alert? > Or wait for your refactor? No need to resend, the refactored code is now in devel. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: fix re-declared get_device_info()
On Tue, Apr 05, 2016 at 03:36:01PM +0800, Anand Jain wrote: > The other get_device_info() is in the same file, 4 lines above. > > Signed-off-by: Anand JainApplied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix missing s_id setting
On 04/05/2016 08:08 AM, Tsutomu Itoh wrote: When fs_devices->latest_bdev is deleted or is replaced, sb->s_id has not been updated. As a result, the deleted device name is displayed by btrfs_printk. [before fix] # btrfs dev del /dev/sdc4 /mnt2 # btrfs dev add /dev/sdb6 /mnt2 [ 217.458249] BTRFS info (device sdc4): found 1 extents [ 217.695798] BTRFS info (device sdc4): disk deleted /dev/sdc4 [ 217.941284] BTRFS info (device sdc4): disk added /dev/sdb6 [after fix] # btrfs dev del /dev/sdc4 /mnt2 # btrfs dev add /dev/sdb6 /mnt2 [ 83.835072] BTRFS info (device sdc4): found 1 extents [ 84.080617] BTRFS info (device sdc3): disk deleted /dev/sdc4 [ 84.401951] BTRFS info (device sdc3): disk added /dev/sdb6 [PATCH 05/13] Btrfs: fix fs logging for multi device any comments ? We would want to maintain the logging prefix as constant, so that troubleshooters with filters/scripts will find it helpful. Thanks, Anand Signed-off-by: Tsutomu Itoh--- fs/btrfs/dev-replace.c | 5 - fs/btrfs/volumes.c | 11 +-- 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index a1d6652..11c4198 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -560,8 +560,11 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, tgt_device->commit_bytes_used = src_device->bytes_used; if (fs_info->sb->s_bdev == src_device->bdev) fs_info->sb->s_bdev = tgt_device->bdev; - if (fs_info->fs_devices->latest_bdev == src_device->bdev) + if (fs_info->fs_devices->latest_bdev == src_device->bdev) { fs_info->fs_devices->latest_bdev = tgt_device->bdev; + snprintf(fs_info->sb->s_id, sizeof(fs_info->sb->s_id), "%pg", +tgt_device->bdev); + } list_add(_device->dev_alloc_list, _info->fs_devices->alloc_list); fs_info->fs_devices->rw_devices++; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index e2b54d5..a471385 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1846,8 +1846,12 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path) struct btrfs_device, dev_list); if (device->bdev == root->fs_info->sb->s_bdev) root->fs_info->sb->s_bdev = next_device->bdev; - if (device->bdev == root->fs_info->fs_devices->latest_bdev) + if (device->bdev == root->fs_info->fs_devices->latest_bdev) { root->fs_info->fs_devices->latest_bdev = next_device->bdev; + snprintf(root->fs_info->sb->s_id, +sizeof(root->fs_info->sb->s_id), "%pg", +next_device->bdev); + } if (device->bdev) { device->fs_devices->open_devices--; @@ -2034,8 +2038,11 @@ void btrfs_destroy_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, struct btrfs_device, dev_list); if (tgtdev->bdev == fs_info->sb->s_bdev) fs_info->sb->s_bdev = next_device->bdev; - if (tgtdev->bdev == fs_info->fs_devices->latest_bdev) + if (tgtdev->bdev == fs_info->fs_devices->latest_bdev) { fs_info->fs_devices->latest_bdev = next_device->bdev; + snprintf(fs_info->sb->s_id, sizeof(fs_info->sb->s_id), "%pg", +next_device->bdev); + } list_del_rcu(>dev_list); call_rcu(>rcu, free_device); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
BTRFS warnings in dmesg / scrub errors - Transaction aborted ( error -95 )
Hi, I have a btrfs partition which exhibits some strange behaviours: - btrfs scrub seems to force the filesystem in read-only mode - transactions aborted with error -95 For context, this is a partition converted from ext4 which 'survived' through the issues described at http://www.spinics.net/lists/linux-btrfs/msg44660.html The errors that I receive at runtime, during regular usage are [ 49.297851] [ cut here ] [ 49.297905] WARNING: CPU: 0 PID: 72 at ../fs/btrfs/inode.c:2937 btrfs_finish_ordered_io+0x5f5/0x650 [b trfs]() [ 49.297911] BTRFS warning (device sda1): btrfs_finish_ordered_io:2937: Aborting unused transaction(unk nown). [ 49.297916] BTRFS: Transaction aborted (error -95) [ 49.297919] Modules linked in: bnep af_packet vboxpci(O) vboxnetadp(O) vboxnetflt(O) ip6t_REJECT nf_re ject_ipv6 nf_log_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT nf_reject_ipv4 xt_tcpudp x t_pkttype nf_log_ipv4 nf_log_common xt_LOG xt_limit iptable_raw xt_CT iptable_filter vboxdrv(O) ip6table_ mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_connt rack nf_conntrack ip6table_filter ip6_tables x_tables msr xfs libcrc32c snd_hda_codec_hdmi snd_hda_codec_ realtek snd_hda_codec_generic i915 raid1 joydev intel_rapl drm_kms_helper fb_sys_fops x86_pkg_temp_therma l syscopyarea intel_powerclamp sysfillrect sysimgblt i2c_algo_bit coretemp kvm_intel i2c_designware_platform snd_hda_intel i2c_designware_core 8250_dw md_mod ppdev kvm snd_hda_codec [ 49.297991] snd_hda_core irqbypass snd_seq crct10dif_pclmul snd_hwdep crc32_pclmul ghash_clmulni_intel aesni_intel snd_pcm aes_x86_64 hci_uart lrw gf128mul btbcm glue_helper ablk_helper cryptd snd_seq_device pcspkr r8169 btqca i2c_i801 btintel snd_timer mii bluetooth parport_pc dm_mod rfkill battery 8250_fintek parport snd idma64 virt_dma pinctrl_sunrisepoint pinctrl_intel intel_lpss_acpi soundcore mei_me intel_lpss_pci intel_lpss mei acpi_als mfd_core kfifo_buf industrialio shpchp tpm_tis thermal acpi_pad fan tpm fjes hid_generic hid_logitech_hidpp hid_logitech_dj usbhid btrfs xor uas usb_storage mxm_wmi raid6_pq crc32c_intel serio_raw nvidia_uvm(PO) nvidia(PO) sr_mod cdrom xhci_pci xhci_hcd usbcore usb_common drm i2c_hid wmi video button sg [ 49.298081] CPU: 0 PID: 72 Comm: kworker/u8:4 Tainted: P O4.5.0-2-default #1 [ 49.298084] Hardware name: MSI MS-7971/H170A PC MATE (MS-7971), BIOS B.60 02/23/2016 [ 49.298131] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] [ 49.298135] 88045b837c98 81395921 88045b837ce0 [ 49.298142] a0d096be 88045b837cd0 8107d912 880089ab85c8 [ 49.298147] 88045b9c5000 ffa1 88045dea5690 880459efd0c0 [ 49.298153] Call Trace: [ 49.298170] [] try_stack_unwind+0x175/0x190 [ 49.298185] [] dump_trace+0x69/0x3a0 [ 49.298193] [] show_trace_log_lvl+0x4b/0x60 [ 49.298201] [] show_stack_log_lvl+0x10c/0x180 [ 49.298208] [] show_stack+0x25/0x50 [ 49.298218] [] dump_stack+0x63/0x82 [ 49.298230] [] warn_slowpath_common+0x82/0xc0 [ 49.298240] [] warn_slowpath_fmt+0x4c/0x50 [ 49.298281] [] btrfs_finish_ordered_io+0x5f5/0x650 [btrfs] [ 49.298326] [] finish_ordered_fn+0x15/0x20 [btrfs] [ 49.298369] [] btrfs_scrubparity_helper+0xcb/0x350 [btrfs] [ 49.298413] [] btrfs_endio_write_helper+0xe/0x10 [btrfs] [ 49.298421] [] process_one_work+0x15c/0x4b0 [ 49.298428] [] worker_thread+0x48/0x4e0 [ 49.298436] [] kthread+0xc9/0xe0 [ 49.298445] [] ret_from_fork+0x3f/0x70 [ 49.302378] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70 [ 49.302383] Leftover inexact backtrace: [ 49.302392] [] ? kthread_worker_fn+0x180/0x180 [ 49.302488] ---[ end trace a383d63cdc6dbae3 ]--- [ 49.302502] BTRFS warning (device sda1): btrfs_finish_ordered_io:2937: Aborting unused transaction(unknown). I also recently got an error related to scrubbing which (IIRC) caused the fs to become read-only Apr 05 10:13:06 mars kernel: [ cut here ] Apr 05 10:13:06 mars kernel: WARNING: CPU: 0 PID: 72 at ../fs/btrfs/inode.c:2937 btrfs_finish_ordered_io+0x5f5/0x650 [btrfs]() Apr 05 10:13:06 mars kernel: BTRFS warning (device sda1): btrfs_finish_ordered_io:2937: Aborting unused transaction(unknown). Apr 05 10:13:06 mars kernel: BTRFS: Transaction aborted (error -95) Apr 05 10:13:06 mars kernel: Modules linked in: bnep af_packet vboxpci(O) vboxnetadp(O) vboxnetflt(O) ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT nf_reject_ipv Apr 05 10:13:06 mars kernel: snd_hda_core irqbypass snd_seq crct10dif_pclmul snd_hwdep crc32_pclmul ghash_clmulni_intel aesni_intel snd_pcm aes_x86_64 hci_uart lrw gf128mul btbcm glue_helper ablk_helper cryptd Apr 05 10:13:06 mars kernel: CPU: 0 PID: 72 Comm: kworker/u8:4 Tainted: P O4.5.0-2-default #1 Apr 05 10:13:06 mars kernel: Hardware name:
[PATCH] btrfs-progs: fix re-declared get_device_info()
The other get_device_info() is in the same file, 4 lines above. Signed-off-by: Anand Jain--- utils.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/utils.h b/utils.h index ea54692aca86..e2ff41ad4ec1 100644 --- a/utils.h +++ b/utils.h @@ -187,8 +187,6 @@ int get_device_info(int fd, u64 devid, struct btrfs_ioctl_dev_info_args *di_args); int test_uuid_unique(char *fs_uuid); u64 disk_size(const char *path); -int get_device_info(int fd, u64 devid, - struct btrfs_ioctl_dev_info_args *di_args); u64 get_partition_size(const char *dev); int test_minimum_size(const char *file, u32 leafsize); -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfsck: backpointer mismatch (and multiple other errors)
Am Mon, 4 Apr 2016 17:09:14 -0600 schrieb Chris Murphy: > > Why couldn't/shouldn't I remove snapshots before detaching the seed > > device? I want to keep them on the seed but they are useless to me > > on the sprout. > > You can remove snapshots before or after detaching the seed device, it > doesn't matter, but such snapshot removal only affects the sprout. You > wrote: > > "remove all left-over snapshots from the seed" > > The seed is read only, you can't modify the contents of the seed > device. Sorry, not a native speaker... What I actually meant was to remove the snapshot that originated from the seed, and which I don't need in the sprout. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html