Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
For raid5 it's different. No single chunks are created while copying files to a degraded volume. And the scrub produces very noisy kernel messages. Looks like there's a message for each missing block (or stripe?), thousands per file. And also many uncorrectable errors like this: [267466.792060] f23s.localdomain kernel: BTRFS error (device dm-8): unable to fixup (regular) error at logical 3760582656 on dev /dev/dm-7 [267467.508588] f23s.localdomain kernel: scrub_handle_errored_block: 401 callbacks suppressed [root@f23s ~]# btrfs scrub start /mnt/1/ ERROR: there are uncorrectable errors [root@f23s ~]# btrfs scrub status /mnt/1/ scrub status for 51e1efb0-7df3-44d5-8716-9ed4bdadc93e scrub started at Fri Apr 8 14:35:25 2016 and finished after 00:11:26 total bytes scrubbed: 3.21GiB with 45186 errors error details: read=95 super=2 verify=8 csum=45081 corrected errors: 44935, uncorrectable errors: 249, unverified errors: 0 Subsequent balance and scrub have no messages at all. So... uncorrectable? Really? That's confusing. FYI, for a scrub with no errors it's 4m24s, but with the same data and 1/2 of it needing to be rebuilt during the scrub took 16m4s, so about 4x longer to reconstruct. Seems excessive. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/13] btrfs: introduce helper functions to perform hot replace
On Sat, Apr 02, 2016 at 09:30:48AM +0800, Anand Jain wrote: > Hot replace / auto replace is important volume manager feature > and is critical to the data center operations, so that the degraded > volume can be brought back to a healthy state at the earliest and > without manual intervention. > > This modifies the existing replace code to suite the need of auto > replace, in the long run I hope both the codes to be merged. > > Signed-off-by: Anand Jain> Tested-by: Austin S. Hemmelgarn > --- > fs/btrfs/dev-replace.c | 43 +++ > fs/btrfs/dev-replace.h | 1 + > 2 files changed, 44 insertions(+) > > diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c > index 2b926867d136..ceab4c51db32 100644 > --- a/fs/btrfs/dev-replace.c > +++ b/fs/btrfs/dev-replace.c > @@ -957,3 +957,46 @@ void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info > *fs_info) >_info->fs_state)); > } > } > + > +int btrfs_auto_replace_start(struct btrfs_root *root, > + struct btrfs_device *src_device) > +{ > + int ret; > + char *tgt_path; > + char *src_path; > + struct btrfs_fs_info *fs_info = root->fs_info; > + > + if (fs_info->sb->s_flags & MS_RDONLY) > + return -EROFS; > + > + btrfs_dev_replace_lock(_info->dev_replace, 0); > + if (btrfs_dev_replace_is_ongoing(_info->dev_replace)) { > + btrfs_dev_replace_unlock(_info->dev_replace, 0); > + return -EBUSY; > + } > + btrfs_dev_replace_unlock(_info->dev_replace, 0); > + > + if (btrfs_get_spare_device(_path)) { > + btrfs_err(root->fs_info, > + "No spare device found/configured in the kernel"); > + return -EINVAL; > + } > + > + rcu_read_lock(); > + src_path = kstrdup(rcu_str_deref(src_device->name), GFP_ATOMIC); > + rcu_read_unlock(); > + if (!src_path) { > + kfree(tgt_path); > + return -ENOMEM; > + } > + ret = btrfs_dev_replace_start(root, tgt_path, > + src_device->devid, src_path, > + BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID); > + if (ret) > + btrfs_put_spare_device(tgt_path); > + > + kfree(tgt_path); > + kfree(src_path); > + > + return 0; > +} Without of fs_info->mutually_exclusive_operation_running flag set in btrfs_auto_replace_start(), device add/remove/balance etc. can be started in parralel with autoreplace. Should this scenarios be permitted? > diff --git a/fs/btrfs/dev-replace.h b/fs/btrfs/dev-replace.h > index e922b42d91df..b918b9d6e5df 100644 > --- a/fs/btrfs/dev-replace.h > +++ b/fs/btrfs/dev-replace.h > @@ -46,4 +46,5 @@ static inline void btrfs_dev_replace_stats_inc(atomic64_t > *stat_value) > { > atomic64_inc(stat_value); > } > +int btrfs_auto_replace_start(struct btrfs_root *root, struct btrfs_device > *src_device); > #endif > -- > 2.7.0 -- Yauhen Kharuzhy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs
Hi Linus We have some fixes queued up in my for-linus-4.6 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.6 These are bug fixes, including a really old fsync bug, and a few trace points to help us track down problems in the quota code. Mark Fasheh (2) commits (+129/-23): btrfs: handle non-fatal errors in btrfs_qgroup_inherit() (+32/-22) btrfs: Add qgroup tracing (+97/-1) Filipe Manana (1) commits (+137/-0): Btrfs: fix file/data loss caused by fsync after rename and new inode Liu Bo (1) commits (+1/-0): Btrfs: fix invalid reference in replace_path Yauhen Kharuzhy (1) commits (+2/-0): btrfs: Reset IO error counters before start of device replacing Qu Wenruo (1) commits (+19/-2): btrfs: Output more info for enospc_debug mount option Davide Italiano (1) commits (+6/-3): Btrfs: Improve FL_KEEP_SIZE handling in fallocate Josef Bacik (1) commits (+1/-1): Btrfs: don't use src fd for printk David Sterba (1) commits (+8/-4): btrfs: fallback to vmalloc in btrfs_compare_tree Total: (9) commits (+303/-33) fs/btrfs/ctree.c | 12 ++-- fs/btrfs/dev-replace.c | 2 + fs/btrfs/extent-tree.c | 21 ++- fs/btrfs/file.c | 9 ++- fs/btrfs/ioctl.c | 2 +- fs/btrfs/qgroup.c| 63 +--- fs/btrfs/relocation.c| 1 + fs/btrfs/tree-log.c | 137 +++ include/trace/events/btrfs.h | 89 +++- 9 files changed, 303 insertions(+), 33 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
On Fri, Apr 8, 2016 at 1:27 PM, Austin S. Hemmelgarnwrote: > On 2016-04-08 14:30, Chris Murphy wrote: >> >> On Fri, Apr 8, 2016 at 12:18 PM, Austin S. Hemmelgarn >> wrote: >>> >>> On 2016-04-08 14:05, Chris Murphy wrote: On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarn wrote: > I entirely agree. If the fix doesn't require any kind of decision to > be > made other than whether to fix it or not, it should be trivially > fixable > with the tools. TBH though, this particular issue with devices > disappearing > and reappearing could be fixed easier in the block layer (at least, > there > are things that need to be fixed WRT it in the block layer). Another feature needed for transient failures with large storage, is some kind of partial scrub, along the lines of md partial resync when there's a bitmap write intent log. >>> In this case, I would think the simplest way to do this would be to have >>> scrub check if generation matches and not further verify anything that >>> does >>> (I think we might be able to prune anything below objects whose >>> generation >>> matches, but I'm not 100% certain about how writes cascade up the trees). >>> I >>> hadn't really thought about this before, but now that I do, it kind of >>> surprises me that we don't have something to do this. >>> >> >> And I need to better qualify this: this scrub (or balance) needs to be >> initiated automatically, perhaps have some reasonable delay after the >> block layer informs Btrfs the missing device as reappeared. Both the >> requirement of a full scrub as well as it being a manual scrub, are >> pretty big gotchas. >> > We would still ideally want some way to initiate it manually because: > 1. It would make it easier to test. > 2. We should have a way to do it on filesystems that have been reassembled > after a reboot, not just ones that got the device back in the same boot (or > it was missing on boot and then appeared). I'm OK with a mount option, 'autoraidfixup' (not a proposed name!), that permits the mechanism to happen, but which isn't yet the default. However, one day I think it should be, because right now we already allow mounts of devices with different generations and there is no message indicating this at all, even though the superblocks clearly show a discrepancy in generation. mount with one device missing [264466.609093] BTRFS: has skinny extents [264912.547199] BTRFS info (device dm-6): disk space caching is enabled [264912.547267] BTRFS: has skinny extents [264912.606266] BTRFS: failed to read chunk tree on dm-6 [264912.621829] BTRFS: open_ctree failed mount -o degraded [264953.758518] BTRFS info (device dm-6): allowing degraded mounts [264953.758794] BTRFS info (device dm-6): disk space caching is enabled [264953.759055] BTRFS: has skinny extents copy 800MB file umount lvchange -ay mount [265082.859201] BTRFS info (device dm-6): disk space caching is enabled [265082.859474] BTRFS: has skinny extents btrfs scrub start [265260.024267] BTRFS error (device dm-6): bdev /dev/dm-7 errs: wr 0, rd 0, flush 0, corrupt 0, gen 1 # btrfs scrub status /mnt/1 scrub status for b01b3922-4012-4de1-af42-63f5b2f68fc3 scrub started at Fri Apr 8 14:01:41 2016 and finished after 00:00:18 total bytes scrubbed: 1.70GiB with 1 errors error details: super=1 corrected errors: 0, uncorrectable errors: 0, unverified errors: 0 After scrubbing and fixing everything and zeroing out the counters, if I fail the device again, I can no longer mount degraded: [265502.432444] BTRFS: missing devices(1) exceeds the limit(0), writeable mount is not allowed because of this nonsense: [root@f23s ~]# btrfs fi df /mnt/1 Data, RAID1: total=1.00GiB, used=458.06MiB Data, single: total=1.00GiB, used=824.00MiB System, RAID1: total=64.00MiB, used=16.00KiB System, single: total=32.00MiB, used=0.00B Metadata, RAID1: total=2.00GiB, used=576.00KiB Metadata, single: total=256.00MiB, used=912.00KiB GlobalReserve, single: total=16.00MiB, used=0.00B a.) the device I'm mounting degraded contains the single chunks, it's not like the single chunks are actually missing b.) the manual scrub only fixed the supers, it did not replicate the newly copied data since it was placed in new single chunks rather than existing raid1 chunks. c.) this requires a manual balance convert,soft to actually get everything back to raid1. Very non-obvious. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.4.0 - no space left with >1.7 TB free space left
Roman Mamedov posted on Fri, 08 Apr 2016 16:53:32 +0500 as excerpted: > It's not in 4.4.6 either. I don't know why it doesn't get included, or > what we need to do. Last time I asked, it was queued: > http://www.spinics.net/lists/linux-btrfs/msg52478.html But maybe that > meant 4.5 or 4.6 only? While the bug is affecting people on 4.4.x today. Patches must make it to the current development kernel before they're eligible for stable. Additionally, they need to be cced to stable as well, in ordered to be queued there. So check 4.5 and 4.6-rc. If it's in neither of those, it's not going to be in stable yet. Once it's in the development kernel, see if it was cced to stable and if needed, ask the author and btrfs devs to cc it to stable. Tho sometimes stable can get a backlog as well. I know earlier this year they were dealing with one, but I follow release or development, not stable, and don't know what stable's current status is. If it gets to stable, and it wasn't for a bug introduced /after/ 4.4, it should eventually get into 4.4, as that's an LTS kernel. But it might take awhile, as the above discussion hints. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Missing device handling (was: 'unable to mount btrfs pool...')
On Fri, Apr 08, 2016 at 03:23:28PM -0400, Austin S. Hemmelgarn wrote: > On 2016-04-08 12:17, Chris Murphy wrote: > > I would personally suggest adding a per-filesystem node in sysfs to handle > both 2 and 5. Having it open tells BTRFS to not automatically attempt > countermeasures when degraded, select/epoll on it will return when state > changes, reads will return (at minimum): what devices comprise the FS, per > disk state (is it working, failed, missing, a hot-spare, etc), and what > effective redundancy we have (how many devices we can lose and still be > mountable, so 1 for raid1, raid10, and raid5, 2 for raid6, and 0 for > raid0/single/dup, possibly higher for n-way replication (n-1), n-order > parity (n), or erasure coding). This would make it trivial to write a daemon > to monitor the filesystem, react when something happens, and handle all the > policy decisions. Hm, good proposal. Personally I tried to use uevents for this but they cause locking troubles, and I didn't continue this attempt. In any case we need have interface for btrfs-progs to passing FS state information (presence and IDs of missing devices, for example, degraded/good state of RAID etc.). For testing as first attempt I implemented following interface. It still seems not good for me but acceptable as a starting point. Additionally to this, I changed missing device name reported in btrfs_ioctl_dev_info() to 'missing' for avoiding of interferences with block devices inserted after closing of failed device (adding of 'missing' field to the struct btrfs_ioctl_dev_info_args may be more right way). So, your opinion? diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index d9b147f..f9a2fa6 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2716,12 +2716,17 @@ static long btrfs_ioctl_fs_info(struct btrfs_root *root, void __user *arg) mutex_lock(_devices->device_list_mutex); fi_args->num_devices = fs_devices->num_devices; + fi_args->missing_devices = fs_devices->missing_devices; + fi_args->open_devices = fs_devices->open_devices; + fi_args->rw_devices = fs_devices->rw_devices; + fi_args->total_devices = fs_devices->total_devices; memcpy(_args->fsid, root->fs_info->fsid, sizeof(fi_args->fsid)); list_for_each_entry(device, _devices->devices, dev_list) { if (device->devid > fi_args->max_id) fi_args->max_id = device->devid; } + fi_args->state = root->fs_info->fs_state; mutex_unlock(_devices->device_list_mutex); fi_args->nodesize = root->fs_info->super_copy->nodesize; diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index dea8931..6808bf2 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -186,8 +186,12 @@ struct btrfs_ioctl_fs_info_args { __u32 nodesize; /* out */ __u32 sectorsize; /* out */ __u32 clone_alignment; /* out */ - __u32 reserved32; - __u64 reserved[122];/* pad to 1k */ + __u32 state;/* out */ + __u64 missing_devices; /* out */ + __u64 open_devices; /* out */ + __u64 rw_devices; /* out */ + __u64 total_devices;/* out */ + __u64 reserved[118];/* pad to 1k */ }; struct btrfs_ioctl_feature_flags { -- Yauhen Kharuzhy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
On 2016-04-08 14:30, Chris Murphy wrote: On Fri, Apr 8, 2016 at 12:18 PM, Austin S. Hemmelgarnwrote: On 2016-04-08 14:05, Chris Murphy wrote: On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarn wrote: I entirely agree. If the fix doesn't require any kind of decision to be made other than whether to fix it or not, it should be trivially fixable with the tools. TBH though, this particular issue with devices disappearing and reappearing could be fixed easier in the block layer (at least, there are things that need to be fixed WRT it in the block layer). Another feature needed for transient failures with large storage, is some kind of partial scrub, along the lines of md partial resync when there's a bitmap write intent log. In this case, I would think the simplest way to do this would be to have scrub check if generation matches and not further verify anything that does (I think we might be able to prune anything below objects whose generation matches, but I'm not 100% certain about how writes cascade up the trees). I hadn't really thought about this before, but now that I do, it kind of surprises me that we don't have something to do this. And I need to better qualify this: this scrub (or balance) needs to be initiated automatically, perhaps have some reasonable delay after the block layer informs Btrfs the missing device as reappeared. Both the requirement of a full scrub as well as it being a manual scrub, are pretty big gotchas. We would still ideally want some way to initiate it manually because: 1. It would make it easier to test. 2. We should have a way to do it on filesystems that have been reassembled after a reboot, not just ones that got the device back in the same boot (or it was missing on boot and then appeared). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Missing device handling (was: 'unable to mount btrfs pool...')
On 2016-04-08 12:17, Chris Murphy wrote: On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarnwrote: I entirely agree. If the fix doesn't require any kind of decision to be made other than whether to fix it or not, it should be trivially fixable with the tools. TBH though, this particular issue with devices disappearing and reappearing could be fixed easier in the block layer (at least, there are things that need to be fixed WRT it in the block layer). Right. The block layer needs a way to communicate device missing to Btrfs and Btrfs needs to have some tolerance for transience. Being notified when a device disappears _shouldn't_ be that hard. A uevent gets sent already, and we should be able to associate some kind of callback with that happening for devices we have mounted. The bigger issue is going to be handling the devices _reappearing_ (if we still hold a reference to the device, it appears under a different name/major/minor, and if it's more than one device and we have no references, they may appear in a different order than they were originally), and there is where we really need to fix things. A device disappearing forever is bad and all, but a device losing connection and reconnecting completely ruining the FS is exponentially worse. Overall, to provide true reliability here, we need: 1. Some way for userspace to disable writeback caching per-device (this is needed for other reasons as well, but those are orthogonal to this discussion). This then needs to be used on all removable devices by default (Windows and OS X do this, it's part of why small transfers appear to complete faster on Linux, and then the disk takes _forever_ to unmount). This would reduce the possibility of data loss when a device disappears. 2. A way for userspace to be notified (instead of having to poll) of state changes in BTRFS. Currently, the only ways for userspace to know something is wrong are either parsing dmesg or polling the filesystem flags (and based both personal experience, and statements I've seen here and elsewhere, polling the FS flags is not reliable for this). Most normal installations are going to want to trigger handlers for specific state changes (be it e-mail to an admin, or some other notification method, or even doing some kind of maintenance on the FS automatically), and we need some kind of notification if we want to give userspace the ability to properly manage things. 3. A way to tell that a device is gone _when it happens_, not when we try to write to it next, not when a write fails, but the moment the block layer knows it's not there, we need to know as well. This is a prerequisite for the next two items. Sadly, we're probably the only thing that would directly benefit from this (LVM uses uevents and monitoring daemons to handle this, we don't exactly have that luxury), which means it may be hard to get something like this merged. 4. Transparent handling of short, transient loss of a device. This goes together to a certain extent with 1, if something disappears for long enough that the kernel notices, but it reappears before we have any I/O to do on it again, we shouldn't lose our lunch unless userspace tells us to (because we told userspace that it's gone due to item 2). In theory, we should be able to cache a small number of internal pending writes for when it reappears (so for example, if a transaction is being committed, and the USB disk disappears for a second, we should be able to pick up where we left off (after verifying the last write we sent)). We should also have an automatic re-sync if it's a short enough period it's gone for. The max timeout here should probably be configurable, but probably could just be one tunable for the whole system. 5. Give userspace the option to handle degraded states how it wants to, and keep our default of remount RO when degraded when userspace doesn't want to handle it itself. This needs to be configured at run-time (not stored on the media), and it needs to be per-filesystem, otherwise we open up all kinds of other issues. This is a core concept in LVM and many other storage management systems; namely, userspace can choose to handle a degraded RAID array however the hell it wants, and we'll provide a couple of sane default handlers for the common cases. I would personally suggest adding a per-filesystem node in sysfs to handle both 2 and 5. Having it open tells BTRFS to not automatically attempt countermeasures when degraded, select/epoll on it will return when state changes, reads will return (at minimum): what devices comprise the FS, per disk state (is it working, failed, missing, a hot-spare, etc), and what effective redundancy we have (how many devices we can lose and still be mountable, so 1 for raid1, raid10, and raid5, 2 for raid6, and 0 for raid0/single/dup, possibly higher for n-way replication (n-1), n-order parity (n), or erasure coding). This would
FROM: MR. OLIVER SENO!!
Dear Sir. I bring you greetings. My name is Mr.Oliver Seno Lim, I am a staff of Abbey National Plc. London and heading our regional office in West Africa. Our late customer named Engr.Ben W.westland, made a fixed deposit amount of US$7Million.He did not declare any next of kin in any of his paper work, I want you as a foreigner to stand as the beneficiary to transfer this funds out of my bank into your account, after the successful transfer, we shall share in the ratio of 30% for you, 70%for me. Should you be interested please send me your information: 1,Full names. 2,current residential address. 3,Tele/Fax numbers./your work. All I need from you is your readiness, trustworthiness and edication. Please email me directly on my private email address: officeose...@yahoo.com) so we can begin arrangements and I would give you more information on how we would handle this venture and once i hear from you i will give you information of the bank for the transferring funds on your name. Regards, Mr.Oliver Seno Lim -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARN_ON in record_root_in_trans() when deleting freshly renamed subvolume
On Fri, Apr 08, 2016 at 03:10:35PM +0200, Holger Hoffstätte wrote: > [cc: Mark and Qu] > > On 04/08/16 13:51, Holger Hoffstätte wrote: > > On 04/08/16 13:14, Filipe Manana wrote: > >> Using Chris' for-linus-4.6 branch, which is 4.5-rc6 + all 4.6 btrfs > >> patches, it didn't reproduce here: > > > > Great, that's good to know (sort of :). Thanks also to Liu Bo. > > > >> Are you sure that you are not using some patches not in 4.6? > > We have a bingo! > > Reverting "qgroup: Fix qgroup accounting when creating snapshot" > from last Wednesday immediately fixes the problem. Not surprising, I had some issues testing it out too. I'm pretty sure this patch is corrupting memory, I just haven't found where yet though my educated guess is that the transaction is being reused improperly. --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Volume stuck after Checking UUID tree
johnathan falk gmail.com> writes: > > The drive mounts perfectly fine when you mount RO, but when you mount > it rw it gives this (and eventually locks up the system as I can't > restart it cleanly): > > kernel: BTRFS info (device sdb1): disk space caching is enabled > Apr 06 17:09:16 kernel: BTRFS error (device sdb1): qgroup generation > mismatch, marked as inconsistent > Apr 06 17:09:17 kernel: BTRFS: checking UUID tree > Apr 06 17:12:30 kernel: INFO: task btrfs-transacti:5701 blocked for > more than 120 seconds. > > Bug reported on launch pad: > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1555828 > > When I mount the disk in journalctl -f: > http://pastebin.com/1p6XppRb > > Went through Marc Merlin's suggestions to fix my drive and the btrfsck > -S1/2/3 never finishes and gets stuck on checking quota groups. I also > created a btrfs-image first thing. > > Any suggestions would be welcome as I don't know what to do. > Paste of journalctl: kernel: Call Trace: Apr 02 18:55:03 theark kernel: [] schedule+0x35/0x80 Apr 02 18:55:03 theark kernel: [] btrfs_commit_transaction+0x382/0xa90 [btrfs] Apr 02 18:55:03 theark kernel: [] ? wake_atomic_t_function+0x60/0x60 Apr 02 18:55:03 theark kernel: [] transaction_kthread+0x229/0x240 [btrfs] Apr 02 18:55:03 theark kernel: [] ? btrfs_cleanup_transaction+0x580/0x580 [btrfs] Apr 02 18:55:03 theark kernel: [] kthread+0xd8/0xf0 Apr 02 18:55:03 theark kernel: [] ret_from_fork+0x22/0x40 Apr 02 18:55:03 theark kernel: [] ? kthread_create_on_node+0x1a0/0x1a0 Apr 02 18:55:03 theark kernel: INFO: task umount:4672 blocked for more than 120 seconds. Apr 02 18:55:03 theark kernel: Not tainted 4.6.0-040600rc1-generic #201603261930 Apr 02 18:55:03 theark kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 02 18:55:03 theark kernel: umount D 88008e323cd8 0 4672 3434 0x Apr 02 18:55:03 theark kernel: 88008e323cd8 81e0d540 880343711c80 Apr 02 18:55:03 theark kernel: 88008e324000 88040b4579f0 88040b457800 88040b4579f0 Apr 02 18:55:03 theark kernel: 88008e323cf0 81835e45 8800bbd561b0 Apr 02 18:55:03 theark kernel: Call Trace: Apr 02 18:55:03 theark kernel: [] schedule+0x35/0x80 Apr 02 18:55:03 theark kernel: [] wait_current_trans.isra.22+0xd3/0x120 [btrfs] Apr 02 18:55:03 theark kernel: [] ? wake_atomic_t_function+0x60/0x60 Apr 02 18:55:03 theark kernel: [] start_transaction+0x27b/0x4c0 [btrfs] Apr 02 18:55:03 theark kernel: [] btrfs_attach_transaction_barrier+0x1d/0x50 [btrfs] Apr 02 18:55:03 theark kernel: [] btrfs_sync_fs+0x42/0x110 [btrfs] Apr 02 18:55:03 theark kernel: [] sync_filesystem+0x71/0xa0 Apr 02 18:55:03 theark kernel: [] generic_shutdown_super+0x27/0x100 Apr 02 18:55:03 theark kernel: [] kill_anon_super+0x12/0x20 Apr 02 18:55:03 theark kernel: [] btrfs_kill_super+0x18/0x110 [btrfs] Apr 02 18:55:03 theark kernel: [] deactivate_locked_super+0x43/0x70 Apr 02 18:55:03 theark kernel: [] deactivate_super+0x5c/0x60 Apr 02 18:55:03 theark kernel: [] cleanup_mnt+0x3f/0x90 Apr 02 18:55:03 theark kernel: [] __cleanup_mnt+0x12/0x20 Apr 02 18:55:03 theark kernel: [] task_work_run+0x73/0x90 Apr 02 18:55:03 theark kernel: [] exit_to_usermode_loop+0xc2/0xd0 Apr 02 18:55:03 theark kernel: [] syscall_return_slowpath+0x4e/0x60 Apr 02 18:55:03 theark kernel: [] entry_SYSCALL_64_fastpath+0xa6/0xa8 Currently a btrfsck using btrfs-progs 4.5.1 has run for 3 days now and is currently using: pri ni virt res shr s cpu% mem% time 20 0 22.1G 14.36G 188 D 1.9 95.5 47:16.21 btrfs check -s 2 /dev/sdb -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
On Fri, Apr 8, 2016 at 12:18 PM, Austin S. Hemmelgarnwrote: > On 2016-04-08 14:05, Chris Murphy wrote: >> >> On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarn >> wrote: >> >>> I entirely agree. If the fix doesn't require any kind of decision to be >>> made other than whether to fix it or not, it should be trivially fixable >>> with the tools. TBH though, this particular issue with devices >>> disappearing >>> and reappearing could be fixed easier in the block layer (at least, there >>> are things that need to be fixed WRT it in the block layer). >> >> >> Another feature needed for transient failures with large storage, is >> some kind of partial scrub, along the lines of md partial resync when >> there's a bitmap write intent log. >> > In this case, I would think the simplest way to do this would be to have > scrub check if generation matches and not further verify anything that does > (I think we might be able to prune anything below objects whose generation > matches, but I'm not 100% certain about how writes cascade up the trees). I > hadn't really thought about this before, but now that I do, it kind of > surprises me that we don't have something to do this. > And I need to better qualify this: this scrub (or balance) needs to be initiated automatically, perhaps have some reasonable delay after the block layer informs Btrfs the missing device as reappeared. Both the requirement of a full scrub as well as it being a manual scrub, are pretty big gotchas. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
On 2016-04-08 14:05, Chris Murphy wrote: On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarnwrote: I entirely agree. If the fix doesn't require any kind of decision to be made other than whether to fix it or not, it should be trivially fixable with the tools. TBH though, this particular issue with devices disappearing and reappearing could be fixed easier in the block layer (at least, there are things that need to be fixed WRT it in the block layer). Another feature needed for transient failures with large storage, is some kind of partial scrub, along the lines of md partial resync when there's a bitmap write intent log. In this case, I would think the simplest way to do this would be to have scrub check if generation matches and not further verify anything that does (I think we might be able to prune anything below objects whose generation matches, but I'm not 100% certain about how writes cascade up the trees). I hadn't really thought about this before, but now that I do, it kind of surprises me that we don't have something to do this. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarnwrote: > I entirely agree. If the fix doesn't require any kind of decision to be > made other than whether to fix it or not, it should be trivially fixable > with the tools. TBH though, this particular issue with devices disappearing > and reappearing could be fixed easier in the block layer (at least, there > are things that need to be fixed WRT it in the block layer). Another feature needed for transient failures with large storage, is some kind of partial scrub, along the lines of md partial resync when there's a bitmap write intent log. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs send/receive using generation number as source
On Fri, Apr 8, 2016 at 5:01 AM, Martin Steigerwald <mar...@lichtvoll.de> wrote: > Hello! > > As far as I understood, for differential btrfs send/receive – I didn´t use it > yet – I need to keep a snapshot on the source device to then tell btrfs send > to send the differences between the snapshot and the current state. > > Now the BTRFS filesystems on my SSDs are often quite full, thus I do not keep > any snapshots except for one during rsync or borgbackup script run-time. > > Is it possible to tell btrfs send to use generation number xyz to calculate > the difference? This way, I wouldn´t have to keep a snapshot around, I > believe. > > I bet not, at the time cause -c wants a snapshot. Ah and it wants a snapshot > of the same state on the destination as well. Well on the destination I let > the script make a snapshot after the backup so… what I would need is to > remember the generation number of the source snapshot that the script creates > to backup from and then tell btrfs send that generation number + the > destination snapshots. > > Well, or get larger SSDs or get rid of some data on them. Well if you can't even keep one ro snapshot around, it suggests you need more space. Otherwise the minimal strategy is: Yesterday's source has subvols: root.current root.20160406 So you'd do btrfs sub snap -r root.current root.20160407 btrfs send -p root.20160406 root.20160407 | btrfs receive xxx btrfs sub del root.20160406 Today it's btrfs sub snap -r root.current root.20160408 btrfs send -p root.20160407 root.20160408 | btrfs receive xxx btrfs sub del root.20160407 Tomorrow: btrfs sub snap -r root.current root.20160409 btrfs send -p root.20160408 root.20160409 | btrfs receive xxx btrfs sub del root.20160408 Locally you always have one snapshot to rollback to or make selective reflink copies of files from. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs send/receive using generation number as source
On Fri, Apr 8, 2016 at 1:01 PM, Martin Steigerwaldwrote: > Hello! > > As far as I understood, for differential btrfs send/receive – I didn´t use it > yet – I need to keep a snapshot on the source device to then tell btrfs send > to send the differences between the snapshot and the current state. During the incremental send operation you need 2 ro snapshots available (a parent and a current snapshot) and after that, you just need to keep the current one and promote that to parent snapshot and keep it around until the next incremental send. So indeed that locks space and you might run out of free space if there is a long time before the next incremental send|receive and changes in the filesystem are large in volume. Alternatively, you could do non-incremental send, if the fs is relatively small and you have some method to dedupe on the receiving filesystem. But the rsync method is by far preferred in this case I would say. > Now the BTRFS filesystems on my SSDs are often quite full, thus I do not keep > any snapshots except for one during rsync or borgbackup script run-time. > > Is it possible to tell btrfs send to use generation number xyz to calculate > the difference? This way, I wouldn´t have to keep a snapshot around, I > believe. > > I bet not, at the time cause -c wants a snapshot. Ah and it wants a snapshot > of the same state on the destination as well. Well on the destination I let > the script make a snapshot after the backup so… what I would need is to You can use -p for incremental send and you can also send back (new) increments from backup to master. > remember the generation number of the source snapshot that the script creates > to backup from and then tell btrfs send that generation number + the > destination snapshots. > > Well, or get larger SSDs or get rid of some data on them. I switched from ext4 to btrfs rootfs on an old netbook which has only 4G soldered flash and no option for extension (except via USB/SDcard which turned out to be not reliable enough over a longer period of time). Basically compress=lzo mount option extents the lifetime of this netbook while still using a modern full-sized linux distro. But I guess you have already compressed/compacted what is possible. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarnwrote: >> I can see this being happening automatically with up to 2 device >> failures, so that all subsequent writes are fully intact stripe >> writes. But the instant there's a 3rd device failure, there's a rather >> large hole in the file system that can't be reconstructed. It's an >> invalid file system. I'm not sure what can be gained by allowing >> writes to continue, other than tying off loose ends (so to speak) with >> full stripe metadata writes for the purpose of making recovery >> possible and easier, but after that metadata is written - poof, go >> read only. > > I don't mean writing partial stripes, I mean writing full stripes with a > reduced width (so in an 8 device filesystem, if 3 devices fail, we can still > technically write a complete stripe across 5 devices, but it will result in > less total space we can use). I understand what you mean, it was clear before. The problem is that once its below the critical number of drives, the previously existing file system is busted. So it should go read only. But it can't because it doesn't yet have the concept of faulty devices, *and* also an understanding of how many faulty devices can be tolerated before there's a totally untenable hole in the file system. >Whether or not this behavior is correct is > another argument, but that appears to be what we do currently. Ideally, > this should be a mount option, as strictly speaking, it's policy, which > therefore shouldn't be in the kernel. I think we can definitely agree the current behavior is suboptimal because in fact whatever it wrote to 16 drives was sufficiently confusing that mounting all 20 drives again isn't possible no matter what option is used. >> I think considering the idea of Btrfs is to be more scalable than past >> storage and filesystems have been, it needs to be able to deal with >> transient failures like this. In theory all available information is >> written on all the disks. This was a temporary failure. Once all >> devices are made available again, the fs should be able to figure out >> what to do, even so far as salvaging the writes that happened after >> the 4 devices went missing if those were successful full stripe >> writes. > > I entirely agree. If the fix doesn't require any kind of decision to be > made other than whether to fix it or not, it should be trivially fixable > with the tools. TBH though, this particular issue with devices disappearing > and reappearing could be fixed easier in the block layer (at least, there > are things that need to be fixed WRT it in the block layer). Right. The block layer needs a way to communicate device missing to Btrfs and Btrfs needs to have some tolerance for transience. >> >> Of course it is possible there's corruption problems with those four drives having vanished while writes were incomplete. But if you're lucky, data write happen first, then metadata writes second, and only then is the super updated. So the super should point to valid metadata and that should point to valid data. If that order is wrong, then it's bad news and you have to look at backup roots. But *if* you get all the supers correct and on the same page, you can access the backup roots by using -o recovery if corruption is found with a normal mount. >>> >>> >>> This though is where the potential issue is. -o recovery will only go >>> back >>> so many generations before refusing to mount, and I think that may be why >>> it's not working now.. >> >> >> It also looks like none of the tools are considering the stale supers >> on the formerly missing 4 devices. I still think those are the best >> chance to recover because even if their most current data is wrong due >> to reordered writes not making it to stable storage, one of the >> available backups in those supers should be good. >> > Depending on utilization on the other devices though, they may not point to > complete roots either. In this case, they probably will because of the low > write frequency. In other cases, they may not though, because we try to > reuse space in chunks before allocating new chunks. Based on the superblock posted, I think the *38 generation tree might be incomplete, but there's a *37 and *36 generation that should be intact. Chunk generation is the same. What complicates the rollback is any deletions were happening at the time. If it's just file additions, I think a rollback has a good chance of working. It's just tedious. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.4.0 - no space left with >1.7 TB free space left
On 2016-04-08 20:53, Roman Mamedov wrote: > Do you snapshot the parent subvolume which holds the databases? Can you > correlate that perhaps ENOSPC occurs at the time of snapshotting? If > yes, then > you should try the patch https://patchwork.kernel.org/patch/7967161/ > > (Too bad this was not included into 4.4.1.) By the way - was it included in any later kernel? I'm running 4.4.5 on that server, but still hitting the same issue. It's not in 4.4.6 either. I don't know why it doesn't get included, or what we need to do. Last time I asked, it was queued: http://www.spinics.net/lists/linux-btrfs/msg52478.html But maybe that meant 4.5 or 4.6 only? While the bug is affecting people on 4.4.x today. Does it mean 4.5 also doesn't have it yet? Tomasz Chmielewski http://wpkg.org -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs send/receive using generation number as source
On Freitag, 8. April 2016 11:12:54 CEST Hugo Mills wrote: > On Fri, Apr 08, 2016 at 01:01:03PM +0200, Martin Steigerwald wrote: > > Hello! > > > > As far as I understood, for differential btrfs send/receive – I didn´t use > > it yet – I need to keep a snapshot on the source device to then tell > > btrfs send to send the differences between the snapshot and the current > > state. > > > > Now the BTRFS filesystems on my SSDs are often quite full, thus I do not > > keep any snapshots except for one during rsync or borgbackup script > > run-time. > > > > Is it possible to tell btrfs send to use generation number xyz to > > calculate > > the difference? This way, I wouldn´t have to keep a snapshot around, I > > believe. > >btrfs sub find-new > >BUT that will only tell you which files have been added or updated. > It won't tell you which files have been deleted. It's also unrelated > to send/receive, so you'd have to roll your own solution. I am aware of this one. > > I bet not, at the time cause -c wants a snapshot. Ah and it wants a > > snapshot of the same state on the destination as well. Well on the > > destination I let the script make a snapshot after the backup so… > > what I would need is to remember the generation number of the source > > snapshot that the script creates to backup from and then tell btrfs > > send that generation number + the destination snapshots. > > > > Well, or get larger SSDs or get rid of some data on them. > >Those are the other options, of course. Hm, I see. Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARN_ON in record_root_in_trans() when deleting freshly renamed subvolume
[cc: Mark and Qu] On 04/08/16 13:51, Holger Hoffstätte wrote: > On 04/08/16 13:14, Filipe Manana wrote: >> Using Chris' for-linus-4.6 branch, which is 4.5-rc6 + all 4.6 btrfs >> patches, it didn't reproduce here: > > Great, that's good to know (sort of :). Thanks also to Liu Bo. > >> Are you sure that you are not using some patches not in 4.6? We have a bingo! Reverting "qgroup: Fix qgroup accounting when creating snapshot" from last Wednesday immediately fixes the problem. Was quite easy to find - the triggered WARN_ON was the second one that complained about a mismatch between roots. The only patch that even remotely did something in that area was said qgroup fix. Looks like something is missing there. Suggestions welcome. :) Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.4.0 - no space left with >1.7 TB free space left
On Fri, 08 Apr 2016 20:36:26 +0900 Tomasz Chmielewskiwrote: > On 2016-02-08 20:24, Roman Mamedov wrote: > > >> Linux 4.4.0 - btrfs is mainly used to host lots of test containers, > >> often snapshots, and at times, there is heavy IO in many of them for > >> extended periods of time. btrfs is on HDDs. > >> > >> > >> Every few days I'm getting "no space left" in a container running > >> mongo > >> 3.2.1 database. Interestingly, haven't seen this issue in containers > >> with MySQL. All databases have chattr +C set on their directories. > > > > Hello, > > > > Do you snapshot the parent subvolume which holds the databases? Can you > > correlate that perhaps ENOSPC occurs at the time of snapshotting? If > > yes, then > > you should try the patch https://patchwork.kernel.org/patch/7967161/ > > > > (Too bad this was not included into 4.4.1.) > > By the way - was it included in any later kernel? I'm running 4.4.5 on > that server, but still hitting the same issue. It's not in 4.4.6 either. I don't know why it doesn't get included, or what we need to do. Last time I asked, it was queued: http://www.spinics.net/lists/linux-btrfs/msg52478.html But maybe that meant 4.5 or 4.6 only? While the bug is affecting people on 4.4.x today. Thanks -- With respect, Roman pgpxDETSxWzY9.pgp Description: OpenPGP digital signature
Re: WARN_ON in record_root_in_trans() when deleting freshly renamed subvolume
On 04/08/16 13:14, Filipe Manana wrote: > Using Chris' for-linus-4.6 branch, which is 4.5-rc6 + all 4.6 btrfs > patches, it didn't reproduce here: Great, that's good to know (sort of :). Thanks also to Liu Bo. > Are you sure that you are not using some patches not in 4.6? Quite a few, but to offset that I also left out some that have diverged too much or were not that important (block/sectorsize, device handling). But those should not have anything to do with this particular bug. Except for this everything works rock-solid, I use it daily. Should be easy to track down.. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.4.0 - no space left with >1.7 TB free space left
On 2016-02-08 20:24, Roman Mamedov wrote: Linux 4.4.0 - btrfs is mainly used to host lots of test containers, often snapshots, and at times, there is heavy IO in many of them for extended periods of time. btrfs is on HDDs. Every few days I'm getting "no space left" in a container running mongo 3.2.1 database. Interestingly, haven't seen this issue in containers with MySQL. All databases have chattr +C set on their directories. Hello, Do you snapshot the parent subvolume which holds the databases? Can you correlate that perhaps ENOSPC occurs at the time of snapshotting? If yes, then you should try the patch https://patchwork.kernel.org/patch/7967161/ (Too bad this was not included into 4.4.1.) By the way - was it included in any later kernel? I'm running 4.4.5 on that server, but still hitting the same issue. Tomasz Chmielewski http://wpkg.org -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
On 2016-04-07 15:32, Chris Murphy wrote: On Thu, Apr 7, 2016 at 5:19 AM, Austin S. Hemmelgarnwrote: On 2016-04-06 19:08, Chris Murphy wrote: On Wed, Apr 6, 2016 at 9:34 AM, Ank Ular wrote: From the ouput of 'dmesg', the section: [ 20.998071] BTRFS: device label FSgyroA devid 9 transid 625039 /dev/sdm [ 20.84] BTRFS: device label FSgyroA devid 10 transid 625039 /dev/sdn [ 21.004127] BTRFS: device label FSgyroA devid 11 transid 625039 /dev/sds [ 21.011808] BTRFS: device label FSgyroA devid 12 transid 625039 /dev/sdu bothers me because the transid value of these four devices doesn't match the other 16 devices in the pool {should be 625065}. In theory, I believe these should all have the same transid value. These four devices are all on a single USB 3.0 port and this is the link I believe went down and came back up. This is effectively a 4 disk failure and raid6 only allows for 2. Now, a valid complaint is that as soon as Btrfs is seeing write failures for 3 devices, it needs to go read-only. Specifically, it would go read only upon 3 or more write errors affecting a single full raid stripe (data and parity strips combined); and that's because such a write is fully failed. AFAIUI, currently, BTRFS will fail that stripe, but not retry it, _but_ after that, it will start writing out narrower stripes across the remaining disks if there are enough for it to maintain the data consistency (so if there's at least 3 for raid6 (I think, I don't remember if our lower limit is 3 (which is degenerate), or 4 (which isn't, but most other software won't let you use it for some stupid reason))). Based on this, if the FS does get recovered, make sure to run a balance on it too, otherwise you might have some sub-optimal striping for some data. I can see this being happening automatically with up to 2 device failures, so that all subsequent writes are fully intact stripe writes. But the instant there's a 3rd device failure, there's a rather large hole in the file system that can't be reconstructed. It's an invalid file system. I'm not sure what can be gained by allowing writes to continue, other than tying off loose ends (so to speak) with full stripe metadata writes for the purpose of making recovery possible and easier, but after that metadata is written - poof, go read only. I don't mean writing partial stripes, I mean writing full stripes with a reduced width (so in an 8 device filesystem, if 3 devices fail, we can still technically write a complete stripe across 5 devices, but it will result in less total space we can use). Whether or not this behavior is correct is another argument, but that appears to be what we do currently. Ideally, this should be a mount option, as strictly speaking, it's policy, which therefore shouldn't be in the kernel. You literally might have to splice superblocks and write them to 16 drives in exactly 3 locations per drive (well, maybe just one of them, and then delete the magic from the other two, and then 'btrfs rescue super-recover' should then use the one good copy to fix the two bad copies). Sigh maybe? In theory it's possible, I just don't know the state of the tools. But I'm fairly sure the best chance of recovery is going to be on the 4 drives that abruptly vanished. Their supers will be mostly correct or close to it: and that's what has all the roots in it: tree, fs, chunk, extent and csum. And all of those states are better farther in the past, rather than the 16 drives that have much newer writes. FWIW, it is actually possible to do this, I've done it before myself on much smaller raid1 filesystems with single drives disappearing, and once with a raid6 filesystem with a double drive failure. It is by no means easy, and there's not much in the tools that helps with it, but it is possible (although I sincerely hope I never have to do it again myself). I think considering the idea of Btrfs is to be more scalable than past storage and filesystems have been, it needs to be able to deal with transient failures like this. In theory all available information is written on all the disks. This was a temporary failure. Once all devices are made available again, the fs should be able to figure out what to do, even so far as salvaging the writes that happened after the 4 devices went missing if those were successful full stripe writes. I entirely agree. If the fix doesn't require any kind of decision to be made other than whether to fix it or not, it should be trivially fixable with the tools. TBH though, this particular issue with devices disappearing and reappearing could be fixed easier in the block layer (at least, there are things that need to be fixed WRT it in the block layer). Of course it is possible there's corruption problems with those four drives having vanished while writes were incomplete. But if you're lucky, data write happen first, then metadata writes second,
Re: WARN_ON in record_root_in_trans() when deleting freshly renamed subvolume
On Thu, Apr 7, 2016 at 5:44 PM, Holger Hoffstättewrote: > Hi, > > Looks like I just found an exciting new corner case. > kernel 4.4.6 with btrfs ~4.6, so 4.6 should reproduce. Using Chris' for-linus-4.6 branch, which is 4.5-rc6 + all 4.6 btrfs patches, it didn't reproduce here: #!/bin/bash dmesg -C mkfs.btrfs -f /dev/sdi mount /dev/sdi /mnt/sdi cd /mnt/sdi btrfs subvolume create foo sync btrfs subvolume snapshot foo foo-1 sync mv foo-1 foo.new btrfs subvolume delete foo.new cd - umount /dev/sdi dmesg gives: btrfs-progs v4.5.1-dirty See http://btrfs.wiki.kernel.org for more information. Performing full device TRIM (100.00GiB) ... Label: (null) UUID: 76cebc54-0ae1-4f53-91fd-3f9438bdfb50 Node size: 16384 Sector size:4096 Filesystem size:100.00GiB Block group profiles: Data: single8.00MiB Metadata: DUP 1.01GiB System: DUP 12.00MiB SSD detected: no Incompat features: extref, skinny-metadata Number of devices: 1 Devices: IDSIZE PATH 1 100.00GiB /dev/sdi Create subvolume './foo' Create a snapshot of 'foo' in './foo-1' Delete subvolume (no-commit): '/mnt/sdi/foo.new' /mnt [75015.529626] systemd-journald[578]: Sent WATCHDOG=1 notification. [75015.756407] BTRFS: device fsid 76cebc54-0ae1-4f53-91fd-3f9438bdfb50 devid 1 transid 3 /dev/sdi [75015.932527] BTRFS info (device sdi): disk space caching is enabled [75015.937674] BTRFS: has skinny extents [75015.938470] BTRFS: flagging fs with big metadata feature [75015.962601] BTRFS: creating UUID tree Are you sure that you are not using some patches not in 4.6? Also tried my own integration branch, and no issue either. > > Try on a fresh volume: > > $btrfs subvolume create foo > Create subvolume './foo' > $sync > $btrfs subvolume snapshot foo foo-1 > Create a snapshot of 'foo' in './foo-1' > $sync > $mv foo-1 foo.new > $btrfs subvolume delete foo.new > Delete subvolume (no-commit): '/mnt/test/foo.new' > $dmesg > [ 226.923316] [ cut here ] > [ 226.923339] WARNING: CPU: 1 PID: 5863 at fs/btrfs/transaction.c:319 > record_root_in_trans+0xd6/0x100 [btrfs]() > [ 226.923340] Modules linked in: auth_rpcgss oid_registry nfsv4 btrfs xor > raid6_pq loop nfs lockd grace sunrpc autofs4 sch_fq_codel radeon > snd_hda_codec_realtek x86_pkg_temp_thermal snd_hda_codec_generic coretemp > crc32_pclmul crc32c_intel aesni_intel i2c_algo_bit uvcvideo > snd_hda_codec_hdmi aes_x86_64 drm_kms_helper videobuf2_vmalloc glue_helper > videobuf2_memops syscopyarea lrw sysfillrect gf128mul videobuf2_v4l2 > sysimgblt snd_usb_audio fb_sys_fops ablk_helper snd_hda_intel videobuf2_core > ttm cryptd snd_hwdep v4l2_common usbhid snd_hda_codec snd_usbmidi_lib > videodev snd_rawmidi drm snd_hda_core snd_seq_device i2c_i801 snd_pcm > i2c_core snd_timer snd r8169 soundcore mii parport_pc parport > [ 226.923365] CPU: 1 PID: 5863 Comm: ls Not tainted 4.4.6 #1 > [ 226.923366] Hardware name: Gigabyte Technology Co., Ltd. > P67-DS3-B3/P67-DS3-B3, BIOS F1 05/06/2011 > [ 226.923367] 8800da677d20 813181a8 > > [ 226.923368] a0aacdbf 8800da677d58 810507b2 > 880601e90800 > [ 226.923369] 8800dacf10a0 880601e90800 880601e909f0 > 0001 > [ 226.923371] Call Trace: > [ 226.923374] [] dump_stack+0x4d/0x65 > [ 226.923376] [] warn_slowpath_common+0x82/0xc0 > [ 226.923378] [] warn_slowpath_null+0x1a/0x20 > [ 226.923387] [] record_root_in_trans+0xd6/0x100 [btrfs] > [ 226.923395] [] btrfs_record_root_in_trans+0x44/0x70 > [btrfs] > [ 226.923404] [] start_transaction+0x9e/0x4c0 [btrfs] > [ 226.923412] [] btrfs_join_transaction+0x17/0x20 [btrfs] > [ 226.923421] [] btrfs_dirty_inode+0x35/0xd0 [btrfs] > [ 226.923430] [] btrfs_update_time+0x7d/0xb0 [btrfs] > [ 226.923432] [] touch_atime+0x88/0xa0 > [ 226.923434] [] iterate_dir+0xdb/0x120 > [ 226.923435] [] SyS_getdents+0x88/0xf0 > [ 226.923437] [] ? fillonedir+0xd0/0xd0 > [ 226.923439] [] entry_SYSCALL_64_fastpath+0x12/0x6a > [ 226.923440] ---[ end trace 9c78caf253e284fe ]--- > > Code looks like: > > .. > static int record_root_in_trans(struct btrfs_trans_handle *trans, >struct btrfs_root *root) > { > if (test_bit(BTRFS_ROOT_REF_COWS, >state) && > root->last_trans < trans->transid) { > WARN_ON(root == root->fs_info->extent_root); > WARN_ON(root->commit_root != root->node); > .. > > There's been a few journal/recovery/directory consistency patches recently, > so maybe it's a corner case or an older problem. I'll try to bisect, but > meanwhile wanted to report it for discussion. > > Holger > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at
Re: btrfs send/receive using generation number as source
On Fri, Apr 08, 2016 at 01:01:03PM +0200, Martin Steigerwald wrote: > Hello! > > As far as I understood, for differential btrfs send/receive – I didn´t use it > yet – I need to keep a snapshot on the source device to then tell btrfs send > to send the differences between the snapshot and the current state. > > Now the BTRFS filesystems on my SSDs are often quite full, thus I do not keep > any snapshots except for one during rsync or borgbackup script run-time. > > Is it possible to tell btrfs send to use generation number xyz to calculate > the difference? This way, I wouldn´t have to keep a snapshot around, I > believe. btrfs sub find-new BUT that will only tell you which files have been added or updated. It won't tell you which files have been deleted. It's also unrelated to send/receive, so you'd have to roll your own solution. > I bet not, at the time cause -c wants a snapshot. Ah and it wants a > snapshot of the same state on the destination as well. Well on the > destination I let the script make a snapshot after the backup so… > what I would need is to remember the generation number of the source > snapshot that the script creates to backup from and then tell btrfs > send that generation number + the destination snapshots. > Well, or get larger SSDs or get rid of some data on them. Those are the other options, of course. Hugo. -- Hugo Mills | The trouble with you, Ibid, is you think you know hugo@... carfax.org.uk | everything. http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
btrfs send/receive using generation number as source
Hello! As far as I understood, for differential btrfs send/receive – I didn´t use it yet – I need to keep a snapshot on the source device to then tell btrfs send to send the differences between the snapshot and the current state. Now the BTRFS filesystems on my SSDs are often quite full, thus I do not keep any snapshots except for one during rsync or borgbackup script run-time. Is it possible to tell btrfs send to use generation number xyz to calculate the difference? This way, I wouldn´t have to keep a snapshot around, I believe. I bet not, at the time cause -c wants a snapshot. Ah and it wants a snapshot of the same state on the destination as well. Well on the destination I let the script make a snapshot after the backup so… what I would need is to remember the generation number of the source snapshot that the script creates to backup from and then tell btrfs send that generation number + the destination snapshots. Well, or get larger SSDs or get rid of some data on them. Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v1] block: avoid to call .bi_end_io() recursively
There were reports about heavy stack use by recursive calling .bi_end_io().[1][2][3] Also these patches[1] [2] [3] were posted for addressing the issue. And the idea is basically similar, all serializes the recursive calling of .bi_end_io() by percpu list. This patch still takes the same idea, but uses bio_list to implement it, which turns out more simple and the code becomes more readable meantime. xfstests(-g auto) is run with this patch and no regression is found on ext4, but when testing btrfs, generic/224 and generic/323 causes kernel oops. [1] http://marc.info/?t=12142850204=1=2 [2] http://marc.info/?l=dm-devel=139595190620008=2 [3] http://marc.info/?t=14597464411=1=2 Cc: Shaun TancheffCc: Christoph Hellwig Cc: Mikulas Patocka Signed-off-by: Ming Lei --- V1: - change to RFC - fix when unwind_bio_endio() is called recursively - run xfstest again: no regression found on ext4, but generic/323 and generic/224 cause kernel oops block/bio.c | 44 ++-- 1 file changed, 42 insertions(+), 2 deletions(-) diff --git a/block/bio.c b/block/bio.c index f124a0a..e2d0970 100644 --- a/block/bio.c +++ b/block/bio.c @@ -68,6 +68,8 @@ static DEFINE_MUTEX(bio_slab_lock); static struct bio_slab *bio_slabs; static unsigned int bio_slab_nr, bio_slab_max; +static DEFINE_PER_CPU(struct bio_list *, bio_end_list) = { NULL }; + static struct kmem_cache *bio_find_or_create_slab(unsigned int extra_size) { unsigned int sz = sizeof(struct bio) + extra_size; @@ -1737,6 +1739,45 @@ static inline bool bio_remaining_done(struct bio *bio) return false; } +/* disable local irq when manipulating the percpu bio_list */ +static void unwind_bio_endio(struct bio *bio) +{ + struct bio_list *bl; + unsigned long flags; + bool clear_list = false; + + preempt_disable(); + local_irq_save(flags); + + bl = this_cpu_read(bio_end_list); + if (!bl) { + struct bio_list bl_in_stack; + + bl = _in_stack; + bio_list_init(bl); + this_cpu_write(bio_end_list, bl); + clear_list = true; + } else { + bio_list_add(bl, bio); + goto out; + } + + while (bio) { + local_irq_restore(flags); + + if (bio->bi_end_io) + bio->bi_end_io(bio); + + local_irq_save(flags); + bio = bio_list_pop(bl); + } + if (clear_list) + this_cpu_write(bio_end_list, NULL); + out: + local_irq_restore(flags); + preempt_enable(); +} + /** * bio_endio - end I/O on a bio * @bio: bio @@ -1765,8 +1806,7 @@ again: goto again; } - if (bio->bi_end_io) - bio->bi_end_io(bio); + unwind_bio_endio(bio); } EXPORT_SYMBOL(bio_endio); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html