Instant reboot while defragging on 3.13-rc5
Dear list members, I upgraded to 3.13-rc5 kernel and started to defrag my whole file system with the following commands: cd / for i in */*; do if [[ $i != windows* ]]; then echo --$i--; btrfs fi defrag -clzo -r $i; fi; done 21 | tee /root/defrag After 10 or 15 seconds my computer reboots itself without any warning. In dmesg I see only the following: [ 531.866566] btrfs: device fsid 2cc2b266-3f04-4f1c-8477-cf7efd6ac139 devid 1 transid 396679 /dev/sda7 [ 532.001919] watchdog watchdog0: watchdog did not stop! [ 532.002719] watchdog watchdog0: watchdog did not stop! Together with this I'm constantly getting sysfs WARNINGS into dmesg, but I don't think this would be related: WARNING: CPU: 5 PID: 1644 at /home/abuild/rpmbuild/BUILD/kernel- desktop-3.13.rc5/linux-3.13-rc5/fs/sysfs/group.c:214 device_del+0x3b/0x1b0() sysfs group 81c7e040 not found for kobject 'hidraw2' Versions: OS: openSUSE 13.1 Kernel: 3.13.0-rc5-1.g7127d5f-desktop btrfs: v3.12+20131125 Ákos -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is anyone using btrfs send/receive for backups instead of rsync?
Chris Murphy posted on Sat, 28 Dec 2013 16:11:37 -0700 as excerpted: I am slightly bugged about a 16MB file having nearly 2000 extents, basically it's being turned into a bunch of 8KB fragments. I know nothing of the pros and cons of how systemd is writing journals, but they don't seem very big so I don't understand why they're preallocated, which on btrfs appears instantly defeated by COW upon the journal being modified. It seems to me either the journal doesn't need to be preallocated (at least on btrfs) or maybe systemd should set xattr +C on /var/log/journal? That does disable checksumming though, along with data cow. While I don't (yet?) use systemd here, from what I understand of its journal, it's essentially a binary database, and isn't necessarily even sequentially written as is traditional with log files. That would explain the preallocation. And given your mention of 8KB fragments, I wonder if that's its record-size? Meanwhile, as with any pre-allocated-then-written-into file, including VM images and pre-allocated bittorrent files, the systemd journal is a known worst-case for COW filesystems including btrfs. And setting NOCOW on the file (xattr +C) before those write-intos is the knob btrfs exposes to deal with the problem... for all the above cases. Yes, it does turn off checksumming as well as COW, but given the write- into scenario, that's actually best anyway, because otherwise btrfs has to keep updating the checksums as the internal writes are occurring, and that's both CPU intensive and potentially rate-limited, and an invitation to race conditions since the writing application and btrfs' checksumming are constantly lock-fighting, the one to update the file, the other to update the checksum based on the new data. But in all these cases, it's also quite common for the application doing the writing to have its own checksumming/error-detection and possible correction -- it pretty much comes with the territory -- in which case btrfs attempting to do the same is simply superfluous even if it weren't a race-condition trigger. Certainly torrents include checksumming -- that automatically guaranteed download integrity is part of what makes the protocol as popular as it is. And databases too. I don't actually know enough about VMs to know if it's the case there or not, but certainly, unexpected bit-flipping is likely to corrupt and crash the VM, just as it tends to do with on-the-metal operating systems. If/when the file reaches effective stasis, as a torrented file once it's fully downloaded, /then/ it's reasonable to kill the NOCOW and do a final (sequential-write) copy/move so btrfs has it checksummed too. And database and VM backups... if they're not being actively used, btrfs checksumming can guard against bitrot there too. Similarly systemd's binary journal, once those are taken out of active logging, yeah, let btrfs do its normal thing. But for all these cases, as long as the files are being actively written into, NOCOW, including its NOSUM implications, is exactly and precisely what they SHOULD be when the filesystem hosting them is btrfs. And I'm predicting that since btrfs is the assumed successor to the ext* series as the Linux default filesystem, and systemd is targeting Linux default initsystem status as well, it's only logical that at some point systemd will detect what filesystem it's logging to, and will automatically set NOCOW on the journal file when that filesystem is btrfs. Most Linux-targeted databases and file-preallocating torrent clients will no doubt do exactly the same thing. Either that, or in their documentation, they'll suggest setting NOCOW on the target directory when setting up the app in the first place. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] Btrfs: return free space to global_rsv as much as possible
@full is not protected within global_rsv.lock, so we may think global_rsv is already full but in fact it's not, so we miss the opportunity to return free space to global_rsv directly when we release other block_rsvs. Signed-off-by: Liu Bo bo.li@oracle.com --- fs/btrfs/extent-tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 9c01509..009980c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4637,7 +4637,7 @@ void btrfs_block_rsv_release(struct btrfs_root *root, u64 num_bytes) { struct btrfs_block_rsv *global_rsv = root-fs_info-global_block_rsv; - if (global_rsv-full || global_rsv == block_rsv || + if (global_rsv == block_rsv || block_rsv-space_info != global_rsv-space_info) global_rsv = NULL; block_rsv_release_bytes(root-fs_info, block_rsv, global_rsv, -- 1.8.2.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] Btrfs: release subvolume's block_rsv before transaction commit
We don't have to keep subvolume's block_rsv during transaction commit, and within transaction commit, we may also need the free space reclaimed from this block_rsv to process delayed refs. Signed-off-by: Liu Bo bo.li@oracle.com --- fs/btrfs/ioctl.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 21da576..347bf61 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -417,7 +417,8 @@ static noinline int create_subvol(struct inode *dir, trans = btrfs_start_transaction(root, 0); if (IS_ERR(trans)) { ret = PTR_ERR(trans); - goto out; + btrfs_subvolume_release_metadata(root, block_rsv, qgroup_reserved); + return ret; } trans-block_rsv = block_rsv; trans-bytes_reserved = block_rsv.size; @@ -542,6 +543,8 @@ static noinline int create_subvol(struct inode *dir, fail: trans-block_rsv = NULL; trans-bytes_reserved = 0; + btrfs_subvolume_release_metadata(root, block_rsv, qgroup_reserved); + if (async_transid) { *async_transid = trans-transid; err = btrfs_commit_transaction_async(trans, root, 1); @@ -555,8 +558,6 @@ fail: if (!ret) d_instantiate(dentry, btrfs_lookup_dentry(dir, dentry)); -out: - btrfs_subvolume_release_metadata(root, block_rsv, qgroup_reserved); return ret; } -- 1.8.2.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Migrate to bcache: A few questions
Hello list! I'm planning to buy a small SSD (around 60GB) and use it for bcache in front of my 3x 1TB HDD btrfs setup (mraid1+draid0) using write-back caching. Btrfs is my root device, thus the system must be able to boot from bcache using init ramdisk. My /boot is a separate filesystem outside of btrfs and will be outside of bcache. I am using Gentoo as my system. I have a few questions: * How stable is it? I've read about some csum errors lately... * I want to migrate my current storage to bcache without replaying a backup. Is it possible? * Did others already use it? What is the perceived performance for desktop workloads in comparision to not using bcache? * How well does bcache handle power outages? Btrfs does handle them very well since many months. * How well does it play with dracut as initrd? Is it as simple as telling it the new device nodes or is there something complicate to configure? * How does bcache handle a failing SSD when it starts to wear out in a few years? * Is it worth waiting for hot-relocation support in btrfs to natively use a SSD as cache? * Would you recommend going with a bigger/smaller SSD? I'm planning to use only 75% of it for bcache so wear-leveling can work better, maybe use another part of it for hibernation (suspend to disk). Regards, Kai -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: systemd-journal, nodatacow, was: Is anyone using btrfs send/receive for backups instead of rsync?
On Dec 29, 2013, at 5:39 AM, Duncan 1i5t5.dun...@cox.net wrote: Yes, it does turn off checksumming as well as COW, but given the write- into scenario, that's actually best anyway, because otherwise btrfs has to keep updating the checksums On second thought, I'm less concerned with bitrot and checksumming being lost with nodatacow, than I am with significantly increasing the chance the journal is irreparably lost due to corruption during an unclean shutdown. But in all these cases, it's also quite common for the application doing the writing to have its own checksumming/error-detection and possible correction -- it pretty much comes with the territory -- in which case btrfs attempting to do the same is simply superfluous even if it weren't a race-condition trigger. I don't know what kind of checksumming systemd performs on the journal, but whenever Btrfs has found corruption with the journal file(s), systemd-journald has also found corruption and starts a new log. So it makes sense to rely on its own mechanisms, than Btrfs's. And I'm predicting that since btrfs is the assumed successor to the ext* series as the Linux default filesystem, and systemd is targeting Linux default initsystem status as well, it's only logical that at some point systemd will detect what filesystem it's logging to, and will automatically set NOCOW on the journal file when that filesystem is btrfs. Is this something that should be brought up on the systemd-devel@ list? Or maybe file it as an RFE against systemd at freedesktop.org? Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Migrate to bcache: A few questions
On Dec 29, 2013, at 2:11 PM, Kai Krakow hurikhan77+bt...@gmail.com wrote: * How stable is it? I've read about some csum errors lately… Seems like bcache devs are still looking into the recent btrfs csum issues. * I want to migrate my current storage to bcache without replaying a backup. Is it possible? * Did others already use it? What is the perceived performance for desktop workloads in comparision to not using bcache? * How well does bcache handle power outages? Btrfs does handle them very well since many months. * How well does it play with dracut as initrd? Is it as simple as telling it the new device nodes or is there something complicate to configure? * How does bcache handle a failing SSD when it starts to wear out in a few years? I think most of these questions are better suited for the bcache list. I think there are still many uncertainties about the behavior of SSDs during power failures when they aren't explicitly designed with power failure protection in mind. At best I'd hope for a rollback involving data loss, but hopefully not a corrupt file system. I'd rather lose the last minute of data supposedly written to the drive, than have to do a fuil restore from backup. * Is it worth waiting for hot-relocation support in btrfs to natively use a SSD as cache? I haven't read anything about it. Don't see it listed in project ideas. * Would you recommend going with a bigger/smaller SSD? I'm planning to use only 75% of it for bcache so wear-leveling can work better, maybe use another part of it for hibernation (suspend to disk). I think that depends greatly on workload. If you're writing or reading a lot of disparate files, or a lot of small file random writes (mail server), I'd go bigger. By default sequential IO isn't cached. So I think you can get a big boost in responsiveness with a relatively small bcache size. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Migrate to bcache: A few questions
Chris Murphy li...@colorremedies.com schrieb: I think most of these questions are better suited for the bcache list. Ah yes, you are true. I will repost the non-btrfs related questions to the bcache list. But actually I am most interested in using bcache together btrfs, so getting a general picture of its current state in this combination would be nice - and so these questions may be partially appropriate here. I think there are still many uncertainties about the behavior of SSDs during power failures when they aren't explicitly designed with power failure protection in mind. At best I'd hope for a rollback involving data loss, but hopefully not a corrupt file system. I'd rather lose the last minute of data supposedly written to the drive, than have to do a fuil restore from backup. These thought are actually quite interesting. So you are saying that data may not be fully written to SSD although the kernel thinks so? This is probably very dangerous. The bcache module could not ensure coherence between its backing devices and its own contents - and data loss will occur and probably destroy important file system structures. I understand your words as data may only partially being written. This, of course, may happen to HDDs as well. But usually a file system works with transactions so the last incomplete transaction can simply be thrown away. I hope bcache implements the same architecture. But what does it mean for the stacked write-back architecture? As I understand, bcache may use write-through for sequential writes, but write-back for random writes. In this case, part of the data may have hit the backing device, other data does only exist in the bcache. If that last transaction is not closed due to power-loss, and then thrown away, we have part of the transaction already written to the backing device that the filesystem does not know of after resume. I'd appreciate some thoughts about it but this topic is probably also best moved over to the bcache list. Thanks, Kai -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Migrate to bcache: A few questions
On Dec 29, 2013, at 6:22 PM, Kai Krakow hurikhan77+bt...@gmail.com wrote: So you are saying that data may not be fully written to SSD although the kernel thinks so? Drives shouldn't lie when asked to flush to disk, but they do. Older article about this at lwn is a decent primer on the subject of write barriers. http://lwn.net/Articles/283161/ This is probably very dangerous. The bcache module could not ensure coherence between its backing devices and its own contents - and data loss will occur and probably destroy important file system structures. I don't know the details, there's more detail on lkml.org and bcache lists. My impression is that short of bugs, it should be much safer than you describe. It's not like a linear/concat md or LVM device fail scenario. There's good info in the bcache.h file: http://lxr.free-electrons.com/source/drivers/md/bcache/bcache.h If anything, once the kinks are worked out, under heavy random write IO I'd expect bcache to improve the likelihood data isn't lost. Faster speed of SSD means we get a faster commit of the data to stable media. Also bcache assumes the cache is always dirty on startup, no matter whether the shutdown was clean or dirty, so the code is explicitly designed to resolve the state of the cache relative to the backing device. It's actually pretty fascinating work. It may not be required, but I'd expect we'd want the write cache on the backing device disabled. It should still honor write barriers but it kinda seems unnecessary and riskier to have it enabled (which is the default with consumer drives). As I understand, bcache may use write-through for sequential writes, but write-back for random writes. In this case, part of the data may have hit the backing device, other data does only exist in the bcache. If that last transaction is not closed due to power-loss, and then thrown away, we have part of the transaction already written to the backing device that the filesystem does not know of after resume. In the write through case we should be no worse off than the bare drive in a power loss. In the write back case the SSD should have committed more data than the HDD could have in the same situation. I don't understand the details of how partially successful writes to the backing media are handled when the system comes back up. Since bcache is also COW, SSD blocks aren't reused until data is committed to the backing device. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Migrate to bcache: A few questions
Kai Krakow posted on Sun, 29 Dec 2013 22:11:16 +0100 as excerpted: Hello list! I'm planning to buy a small SSD (around 60GB) and use it for bcache in front of my 3x 1TB HDD btrfs setup (mraid1+draid0) using write-back caching. Btrfs is my root device, thus the system must be able to boot from bcache using init ramdisk. My /boot is a separate filesystem outside of btrfs and will be outside of bcache. I am using Gentoo as my system. Gentooer here too. =:^) I have a few questions: * How stable is it? I've read about some csum errors lately... FWIW, both bcache and btrfs are new and still developing technology. While I'm using btrfs here, I have tested usable (which for root means either means directly bootable or that you have tested booting to a recovery image and restoring from there, I do the former, here) backups, as STRONGLY recommended for btrfs in its current state, but haven't had to use them. And I considered bcache previously and might otherwise be using it, but at least personally, I'm not willing to try BOTH of them at once, since neither one is mature yet and if there are problems as there very well might be, I'd have the additional issue of figuring out which one was the problem, and I'm personally not prepared to deal with that. Instead, at this point I'd recommend choosing /either/ bcache /or/ btrfs, and using bcache with a more mature filesystem like ext4 or (what I used for years previous and still use for spinning rust) reiserfs. And as I said, keep your backups as current as you're willing to deal with losing what's not backed up, and tested usable and (for root) either bootable or restorable from alternate boot, because while at least btrfs is /reasonably/ stable for /ordinary/ daily use, there remain corner- cases and you never know when your case is going to BE a corner-case! * I want to migrate my current storage to bcache without replaying a backup. Is it possible? Since I've not actually used bcache, I won't try to answer some of these, but will answer based on what I've seen on the list where I can... I don't know on this one. * Did others already use it? What is the perceived performance for desktop workloads in comparision to not using bcache? Others are indeed already using it. I've seen some btrfs/bcache problems reported on this list, but as mentioned above, when both are in use that means figuring out which is the problem, and at least from the btrfs side I've not seen a lot of resolution in that regard. From here it /looks/ like that's simply being punted at this time, as there's still more easily traceable problems without the additional bcache variable to work on first. But it's quite possible the bcache list is actively tackling btrfs/bache combination problems, as I'm not subscribed there. So I can't answer the desktop performance comparison question directly, but given that I /am/ running btrfs on SSD, I /can/ say I'm quite happy with that. =:^) Keep in mind... We're talking storage cache here. Given the cost of memory and common system configurations these days, 4-16 gig of memory on a desktop isn't unusual or cost prohibitive, and a common desktop working set should well fit. I suspect my desktop setup, 16 gigs memory backing a 6-core AMD fx6100 (bulldozer-1) @ 3.6 GHz, is probably a bit toward the high side even for a gentooer, but not inordinately so. Based on my usage... Typical app memory usage runs 1-2 GiB (that's with KDE 4.12.49. from the gentoo/kde overlay, but USE=-semantic-desktop, etc). Buffer memory runs a few MiB but isn't normally significant, so it can fold into that same 1-2 GiB too. That leaves a full 14 GiB for cache. But at least with /my/ usage, normal non-update cache memory usage tends to be below ~6 GiB too, so total apps/buffer/cache memory usage tends to be below 8 GiB as well. When I'm doing multi-job builds or working with big media files, I'll sometimes go above 8 gig usage, and that occasional cache-spill was why I upgraded to 16 gig. But in practice, 10 gig would take care of that most of the time, and were it not for the accident of powers-of-two meaning 16 gig is the notch above 8 gig, 10 or 12 gig would be plenty. Truth be told, I so seldom use that last 4 gig that it's almost embarrassing. * Tho if I ran multi-GiB VMs that'd use up that extra memory real fast! But while that /is/ becoming more common, I'm not exactly sure I'd classify 4 gigs plus of VM usage as desktop usage just yet. Workstation, yes, and definitely server, but not really desktop. All that as background to this... * Cache works only after first access. If you only access something occasionally, it may not be worth caching at all. * Similarly, if access isn't time critical, think of playing a huge video file where only a few meg in memory at once is plenty, and where storage access is several times faster than play-speed, cache isn't particularly useful. * Bcache
[PATCH] Btrfs: use WARN_ON_ONCE instead for btrfs_invalidate_inodes
So after transaction is aborted, we need to cleanup inode resources by calling btrfs_invalidate_inodes(), and btrfs_invalidate_inodes() hopes roots' refs to be zero in old times and sets a WARN_ON(), however, this is not always true within cleaning up transaction, so WARN_ON_ONCE() is better, and we won't get another syslog message bomb. Signed-off-by: Liu Bo bo.li@oracle.com --- fs/btrfs/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index f1a7744..ef7f6af 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4761,7 +4761,7 @@ void btrfs_invalidate_inodes(struct btrfs_root *root) struct inode *inode; u64 objectid = 0; - WARN_ON(btrfs_root_refs(root-root_item) != 0); + WARN_ON_ONCE(btrfs_root_refs(root-root_item) != 0); spin_lock(root-inode_lock); again: -- 1.8.2.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html